2312.17742
Report |
Learning Vision from Models Rivals Learning Vision from Data |
Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola |
We introduce SynCLR, a novel approach for learning visual representations
exclusively from synthetic images and synthetic captions, without any real
data. We synthesize a large dataset of image captions using LLMs, then use an
off-the-shelf text-to-image model to generate multiple images corresponding to
each synthetic caption. We perform visual representation learning on these
synthetic images via contrastive learning, treating images sharing the same
caption as positive pairs. The resulting representations transfer well to many
downstream tasks, competing favorably with other general-purpose visual
representation learners such as CLIP and DINO v2 in image classification tasks.
Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR
outperforms previous self-supervised methods by a significant margin, e.g.,
improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16. |
This paper introduces SynCLR, a novel approach for learning visual representations exclusively from synthetic images and captions generated by LLMs and text-to-image models, without relying on real data. |
SynCLR addresses the limitations of relying on large-scale real datasets for visual representation learning, which can be costly, ethically challenging, and may introduce biases. |
The methodology involves synthesizing a large dataset of image captions using LLMs, generating multiple images per caption using a text-to-image model, and training a visual representation model via contrastive learning and masked image modeling on this synthetic dataset. |
SynCLR achieves comparable performance to OpenAI's CLIP and DINO v2 on ImageNet linear evaluation and fine-grained classification tasks, despite relying solely on synthetic data.
SynCLR demonstrates strong transferability to dense prediction tasks, outperforming MAE and iBOT on ADE20K semantic segmentation.
SynCLR's performance scales with the volume of synthetic data, with larger models benefiting more from increased data scale. |
While SynCLR shows promising results, it still lags behind models like DINO v2, which benefit from distillation from larger architectures and high-resolution training.
The dependence on the quality of the synthetic data introduces a limitation, as improvements in generative models can directly influence SynCLR's performance. |
representation learning, synthetic data, contrastive learning, masked image modeling, text-to-image generation |
2312.17681
Report |
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis |
Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana Marculescu |
Diffusion models have transformed the image-to-image (I2I) synthesis and are
now permeating into videos. However, the advancement of video-to-video (V2V)
synthesis has been hampered by the challenge of maintaining temporal
consistency across video frames. This paper proposes a consistent V2V synthesis
framework by jointly leveraging spatial conditions and temporal optical flow
clues within the source video. Contrary to prior methods that strictly adhere
to optical flow, our approach harnesses its benefits while handling the
imperfection in flow estimation. We encode the optical flow via warping from
the first frame and serve it as a supplementary reference in the diffusion
model. This enables our model for video synthesis by editing the first frame
with any prevalent I2I models and then propagating edits to successive frames.
Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility:
FlowVid works seamlessly with existing I2I models, facilitating various
modifications, including stylization, object swaps, and local edits. (2)
Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution
takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF,
Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our
FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender
(10.2%), and TokenFlow (40.4%). |
FlowVid, a novel video-to-video synthesis method that leverages optical flow as a soft constraint in conjunction with spatial conditions to enhance temporal consistency in generated videos. |
Maintaining temporal consistency in video-to-video synthesis is challenging. Existing methods relying solely on spatial-temporal attention or rigidly constrained optical flow often fall short. FlowVid addresses this by jointly using spatial conditions and flexibly incorporating optical flow, leading to more consistent and high-quality video synthesis. |
FlowVid employs a two-stage edit-propagate approach: 1) Edit the first frame using existing image-to-image models. 2) Propagate edits to subsequent frames using a trained video diffusion model conditioned on spatial information (e.g., depth maps) and temporal information from flow-warped frames. |
Outperforms state-of-the-art methods like CoDeF, Rerender, and TokenFlow in user studies, exhibiting superior prompt alignment and overall video quality.
Significantly faster than competing methods, generating a 4-second, 512x512 resolution video at 30 FPS in just 1.5 minutes.
Flexible enough to support various video editing tasks, including stylization, object swaps, and local edits. |
Relies heavily on the structural alignment of the edited first frame with the original.
May struggle with large occlusions caused by rapid camera or object motion. |
video-to-video synthesis, diffusion models, optical flow, temporal consistency, spatial conditions |
2312.17561
Report |
Informative Rays Selection for Few-Shot Neural Radiance Fields |
Marco Orsingher, Anthony Dell'Eva, Paolo Zani, Paolo Medici, Massimo Bertozzi |
Neural Radiance Fields (NeRF) have recently emerged as a powerful method for
image-based 3D reconstruction, but the lengthy per-scene optimization limits
their practical usage, especially in resource-constrained settings. Existing
approaches solve this issue by reducing the number of input views and
regularizing the learned volumetric representation with either complex losses
or additional inputs from other modalities. In this paper, we present KeyNeRF,
a simple yet effective method for training NeRF in few-shot scenarios by
focusing on key informative rays. Such rays are first selected at camera level
by a view selection algorithm that promotes baseline diversity while
guaranteeing scene coverage, then at pixel level by sampling from a probability
distribution based on local image entropy. Our approach performs favorably
against state-of-the-art methods, while requiring minimal changes to existing
NeRF codebases. |
This paper proposes KeyNeRF, a method to improve the efficiency of Neural Radiance Fields (NeRF) in few-shot scenarios by focusing on key informative cameras and pixels during training. |
Standard NeRF training is computationally expensive, especially when the number of input views is limited. Existing few-shot NeRF methods address this issue by introducing complex losses or relying on additional input modalities, which adds complexity and might hinder practicality. |
KeyNeRF employs a two-stage selection process: (1) View selection: A minimal set of cameras covering the entire scene is selected and augmented with additional views based on baseline diversity. (2) Rays sampling: For each selected camera, pixels are sampled from a probability distribution based on local image entropy to prioritize high-frequency details. |
KeyNeRF outperforms state-of-the-art few-shot NeRF methods on both synthetic (Blender) and real-world (CO3D) datasets in terms of rendering quality.
The view selection strategy demonstrates faster and more stable convergence, especially with very few input views.
Entropy-based rays sampling leads to better rendering of fine-grained details and intricate structures compared to uniform sampling. |
The current method assumes an object-centric acquisition trajectory with the object having higher entropy than the background.
Future work will focus on addressing these limitations and extending the approach to other neural reconstruction methods. |
neural radiance fields, novel view synthesis, few-shot learning, 3d reconstruction, view selection |
2312.17505
Report |
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation |
Tuan-Anh Vu, Duc Thanh Nguyen, Qing Guo, Binh-Son Hua, Nhat Minh Chung, Ivor W. Tsang, Sai-Kit Yeung |
Text-to-image diffusion techniques have shown exceptional capability of
producing high-quality images from text descriptions. This indicates that there
exists a strong correlation between the visual and textual domains. In
addition, text-image discriminative models such as CLIP excel in image
labelling from text prompts, thanks to the rich and diverse information
available from open concepts. In this paper, we leverage these technical
advances to solve a challenging problem in computer vision: camouflaged
instance segmentation. Specifically, we propose a method built upon a
state-of-the-art diffusion model, empowered by open-vocabulary to learn
multi-scale textual-visual features for camouflaged object representations.
Such cross-domain representations are desirable in segmenting camouflaged
objects where visual cues are subtle to distinguish the objects from the
background, especially in segmenting novel objects which are not seen in
training. We also develop technically supportive components to effectively fuse
cross-domain features and engage relevant features towards respective
foreground objects. We validate our method and compare it with existing ones on
several benchmark datasets of camouflaged instance segmentation and generic
open-vocabulary instance segmentation. Experimental results confirm the
advances of our method over existing ones. We will publish our code and
pre-trained models to support future research. |
This paper proposes a novel method for Camouflaged Instance Segmentation (CIS) that leverages text-to-image diffusion and text-image transfer techniques with open-vocabulary. |
CIS is a challenging problem in computer vision due to the subtle visual differences between camouflaged objects and their surroundings. Existing methods often struggle with novel objects unseen during training. This method aims to overcome these challenges by incorporating rich textual information from open-vocabulary. |
The method combines a pre-trained Stable Diffusion model for image feature extraction, a pre-trained CLIP model for text embedding generation, and a mask generator based on Mask2Former. It uses a multi-scale feature fusion module to integrate image features and text embeddings at different scales. A textual-visual aggregation module highlights object-relevant features, while a camouflaged instance normalisation module refines the segmentation masks. |
The proposed method achieves state-of-the-art performance on benchmark camouflaged object datasets (COD10K-v3 and NC4K).
It demonstrates strong generalization ability, effectively segmenting novel object categories.
The method achieves comparable performance to state-of-the-art open-vocabulary instance segmentation methods on generic datasets (ADE20K and Cityscapes) while using significantly fewer parameters. |
The method may face difficulty in separating touching/overlapping instances with highly similar appearances.
Severe object occlusions can hinder accurate segmentation, leading to potential misclassifications. |
camouflaged instance segmentation, open-vocabulary, text-to-image diffusion, text-image transfer, computer vision |
2312.17448
Report |
Tracking with Human-Intent Reasoning |
Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Huchuan Lu, Yifeng Geng, Xuansong Xie |
Advances in perception modeling have significantly improved the performance
of object tracking. However, the current methods for specifying the target
object in the initial frame are either by 1) using a box or mask template, or
by 2) providing an explicit language description. These manners are cumbersome
and do not allow the tracker to have self-reasoning ability. Therefore, this
work proposes a new tracking task -- Instruction Tracking, which involves
providing implicit tracking instructions that require the trackers to perform
tracking automatically in video frames. To achieve this, we investigate the
integration of knowledge and reasoning capabilities from a Large
Vision-Language Model (LVLM) for object tracking. Specifically, we propose a
tracker called TrackGPT, which is capable of performing complex reasoning-based
tracking. TrackGPT first uses LVLM to understand tracking instructions and
condense the cues of what target to track into referring embeddings. The
perception component then generates the tracking results based on the
embeddings. To evaluate the performance of TrackGPT, we construct an
instruction tracking benchmark called InsTrack, which contains over one
thousand instruction-video pairs for instruction tuning and evaluation.
Experiments show that TrackGPT achieves competitive performance on referring
video object segmentation benchmarks, such as getting a new state-of the-art
performance of 66.5 $\mathcal{J}\&\mathcal{F}$ on Refer-DAVIS. It also
demonstrates a superior performance of instruction tracking under new
evaluation protocols. The code and models are available at
\href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}. |
This paper introduces the task of "instruction tracking", a new paradigm in object tracking where implicit instructions, rather than explicit bounding boxes or language descriptions, guide the tracker to locate and follow a target object in a video. |
Current tracking methods rely on cumbersome and impractical methods (bounding boxes, precise masks, detailed descriptions) to specify the object to be tracked. Instruction tracking aims to make this interaction more natural and intuitive, mimicking how humans would guide a tracker. |
The authors propose TrackGPT, a novel tracker powered by a Large Vision-Language Model (LVLM). TrackGPT leverages the LVLM's reasoning ability to interpret human instructions and translate them into referring cues for object tracking. The system also features a 'rethinking mechanism' to adjust tracking based on how well the results match the instruction's intent, and a 'cross-frame referring propagation' module for temporal consistency. |
TrackGPT achieves state-of-the-art performance on the newly proposed InsTrack benchmark for instruction tracking.
It also demonstrates competitive performance on established referring video object segmentation benchmarks, including a new state-of-the-art result on Refer-DAVIS₁₇.
Ablation studies confirm the effectiveness of the rethinking mechanism, cross-frame referring propagation, and instruction tuning strategies. |
TrackGPT currently doesn't support tracking multiple objects from a single instruction.
Future work could explore multi-object instruction tracking and improve efficiency for real-time applications. |
instruction tracking, object tracking, large vision-language model, referring video object segmentation, reasoning |
2312.17432
Report |
Video Understanding with Large Language Models: A Survey |
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu |
With the burgeoning growth of online video platforms and the escalating
volume of video content, the demand for proficient video understanding tools
has intensified markedly. Given the remarkable capabilities of Large Language
Models (LLMs) in language and multimodal tasks, this survey provides a detailed
overview of the recent advancements in video understanding harnessing the power
of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly
advanced, particularly their ability for open-ended spatial-temporal reasoning
combined with commonsense knowledge, suggesting a promising path for future
video understanding. We examine the unique characteristics and capabilities of
Vid-LLMs, categorizing the approaches into four main types: LLM-based Video
Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods.
Furthermore, this survey presents a comprehensive study of the tasks, datasets,
and evaluation methodologies for Vid-LLMs. Additionally, it explores the
expansive applications of Vid-LLMs across various domains, highlighting their
remarkable scalability and versatility in real-world video understanding
challenges. Finally, it summarizes the limitations of existing Vid-LLMs and
outlines directions for future research. For more information, readers are
recommended to visit the repository at
https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding. |
This paper surveys recent advancements in video understanding using Large Language Models (Vid-LLMs), analyzing models, datasets, tasks, and applications. |
The integration of LLMs with video understanding is crucial due to the growing volume of video content and the demand for intelligent analysis tools. |
The paper categorizes Vid-LLM approaches into four types: LLM-based Video Agents, Vid-LLM Pretraining, Vid-LLM Instruction Tuning, and Hybrid Methods. It also examines common tasks, datasets, and evaluation metrics in video understanding. |
Vid-LLMs demonstrate promising capabilities in various video understanding tasks, including recognition, captioning, grounding, and question answering.
The use of LLMs enables more sophisticated multimodal understanding, allowing for the processing of complex interactions between visual, textual, and auditory data.
Existing Vid-LLMs have limitations in fine-grained and long-term video understanding, multi-modal integration, and addressing hallucination issues. |
Current Vid-LLMs face challenges in handling fine-grained and long-term video understanding, multi-modal integration, human interaction.
Future research should focus on addressing hallucination issues, improving computational efficiency, and developing more robust evaluation metrics. |
video understanding, large language models, multimodal learning, computer vision, artificial intelligence |
2312.17250
Report |
iFusion: Inverting Diffusion for Pose-Free Reconstruction from Sparse Views |
Chin-Hsuan Wu, Yen-Chun Chen, Bolivar Solarte, Lu Yuan, Min Sun |
We present iFusion, a novel 3D object reconstruction framework that requires
only two views with unknown camera poses. While single-view reconstruction
yields visually appealing results, it can deviate significantly from the actual
object, especially on unseen sides. Additional views improve reconstruction
fidelity but necessitate known camera poses. However, assuming the availability
of pose may be unrealistic, and existing pose estimators fail in sparse view
scenarios. To address this, we harness a pre-trained novel view synthesis
diffusion model, which embeds implicit knowledge about the geometry and
appearance of diverse objects. Our strategy unfolds in three steps: (1) We
invert the diffusion model for camera pose estimation instead of synthesizing
novel views. (2) The diffusion model is fine-tuned using provided views and
estimated poses, turned into a novel view synthesizer tailored for the target
object. (3) Leveraging registered views and the fine-tuned diffusion model, we
reconstruct the 3D object. Experiments demonstrate strong performance in both
pose estimation and novel view synthesis. Moreover, iFusion seamlessly
integrates with various reconstruction methods and enhances them. |
Presents iFusion, a novel 3D object reconstruction framework that requires only two views with unknown camera poses, leveraging a pre-trained diffusion model (Zero123) to estimate poses and enhance reconstruction fidelity. |
Existing 3D reconstruction methods either rely on single-view inference leading to ambiguity or require accurate camera poses typically unavailable from sparse views. |
1) Inverts Zero123 to estimate relative camera pose by minimizing differences in denoised latent visual features. 2) Fine-tunes Zero123 with estimated poses and given views for object-specific novel view synthesis. 3) Integrates estimated poses and fine-tuned diffusion model with a differentiable renderer (e.g., NeRFs, Gaussian Splatting) for 3D reconstruction. |
Significantly outperforms state-of-the-art pose estimation methods with only two views.
Demonstrates superior novel view synthesis quality compared to Zero123 and 3D-based methods.
Consistently enhances the performance of existing single-view reconstruction methods leading to more accurate 3D models. |
Pose estimation is slower than feed-forward methods due to optimization-based approach.
Lacks complete 3D consistency due to limitations of Zero123's 2D-based architecture. |
3d reconstruction, pose estimation, novel view synthesis, diffusion models, sparse view |
2312.17243
Report |
Unsupervised Universal Image Segmentation |
Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, Trevor Darrell |
Several unsupervised image segmentation approaches have been proposed which
eliminate the need for dense manually-annotated segmentation masks; current
models separately handle either semantic segmentation (e.g., STEGO) or
class-agnostic instance segmentation (e.g., CutLER), but not both (i.e.,
panoptic segmentation). We propose an Unsupervised Universal Segmentation model
(U2Seg) adept at performing various image segmentation tasks -- instance,
semantic and panoptic -- using a novel unified framework. U2Seg generates
pseudo semantic labels for these segmentation tasks via leveraging
self-supervised models followed by clustering; each cluster represents
different semantic and/or instance membership of pixels. We then self-train the
model on these pseudo semantic labels, yielding substantial performance gains
over specialized methods tailored to each task: a +2.6 AP$^{\text{box}}$ boost
vs. CutLER in unsupervised instance segmentation on COCO and a +7.0 PixelAcc
increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff.
Moreover, our method sets up a new baseline for unsupervised panoptic
segmentation, which has not been previously explored. U2Seg is also a strong
pretrained model for few-shot segmentation, surpassing CutLER by +5.0
AP$^{\text{mask}}$ when trained on a low-data regime, e.g., only 1% COCO
labels. We hope our simple yet effective method can inspire more research on
unsupervised universal image segmentation. |
This paper introduces U2Seg, a novel unified framework for Unsupervised Universal image Segmentation capable of performing instance, semantic, and panoptic segmentation without human annotations. |
Existing unsupervised image segmentation methods only address either semantic or class-agnostic instance segmentation, but not both, limiting comprehensive scene understanding. |
U2Seg leverages self-supervised models and clustering to generate pseudo semantic labels for both instance and semantic segmentation. Then, it trains a universal segmentation model on these pseudo labels to perform various segmentation tasks. |
U2Seg outperforms previous state-of-the-art methods specialized for individual tasks, achieving +2.6 AP^box improvement in instance segmentation on COCO and +7.0 PixelAcc increase in semantic segmentation on COCOStuff.
The method establishes a new baseline for unsupervised panoptic segmentation, previously unexplored.
U2Seg shows superior performance as a pretrained model for few-shot segmentation, outperforming CutLER by +5.0 AP^mask when trained on 1% COCO labels. |
The universal model shows slightly lower performance compared to task-specific models.
Future work focuses on improving model versatility to handle multiple tasks effectively with single training. |
unsupervised learning, image segmentation, instance segmentation, semantic segmentation, panoptic segmentation |
2312.17241
Report |
Compact Neural Graphics Primitives with Learned Hash Probing |
Towaki Takikawa, Thomas Müller, Merlin Nimier-David, Alex Evans, Sanja Fidler, Alec Jacobson, Alexander Keller |
Neural graphics primitives are faster and achieve higher quality when their
neural networks are augmented by spatial data structures that hold trainable
features arranged in a grid. However, existing feature grids either come with a
large memory footprint (dense or factorized grids, trees, and hash tables) or
slow performance (index learning and vector quantization). In this paper, we
show that a hash table with learned probes has neither disadvantage, resulting
in a favorable combination of size and speed. Inference is faster than unprobed
hash tables at equal quality while training is only 1.2-2.6x slower,
significantly outperforming prior index learning approaches. We arrive at this
formulation by casting all feature grids into a common framework: they each
correspond to a lookup function that indexes into a table of feature vectors.
In this framework, the lookup functions of existing data structures can be
combined by simple arithmetic combinations of their indices, resulting in
Pareto optimal compression and speed. |
This paper introduces a novel compression technique for neural graphics primitives, termed "compact neural graphics primitives," which leverages learned hash probing to achieve a favorable balance between compactness and speed. |
Existing spatial data structures used to enhance neural graphics primitives often compromise either memory efficiency or performance. This work aims to address this limitation by developing a technique that combines the strengths of hash tables and index learning. |
The authors propose a learned hash probing scheme where a spatial hash function determines the most significant bits of an index, while the remaining bits are learned via an auxiliary index codebook. This strategy enables efficient collision resolution and feature reuse. The method is trained using a straight-through estimator to handle the non-differentiable nature of the indexing process. |
Compact neural graphics primitives demonstrate faster inference than unprobed hash tables (Instant NGP) at comparable quality levels due to improved cache utilization.
The technique achieves competitive compression rates compared to state-of-the-art methods, including JPEG for images and masked wavelet representations for NeRFs, while maintaining random access capability and differentiability.
Experiments reveal that a small probing range is sufficient for effective compression, and the training overhead associated with learned probing is manageable (1.26x to 2.61x slower than Instant NGP). |
The approach currently relies on a straight-through estimator for training, which might not be optimal. Exploring alternative techniques, such as sparse or stochastic variants, could be beneficial.
While spatial hashing provides application agnosticism, utilizing data structures that better exploit spatial locality might enhance compression efficiency further. |
neural graphics primitives, compression, hash tables, index learning, deep learning |
2312.17240
Report |
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model |
Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, Jiaya Jia |
While LISA effectively bridges the gap between segmentation and large
language models to enable reasoning segmentation, it poses certain limitations:
unable to distinguish different instances of the target region, and constrained
by the pre-defined textual response formats. In this work, we introduce LISA++,
an update to the existing LISA model, focusing on improving core
functionalities while keeping the base architecture intact. The main
enhancements in LISA++ include: \textbf{1) Enhanced Segmentation}: The instance
segmentation ability has been added, providing a more detailed scene analysis
along with the existing multi-region semantic segmentation. \textbf{2) More
Natural Conversation}: Improved capability for multi-turn dialogue, with the
ability to incorporate segmentation results directly into text responses, i.e.,
Segmentation in Dialogue (SiD). These improvements are achieved by curating the
existing samples of generic segmentation datasets, aimed specifically at
enhancing the segmentation and conversational skills without structural change
and additional data sources. Comparative analysis with the original LISA model
shows significant advancements in these areas, positioning LISA++ as a notable
upgrade in visual understanding and interaction. LISA++'s adaptability and
improved features highlight the versatility of the mask-as-embedding paradigm
proposed by LISA, and the potential as a foundational model for diverse
applications. |
Introduces LISA++, an enhanced version of LISA for visual understanding and interaction, focusing on improving instance segmentation and natural conversation capabilities. |
Addresses limitations of existing multimodal models in providing detailed positional information and engaging in natural dialogue with segmentation results. |
Leverages existing segmentation datasets to reconstruct instruction-tuning data, enabling instance segmentation and Segmentation in Dialogue (SiD) without architectural changes. Extends the ReasonSeg benchmark to evaluate instance segmentation. |
LISA++ demonstrates significant improvements in instance segmentation compared to the original LISA.
LISA++ maintains comparable performance to LISA in semantic segmentation, indicating the generalizability of the framework.
LISA++ exhibits the ability to integrate segmentation results naturally within dialogue, enhancing its conversational capabilities. |
The performance on low-resolution images and small objects requires further investigation.
Future work includes extending LISA++ to more complex scenarios, such as video understanding and 3D scene analysis. |
instance segmentation, visual reasoning, multimodal learning, large language models, segmentation in dialogue |
2312.17232
Report |
Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels |
Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, Francis Engelmann |
Current 3D scene segmentation methods are heavily dependent on manually
annotated 3D training datasets. Such manual annotations are labor-intensive,
and often lack fine-grained details. Importantly, models trained on this data
typically struggle to recognize object classes beyond the annotated classes,
i.e., they do not generalize well to unseen domains and require additional
domain-specific annotations. In contrast, 2D foundation models demonstrate
strong generalization and impressive zero-shot abilities, inspiring us to
incorporate these characteristics from 2D models into 3D models. Therefore, we
explore the use of image segmentation foundation models to automatically
generate training labels for 3D segmentation. We propose Segment3D, a method
for class-agnostic 3D scene segmentation that produces high-quality 3D
segmentation masks. It improves over existing 3D segmentation models
(especially on fine-grained masks), and enables easily adding new training data
to further boost the segmentation performance -- all without the need for
manual training labels. |
Introduces Segment3D, a novel method for fine-grained, class-agnostic 3D point cloud segmentation that doesn't require manually annotated labels. |
Existing 3D segmentation methods depend heavily on manual labels which are costly and time-consuming to acquire, and often lack fine-grained details, limiting their generalizability. |
Leverages pre-trained 2D foundation models (SAM) for automatic mask generation. Employs a two-stage training approach: 1) Pre-training on partial RGB-D point clouds supervised by projected SAM masks, 2) Self-supervised fine-tuning on full 3D point clouds using high-confidence predictions from the pre-trained model. |
Segment3D achieves state-of-the-art performance on ScanNet++, surpassing existing methods, including fully supervised ones.
It exhibits superior performance in segmenting small objects and fine-grained details compared to methods trained on manual labels.
Demonstrates strong generalization ability, effectively segmenting unseen objects in both indoor and outdoor scenarios. |
Limited exploration of incorporating even larger and more diverse datasets during pre-training.
Further investigation into the impact of the number of queries on the model's performance is needed. |
3d point cloud segmentation, class-agnostic segmentation, foundation models, unsupervised learning, open-vocabulary scene understanding |
2312.17225
Report |
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency |
Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, Yunchao Wei |
Aided by text-to-image and text-to-video diffusion models, existing 4D
content creation pipelines utilize score distillation sampling to optimize the
entire dynamic 3D scene. However, as these pipelines generate 4D content from
text or image inputs, they incur significant time and effort in prompt
engineering through trial and error. This work introduces 4DGen, a novel,
holistic framework for grounded 4D content creation that decomposes the 4D
generation task into multiple stages. We identify static 3D assets and
monocular video sequences as key components in constructing the 4D content. Our
pipeline facilitates conditional 4D generation, enabling users to specify
geometry (3D assets) and motion (monocular videos), thus offering superior
control over content creation. Furthermore, we construct our 4D representation
using dynamic 3D Gaussians, which permits efficient, high-resolution
supervision through rendering during training, thereby facilitating
high-quality 4D generation. Additionally, we employ spatial-temporal pseudo
labels on anchor frames, along with seamless consistency priors implemented
through 3D-aware score distillation sampling and smoothness regularizations.
Compared to existing baselines, our approach yields competitive results in
faithfully reconstructing input signals and realistically inferring renderings
from novel viewpoints and timesteps. Most importantly, our method supports
grounded generation, offering users enhanced control, a feature difficult to
achieve with previous methods. Project page:
https://vita-group.github.io/4DGen/ |
Introduces 4DGen, a novel pipeline for grounded 4D content creation allowing control over motion and appearance using monocular video as input. |
Addresses limitations of previous 4D generation methods like restricted motion capabilities, reliance on unreliable prompt engineering, and low-resolution outputs. |
Leverages deformable 3D Gaussians for 4D representation, employs spatial-temporal pseudo labels from a multi-view diffusion model, and ensures consistency via 3D-aware score distillation sampling and smoothness regularization. |
Outperforms baselines in video-to-4D tasks, demonstrating superior spatial and temporal consistency.
Enables faithful generation of input signals and plausible novel view synthesis at arbitrary timesteps.
Supports image-to-4D and text-to-4D generation via integration with video diffusion models. |
Limited to single object generation due to the object-centric nature of the pre-trained diffusion prior.
Future work will focus on extending the framework to handle multi-object and scene-level generation. |
4d content creation, grounded generation, 3d gaussian splatting, score distillation sampling, diffusion models |
2312.17161
Report |
Restoration by Generation with Constrained Priors |
Zheng Ding, Xuaner Zhang, Zhuowen Tu, Zhihao Xia |
The inherent generative power of denoising diffusion models makes them
well-suited for image restoration tasks where the objective is to find the
optimal high-quality image within the generative space that closely resembles
the input image. We propose a method to adapt a pretrained diffusion model for
image restoration by simply adding noise to the input image to be restored and
then denoise. Our method is based on the observation that the space of a
generative model needs to be constrained. We impose this constraint by
finetuning the generative model with a set of anchor images that capture the
characteristics of the input image. With the constrained space, we can then
leverage the sampling strategy used for generation to do image restoration. We
evaluate against previous methods and show superior performances on multiple
real-world restoration datasets in preserving identity and image quality. We
also demonstrate an important and practical application on personalized
restoration, where we use a personal album as the anchor images to constrain
the generative space. This approach allows us to produce results that
accurately preserve high-frequency details, which previous works are unable to
do. Project webpage: https://gen2res.github.io. |
This paper proposes a novel image restoration method that leverages the generative power of pre-trained diffusion models by adding noise to a degraded input image and then denoising it using the diffusion model, with the generative space constrained by a set of anchor images. |
This method addresses limitations of existing supervised restoration methods that rely on paired training data and struggle to generalize to real-world degradations. |
The method constrains the diffusion model's generative space by fine-tuning it with either a 'generative album' (generated from the input image with skip guidance) for single-image restoration or a 'personal album' (provided set of clean images of the same subject) for personalized restoration. |
The method achieves state-of-the-art results on standard blind face restoration benchmarks, outperforming supervised methods in FID and MUSIQ.
It exhibits strong generalization to real-world degradations like motion blur, even without explicit training on such data.
In personalized restoration, the method effectively leverages the personal album to preserve identity and recover high-frequency details, surpassing both single-image and exemplar-based approaches. |
Single-image restoration requires per-image fine-tuning, which is computationally expensive.
The method's effectiveness on general image restoration is yet to be explored, relying on the availability of high-quality pre-trained diffusion models for diverse image domains. |
image restoration, diffusion models, generative models, blind image restoration, personalized image restoration |
2312.17142
Report |
DreamGaussian4D: Generative 4D Gaussian Splatting |
Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, Ziwei Liu |
Remarkable progress has been made in 4D content generation recently. However,
existing methods suffer from long optimization time, lack of motion
controllability, and a low level of detail. In this paper, we introduce
DreamGaussian4D, an efficient 4D generation framework that builds on 4D
Gaussian Splatting representation. Our key insight is that the explicit
modeling of spatial transformations in Gaussian Splatting makes it more
suitable for the 4D generation setting compared with implicit representations.
DreamGaussian4D reduces the optimization time from several hours to just a few
minutes, allows flexible control of the generated 3D motion, and produces
animated meshes that can be efficiently rendered in 3D engines. |
DreamGaussian4D is an efficient 4D generation framework based on 4D Gaussian Splatting, which reduces optimization time from hours to minutes while enabling controllable 3D motion. |
Existing 4D content generation methods suffer from long optimization times, lack of motion controllability, and low detail. |
The framework leverages a static 3D Gaussian Splatting model, optimized using an enhanced DreamGaussianHD method. A deformation network learns motion from a driving video, allowing for controllable dynamics. An optional video-to-video pipeline refines textures on exported animated meshes. |
Significantly faster optimization compared to existing methods (minutes instead of hours).
Controllable 3D motion generation by leveraging driving videos.
High-quality animated meshes with refined textures, suitable for real-world applications. |
Limited diversity in the generated shapes due to single-image input.
Reliance on external image-to-video models for driving video generation. |
4d content generation, gaussian splatting, motion control, image-to-4d, texture refinement |
2312.16886
Report |
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices |
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen |
We present MobileVLM, a competent multimodal vision language model (MMVLM)
targeted to run on mobile devices. It is an amalgamation of a myriad of
architectural designs and techniques that are mobile-oriented, which comprises
a set of language models at the scale of 1.4B and 2.7B parameters, trained from
scratch, a multimodal vision model that is pre-trained in the CLIP fashion,
cross-modality interaction via an efficient projector. We evaluate MobileVLM on
several typical VLM benchmarks. Our models demonstrate on par performance
compared with a few much larger models. More importantly, we measure the
inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin
GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens
per second, respectively. Our code will be made available at:
https://github.com/Meituan-AutoML/MobileVLM. |
The paper introduces MobileVLM, a multimodal vision language model designed for efficient execution on mobile and IoT devices. |
Large multimodal models (LMMs) are resource-intensive and challenging to deploy on edge devices. MobileVLM addresses this by offering comparable performance to larger models while being optimized for mobile platforms. |
The model consists of a pre-trained CLIP ViT-L/14 visual encoder, a lightweight downsample projector (LDP) for aligning visual and textual features, and a tailored LLM (MobileLLaMA) based on a downscaled LLaMA architecture. It is trained using a two-stage approach involving pretraining and instruction tuning. |
MobileVLM achieves comparable performance to larger VLMs on benchmarks like GQA, POPE, and MMBench.
The efficient projector design in MobileVLM reduces visual tokens by 75% without compromising performance.
MobileVLM exhibits superior inference speed on Snapdragon 888 CPU and Jetson Orin GPU compared to similar-sized models, achieving up to 21.5 tokens/s and 65.3 tokens/s respectively. |
The performance of MobileVLM on tasks like ScienceQA and MME, which require extensive training data, shows room for improvement.
Future work includes exploring neural architecture search for optimizing the LLM component. |
vision language model, mobile deployment, efficient projector, multimodal learning, edge ai |
2312.16862
Report |
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones |
Zhengqing Yuan, Zhaoxu Li, Weiran Huang, Yanfang Ye, Lichao Sun |
In recent years, multimodal large language models (MLLMs) such as GPT-4V have
demonstrated remarkable advancements, excelling in a variety of vision-language
tasks. Despite their prowess, the closed-source nature and computational
demands of such models limit their accessibility and applicability. This study
introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training
and inference across various vision-language tasks, including image captioning
(IC) and visual question answering (VQA). Leveraging a compact yet powerful
architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained
vision encoders, utilizing a unique mapping module for visual and linguistic
information fusion. With a training regimen optimized for small backbones and
employing a diverse dataset amalgam, TinyGPT-V requires significantly lower
computational resources 24GB for training and as little as 8GB for inference
without compromising on performance. Our experiments demonstrate that
TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable
results in VQA and image inference tasks to its larger counterparts while being
uniquely suited for deployment on resource-constrained devices through
innovative quantization techniques. This work not only paves the way for more
accessible and efficient MLLMs but also underscores the potential of smaller,
optimized models in bridging the gap between high performance and computational
efficiency in real-world applications. Additionally, this paper introduces a
new approach to multimodal large language models using smaller backbones. Our
code and training weights are available in
\url{https://github.com/DLYuanGod/TinyGPT-V}. |
This paper introduces TinyGPT-V, an open-source multimodal large language model (MLLM) designed for efficient training and inference across various vision-language tasks, despite having a significantly smaller size compared to existing models. |
Existing MLLMs, while powerful, often require substantial computational resources, limiting their accessibility and applicability. TinyGPT-V addresses this by achieving comparable performance with lower computational demands, making it suitable for resource-constrained devices. |
TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. It's trained using a novel methodology optimized for small backbones and a diverse dataset. |
TinyGPT-V achieves competitive performance on various benchmarks like VQA and referring expression comprehension, comparable to models with much larger parameter sizes.
It requires only 24GB GPU memory for training and can be deployed on devices with as little as 8GB memory.
TinyGPT-V demonstrates superior efficiency in inference speed and memory occupancy compared to models like LLaVA and MiniGPT-4. |
TinyGPT-V's performance on certain benchmarks, while strong, still lags behind the absolute top performers, suggesting room for improvement.
The study primarily focuses on a limited set of vision-language tasks, leaving its capabilities in other areas unexplored. |
multimodal large language models, vision-language tasks, efficient training, resource-constrained devices, open-source |
2312.16837
Report |
DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors |
Biwen Lei, Kai Yu, Mengyang Feng, Miaomiao Cui, Xuansong Xie |
Text-guided domain adaptation and generation of 3D-aware portraits find many
applications in various fields. However, due to the lack of training data and
the challenges in handling the high variety of geometry and appearance, the
existing methods for these tasks suffer from issues like inflexibility,
instability, and low fidelity. In this paper, we propose a novel framework
DiffusionGAN3D, which boosts text-guided 3D domain adaptation and generation by
combining 3D GANs and diffusion priors. Specifically, we integrate the
pre-trained 3D generative models (e.g., EG3D) and text-to-image diffusion
models. The former provides a strong foundation for stable and high-quality
avatar generation from text. And the diffusion models in turn offer powerful
priors and guide the 3D generator finetuning with informative direction to
achieve flexible and efficient text-guided domain adaptation. To enhance the
diversity in domain adaptation and the generation capability in text-to-avatar,
we introduce the relative distance loss and case-specific learnable triplane
respectively. Besides, we design a progressive texture refinement module to
improve the texture quality for both tasks above. Extensive experiments
demonstrate that the proposed framework achieves excellent results in both
domain adaptation and text-to-avatar tasks, outperforming existing methods in
terms of generation quality and efficiency. The project homepage is at
https://younglbw.github.io/DiffusionGAN3D-homepage/. |
This paper presents DiffusionGAN3D, a novel framework that boosts the performance of text-guided 3D generation and domain adaptation by combining 3D GANs and diffusion priors. |
This approach addresses the limitations of existing text-to-3D methods, which often suffer from low-quality results, instability, and poor texture details. |
DiffusionGAN3D employs a Semantic Diffusion Sampling (SDS) strategy to guide the generation process of a 3D GAN, along with a progressive texture refinement mechanism to further enhance the quality of the generated 3D assets. |
DiffusionGAN3D demonstrates superior performance in both domain adaptation and text-to-avatar tasks, producing high-fidelity results with fine-grained textures and diverse geometry.
The framework exhibits strong generalization capabilities across various domains, including human heads, animals, and stylized avatars.
DiffusionGAN3D also enables 3D-aware local editing on both synthetic and real images while preserving details and identity. |
The performance of DiffusionGAN3D is reliant on the quality and capabilities of the base 3D generator used.
The method currently struggles with local editing tasks that involve significant deformation. |
text-to-3d, 3d generation, domain adaptation, diffusion models, generative adversarial networks (gans) |
2312.16812
Report |
Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis |
Zhan Li, Zhang Chen, Zhong Li, Yi Xu |
Novel view synthesis of dynamic scenes has been an intriguing yet challenging
problem. Despite recent advancements, simultaneously achieving high-resolution
photorealistic results, real-time rendering, and compact storage remains a
formidable task. To address these challenges, we propose Spacetime Gaussian
Feature Splatting as a novel dynamic scene representation, composed of three
pivotal components. First, we formulate expressive Spacetime Gaussians by
enhancing 3D Gaussians with temporal opacity and parametric motion/rotation.
This enables Spacetime Gaussians to capture static, dynamic, as well as
transient content within a scene. Second, we introduce splatted feature
rendering, which replaces spherical harmonics with neural features. These
features facilitate the modeling of view- and time-dependent appearance while
maintaining small size. Third, we leverage the guidance of training error and
coarse depth to sample new Gaussians in areas that are challenging to converge
with existing pipelines. Experiments on several established real-world datasets
demonstrate that our method achieves state-of-the-art rendering quality and
speed, while retaining compact storage. At 8K resolution, our lite-version
model can render at 60 FPS on an Nvidia RTX 4090 GPU. Our code is available at
https://github.com/oppo-us-research/SpacetimeGaussians. |
This paper introduces Spacetime Gaussian Feature Splatting, a novel dynamic scene representation for real-time, high-resolution dynamic view synthesis with compact model size. |
Existing methods struggle to simultaneously achieve high-resolution photorealistic results, real-time rendering, and compact storage for dynamic novel view synthesis. |
The method extends 3D Gaussians to 4D spacetime using temporal opacity and polynomial motion/rotation parameters. It replaces spherical harmonics with splatted neural features and a lightweight MLP for efficient view- and time-dependent radiance encoding. Guided sampling of Gaussians based on training error and coarse depth improves rendering quality in sparsely covered areas. |
Achieves state-of-the-art rendering quality and speed while maintaining compact model size on multiple datasets.
Outperforms baselines on PSNR, DSSIM, and LPIPS metrics.
Enables 8K resolution rendering at 60 FPS with the lite-version model on an NVIDIA RTX 4090 GPU. |
The method currently lacks on-the-fly training capability.
It currently focuses on multi-view video inputs and adapting to monocular settings is left for future work. |
dynamic view synthesis, neural rendering, spacetime gaussian, feature splatting, guided sampling |
2312.16794
Report |
ZONE: Zero-Shot Instruction-Guided Local Editing |
Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xuhui Liu, Jiaming Liu, Li Lin, Xu Tang, Yao Hu, Jianzhuang Liu, Baochang Zhang |
Recent advances in vision-language models like Stable Diffusion have shown
remarkable power in creative image synthesis and editing.However, most existing
text-to-image editing methods encounter two obstacles: First, the text prompt
needs to be carefully crafted to achieve good results, which is not intuitive
or user-friendly. Second, they are insensitive to local edits and can
irreversibly affect non-edited regions, leaving obvious editing traces. To
tackle these problems, we propose a Zero-shot instructiON-guided local image
Editing approach, termed ZONE. We first convert the editing intent from the
user-provided instruction (e.g., "make his tie blue") into specific image
editing regions through InstructPix2Pix. We then propose a Region-IoU scheme
for precise image layer extraction from an off-the-shelf segment model. We
further develop an edge smoother based on FFT for seamless blending between the
layer and the image.Our method allows for arbitrary manipulation of a specific
region with a single instruction while preserving the rest. Extensive
experiments demonstrate that our ZONE achieves remarkable local editing results
and user-friendliness, outperforming state-of-the-art methods. Code is
available at https://github.com/lsl001006/ZONE. |
\texttt{ZONE} is a zero-shot, instruction-guided approach for local image editing that enables users to modify specific regions of real or synthetic images using simple instructions while preserving non-edited areas. |
Existing text-to-image editing methods often require complex prompt engineering or struggle to confine edits locally, leading to undesired alterations in non-targeted image regions. \texttt{ZONE} addresses these limitations by enabling intuitive, localized edits with user-friendly instructions. |
\texttt{ZONE} leverages a pre-trained InstructPix2Pix model to identify and edit regions based on user instructions. It introduces a Region-IoU scheme for precise mask refinement using SAM and employs an FFT-based edge smoother for seamless blending of edited layers with the original image. |
\texttt{ZONE} achieves superior performance in local image editing compared to state-of-the-art methods, as demonstrated by quantitative metrics like L1, L2, LPIPS, CLIP-I, and CLIP-T.
It effectively preserves non-edited regions, avoiding distortions commonly found in other instruction-guided methods.
Human evaluations confirm \texttt{ZONE}'s effectiveness, with users showing a strong preference for its editing results and a higher success rate in achieving desired edits. |
The editing capabilities of \texttt{ZONE} are limited by the capacity of the underlying instruction-guided diffusion models, which may not always perform optimally.
Localization can be challenging in complex scenes with multiple similar objects or very small objects, requiring further research to improve. |
image editing, local editing, instruction-guided editing, diffusion models, zero-shot learning |
2312.16720
Report |
Prompt Expansion for Adaptive Text-to-Image Generation |
Siddhartha Datta, Alexander Ku, Deepak Ramachandran, Peter Anderson |
Text-to-image generation models are powerful but difficult to use. Users
craft specific prompts to get better images, though the images can be
repetitive. This paper proposes a Prompt Expansion framework that helps users
generate high-quality, diverse images with less effort. The Prompt Expansion
model takes a text query as input and outputs a set of expanded text prompts
that are optimized such that when passed to a text-to-image model, generates a
wider variety of appealing images. We conduct a human evaluation study that
shows that images generated through Prompt Expansion are more aesthetically
pleasing and diverse than those generated by baseline methods. Overall, this
paper presents a novel and effective approach to improving the text-to-image
generation experience. |
This paper introduces Prompt Expansion, a framework that enhances text-to-image generation by expanding user queries into detailed prompts, improving image quality and diversity. |
Existing text-to-image models often produce repetitive outputs and necessitate elaborate prompt engineering. This framework addresses these limitations by promoting diverse and high-quality image generation with less user effort. |
The authors create a Prompt Expansion dataset by inverting aesthetically pleasing images to text prompts and then mapping them to high-level user queries. They train a text-to-text model on this dataset and fine-tune it using a downstream text-to-image model. |
Prompt Expansion generates more aesthetically pleasing and diverse images compared to straight-query generation, as evidenced by automatic metrics and human evaluation.
The fine-tuned model excels in aesthetics, demonstrating the significance of aligning the framework with the downstream text-to-image model.
Prompt Expansion maintains reasonable text-image alignment, ensuring the expanded prompts stay true to the user's original intent. |
The diversity improvements, while consistent, are relatively small in magnitude.
The model's performance relies heavily on the quality and diversity of the training dataset. |
text-to-image generation, prompt engineering, image diversity, image aesthetics, text-image alignment |
2312.16693
Report |
I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models |
Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, Zhengjun Zha, Haibin Huang, Chongyang Ma |
Text-guided image-to-video (I2V) generation aims to generate a coherent video
that preserves the identity of the input image and semantically aligns with the
input prompt. Existing methods typically augment pretrained text-to-video (T2V)
models by either concatenating the image with noised video frames channel-wise
before being fed into the model or injecting the image embedding produced by
pretrained image encoders in cross-attention modules. However, the former
approach often necessitates altering the fundamental weights of pretrained T2V
models, thus restricting the model's compatibility within the open-source
communities and disrupting the model's prior knowledge. Meanwhile, the latter
typically fails to preserve the identity of the input image. We present
I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the
unnoised input image to subsequent noised frames through a cross-frame
attention mechanism, maintaining the identity of the input image without any
changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few
trainable parameters, significantly alleviating the training cost and also
ensures compatibility with existing community-driven personalized models and
control tools. Moreover, we propose a novel Frame Similarity Prior to balance
the motion amplitude and the stability of generated videos through two
adjustable control coefficients. Our experimental results demonstrate that
I2V-Adapter is capable of producing high-quality videos. This performance,
coupled with its agility and adaptability, represents a substantial advancement
in the field of I2V, particularly for personalized and controllable
applications. |
This paper proposes I2V-Adapter, a lightweight, plug-and-play adapter for image-to-video generation that efficiently adapts pretrained text-to-video diffusion models without altering their original weights. |
Existing methods for adapting text-to-video models for image-to-video tasks require significant modifications, leading to training instability and incompatibility with personalized models or control tools. |
I2V-Adapter leverages pretrained model knowledge by feeding the unnoised input image and noised frames in parallel. It employs a cross-frame attention mechanism to propagate identity information, preserving the first frame's identity. Additionally, a Frame Similarity Prior balances motion and stability in generated videos. |
I2V-Adapter generates high-quality videos with consistency between input images and subsequent frames while adhering to text prompts.
It outperforms existing methods in quantitative metrics, demonstrating superior image consistency, motion range, and motion accuracy.
The method's plug-and-play nature ensures compatibility with personalized T2I models and control tools like ControlNet. |
Limited to generating 16-frame, 512x512 videos due to constraints from pretrained base models and video data.
Future work aims to incorporate frame interpolation and super-resolution modules for longer, higher-resolution videos. |
image-to-video generation, diffusion models, adapter, cross-frame attention, frame similarity prior |
2312.16649
Report |
Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection |
Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Yao Zhao, Jingdong Wang |
In this paper, we study the problem of generalizable synthetic image
detection, aiming to detect forgery images from diverse generative methods,
e.g., GANs and diffusion models. Cutting-edge solutions start to explore the
benefits of pre-trained models, and mainly follow the fixed paradigm of solely
training an attached classifier, e.g., combining frozen CLIP-ViT with a
learnable linear layer in UniFD. However, our analysis shows that such a fixed
paradigm is prone to yield detectors with insufficient learning regarding
forgery representations. We attribute the key challenge to the lack of forgery
adaptation, and present a novel forgery-aware adaptive transformer approach,
namely FatFormer. Based on the pre-trained vision-language spaces of CLIP,
FatFormer introduces two core designs for the adaption to build generalized
forgery representations. First, motivated by the fact that both image and
frequency analysis are essential for synthetic image detection, we develop a
forgery-aware adapter to adapt image features to discern and integrate local
forgery traces within image and frequency domains. Second, we find that
considering the contrastive objectives between adapted image features and text
prompt embeddings, a previously overlooked aspect, results in a nontrivial
generalization improvement. Accordingly, we introduce language-guided alignment
to supervise the forgery adaptation with image and text prompts in FatFormer.
Experiments show that, by coupling these two designs, our approach tuned on
4-class ProGAN data attains a remarkable detection performance, achieving an
average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen
diffusion models with 95% accuracy. |
This paper proposes FatFormer, a forgery-aware adaptive transformer, for generalizable synthetic image detection, aiming to effectively detect fake images generated by various methods like GANs and diffusion models. |
Existing methods relying on fixed pre-trained models with attached classifiers show limitations in learning robust forgery representations, resulting in poor generalization ability to unseen generation methods. |
FatFormer leverages a forgery-aware adapter (FAA) to extract and integrate forgery traces in both image and frequency domains. It also introduces language-guided alignment (LGA) that utilizes contrastive objectives between adapted image features and text prompts for supervising the learning of generalized forgery representations. |
FatFormer consistently outperforms state-of-the-art methods on detecting fake images from various GANs, achieving 98.4% accuracy and 99.7% AP.
It demonstrates remarkable generalization ability by effectively detecting images from unseen diffusion models with 95.0% accuracy and 98.8% AP.
Ablation studies validate the contributions of FAA, LGA, and their components in enhancing detection performance and generalizability. |
There is still room for improvement in detecting fake images generated by specific diffusion models like Guided.
Exploring better pre-training tasks specifically designed for synthetic image detection could further enhance FatFormer's performance. |
synthetic image detection, forgery detection, generative models, adaptive transformer, vision-language model |
2312.16486
Report |
PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion |
Guansong Lu, Yuanfan Guo, Jianhua Han, Minzhe Niu, Yihan Zeng, Songcen Xu, Zeyi Huang, Zhao Zhong, Wei Zhang, Hang Xu |
Current large-scale diffusion models represent a giant leap forward in
conditional image synthesis, capable of interpreting diverse cues like text,
human poses, and edges. However, their reliance on substantial computational
resources and extensive data collection remains a bottleneck. On the other
hand, the integration of existing diffusion models, each specialized for
different controls and operating in unique latent spaces, poses a challenge due
to incompatible image resolutions and latent space embedding structures,
hindering their joint use. Addressing these constraints, we present
"PanGu-Draw", a novel latent diffusion model designed for resource-efficient
text-to-image synthesis that adeptly accommodates multiple control signals. We
first propose a resource-efficient Time-Decoupling Training Strategy, which
splits the monolithic text-to-image model into structure and texture
generators. Each generator is trained using a regimen that maximizes data
utilization and computational efficiency, cutting data preparation by 48% and
reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an
algorithm that enables the cooperative use of various pre-trained diffusion
models with different latent spaces and predefined resolutions within a unified
denoising process. This allows for multi-control image synthesis at arbitrary
resolutions without the necessity for additional data or retraining. Empirical
validations of Pangu-Draw show its exceptional prowess in text-to-image and
multi-control image generation, suggesting a promising direction for future
model training efficiencies and generation versatility. The largest 5B T2I
PanGu-Draw model is released on the Ascend platform. Project page:
$\href{https://pangu-draw.github.io}{this~https~URL}$ |
Presents "PanGu-Draw", a novel latent diffusion model for resource-efficient text-to-image synthesis that supports multiple control signals. |
Addresses the high computational cost and data requirements of large-scale diffusion models, as well as the challenge of integrating existing models with different controls, resolutions, and latent spaces. |
Introduces two key innovations: (1) Time-Decoupling Training Strategy: Splits the model into structure and texture generators trained separately for efficiency. (2) Coop-Diffusion Algorithm: Enables cooperative use of pre-trained diffusion models with different latent spaces and resolutions. |
PanGu-Draw achieves state-of-the-art text-to-image generation quality, surpassing models like DALL-E 2 and SDXL on English benchmarks.
It excels in Chinese text-to-image generation, achieving superior scores across FID, IS, and CN-CLIP-score metrics.
Coop-Diffusion enables multi-control and multi-resolution image generation by effectively fusing different diffusion models without retraining. |
The paper primarily focuses on efficiency and quality, with limited exploration of novel control mechanisms.
Future work could investigate the generalization of Coop-Diffusion to an even wider range of pre-trained models. |
text-to-image synthesis, diffusion models, multi-control generation, multi-resolution synthesis, efficient training |
2312.16414
Report |
Bellman Optimal Stepsize Straightening of Flow-Matching Models |
Bao Nguyen, Binh Nguyen, Viet Anh Nguyen |
Flow matching is a powerful framework for generating high-quality samples in
various applications, especially image synthesis. However, the intensive
computational demands of these models, especially during the finetuning process
and sampling processes, pose significant challenges for low-resource scenarios.
This paper introduces Bellman Optimal Stepsize Straightening (BOSS) technique
for distilling flow-matching generative models: it aims specifically for a
few-step efficient image sampling while adhering to a computational budget
constraint. First, this technique involves a dynamic programming algorithm that
optimizes the stepsizes of the pretrained network. Then, it refines the
velocity network to match the optimal step sizes, aiming to straighten the
generation paths. Extensive experimental evaluations across image generation
tasks demonstrate the efficacy of BOSS in terms of both resource utilization
and image quality. Our results reveal that BOSS achieves substantial gains in
efficiency while maintaining competitive sample quality, effectively bridging
the gap between low-resource constraints and the demanding requirements of
flow-matching generative models. Our paper also fortifies the responsible
development of artificial intelligence, offering a more sustainable generative
model that reduces computational costs and environmental footprints. Our code
can be found at https://github.com/nguyenngocbaocmt02/BOSS. |
This paper introduces Bellman Optimal Stepsize Straightening (BOSS), a technique to distill flow-matching generative models for efficient image sampling under low-resource constraints. |
Flow matching models, while powerful, demand significant computational resources for finetuning and sampling, making them challenging to use in low-resource settings. BOSS aims to bridge this gap. |
BOSS uses a two-phase approach: 1) a dynamic programming algorithm finds optimal stepsizes for the pretrained model, and 2) the velocity network is retrained to match these optimal stepsizes, straightening the generation paths. |
Bellman optimal stepsizes substantially improve image quality (lower FID) compared to uniform stepsizes, especially for high-resolution datasets.
BOSS achieves comparable or better image quality than standard reflow techniques with significantly fewer retraining iterations (reduced computational cost).
Low-Rank Adaptation (LoRA) can be effectively used during the straightening process, achieving competitive results while finetuning only a small fraction of model parameters. |
The method requires additional training to determine optimal stepsizes.
Future work could explore extending BOSS to guided velocity networks and developing computationally cheaper algorithms for stepsize calculation. |
generative models, flow matching, image generation, efficient sampling, low-resource settings |
2312.16274
Report |
Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis |
Jingjing Ren, Cheng Xu, Haoyu Chen, Xinran Qin, Lei Zhu |
Recent progress in multi-modal conditioned face synthesis has enabled the
creation of visually striking and accurately aligned facial images. Yet,
current methods still face issues with scalability, limited flexibility, and a
one-size-fits-all approach to control strength, not accounting for the
differing levels of conditional entropy, a measure of unpredictability in data
given some condition, across modalities. To address these challenges, we
introduce a novel uni-modal training approach with modal surrogates, coupled
with an entropy-aware modal-adaptive modulation, to support flexible, scalable,
and scalable multi-modal conditioned face synthesis network. Our uni-modal
training with modal surrogate that only leverage uni-modal data, use modal
surrogate to decorate condition with modal-specific characteristic and serve as
linker for inter-modal collaboration , fully learns each modality control in
face synthesis process as well as inter-modal collaboration. The entropy-aware
modal-adaptive modulation finely adjust diffusion noise according to
modal-specific characteristics and given conditions, enabling well-informed
step along denoising trajectory and ultimately leading to synthesis results of
high fidelity and quality. Our framework improves multi-modal face synthesis
under various conditions, surpassing current methods in image quality and
fidelity, as demonstrated by our thorough experimental results. |
This paper proposes a novel uni-modal training approach with modal surrogates and an entropy-aware modal-adaptive modulation mechanism for scalable, flexible, and adaptive multi-modal conditioned face synthesis. |
Existing methods suffer from poor scalability, limited flexibility in handling modal combinations, and a lack of adaptivity to the varying control strength required for different modalities. |
The method uses modal surrogates to decorate conditions with modal-specific characteristics and facilitate inter-modal collaboration. It also dynamically adjusts noise levels based on conditional entropy for each modality, ensuring effective utilization of information from all modalities. |
The proposed framework demonstrates superior performance in multi-modal face synthesis, outperforming existing methods in terms of image quality and condition alignment.
It supports a wide range of face synthesis applications, including diverse uni-modal synthesis and flexible combinations of multi-modal conditions.
The method achieves high flexibility and scalability by enabling the synthesis of facial images under various modal combinations within a single sampling process of a unified diffusion model. |
The method currently relies on pre-trained encoders for certain modalities like text and low-resolution images.
Further exploration is needed to extend the approach to even more modalities and higher-resolution image synthesis. |
multi-modal face synthesis, diffusion models, uni-modal training, modal surrogates, entropy-aware modulation |
2312.16272
Report |
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation |
Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, Zhongliang Jing |
Recent advancements in subject-driven image generation have led to zero-shot
generation, yet precise selection and focus on crucial subject representations
remain challenging. Addressing this, we introduce the SSR-Encoder, a novel
architecture designed for selectively capturing any subject from single or
multiple reference images. It responds to various query modalities including
text and masks, without necessitating test-time fine-tuning. The SSR-Encoder
combines a Token-to-Patch Aligner that aligns query inputs with image patches
and a Detail-Preserving Subject Encoder for extracting and preserving fine
features of the subjects, thereby generating subject embeddings. These
embeddings, used in conjunction with original text embeddings, condition the
generation process. Characterized by its model generalizability and efficiency,
the SSR-Encoder adapts to a range of custom models and control modules.
Enhanced by the Embedding Consistency Regularization Loss for improved
training, our extensive experiments demonstrate its effectiveness in versatile
and high-quality image generation, indicating its broad applicability. Project
page: https://ssr-encoder.github.io |
This paper introduces SSR-Encoder, a novel finetuning-free method for selective subject-driven image generation using text or mask queries. |
Existing methods either lack the flexibility to selectively generate subjects from single or multiple images without test-time fine-tuning or fail to fully capture and leverage the detailed representation of subjects. |
The SSR-Encoder consists of a Token-to-Patch Aligner for precise query-subject alignment and a Detail-Preserving Subject Encoder for extracting multi-scale subject embeddings. An Embedding Consistency Regularization Loss enhances token-to-patch alignment during training. |
SSR-Encoder outperforms state-of-the-art finetuning-free methods in subject and image-text alignment, subject exclusivity, and image quality.
It demonstrates competitive performance even compared to finetuning-based methods.
Ablation studies validate the contribution of each component, showcasing improved expressiveness and precision in subject-driven generation. |
The fidelity of generated images can be affected by the uneven distribution of training data.
Future work includes addressing data distribution limitations and extending the approach to 3D generation. |
image generation, subject-driven generation, text-to-image, diffusion models, selective representation |
2312.16256
Report |
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision |
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, Aniket Bera |
We have witnessed significant progress in deep learning-based 3D vision,
ranging from neural radiance field (NeRF) based 3D representation learning to
applications in novel view synthesis (NVS). However, existing scene-level
datasets for deep learning-based 3D vision, limited to either synthetic
environments or a narrow selection of real-world scenes, are quite
insufficient. This insufficiency not only hinders a comprehensive benchmark of
existing methods but also caps what could be explored in deep learning-based 3D
analysis. To address this critical gap, we present DL3DV-10K, a large-scale
scene dataset, featuring 51.2 million frames from 10,510 videos captured from
65 types of point-of-interest (POI) locations, covering both bounded and
unbounded scenes, with different levels of reflection, transparency, and
lighting. We conducted a comprehensive benchmark of recent NVS methods on
DL3DV-10K, which revealed valuable insights for future research in NVS. In
addition, we have obtained encouraging results in a pilot study to learn
generalizable NeRF from DL3DV-10K, which manifests the necessity of a
large-scale scene-level dataset to forge a path toward a foundation model for
learning 3D representation. Our DL3DV-10K dataset, benchmark results, and
models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/. |
This paper introduces DL3DV-10K, a large-scale, real-world multi-view scene dataset for novel view synthesis and 3D representation learning, containing 51.3 million 4K resolution frames across 65 point-of-interest categories with fine-grained annotations for scene complexity. |
Existing scene-level datasets are limited in scale and diversity, hindering comprehensive benchmarking of NVS methods and the development of generalizable 3D representation learning models. |
The dataset was created by capturing high-resolution videos of diverse real-world scenes using consumer mobile devices and drones, followed by a detailed annotation process for scene complexity. |
Zip-NeRF and 3DGS demonstrated the best overall performance on the benchmark, with Zip-NeRF excelling in most scenarios but consuming more memory.
Outdoor (unbounded) scenes and scenes with high-frequency details posed significant challenges for all evaluated NVS methods.
Pretraining a generalizable NeRF model on DL3DV-10K significantly improved performance compared to training from scratch or using a smaller dataset, highlighting the dataset's potential for learning universal scene priors. |
The presence of moving objects in some scenes, inherent to mobile phone video capture, presents challenges for static view synthesis.
Future work includes expanding the dataset with dynamic scenes and exploring the development of robust learning-based 3D models for dynamic NVS. |
novel view synthesis, 3d representation learning, dataset, benchmark, neural radiance fields |
2312.16218
Report |
Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks |
Christian Simon, Sen He, Juan-Manuel Perez-Rua, Mengmeng Xu, Amine Benhalloum, Tao Xiang |
Solving image-to-3D from a single view is an ill-posed problem, and current
neural reconstruction methods addressing it through diffusion models still rely
on scene-specific optimization, constraining their generalization capability.
To overcome the limitations of existing approaches regarding generalization and
consistency, we introduce a novel neural rendering technique. Our approach
employs the signed distance function as the surface representation and
incorporates generalizable priors through geometry-encoding volumes and
HyperNetworks. Specifically, our method builds neural encoding volumes from
generated multi-view inputs. We adjust the weights of the SDF network
conditioned on an input image at test-time to allow model adaptation to novel
scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts
derived from the synthesized views, we propose the use of a volume transformer
module to improve the aggregation of image features instead of processing each
viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we
avoid the bottleneck of scene-specific optimization and maintain consistency
across the images generated from multiple viewpoints. Our experiments show the
advantages of our proposed approach with consistent results and rapid
generation. |
This paper introduces Hyper-VolTran, a novel neural rendering technique for fast and generalizable single-view 3D reconstruction, employing HyperNetworks and a Volume Transformer. |
Single-view 3D reconstruction is ill-posed, and existing diffusion-based methods lack generalization due to scene-specific optimization. |
The method uses a diffusion model to synthesize multi-view images, a HyperNetwork to generate SDF network weights from input image embeddings, and a Volume Transformer (VolTran) for consistent feature aggregation across inconsistent views. |
Hyper-VolTran demonstrates superior generalization ability compared to existing methods in single image-to-3D reconstruction.
The method achieves fast 3D mesh generation in 45 seconds without per-scene optimization.
Ablation studies confirm the contribution of both the HyperNetwork and VolTran modules to the overall performance. |
The performance of Hyper-VolTran relies on the quality and consistency of the multi-view images generated by the diffusion model.
Future work could explore incorporating semantic information or alternative 3D representations for enhanced reconstruction accuracy. |
3d reconstruction, single-view reconstruction, neural rendering, hypernetworks, diffusion models |
2312.16204
Report |
Iterative Prompt Relabeling for diffusion model with RLDF |
Jiaxin Ge, Xinyan Chen, Tianjun Zhang, Shanghang Zhang |
Diffusion models have shown impressive performance in many domains, including
image generation, time series prediction, and reinforcement learning. The
algorithm demonstrates superior performance over the traditional GAN and
transformer based methods. However, the model's capability to follow natural
language instructions (e.g., spatial relationships between objects, generating
complex scenes) is still unsatisfactory. This has been an important research
area to enhance such capability. Prior works adopt reinforcement learning to
adjust the behavior of the diffusion models. However, RL methods not only
require careful reward design and complex hyperparameter tuning, but also fails
to incorporate rich natural language feedback. In this work, we propose
iterative prompt relabeling (IP-RLDF), a novel algorithm that aligns images to
text through iterative image sampling and prompt relabeling. IP-RLDF first
samples a batch of images conditioned on the text, then relabels the text
prompts of unmatched text-image pairs with classifier feedback. We conduct
thorough experiments on three different models, including SDv2, GLIGEN, and
SDXL, testing their capability to generate images following instructions. With
IP-RLDF, we improved up to 15.22% (absolute improvement) on the challenging
spatial relation VISOR benchmark, demonstrating superior performance compared
to previous RL methods. |
This paper proposes IP-RLDF, a novel algorithm to improve the spatial understanding and rendering capabilities of text-to-image diffusion models. |
Current diffusion models struggle to accurately interpret and execute complex instructions involving spatial relationships between objects. |
IP-RLDF uses an iterative process of: 1) Sampling images from a diffusion model. 2) Using an object detection model to analyze spatial relationships and relabel inaccurate text prompts. 3) Retraining the model with the augmented dataset, iteratively refining its spatial understanding. |
IP-RLDF achieves up to 15.22% absolute improvement in spatial accuracy on the VISOR benchmark.
The algorithm shows consistent improvement across different diffusion models (SDv2, GLIGEN, SDXL) and fine-tuning techniques.
Ablation studies confirm that each component (prompt relabeling, iterative training, detection-based reward) contributes to performance gains. |
The current implementation focuses on object count and spatial layouts; expanding to more complex language feedback is a potential area.
Balancing the trade-off between spatial accuracy and maintaining overall image fidelity (CLIP score) requires further investigation. |
diffusion models, text-to-image generation, spatial understanding, prompt relabeling, reinforcement learning |
2312.16197
Report |
INFAMOUS-NeRF: ImproviNg FAce MOdeling Using Semantically-Aligned Hypernetworks with Neural Radiance Fields |
Andrew Hou, Feng Liu, Zhiyuan Ren, Michel Sarkis, Ning Bi, Yiying Tong, Xiaoming Liu |
We propose INFAMOUS-NeRF, an implicit morphable face model that introduces
hypernetworks to NeRF to improve the representation power in the presence of
many training subjects. At the same time, INFAMOUS-NeRF resolves the classic
hypernetwork tradeoff of representation power and editability by learning
semantically-aligned latent spaces despite the subject-specific models, all
without requiring a large pretrained model. INFAMOUS-NeRF further introduces a
novel constraint to improve NeRF rendering along the face boundary. Our
constraint can leverage photometric surface rendering and multi-view
supervision to guide surface color prediction and improve rendering near the
surface. Finally, we introduce a novel, loss-guided adaptive sampling method
for more effective NeRF training by reducing the sampling redundancy. We show
quantitatively and qualitatively that our method achieves higher representation
power than prior face modeling methods in both controlled and in-the-wild
settings. Code and models will be released upon publication. |
INFAMOUS-NeRF, an implicit morphable face model, leverages hypernetworks to learn subject-specific NeRF MLP weights, enhancing representation power while preserving editability through semantically-aligned latent spaces. |
Existing face models struggle to balance high-fidelity rendering with the capacity to represent diverse subjects and enable editing. INFAMOUS-NeRF addresses this by improving representation power and maintaining editability. |
The method employs a two-stage approach: 1) a NeRF model with hypernetworks to learn subject-specific MLPs and semantically aligned latent codes for shared attributes like expressions, and 2) a conditional DDPM for novel view refinement. It also introduces a photometric surface constraint for rendering accuracy at face boundaries and an adaptive sampling technique for efficient NeRF training. |
Achieves state-of-the-art novel view synthesis and 3DMM fitting on FaceScape, FFHQ, and CelebAHQ datasets, demonstrating superior representation power.
Successfully transfers expressions between subjects, proving semantic alignment of latent spaces despite using hypernetworks.
Demonstrates improved rendering quality, especially at face boundaries, thanks to the novel photometric surface constraint and adaptive sampling. |
Handling out-of-distribution expressions remains challenging, potentially requiring training data from more diverse datasets.
Initial latent code optimization for a new image is computationally expensive, necessitating exploration of faster mapping techniques. |
face modeling, neural radiance fields, hypernetworks, 3d morphable models, adaptive sampling |
2312.16171
Report |
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 |
Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen |
This paper introduces 26 guiding principles designed to streamline the
process of querying and prompting large language models. Our goal is to
simplify the underlying concepts of formulating questions for various scales of
large language models, examining their abilities, and enhancing user
comprehension on the behaviors of different scales of large language models
when feeding into different prompts. Extensive experiments are conducted on
LLaMA-1/2 (7B, 13B and 70B), GPT-3.5/4 to verify the effectiveness of the
proposed principles on instructions and prompts design. We hope that this work
can provide a better guide for researchers working on the prompting of large
language models. Project page is available at
https://github.com/VILA-Lab/ATLAS. |
This paper introduces 26 guiding principles for crafting effective prompts for large language models (LLMs). |
The goal is to simplify the process of prompting LLMs, enhance users' understanding of their behavior, and ultimately improve the quality of LLM responses. |
The authors conducted extensive experiments on various LLM scales (LLaMA-1/2, GPT-3.5/4) using the manually designed ATLAS benchmark to evaluate the effectiveness of the proposed principles. |
The principled prompts led to an average improvement of 57.7% in LLM response quality and 36.4% in accuracy on GPT-4.
Larger models exhibited greater performance gains from the principled prompts, exceeding 20% improvement when moving from LLaMA-2-7B to GPT-4.
The principles consistently improved the conciseness, factuality, and clarity of LLM responses across different scales. |
The effectiveness of the principles may be limited when applied to highly complex or specialized questions.
The evaluation was conducted on a limited set of questions and LLM architectures, potentially affecting the generalizability of the findings. |
large language models, prompt engineering, prompting principles, llm evaluation, atlas benchmark |
2312.16145
Report |
One-Dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications |
Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, Guiguang Ding |
The prevalent use of commercial and open-source diffusion models (DMs) for
text-to-image generation prompts risk mitigation to prevent undesired
behaviors. Existing concept erasing methods in academia are all based on full
parameter or specification-based fine-tuning, from which we observe the
following issues: 1) Generation alternation towards erosion: Parameter drift
during target elimination causes alternations and potential deformations across
all generations, even eroding other concepts at varying degrees, which is more
evident with multi-concept erased; 2) Transfer inability & deployment
inefficiency: Previous model-specific erasure impedes the flexible combination
of concepts and the training-free transfer towards other models, resulting in
linear cost growth as the deployment scenarios increase. To achieve
non-invasive, precise, customizable, and transferable elimination, we ground
our erasing framework on one-dimensional adapters to erase multiple concepts
from most DMs at once across versatile erasing applications. The
concept-SemiPermeable structure is injected as a Membrane (SPM) into any DM to
learn targeted erasing, and meantime the alteration and erosion phenomenon is
effectively mitigated via a novel Latent Anchoring fine-tuning strategy. Once
obtained, SPMs can be flexibly combined and plug-and-play for other DMs without
specific re-tuning, enabling timely and efficient adaptation to diverse
scenarios. During generation, our Facilitated Transport mechanism dynamically
regulates the permeability of each SPM to respond to different input prompts,
further minimizing the impact on other concepts. Quantitative and qualitative
results across ~40 concepts, 7 DMs and 4 erasing applications have demonstrated
the superior erasing of SPM. Our code and pre-tuned SPMs are available on the
project page https://lyumengyao.github.io/projects/spm. |
This paper introduces SPM, a one-dimensional adapter framework for erasing concepts from pre-trained diffusion models (DMs) in a precise, customizable, and transferable manner. |
Existing concept erasing methods often lead to undesirable generation alterations and concept erosion, especially when multiple concepts are erased. They also lack transferability across different DM architectures. |
The framework utilizes 1-dim SPMs injected into DMs to learn concept-specific semi-permeability. It employs Latent Anchoring during training to preserve non-target concepts and Facilitated Transport during inference to dynamically regulate SPM activation based on input prompts. |
SPMs successfully erase concrete objects, abstract styles, sexual content, and memorized images while minimizing impact on non-target generations.
The method effectively mitigates generation alterations and alleviates concept erosion, even with multiple concepts erased.
SPMs exhibit training-free transferability to other DMs, enabling efficient adaptation to diverse models and regulatory requirements. |
Challenges remain in precisely defining and erasing interconnected concepts with nuanced attributes.
Further research is needed to enhance the robustness of nudity removal, especially when transferred to community-trained DMs. |
concept erasing, diffusion models, generative safety, parameter-efficient fine-tuning, transfer learning |
2312.16109
Report |
fMPI: Fast Novel View Synthesis in the Wild with Layered Scene Representations |
Jonas Kohler, Nicolas Griffiths Sanchez, Luca Cavalli, Catherine Herold, Albert Pumarola, Alberto Garcia Garcia, Ali Thabet |
In this study, we propose two novel input processing paradigms for novel view
synthesis (NVS) methods based on layered scene representations that
significantly improve their runtime without compromising quality. Our approach
identifies and mitigates the two most time-consuming aspects of traditional
pipelines: building and processing the so-called plane sweep volume (PSV),
which is a high-dimensional tensor of planar re-projections of the input camera
views. In particular, we propose processing this tensor in parallel groups for
improved compute efficiency as well as super-sampling adjacent input planes to
generate denser, and hence more accurate scene representation. The proposed
enhancements offer significant flexibility, allowing for a balance between
performance and speed, thus making substantial steps toward real-time
applications. Furthermore, they are very general in the sense that any
PSV-based method can make use of them, including methods that employ multiplane
images, multisphere images, and layered depth images. In a comprehensive set of
experiments, we demonstrate that our proposed paradigms enable the design of an
NVS method that achieves state-of-the-art on public benchmarks while being up
to $50x$ faster than existing state-of-the-art methods. It also beats the
current forerunner in terms of speed by over $3x$, while achieving
significantly better rendering quality. |
This paper introduces two novel input processing methods for layered scene representation-based novel view synthesis (NVS) to significantly improve runtime performance without sacrificing quality. |
Existing NVS methods, while effective, suffer from high computational complexity, hindering their application in real-time scenarios like VR and immersive telepresence. |
The authors propose (1) Plane Grouping: splitting the computationally expensive plane sweep volume (PSV) into groups for parallel processing and (2) Plane Super-Sampling: enabling the network to leverage PSV redundancies and predict denser MPIs from a sparser input, reducing computation. |
The proposed 'fast MPI' method achieves state-of-the-art quality on public NVS benchmarks while being up to 50x faster than existing methods.
Plane Grouping shows superior performance compared to processing planes independently or jointly, enabling an optimal speed-performance trade-off.
Super-Sampling significantly reduces runtime by predicting denser MPIs from sparser PSVs without compromising quality. |
The method lacks temporal consistency, potentially leading to inconsistencies in video view synthesis.
Memory requirements for layered representations remain high, posing challenges for resource-constrained environments. |
novel view synthesis, multiplane images, layered scene representation, real-time rendering, computer vision |
2312.16084
Report |
LangSplat: 3D Language Gaussian Splatting |
Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, Hanspeter Pfister |
Humans live in a 3D world and commonly use natural language to interact with
a 3D scene. Modeling a 3D language field to support open-ended language queries
in 3D has gained increasing attention recently. This paper introduces
LangSplat, which constructs a 3D language field that enables precise and
efficient open-vocabulary querying within 3D spaces. Unlike existing methods
that ground CLIP language embeddings in a NeRF model, LangSplat advances the
field by utilizing a collection of 3D Gaussians, each encoding language
features distilled from CLIP, to represent the language field. By employing a
tile-based splatting technique for rendering language features, we circumvent
the costly rendering process inherent in NeRF. Instead of directly learning
CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and
then learns language features on the scene-specific latent space, thereby
alleviating substantial memory demands imposed by explicit modeling. Existing
methods struggle with imprecise and vague 3D language fields, which fail to
discern clear boundaries between objects. We delve into this issue and propose
to learn hierarchical semantics using SAM, thereby eliminating the need for
extensively querying the language field across various scales and the
regularization of DINO features. Extensive experimental results show that
LangSplat significantly outperforms the previous state-of-the-art method LERF
by a large margin. Notably, LangSplat is extremely efficient, achieving a 199
$\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We
strongly recommend readers to check out our video results at
https://langsplat.github.io/ |
This paper introduces LangSplat, a novel method for building 3D language fields using 3D Gaussian Splatting, enabling fast and accurate open-vocabulary querying in 3D scenes. |
Modeling 3D language fields allows for versatile interaction with 3D scenes using natural language, benefiting applications like robotics, autonomous driving, and AR/VR. |
LangSplat leverages 3D Gaussian Splatting for efficient rendering, incorporates a scene-specific language autoencoder for memory efficiency, and employs SAM for accurate and hierarchical semantic learning. |
LangSplat significantly outperforms previous state-of-the-art methods like LERF in open-vocabulary 3D object localization and semantic segmentation tasks.
The method exhibits remarkable speed improvements, achieving up to 199x faster query times compared to LERF.
LangSplat effectively learns a precise 3D language field, as demonstrated by its ability to accurately capture object boundaries and reduce noise in segmentation results. |
The current implementation relies on a pre-trained SAM model, which may limit its generalizability to unseen object categories.
Future work could explore incorporating temporal information for dynamic scene understanding and interaction. |
3d language field, open-vocabulary querying, 3d gaussian splatting, segment anything model (sam), scene understanding |
2312.16047
Report |
2D-Guided 3D Gaussian Segmentation |
Kun Lan, Haoran Li, Haolin Shi, Wenjun Wu, Yong Liao, Lin Wang, Pengyuan Zhou |
Recently, 3D Gaussian, as an explicit 3D representation method, has
demonstrated strong competitiveness over NeRF (Neural Radiance Fields) in terms
of expressing complex scenes and training duration. These advantages signal a
wide range of applications for 3D Gaussians in 3D understanding and editing.
Meanwhile, the segmentation of 3D Gaussians is still in its infancy. The
existing segmentation methods are not only cumbersome but also incapable of
segmenting multiple objects simultaneously in a short amount of time. In
response, this paper introduces a 3D Gaussian segmentation method implemented
with 2D segmentation as supervision. This approach uses input 2D segmentation
maps to guide the learning of the added 3D Gaussian semantic information, while
nearest neighbor clustering and statistical filtering refine the segmentation
results. Experiments show that our concise method can achieve comparable
performances on mIOU and mAcc for multi-object segmentation as previous
single-object segmentation methods. |
This paper introduces a novel 3D Gaussian segmentation method guided by 2D segmentation, enhancing efficiency and accuracy in multi-object segmentation within 3D scenes. |
Existing 3D Gaussian segmentation methods are either computationally intensive or incapable of segmenting multiple objects efficiently. This work addresses these limitations, aiming for a fast and accurate multi-object segmentation approach. |
The method leverages pre-trained 2D segmentation models to guide the learning of semantic information (object code) attached to 3D Gaussians. It then employs KNN clustering to refine semantic information and optionally uses statistical filtering to remove erroneously segmented Gaussians. |
The method achieves comparable mean Intersection over Union (mIOU) and mean Accuracy (mAcc) to previous single-object segmentation techniques while enabling multi-object segmentation.
It demonstrates superior detail preservation compared to NeRF-based segmentation methods due to the explicit representation of 3D Gaussians.
The approach is efficient, requiring less than two minutes for semantic information learning and 1-2 seconds for multi-object segmentation from a given viewpoint. |
The method's reliance on 2D segmentation maps might limit its performance in scenarios where 2D segmentation is challenging.
Future work can explore incorporating depth information to further enhance segmentation accuracy in complex scenes. |
3d gaussian, 3d segmentation, 2d segmentation guidance, knn clustering, statistical filtering |
2312.15980
Report |
HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D |
Sangmin Woo, Byeongjun Park, Hyojun Go, Jin-Young Kim, Changick Kim |
Recent progress in single-image 3D generation highlights the importance of
multi-view coherency, leveraging 3D priors from large-scale diffusion models
pretrained on Internet-scale images. However, the aspect of novel-view
diversity remains underexplored within the research landscape due to the
ambiguity in converting a 2D image into 3D content, where numerous potential
shapes can emerge. Here, we aim to address this research gap by simultaneously
addressing both consistency and diversity. Yet, striking a balance between
these two aspects poses a considerable challenge due to their inherent
trade-offs. This work introduces HarmonyView, a simple yet effective diffusion
sampling technique adept at decomposing two intricate aspects in single-image
3D generation: consistency and diversity. This approach paves the way for a
more nuanced exploration of the two critical dimensions within the sampling
process. Moreover, we propose a new evaluation metric based on CLIP image and
text encoders to comprehensively assess the diversity of the generated views,
which closely aligns with human evaluators' judgments. In experiments,
HarmonyView achieves a harmonious balance, demonstrating a win-win scenario in
both consistency and diversity. |
HarmonyView, a novel diffusion sampling technique for single-image 3D generation, balances multi-view consistency and novel-view diversity. |
Balancing consistency and diversity is crucial for high-quality 3D generation from single images, but existing methods struggle to optimize both aspects effectively. |
HarmonyView decomposes the diffusion sampling process using two implicit classifiers to guide visual consistency with the input view and diversity in novel views, achieving a harmonious balance. |
HarmonyView outperforms state-of-the-art methods in novel-view synthesis and 3D reconstruction tasks across quantitative metrics.
HarmonyView generates high-quality, coherent 3D meshes even for complex objects and scenes.
A newly proposed metric, CD score, effectively quantifies novel-view diversity and aligns well with human evaluator judgments. |
Completely eliminating the trade-off between consistency and diversity remains a challenge.
Expanding HarmonyView to handle multi-object scenes with complex backgrounds needs further research. |
3d generation, diffusion models, multi-view consistency, novel-view diversity, single-image 3d reconstruction |
2312.15905
Report |
Cross Initialization for Personalized Text-to-Image Generation |
Lianyu Pang, Jian Yin, Haoran Xie, Qiping Wang, Qing Li, Xudong Mao |
Recently, there has been a surge in face personalization techniques,
benefiting from the advanced capabilities of pretrained text-to-image diffusion
models. Among these, a notable method is Textual Inversion, which generates
personalized images by inverting given images into textual embeddings. However,
methods based on Textual Inversion still struggle with balancing the trade-off
between reconstruction quality and editability. In this study, we examine this
issue through the lens of initialization. Upon closely examining traditional
initialization methods, we identified a significant disparity between the
initial and learned embeddings in terms of both scale and orientation. The
scale of the learned embedding can be up to 100 times greater than that of the
initial embedding. Such a significant change in the embedding could increase
the risk of overfitting, thereby compromising the editability. Driven by this
observation, we introduce a novel initialization method, termed Cross
Initialization, that significantly narrows the gap between the initial and
learned embeddings. This method not only improves both reconstruction and
editability but also reduces the optimization steps from 5000 to 320.
Furthermore, we apply a regularization term to keep the learned embedding close
to the initial embedding. We show that when combined with Cross Initialization,
this regularization term can effectively improve editability. We provide
comprehensive empirical evidence to demonstrate the superior performance of our
method compared to the baseline methods. Notably, in our experiments, Cross
Initialization is the only method that successfully edits an individual's
facial expression. Additionally, a fast version of our method allows for
capturing an input image in roughly 26 seconds, while surpassing the baseline
methods in terms of both reconstruction and editability. Code will be made
publicly available. |
The paper proposes a new initialization method named "Cross Initialization" for personalized text-to-image generation using diffusion models, specifically addressing the overfitting issue observed in Textual Inversion. |
Textual Inversion, a popular method for personalizing text-to-image generation, often suffers from overfitting, limiting its ability to generate images that accurately reflect both the input concept and user prompt. This paper seeks to solve this issue by improving the initialization of the process. |
The method leverages the observation that learned textual embeddings tend to align with the output of the CLIP text encoder. Thus, it initializes the textual embedding with the output of the text encoder, fed with a mean embedding derived from a set of well-known names. Additionally, a regularization term is used to keep the learned embedding close to the initial embedding during optimization. |
Cross Initialization significantly reduces the optimization time compared to Textual Inversion (from 106 minutes to 6 minutes).
It demonstrates superior performance in both identity preservation and prompt similarity compared to baseline methods like DreamBooth, NeTI, and Celeb Basis.
A fast version of the method allows for learning a new concept in only 26 seconds while surpassing baselines in reconstruction and editability. |
The effectiveness of Cross Initialization for general concepts beyond human faces needs further investigation.
Future work will focus on exploring the applicability of the method to a broader range of concepts. |
text-to-image generation, diffusion models, textual inversion, personalization, cross initialization |
2312.15895
Report |
Semantic-aware SAM for Point-Prompted Instance Segmentation |
Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jianbin Jiao, Zhenjun Han |
Single-point annotation in visual tasks, with the goal of minimizing
labelling costs, is becoming increasingly prominent in research. Recently,
visual foundation models, such as Segment Anything (SAM), have gained
widespread usage due to their robust zero-shot capabilities and exceptional
annotation performance. However, SAM's class-agnostic output and high
confidence in local segmentation introduce 'semantic ambiguity', posing a
challenge for precise category-specific segmentation. In this paper, we
introduce a cost-effective category-specific segmenter using SAM. To tackle
this challenge, we have devised a Semantic-Aware Instance Segmentation Network
(SAPNet) that integrates Multiple Instance Learning (MIL) with matching
capability and SAM with point prompts. SAPNet strategically selects the most
representative mask proposals generated by SAM to supervise segmentation, with
a specific focus on object category information. Moreover, we introduce the
Point Distance Guidance and Box Mining Strategy to mitigate inherent
challenges: 'group' and 'local' issues in weakly supervised segmentation. These
strategies serve to further enhance the overall segmentation performance. The
experimental results on Pascal VOC and COCO demonstrate the promising
performance of our proposed SAPNet, emphasizing its semantic matching
capabilities and its potential to advance point-prompted instance segmentation.
The code will be made publicly available. |
This paper introduces SAPNet, a novel end-to-end point-prompted instance segmentation framework that leverages the power of visual foundation models like Segment Anything (SAM) while overcoming their limitations in semantic understanding for precise category-specific segmentation. |
Instance segmentation often requires costly pixel-level annotations. This paper addresses this challenge by proposing a cost-effective category-specific segmenter that utilizes point annotations, significantly reducing annotation costs while maintaining competitive performance compared to fully-supervised methods. |
SAPNet integrates SAM with point prompts and a dual-branch selection mechanism to choose the most semantically representative mask proposals. It introduces Point Distance Guidance (PDG) and a Positive-Negative Proposals Generator (PNPG) to tackle semantic ambiguity and localization errors, further refined by a Box Mining Strategy (BMS). |
SAPNet achieves state-of-the-art performance in Point-Prompted Instance Segmentation (PPIS), significantly outperforming previous methods on COCO and VOC2012 benchmarks.
The proposed method effectively addresses the semantic ambiguity of SAM and the localization challenges in MIL-based selection, leading to high-quality segmentation results.
SAPNet exhibits strong performance even with limited annotation, bridging the gap between point-prompted and fully-supervised instance segmentation techniques. |
The performance of SAPNet might be further improved by exploring different visual backbones or integrating more advanced prompting techniques.
Future work could investigate the generalization ability of SAPNet on other complex datasets and real-world applications with more challenging scenarios. |
instance segmentation, point supervision, weakly supervised learning, visual foundation models, segment anything (sam) |
2312.15770
Report |
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos |
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang |
Diffusion-based text-to-video generation has witnessed impressive progress in
the past year yet still falls behind text-to-image generation. One of the key
reasons is the limited scale of publicly available data (e.g., 10M video-text
pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost
of video captioning. Instead, it could be far easier to collect unlabeled clips
from video platforms like YouTube. Motivated by this, we come up with a novel
text-to-video generation framework, termed TF-T2V, which can directly learn
with text-free videos. The rationale behind is to separate the process of text
decoding from that of temporal modeling. To this end, we employ a content
branch and a motion branch, which are jointly optimized with weights shared.
Following such a pipeline, we study the effect of doubling the scale of
training set (i.e., video-only WebVid10M) with some randomly collected
text-free videos and are encouraged to observe the performance improvement (FID
from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of
our approach. We also find that our model could enjoy sustainable performance
gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some
text labels for training. Finally, we validate the effectiveness and
generalizability of our ideology on both native text-to-video generation and
compositional video synthesis paradigms. Code and models will be publicly
available at https://tf-t2v.github.io/. |
This paper introduces TF-T2V, a novel text-to-video generation framework that can be trained directly on text-free videos by separating temporal and spatial modeling in video diffusion models. |
Current text-to-video generation methods are limited by the scarcity of large-scale video-text datasets. TF-T2V addresses this by leveraging readily available text-free videos, opening possibilities for improved scalability and applicability. |
TF-T2V utilizes a two-branch architecture: a content branch trained on image-text data for spatial appearance and a motion branch trained on text-free videos for temporal dynamics. The model jointly optimizes both branches and incorporates a temporal coherence loss to ensure smooth transitions between frames. |
TF-T2V achieves state-of-the-art performance on text-to-video generation benchmarks, outperforming methods trained on labeled video-text datasets.
Scaling the training set with additional text-free videos leads to consistent performance improvement, demonstrating the method's scalability.
TF-T2V effectively incorporates into compositional video synthesis frameworks, enabling control over video generation using depth, sketch, and motion vectors. |
The scaling experiments are limited to doubling the dataset size due to computational constraints, leaving larger-scale scalability unexplored.
The paper primarily focuses on short video generation, with future work aimed at extending the method to long video sequences. |
text-to-video generation, video diffusion models, text-free video learning, compositional video synthesis, temporal coherence |
2312.15736
Report |
Towards Real-World Blind Face Restoration with Generative Diffusion Prior |
Xiaoxu Chen, Jingfan Tan, Tao Wang, Kaihao Zhang, Wenhan Luo, Xiaochun Cao |
Blind face restoration is an important task in computer vision and has gained
significant attention due to its wide-range applications. Previous works mainly
exploit facial priors to restore face images and have demonstrated high-quality
results. However, generating faithful facial details remains a challenging
problem due to the limited prior knowledge obtained from finite data. In this
work, we delve into the potential of leveraging the pretrained Stable Diffusion
for blind face restoration. We propose BFRffusion which is thoughtfully
designed to effectively extract features from low-quality face images and could
restore realistic and faithful facial details with the generative prior of the
pretrained Stable Diffusion. In addition, we build a privacy-preserving face
dataset called PFHQ with balanced attributes like race, gender, and age. This
dataset can serve as a viable alternative for training blind face restoration
networks, effectively addressing privacy and bias concerns usually associated
with the real face datasets. Through an extensive series of experiments, we
demonstrate that our BFRffusion achieves state-of-the-art performance on both
synthetic and real-world public testing datasets for blind face restoration and
our PFHQ dataset is an available resource for training blind face restoration
networks. The codes, pretrained models, and dataset are released at
https://github.com/chenxx89/BFRffusion. |
This paper proposes BFRffusion, a blind face restoration method leveraging the generative prior of pretrained Stable Diffusion, and introduces PFHQ, a privacy-preserving face dataset with balanced attributes. |
Blind face restoration is essential for various applications but faces challenges in generating faithful details and ethical concerns with real face datasets. |
BFRffusion utilizes a four-module architecture (SDRM, MFEM, TTPM, PDUM) to extract features from low-quality images and guide the restoration process with Stable Diffusion priors. PFHQ is constructed using ControlNet with face parsing maps for image generation and carefully selected for balanced attributes. |
BFRffusion achieves state-of-the-art performance on synthetic and real-world datasets for blind face restoration.
The proposed multi-scale feature extraction module and trainable time-aware prompt module effectively improve restoration quality and efficiency.
PFHQ dataset demonstrates comparable performance to real face datasets while addressing privacy and bias concerns. |
BFRffusion faces challenges in restoring severely degraded images and handling watermarks.
Future work includes developing low-cost training strategies and exploring more practical synthetic data methods. |
blind face restoration, diffusion models, generative prior, face dataset, privacy-preserving |
2312.15715
Report |
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces |
Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo |
The reference-based object segmentation tasks, namely referring image
segmentation (RIS), few-shot image segmentation (FSS), referring video object
segmentation (RVOS), and video object segmentation (VOS), aim to segment a
specific object by utilizing either language or annotated masks as references.
Despite significant progress in each respective field, current methods are
task-specifically designed and developed in different directions, which hinders
the activation of multi-task capabilities for these tasks. In this work, we end
the current fragmented situation and propose UniRef++ to unify the four
reference-based object segmentation tasks with a single architecture. At the
heart of our approach is the proposed UniFusion module which performs
multiway-fusion for handling different tasks with respect to their specified
references. And a unified Transformer architecture is then adopted for
achieving instance-level segmentation. With the unified designs, UniRef++ can
be jointly trained on a broad range of benchmarks and can flexibly complete
multiple tasks at run-time by specifying the corresponding references. We
evaluate our unified models on various benchmarks. Extensive experimental
results indicate that our proposed UniRef++ achieves state-of-the-art
performance on RIS and RVOS, and performs competitively on FSS and VOS with a
parameter-shared network. Moreover, we showcase that the proposed UniFusion
module could be easily incorporated into the current advanced foundation model
SAM and obtain satisfactory results with parameter-efficient finetuning. Codes
and models are available at \url{https://github.com/FoundationVision/UniRef}. |
UniRef++ is a unified model capable of performing four reference-based object segmentation tasks (RIS, FSS, RVOS, and VOS) with the same model weights. |
Current methods for these tasks are task-specific, requiring separate training and leading to redundant parameters. A unified model promotes synergy between tasks, reduces computational costs, and allows for flexible multi-task execution. |
UniRef++ leverages a UniFusion module to inject reference information (language or mask) into visual features. A unified Transformer architecture then performs instance-level segmentation. The model is trained jointly on datasets across all four tasks. |
Achieves state-of-the-art performance on RIS and RVOS.
Performs competitively on FSS and VOS with a single parameter-shared network.
Demonstrates efficiency for long-term video segmentation. |
Performance on FSS slightly lower than specialized models due to data scale.
Future work includes exploring the combination of UniFusion with other foundation models. |
unified model, reference-based segmentation, referring image segmentation, few-shot segmentation, video object segmentation |
2312.15707
Report |
High-Fidelity Diffusion-based Image Editing |
Chen Hou, Guoqiang Wei, Zhibo Chen |
Diffusion models have attained remarkable success in the domains of image
generation and editing. It is widely recognized that employing larger inversion
and denoising steps in diffusion model leads to improved image reconstruction
quality. However, the editing performance of diffusion models tends to be no
more satisfactory even with increasing denoising steps. The deficiency in
editing could be attributed to the conditional Markovian property of the
editing process, where errors accumulate throughout denoising steps. To tackle
this challenge, we first propose an innovative framework where a rectifier
module is incorporated to modulate diffusion model weights with residual
features, thereby providing compensatory information to bridge the fidelity
gap. Furthermore, we introduce a novel learning paradigm aimed at minimizing
error propagation during the editing process, which trains the editing
procedure in a manner similar to denoising score-matching. Extensive
experiments demonstrate that our proposed framework and training strategy
achieve high-fidelity reconstruction and editing results across various levels
of denoising steps, meanwhile exhibits exceptional performance in terms of both
quantitative metric and qualitative assessments. Moreover, we explore our
model's generalization through several applications like image-to-image
translation and out-of-domain image editing. |
This paper proposes a novel method to enhance the fidelity of image reconstruction and editing in diffusion models by introducing a rectifier module and a new editing training paradigm. |
Existing diffusion-based editing methods suffer from distortion and low fidelity, particularly with increasing denoising steps, due to error accumulation. |
The method utilizes a hypernetwork-based rectifier to modulate diffusion model weights with residual features, bridging the fidelity gap. It also trains the editing process like denoising score matching, minimizing error propagation during editing. |
The proposed method achieves high-fidelity reconstruction and editing results across various levels of denoising steps.
The rectifier module proves beneficial for other diffusion-based tasks like image-to-image translation.
The method generalizes well to out-of-domain images without requiring fine-tuning. |
The paper mainly focuses on semantic editing, leaving exploration of other editing types for future work.
Some attributes remain challenging to edit due to their low frequency in training data. |
diffusion models, image editing, image reconstruction, fidelity enhancement, score matching |
2312.15681
Report |
Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers |
Peng Ye, Yongqi Huang, Chongjun Tu, Minglei Li, Tao Chen, Tong He, Wanli Ouyang |
Fine-tuning pre-trained foundation models has gained significant popularity
in various research fields. Existing methods for fine-tuning can be roughly
divided into two categories, namely Parameter-Efficient Fine-Tuning and
High-Performance Fine-Tuning. The former aims at improving efficiency, while
the latter focuses on enhancing performance. Beyond these methods, we
demonstrate that Partial Fine-Tuning can be an innovative and promising
direction capable of concurrently enhancing both efficiency and accuracy. We
first validate eight manually-defined partial fine-tuning strategies across
kinds of datasets and vision transformer architectures, and find that some
partial fine-tuning strategies (e.g., ffn only or attention only) can achieve
better performance with fewer tuned parameters than full fine-tuning, and
selecting appropriate layers is critical to partial fine-tuning. Thus, we
propose a novel fine-tuned angle metric to guide the selection of appropriate
layers for partial fine-tuning, making it flexible to be adapted to various
scenarios for more practicable partial fine-tuning. Additionally, we show that
partial fine-tuning can serve as a new dimension for Model Soups, improving
both the model performance and generalization with fewer tuned parameters.
Comprehensive experiments on a wide range of datasets and models validate the
great potential of partial fine-tuning. |
This paper explores the potential of partial fine-tuning for improving both the performance and parameter efficiency of pre-trained models, introducing a novel approach called Fine-tuned Angle guided Partial Fine-Tuning (FAPFT). |
Fine-tuning large pre-trained models is computationally expensive. This paper explores how to improve efficiency and achieve better performance than full fine-tuning by selectively fine-tuning parts of the models. |
The paper explores the effectiveness of manually defined partial fine-tuning strategies and then proposes FAPFT, which uses a fine-tuned angle metric to quantify the impact of training on different model layers. FAPFT selects layers with large (challenging datasets) or small (easy datasets) fine-tuned angles for fine-tuning. |
Partial fine-tuning, particularly of specific functional layers (e.g., attention or FFN), can achieve comparable or even better performance than full fine-tuning with fewer parameters.
The position of the fine-tuned layers significantly impacts performance.
FAPFT, guided by the fine-tuned angle metric, outperforms other methods on various datasets (CIFAR-100, ImageNet-1K, FGVC) and architectures (ViT, Swin, ConvNeXt, AS-MLP), demonstrating both high accuracy and parameter efficiency. |
The current FAPFT requires fully fine-tuning the model for several epochs to compute the fine-tuned angle, incurring additional computational costs.
The paper mainly focuses on image classification tasks. |
partial fine-tuning, fine-tuned angle metric, parameter efficiency, model soups, vision transformers |
2312.15430
Report |
Make-A-Character: High Quality Text-to-3D Character Generation within Minutes |
Jianqiang Ren, Chao He, Lin Liu, Jiahao Chen, Yutong Wang, Yafei Song, Jianfang Li, Tangli Xue, Siqi Hu, Tao Chen, Kunkun Zheng, Jianjing Xiang, Liefeng Bo |
There is a growing demand for customized and expressive 3D characters with
the emergence of AI agents and Metaverse, but creating 3D characters using
traditional computer graphics tools is a complex and time-consuming task. To
address these challenges, we propose a user-friendly framework named
Make-A-Character (Mach) to create lifelike 3D avatars from text descriptions.
The framework leverages the power of large language and vision models for
textual intention understanding and intermediate image generation, followed by
a series of human-oriented visual perception and 3D generation modules. Our
system offers an intuitive approach for users to craft controllable, realistic,
fully-realized 3D characters that meet their expectations within 2 minutes,
while also enabling easy integration with existing CG pipeline for dynamic
expressiveness. For more information, please visit the project page at
https://human3daigc.github.io/MACH/. |
The paper introduces Mach, a novel text-to-3D character generation framework that leverages LLMs and diffusion models to create realistic, controllable, and animatable 3D avatars from text descriptions. |
The demand for personalized 3D characters is increasing with the rise of the Metaverse and AI agents. However, traditional 3D creation tools are complex and time-consuming. Mach aims to democratize 3D character creation by enabling users to easily generate high-quality avatars using simple text prompts. |
Mach utilizes an LLM (Qwen-14B) to extract facial attributes from the text prompt and generate visual clues. These clues guide Stable Diffusion with ControlNet to create a reference portrait image. Dense landmark detection, triplane-based geometry generation, differentiable rendering, and neural delighting techniques are used to create the final 3D avatar. |
Mach generates high-quality 3D avatars from text descriptions within 2 minutes.
The generated avatars are fully rigged and animatable, supporting various facial expressions.
The framework utilizes an explicit 3D representation, ensuring compatibility with existing CG pipelines. |
The current version primarily focuses on Asian ethnicities due to the training data of the SD model.
The generation of clothes, expressions, and motion from text prompts is still under development. |
text-to-3d, 3d avatar generation, large language models, diffusion models, character animation |
2312.15289
Report |
Wavelet Packet Power Spectrum Kullback-Leibler Divergence: A New Metric for Image Synthesis |
Lokesh Veeramacheneni, Moritz Wolter, Juergen Gall |
Current metrics for generative neural networks are biased towards low
frequencies, specific generators, objects from the ImageNet dataset, and value
texture more than shape. Many current quality metrics do not measure frequency
information directly. In response, we propose a new frequency band-based
quality metric, which opens a door into the frequency domain yet, at the same
time, preserves spatial aspects of the data. Our metric works well even if the
distributions we compare are far from ImageNet or have been produced by
differing generator architectures. We verify the quality of our metric by
sampling a broad selection of generative networks on a wide variety of data
sets. A user study ensures our metric aligns with human perception.
Furthermore, we show that frequency band guidance can improve the frequency
domain fidelity of a current generative network. |
This paper introduces Wavelet Packet Power Spectrum Kullback-Leibler Divergence (WPSKL), a new metric for assessing the quality of image synthesis in generative models. |
Existing metrics like FID and SSIM are biased towards specific datasets, sensitive to irrelevant details, and don't reliably reflect human perception, particularly in the frequency domain. |
The metric leverages the Wavelet Packet Transform (WPT) to capture spatial and frequency information. It computes the KL divergence between normalized wavelet power spectra of real and generated images. |
WPSKL shows better alignment with human perception compared to FID and SSIM in a user study.
Analysis reveals that generative models often struggle to accurately capture high-frequency details, particularly in image backgrounds.
Introducing a wavelet-based loss function during training can improve a model's fidelity in representing frequency information. |
The choice of wavelet function and decomposition level for WPT can impact the metric's results.
Further research is needed to explore WPSKL's applicability to other image generation tasks beyond unconditional synthesis. |
generative models, image synthesis, quality metrics, wavelet packet transform, frequency bias |
2312.15238
Report |
NoPose-NeuS: Jointly Optimizing Camera Poses with Neural Implicit Surfaces for Multi-view Reconstruction |
Mohamed Shawky Sabae, Hoda Anis Baraka, Mayada Mansour Hadhoud |
Learning neural implicit surfaces from volume rendering has become popular
for multi-view reconstruction. Neural surface reconstruction approaches can
recover complex 3D geometry that are difficult for classical Multi-view Stereo
(MVS) approaches, such as non-Lambertian surfaces and thin structures. However,
one key assumption for these methods is knowing accurate camera parameters for
the input multi-view images, which are not always available. In this paper, we
present NoPose-NeuS, a neural implicit surface reconstruction method that
extends NeuS to jointly optimize camera poses with the geometry and color
networks. We encode the camera poses as a multi-layer perceptron (MLP) and
introduce two additional losses, which are multi-view feature consistency and
rendered depth losses, to constrain the learned geometry for better estimated
camera poses and scene surfaces. Extensive experiments on the DTU dataset show
that the proposed method can estimate relatively accurate camera poses, while
maintaining a high surface reconstruction quality with 0.89 mean Chamfer
distance. |
NoPose-NeuS, a novel neural implicit surface reconstruction method extending NeuS to jointly optimize camera poses with geometry and color networks, enhancing 3D reconstruction from multi-view images without assuming accurate camera parameters. |
Estimating accurate camera parameters is crucial but challenging for neural implicit surface reconstruction. Existing methods often assume known camera parameters, limiting their practicality. This work addresses this by enabling camera pose optimization directly within the reconstruction pipeline. |
The method utilizes an MLP for camera pose prediction from camera indices. It introduces multi-view feature consistency and rendered depth losses to refine pose estimation and improve surface reconstruction quality. |
Achieves high surface reconstruction quality comparable to state-of-the-art methods relying on known camera parameters.
Estimates camera poses with high relative accuracy, comparable to classical MVS pipelines.
Demonstrates robustness in handling complex geometries and achieves superior reconstruction quality compared to classical MVS methods. |
The method's performance is sensitive to camera pose initialization.
It assumes a bounded scene, limiting its applicability to unbounded scenarios. Future work could explore relaxing this assumption. |
neural implicit surface reconstruction, camera pose estimation, multi-view stereo, volume rendering, deep learning |
2312.15162
Report |
Cycle-Consistency Learning for Captioning and Grounding |
Ning Wang, Jiajun Deng, Mingbo Jia |
We present that visual grounding and image captioning, which perform as two
mutually inverse processes, can be bridged together for collaborative training
by careful designs. By consolidating this idea, we introduce CyCo, a
cyclic-consistent learning framework to ameliorate the independent training
pipelines of visual grounding and image captioning. The proposed framework (1)
allows the semi-weakly supervised training of visual grounding; (2) improves
the performance of fully supervised visual grounding; (3) yields a general
captioning model that can describe arbitrary image regions. Extensive
experiments show that our fully supervised grounding model achieves
state-of-the-art performance, and the semi-weakly supervised one also exhibits
competitive performance compared to the fully supervised counterparts. Our
image captioning model has the capability to freely describe image regions and
meanwhile shows impressive performance on prevalent captioning benchmarks. |
This paper presents CyCo, a novel cyclic-consistent learning framework that bridges visual grounding and image captioning for collaborative training. |
This framework addresses limitations of current visual grounding and image captioning techniques by enabling semi-weakly supervised training, improving fully supervised performance, and allowing for region-specific image descriptions. |
CyCo utilizes a shared visual encoder and distinct Transformer blocks for each task. It employs two cyclic learning processes: grounding-to-captioning (box consistency) and captioning-to-grounding (caption consistency) to enforce mutual supervision. |
CyCo achieves state-of-the-art performance for fully supervised visual grounding.
The semi-weakly supervised CyCo shows competitive results compared to fully supervised counterparts.
CyCo enables region-specific image descriptions, surpassing traditional global captioning models. |
The work primarily uses ViT-B, exploring stronger backbones is left for future work.
Incorporating larger-scale, weakly-labeled datasets can further enhance performance. |
visual grounding, image captioning, cycle-consistency learning, vision-language pre-training, semi-weakly supervised learning |
2312.15043
Report |
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection |
Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin |
Visual grounding, a crucial vision-language task involving the understanding
of the visual context based on the query expression, necessitates the model to
capture the interactions between objects, as well as various spatial and
attribute information. However, the annotation data of visual grounding task is
limited due to its time-consuming and labor-intensive annotation process,
resulting in the trained models being constrained from generalizing its
capability to a broader domain. To address this challenge, we propose
GroundVLP, a simple yet effective zero-shot method that harnesses visual
grounding ability from the existing models trained from image-text pairs and
pure object detection data, both of which are more conveniently obtainable and
offer a broader domain compared to visual grounding annotation data. GroundVLP
proposes a fusion mechanism that combines the heatmap from GradCAM and the
object proposals of open-vocabulary detectors. We demonstrate that the proposed
method significantly outperforms other zero-shot methods on RefCOCO/+/g
datasets, surpassing prior zero-shot state-of-the-art by approximately 28\% on
the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs
comparably to or even better than some non-VLP-based supervised models on the
Flickr30k entities dataset. Our code is available at
https://github.com/om-ai-lab/GroundVLP. |
This paper proposes GroundVLP, a novel zero-shot method for visual grounding tasks (both Referring Expression Comprehension (REC) and phrase grounding) by leveraging the semantic understanding of Vision-Language Pre-training (VLP) models and the object detection capabilities of Open-Vocabulary object Detectors (OVD). |
Visual grounding datasets are limited due to their complex annotation process. GroundVLP addresses this challenge by leveraging readily available image-text pair and object detection data, eliminating the need for task-specific visual grounding annotations. |
GroundVLP utilizes GradCAM on a VLP model to generate a heatmap highlighting image regions relevant to the given expression. It then employs an OVD to detect candidate objects belonging to a predetermined category (either ground-truth or predicted). Finally, a weighted grade fusion mechanism combines the heatmap and object proposals to pinpoint the target object. |
GroundVLP significantly outperforms existing zero-shot methods on RefCOCO/+/g datasets for REC, surpassing previous state-of-the-art by approximately 28%.
It achieves comparable or even better performance than some non-VLP-based supervised models on the Flickr30k entities dataset for phrase grounding.
Ablation studies validate the effectiveness of each component in GroundVLP, highlighting the importance of the weighted grade fusion and visual word attention aggregation. |
The performance of GroundVLP can be affected by the inherent biases and noise present in the datasets used, especially when relying on predicted object categories.
GroundVLP may inherit potential biases from the foundational VLP and OVD models. |
visual grounding, zero-shot learning, vision-language pre-training, open-vocabulary object detection, gradcam |
2312.14988
Report |
Emage: Non-Autoregressive Text-to-Image Generation |
Zhangyin Feng, Runyi Hu, Liangxin Liu, Fan Zhang, Duyu Tang, Yong Dai, Xiaocheng Feng, Jiwei Li, Bing Qin, Shuming Shi |
Autoregressive and diffusion models drive the recent breakthroughs on
text-to-image generation. Despite their huge success of generating
high-realistic images, a common shortcoming of these models is their high
inference latency - autoregressive models run more than a thousand times
successively to produce image tokens and diffusion models convert Gaussian
noise into images with many hundreds of denoising steps. In this work, we
explore non-autoregressive text-to-image models that efficiently generate
hundreds of image tokens in parallel. We develop many model variations with
different learning and inference strategies, initialized text encoders, etc.
Compared with autoregressive baselines that needs to run one thousand times,
our model only runs 16 times to generate images of competitive quality with an
order of magnitude lower inference latency. Our non-autoregressive model with
346M parameters generates an image of 256$\times$256 with about one second on
one V100 GPU. |
This paper presents Emage, a non-autoregressive model for text-to-image generation that significantly reduces inference latency compared to autoregressive and diffusion models. |
Existing text-to-image generation models, while producing high-quality images, suffer from high inference latency due to their autoregressive or iterative nature. Emage addresses this by generating image tokens in parallel. |
The authors explore several non-autoregressive model variations, including fully parallel and iterative approaches. They utilize techniques like mask prediction, iterative refinement, and a CLIP-initialized text encoder to generate image tokens efficiently. |
Fully non-autoregressive models struggle to converge during training due to the long sequence length of image tokens.
Iterative non-autoregressive models, particularly one that revises previous predictions and predicts new tokens simultaneously, achieve competitive image quality with significantly lower latency.
Emage (346M parameters) generates images in about one second on a V100 GPU, achieving an order of magnitude speedup compared to autoregressive baselines. |
The performance gap between CLIP and larger text encoders needs further investigation.
Generating high-quality human faces remains challenging and requires further model scaling and data improvements. |
text-to-image generation, non-autoregressive models, image generation, clip, vqgan |
2312.14985
Report |
UniHuman: A Unified Model for Editing Human Images in the Wild |
Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin |
Human image editing includes tasks like changing a person's pose, their
clothing, or editing the image according to a text prompt. However, prior work
often tackles these tasks separately, overlooking the benefit of mutual
reinforcement from learning them jointly. In this paper, we propose UniHuman, a
unified model that addresses multiple facets of human image editing in
real-world settings. To enhance the model's generation quality and
generalization capacity, we leverage guidance from human visual encoders and
introduce a lightweight pose-warping module that can exploit different pose
representations, accommodating unseen textures and patterns. Furthermore, to
bridge the disparity between existing human editing benchmarks with real-world
data, we curated 400K high-quality human image-text pairs for training and
collected 2K human images for out-of-domain testing, both encompassing diverse
clothing styles, backgrounds, and age groups. Experiments on both in-domain and
out-of-domain test sets demonstrate that UniHuman outperforms task-specific
models by a significant margin. In user studies, UniHuman is preferred by the
users in an average of 77% of cases. Our project is available at
https://github.com/NannanLi999/UniHuman. |
This paper proposes UniHuman, a unified model addressing multiple human image editing tasks in real-world settings, such as reposing, virtual try-on, and text-guided manipulation, by leveraging synergies between these tasks. |
Existing methods often tackle these tasks in isolation, overlooking the benefits of learning them jointly and neglecting the adaptability to unseen human-in-the-wild cases. |
The UniHuman model employs human visual encoders for texture and style guidance and introduces a novel pose-warping module to ensure texture consistency across different tasks. It leverages both dense and sparse pose representations, making it robust to unseen textures. The authors also curated a large-scale dataset (LH-400K) with diverse human images to improve generalization. |
UniHuman significantly outperforms task-specific models on both in-domain and out-of-domain datasets, demonstrating its strong generalization capability.
The model effectively transfers textures and preserves clothing identities, even for complex patterns and challenging poses.
User studies confirm UniHuman's superiority, with users preferring its results in an average of 77% of cases. |
The performance depends on the accuracy of pose detectors and parsing models, which can be challenging for complex poses.
Future work will explore incorporating 3D information, such as depth and surface normal, to enhance accuracy and address limitations of existing methods. |
human image editing, virtual try-on, reposing, text-guided manipulation, pose warping |
2312.14923
Report |
Fast-NTK: Parameter-Efficient Unlearning for Large-Scale Models |
Guihong Li, Hsiang Hsu, Chun-Fu Chen, Radu Marculescu |
The rapid growth of machine learning has spurred legislative initiatives such
as ``the Right to be Forgotten,'' allowing users to request data removal. In
response, ``machine unlearning'' proposes the selective removal of unwanted
data without the need for retraining from scratch. While the
Neural-Tangent-Kernel-based (NTK-based) unlearning method excels in
performance, it suffers from significant computational complexity, especially
for large-scale models and datasets. Our work introduces ``Fast-NTK,'' a novel
NTK-based unlearning algorithm that significantly reduces the computational
complexity by incorporating parameter-efficient fine-tuning methods, such as
fine-tuning batch normalization layers in a CNN or visual prompts in a vision
transformer. Our experimental results demonstrate scalability to much larger
neural networks and datasets (e.g., 88M parameters; 5k images), surpassing the
limitations of previous full-model NTK-based approaches designed for smaller
cases (e.g., 8M parameters; 500 images). Notably, our approach maintains a
performance comparable to the traditional method of retraining on the retain
set alone. Fast-NTK can thus enable for practical and scalable NTK-based
unlearning in deep neural networks. |
This paper introduces "Fast-NTK," a novel algorithm for machine unlearning in large-scale models that combines parameter-efficient fine-tuning methods with Neural-Tangent-Kernel-based unlearning. |
Existing NTK-based unlearning methods, while effective, struggle with high computational complexity, limiting their application to small-scale models and datasets. Fast-NTK addresses this limitation by significantly reducing the number of parameters involved in the unlearning process. |
Fast-NTK selectively fine-tunes and applies NTK-based unlearning to only a subset of crucial model parameters. For CNNs, it focuses on batch normalization layers, while for ViTs, it utilizes prompts. |
Fast-NTK exhibits performance comparable to retraining from scratch on the retain set, effectively removing the influence of forget samples.
The method significantly reduces the number of parameters involved in fine-tuning and unlearning, enabling its application to larger models and datasets.
Fast-NTK scales to vision transformers and larger datasets, unlike previous NTK-based approaches that were limited to smaller networks and datasets. |
The current implementation relies on exact NTK matrix computations, limiting its efficiency. Exploring approximate computation methods could further improve scalability.
The reliance on pre-trained models introduces risks, as these models may possess prior knowledge of classes to be unlearned, necessitating further investigation into the relationship between pre-training and unlearning. |
machine unlearning, neural tangent kernel, parameter-efficient fine-tuning, deep neural networks, privacy |
2312.14871
Report |
BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction |
Honghao Fu, Zhiqi Shen, Jing Jih Chin, Hao Wang |
Analyzing and reconstructing visual stimuli from brain signals effectively
advances understanding of the human visual system. However, the EEG signals are
complex and contain a amount of noise. This leads to substantial limitations in
existing works of visual stimuli reconstruction from EEG, such as difficulties
in aligning EEG embeddings with the fine-grained semantic information and a
heavy reliance on additional large self-collected dataset for training. To
address these challenges, we propose a novel approach called BrainVis. Firstly,
we divide the EEG signals into various units and apply a self-supervised
approach on them to obtain EEG time-domain features, in an attempt to ease the
training difficulty. Additionally, we also propose to utilize the
frequency-domain features to enhance the EEG representations. Then, we
simultaneously align EEG time-frequency embeddings with the interpolation of
the coarse and fine-grained semantics in the CLIP space, to highlight the
primary visual components and reduce the cross-modal alignment difficulty.
Finally, we adopt the cascaded diffusion models to reconstruct images. Our
proposed BrainVis outperforms state of the arts in both semantic fidelity
reconstruction and generation quality. Notably, we reduce the training data
scale to 10% of the previous work. |
BrainVis, a novel pipeline for reconstructing images from EEG signals, utilizing self-supervised learning for time-domain features, LSTM for frequency-domain features, and cascaded diffusion models for image generation. |
Analyzing visual stimuli reconstruction from brain signals advances the understanding of the human visual system, but existing EEG-based methods have limitations in aligning EEG embeddings with semantic information and rely on large datasets. |
EEG signals are divided into units for self-supervised time-domain feature extraction, LSTM is used for frequency-domain feature extraction, and a cross-modal alignment network aligns EEG features with interpolated CLIP embeddings for image reconstruction using cascaded diffusion models. |
BrainVis outperforms state-of-the-art methods in semantic reconstruction and generation quality.
The method eliminates reliance on additional large-scale datasets.
Analysis suggests that visual information in EEG might prioritize fundamental object properties over specific categories. |
The study primarily focuses on semantic level reconstruction and not pixel-level accuracy.
Further research can explore decoding individual visual properties like color and shape from EEG. |
eeg, image reconstruction, brain-computer interface, deep learning, clip |
2312.14867
Report |
VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation |
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen |
In the rapidly advancing field of conditional image generation research,
challenges such as limited explainability lie in effectively evaluating the
performance and capabilities of various models. This paper introduces VIESCORE,
a Visual Instruction-guided Explainable metric for evaluating any conditional
image generation tasks. VIESCORE leverages general knowledge from Multimodal
Large Language Models (MLLMs) as the backbone and does not require training or
fine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image
tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of
0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)
VIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in
evaluating synthetic images. (3) VIESCORE achieves a correlation on par with
human ratings in the generation tasks but struggles in editing tasks. With
these results, we believe VIESCORE shows its great potential to replace human
judges in evaluating image synthesis tasks. |
This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating conditional image generation tasks using Multimodal Large Language Models (MLLMs) without training. |
Evaluating AI-synthesized images is challenging due to limitations of existing metrics and the subjectivity and scalability issues of human evaluation. VIEScore aims to address these gaps. |
VIEScore leverages MLLMs to evaluate images based on instructions and provide rationale for their scores. The authors tested VIEScore across seven image synthesis tasks using ImagenHub benchmark and compared its performance with human evaluations and existing automatic metrics. |
VIEScore (GPT-4v) achieves high correlation with human evaluations, outperforming other MLLMs and automatic metrics in most tasks.
Open-source MLLMs perform significantly weaker than GPT-4v in evaluating synthetic images.
MLLMs struggle to capture nuances in edited images, highlighting a challenge in evaluating image editing tasks. |
OpenAI's security and privacy policy limits evaluation of images resembling real persons.
Future work focuses on investigating distillation models to replicate human-like evaluation performance. |
image generation, image evaluation, multimodal large language models, explainable ai, viescore |
2312.14828
Report |
Plan, Posture and Go: Towards Open-World Text-to-Motion Generation |
Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yansong Tang, Xin Tong |
Conventional text-to-motion generation methods are usually trained on limited
text-motion pairs, making them hard to generalize to open-world scenarios. Some
works use the CLIP model to align the motion space and the text space, aiming
to enable motion generation from natural language motion descriptions. However,
they are still constrained to generate limited and unrealistic in-place
motions. To address these issues, we present a divide-and-conquer framework
named PRO-Motion, which consists of three modules as motion planner,
posture-diffuser and go-diffuser. The motion planner instructs Large Language
Models (LLMs) to generate a sequence of scripts describing the key postures in
the target motion. Differing from natural languages, the scripts can describe
all possible postures following very simple text templates. This significantly
reduces the complexity of posture-diffuser, which transforms a script to a
posture, paving the way for open-world generation. Finally, go-diffuser,
implemented as another diffusion model, estimates whole-body translations and
rotations for all postures, resulting in realistic motions. Experimental
results have shown the superiority of our method with other counterparts, and
demonstrated its capability of generating diverse and realistic motions from
complex open-world prompts such as "Experiencing a profound sense of joy". The
project page is available at https://moonsliu.github.io/Pro-Motion. |
This paper presents PRO-Motion, a novel framework for open-world text-to-motion generation, addressing the limitations of conventional methods that struggle to generalize beyond limited text-motion paired datasets. |
This work is important because it allows for the generation of diverse and realistic motions from open-world text prompts, a task previously challenging due to the limitations of existing datasets and models. |
The PRO-Motion framework utilizes a divide-and-conquer approach, consisting of three modules: 1) a motion planner that leverages Large Language Models (LLMs) to translate complex text descriptions into a sequence of posture scripts; 2) a posture-diffuser, a diffusion-based model, that generates key poses aligning with the scripts; and 3) a go-diffuser, another diffusion model, that predicts whole-body translations and rotations for smooth and realistic motion generation. |
PRO-Motion demonstrates superior performance compared to state-of-the-art methods in open-world text-to-motion generation, as evidenced by quantitative metrics such as R-precision, FID, and Multimodal Distance.
The posture-diffuser module effectively generates precise poses from localized body part descriptions, surpassing existing methods in preserving textual information and handling diverse motion descriptions.
The go-diffuser module successfully predicts spatial information (translation and rotation) from local pose sequences, outperforming baseline methods and achieving state-of-the-art results in Average Positional Error and Average Variance Error. |
The reliance on LLMs for motion planning introduces a dependency on their capabilities and potential biases, which might require further investigation.
The current implementation primarily focuses on generating motions with a fixed number of frames. Future work could explore extending it to handle variable-length motion sequences. |
text-to-motion, open-world generation, diffusion models, large language models, pose generation |
2312.14733
Report |
Harnessing Diffusion Models for Visual Perception with Meta Prompts |
Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang |
The issue of generative pretraining for vision models has persisted as a
long-standing conundrum. At present, the text-to-image (T2I) diffusion model
demonstrates remarkable proficiency in generating high-definition images
matching textual inputs, a feat made possible through its pre-training on
large-scale image-text pairs. This leads to a natural inquiry: can diffusion
models be utilized to tackle visual perception tasks? In this paper, we propose
a simple yet effective scheme to harness a diffusion model for visual
perception tasks. Our key insight is to introduce learnable embeddings (meta
prompts) to the pre-trained diffusion models to extract proper features for
perception. The effect of meta prompts are two-fold. First, as a direct
replacement of the text embeddings in the T2I models, it can activate
task-relevant features during feature extraction. Second, it will be used to
re-arrange the extracted features to ensures that the model focuses on the most
pertinent features for the task on hand. Additionally, we design a recurrent
refinement training strategy that fully leverages the property of diffusion
models, thereby yielding stronger visual features. Extensive experiments across
various benchmarks validate the effectiveness of our approach. Our approach
achieves new performance records in depth estimation tasks on NYU depth V2 and
KITTI, and in semantic segmentation task on CityScapes. Concurrently, the
proposed method attains results comparable to the current state-of-the-art in
semantic segmentation on ADE20K and pose estimation on COCO datasets, further
exemplifying its robustness and versatility. |
This paper presents a novel adaptation method for employing text-to-image diffusion models in visual perception tasks by introducing learnable embeddings, termed 'meta prompts,' for enhanced feature extraction and recurrent refinement training. |
Adapting powerful generative diffusion models for perception tasks is a promising direction but existing methods struggle with complex prompt interfaces. This paper introduces a streamlined approach using meta prompts, eliminating the need for external text inputs or pre-trained text encoders. |
The method uses a pre-trained text-to-image diffusion model. An input image is encoded into latent space and fed to the model. Learnable meta prompts, instead of text embeddings, are used to activate task-relevant features through cross-attention. These prompts further rearrange multi-scale features. A recurrent refinement strategy with modulated timestep embeddings allows for iterative feature enhancement. Finally, a task-specific decoder generates the prediction. |
Achieves state-of-the-art depth estimation performance on NYU Depth V2 and KITTI datasets.
Sets a new benchmark for semantic segmentation on the CityScapes dataset.
Achieves competitive results in semantic segmentation on ADE20K and pose estimation on COCO, showing robustness and versatility. |
The number of meta prompts needs to be optimized for each specific task.
Further research on extending the method to other visual perception tasks beyond those tested is warranted. |
diffusion models, visual perception, meta prompts, recurrent refinement, feature extraction |
2312.14611
Report |
Tuning-Free Inversion-Enhanced Control for Consistent Image Editing |
Xiaoyue Duan, Shuhao Cui, Guoliang Kang, Baochang Zhang, Zhengcong Fei, Mingyuan Fan, Junshi Huang |
Consistent editing of real images is a challenging task, as it requires
performing non-rigid edits (e.g., changing postures) to the main objects in the
input image without changing their identity or attributes. To guarantee
consistent attributes, some existing methods fine-tune the entire model or the
textual embedding for structural consistency, but they are time-consuming and
fail to perform non-rigid edits. Other works are tuning-free, but their
performances are weakened by the quality of Denoising Diffusion Implicit Model
(DDIM) reconstruction, which often fails in real-world scenarios. In this
paper, we present a novel approach called Tuning-free Inversion-enhanced
Control (TIC), which directly correlates features from the inversion process
with those from the sampling process to mitigate the inconsistency in DDIM
reconstruction. Specifically, our method effectively obtains inversion features
from the key and value features in the self-attention layers, and enhances the
sampling process by these inversion features, thus achieving accurate
reconstruction and content-consistent editing. To extend the applicability of
our method to general editing scenarios, we also propose a mask-guided
attention concatenation strategy that combines contents from both the inversion
and the naive DDIM editing processes. Experiments show that the proposed method
outperforms previous works in reconstruction and consistent editing, and
produces impressive results in various settings. |
Presents Tuning-free Inversion-enhanced Control (TIC) for consistent editing of real images by mitigating inconsistencies in DDIM reconstruction. |
Consistent editing in real images while changing non-rigid attributes is challenging due to limitations in existing methods like DDIM reconstruction quality. |
Analyzes reconstruction error in DDIM, introduces TIC which correlates features from inversion and sampling processes. Employs mask-guided attention concatenation to balance fidelity and editability, and integrates with controllable diffusion models. |
TIC achieves superior reconstruction quality compared to baselines, approaching VAE's upper bound.
Performs non-rigid edits (e.g., posture, expressions) while preserving content consistency in complex scenarios with multiple objects.
Integration with controllable diffusion models and mask-guided attention concatenation extends TIC to general editing, balancing fidelity and new content generation. |
TIC's enhancement strategy is applied from a specific timestep and layer, which might not be optimal for all cases.
Exploration of its application in image and video generation. |
consistent image editing, ddim reconstruction, text-guided image editing, diffusion models, controllable image synthesis |
2312.14579
Report |
Environment-Specific People |
Mirela Ostrek, Soubhik Sanyal, Carol O'Sullivan, Michael J. Black, Justus Thies |
Despite significant progress in generative image synthesis and full-body
generation in particular, state-of-the-art methods are either
context-independent, overly reliant to text prompts, or bound to the curated
training datasets, such as fashion images with monotonous backgrounds. Here,
our goal is to generate people in clothing that is semantically appropriate for
a given scene. To this end, we present ESP, a novel method for context-aware
full-body generation, that enables photo-realistic inpainting of people into
existing "in-the-wild" photographs. ESP is conditioned on a 2D pose and
contextual cues that are extracted from the environment photograph and
integrated into the generation process. Our models are trained on a dataset
containing a set of in-the-wild photographs of people covering a wide range of
different environments. The method is analyzed quantitatively and
qualitatively, and we show that ESP outperforms state-of-the-art on the task of
contextual full-body generation. |
This paper presents ESP, a context-aware full-body generation method that inpaints people wearing semantically appropriate clothing into existing photographs. |
Existing methods are context-independent, rely heavily on text prompts, or are limited by curated datasets lacking realistic environment-clothing correlations. ESP addresses these limitations by enabling photorealistic inpainting of people whose attire matches the scene. |
ESP leverages a VAE to extract contextual cues from the environment, feeds these into a StyleGAN-based HPM generator to predict clothing semantics, and uses a HPM translation module to guide a pre-trained Stable Diffusion model for seamless inpainting. |
ESP successfully generates environment-specific people whose clothing aligns with the input photograph's context.
Quantitative analysis shows that ESP outperforms state-of-the-art methods in terms of contextual appropriateness.
The HPM translation module effectively bridges the semantic gap between binary masks and complex human bodies, enabling high-quality inpainting. |
The current training dataset exhibits biases that need to be addressed through diversification.
Further research can explore finer-grained control over the generated clothing style, potentially incorporating textual prompts. |
image generation, inpainting, context-aware, full-body generation, human parsing maps |
2312.14494
Report |
Revisiting Few-Shot Object Detection with Vision-Language Models |
Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan |
Few-shot object detection (FSOD) benchmarks have advanced techniques for
detecting new categories with limited annotations. Existing benchmarks
repurpose well-established datasets like COCO by partitioning categories into
base and novel classes for pre-training and fine-tuning respectively. However,
these benchmarks do not reflect how FSOD is deployed in practice. Rather than
only pre-training on a small number of base categories, we argue that it is
more practical to fine-tune a foundation model (e.g., a vision-language model
(VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find
that zero-shot inference from VLMs like GroundingDINO significantly outperforms
the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models
can still be misaligned to target concepts of interest. For example, trailers
on the web may be different from trailers in the context of autonomous
vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol
that evaluates detectors pre-trained on any external datasets and fine-tuned on
K-shots per target class. Further, we note that current FSOD benchmarks are
actually federated datasets containing exhaustive annotations for each category
on a subset of the data. We leverage this insight to propose simple strategies
for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of
our approach on LVIS and nuImages, improving over prior work by 5.9 AP. Our
code is available at https://github.com/anishmadan23/foundational_fsod |
This paper proposes Foundational FSOD, a new benchmark for few-shot object detection using vision-language foundation models pre-trained on large-scale datasets. |
Existing FSOD benchmarks are unrealistic because they partition datasets into base and novel classes and do not reflect the use of foundation models in practice. |
The authors leverage the observation that FSOD benchmarks are actually federated datasets and propose simple fine-tuning strategies for VLMs using federated losses and pseudo-negative labels. |
Zero-shot inference with VLMs outperforms state-of-the-art FSOD methods on COCO.
Fine-tuning VLMs with federated losses and pseudo-negatives further improves performance on LVIS and nuImages.
Fine-tuning with pseudo-negatives approaches the oracle performance of using ground-truth negatives. |
Performance on rare categories is significantly lower than common categories, suggesting VLMs are pre-trained on imbalanced data.
The approach only uses class names as text features, and future work could explore richer textual descriptions for multi-modal alignment. |
few-shot object detection, vision-language models, federated datasets, concept alignment, pseudo-negative labels |
2312.14385
Report |
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation |
Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu |
As the development of large-scale Generative AI models evolve beyond text
(1D) generation to include image (2D) and video (3D) generation, processing
spatial and temporal information presents unique challenges to quality,
performance, and efficiency. We present the first work towards understanding
this new system design space for multi-modal text-to-image (TTI) and
text-to-video (TTV) generation models. Current model architecture designs are
bifurcated into 2 categories: Diffusion- and Transformer-based models. Our
systematic performance characterization on a suite of eight representative
TTI/TTV models shows that after state-of-the-art optimization techniques such
as Flash Attention are applied, Convolution accounts for up to 44% of execution
time for Diffusion-based TTI models, while Linear layers consume up to 49% of
execution time for Transformer-based models. We additionally observe that
Diffusion-based TTI models resemble the Prefill stage of LLM inference, and
benefit from 1.1-2.5x greater speedup from Flash Attention than
Transformer-based TTI models that resemble the Decode phase. Since
optimizations designed for LLMs do not map directly onto TTI/TTV models, we
must conduct a thorough characterization of these workloads to gain insights
for new optimization opportunities. In doing so, we define sequence length in
the context of TTI/TTV models and observe sequence length can vary up to 4x in
Diffusion model inference. We additionally observe temporal aspects of TTV
workloads pose unique system bottlenecks, with Temporal Attention accounting
for over 60% of total Attention time. Overall, our in-depth system performance
characterization is a critical first step towards designing efficient and
deployable systems for emerging TTI/TTV workloads. |
This paper provides the first in-depth system characterization of multi-modal text-to-image (TTI) and text-to-video (TTV) generation models, highlighting their unique system properties and performance bottlenecks compared to traditional LLMs. |
As Generative AI evolves beyond text generation towards higher-dimensional data like images and videos, understanding the system implications of TTI/TTV models is crucial for designing efficient and deployable systems for these emerging workloads. This is especially important given their growing usage in industry-scale datacenters. |
The authors systematically characterized the performance of eight representative TTI/TTV models, including diffusion and transformer-based architectures, on NVIDIA A100 GPUs, using tools like PyTorch Profiler and NVIDIA Nsight Compute. They analyzed operator breakdowns, sequence length variations, and the impact of scaling image size and temporal dimensions on system performance. |
After applying Flash Attention, Convolution emerges as the main bottleneck for Diffusion-based TTI models, consuming up to 44% of execution time.
Sequence length in Diffusion models varies significantly during inference, unlike LLMs, and scales quadratically with image size, impacting memory requirements (O(L^4)).
Temporal Attention in TTV models poses a unique bottleneck, consuming 2x the execution time of Spatial Attention despite requiring 9x fewer FLOPs, suggesting optimization opportunities. |
The analysis focuses on a limited set of open-source TTI/TTV models.
Future work can explore optimization strategies tailored to the identified bottlenecks, such as efficient Convolution algorithms for Diffusion models and memory-efficient Temporal Attention mechanisms for TTV models. |
generative ai, multi-modal, diffusion model, transformer, sequence length, attention |
2312.14239
Report |
PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar |
Tzofi Klinghoffer, Xiaoyu Xiang, Siddharth Somasundaram, Yuchen Fan, Christian Richardt, Ramesh Raskar, Rakesh Ranjan |
3D reconstruction from a single-view is challenging because of the ambiguity
from monocular cues and lack of information about occluded regions. Neural
radiance fields (NeRF), while popular for view synthesis and 3D reconstruction,
are typically reliant on multi-view images. Existing methods for single-view 3D
reconstruction with NeRF rely on either data priors to hallucinate views of
occluded regions, which may not be physically accurate, or shadows observed by
RGB cameras, which are difficult to detect in ambient light and low albedo
backgrounds. We propose using time-of-flight data captured by a single-photon
avalanche diode to overcome these limitations. Our method models two-bounce
optical paths with NeRF, using lidar transient data for supervision. By
leveraging the advantages of both NeRF and two-bounce light measured by lidar,
we demonstrate that we can reconstruct visible and occluded geometry without
data priors or reliance on controlled ambient lighting or scene albedo. In
addition, we demonstrate improved generalization under practical constraints on
sensor spatial- and temporal-resolution. We believe our method is a promising
direction as single-photon lidars become ubiquitous on consumer devices, such
as phones, tablets, and headsets. |
PlatoNeRF reconstructs 3D scenes from a single viewpoint using time-of-flight data from a single-photon lidar, exploiting two-bounce light to infer geometry of both visible and occluded regions. |
Single-view 3D reconstruction with NeRF typically relies on data priors for hallucination or shadows from RGB images, both limited in accuracy. PlatoNeRF leverages physically-accurate lidar measurements for enhanced reconstruction. |
The method models two-bounce optical paths with NeRF, supervised by lidar transients. Primary rays determine depth and secondary rays determine shadowing. A combined distance and shadow loss function optimizes the NeRF model. |
PlatoNeRF outperforms baseline lidar and RGB-based methods in depth reconstruction accuracy on simulated scenes.
It demonstrates robustness to low spatial- and temporal-resolution, ambient light, and low albedo backgrounds, advantageous for real-world applications.
The method generalizes well to real-world lidar data, achieving competitive results with fewer artifacts than non-learning based methods. |
Current implementation only considers Lambertian reflectance, limiting its applicability to certain materials.
Reliance on vanilla NeRF architecture can lead to occasional floaters in reconstructed geometry. |
single-view 3d reconstruction, neural radiance fields (nerf), single-photon lidar, time-of-flight imaging, two-bounce light |
2312.14238
Report |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai |
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multimodal AGI systems. However, the progress in vision and
vision-language foundation models, which are also critical elements of
multi-modal AGI, has not kept pace with LLMs. In this work, we design a
large-scale vision-language foundation model (InternVL), which scales up the
vision foundation model to 6 billion parameters and progressively aligns it
with the LLM, using web-scale image-text data from various sources. This model
can be broadly applied to and achieve state-of-the-art performance on 32
generic visual-linguistic benchmarks including visual perception tasks such as
image-level or pixel-level recognition, vision-language tasks such as zero-shot
image/video classification, zero-shot image/video-text retrieval, and link with
LLMs to create multi-modal dialogue systems. It has powerful visual
capabilities and can be a good alternative to the ViT-22B. We hope that our
research could contribute to the development of multi-modal large models. Code
and models are available at https://github.com/OpenGVLab/InternVL. |
InternVL, a large-scale vision-language foundation model, aligns a scaled-up vision encoder (6 billion parameters) with a large language model (LLM), achieving state-of-the-art results on 32 visual and visual-linguistic tasks. |
Existing vision-language models suffer from disparity in parameter scales between vision and language components, inconsistent representations, and inefficient connection methods, hindering their effectiveness in tasks requiring both vision and language understanding. |
InternVL utilizes a three-stage progressive image-text alignment strategy: 1) contrastive learning on large-scale noisy image-text data, 2) generative learning on fine-grained data, and 3) supervised fine-tuning on instruction data for multi-modal dialogue. |
InternVL achieves state-of-the-art performance in zero-shot image classification across various ImageNet variants and ObjectNet, demonstrating robust generalization across different domains.
It exhibits strong multilingual capabilities, outperforming previous methods on multilingual ImageNet-1K and image-text retrieval tasks.
InternVL seamlessly integrates with existing LLMs, enabling effective multi-modal dialogue capabilities with superior performance on benchmarks like MME and POPE. |
The study primarily focuses on public data sources, future work could explore the impact of incorporating private datasets.
While InternVL excels in many tasks, there's room for further investigation into more specialized visual-linguistic tasks requiring fine-grained understanding. |
vision-language foundation model, multi-modal dialogue, progressive alignment, zero-shot learning, large language models |
2312.14233
Report |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models |
Jitesh Jain, Jianwei Yang, Humphrey Shi |
Humans possess the remarkable skill of Visual Perception, the ability to see
and understand the seen, helping them make sense of the visual world and, in
turn, reason. Multimodal Large Language Models (MLLM) have recently achieved
impressive performance on vision-language tasks ranging from visual
question-answering and image captioning to visual reasoning and image
generation. However, when prompted to identify or count (perceive) the entities
in a given image, existing MLLM systems fail. Working towards developing an
accurate MLLM system for perception and reasoning, we propose using Versatile
vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the
VCoder with perception modalities such as segmentation or depth maps, improving
the MLLM's perception abilities. Secondly, we leverage the images from COCO and
outputs from off-the-shelf vision perception models to create our COCO
Segmentation Text (COST) dataset for training and evaluating MLLMs on the
object perception task. Thirdly, we introduce metrics to assess the object
perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive
experimental evidence proving the VCoder's improved object-level perception
skills over existing Multimodal LLMs, including GPT-4V. We open-source our
dataset, code, and models to promote research. We open-source our code at
https://github.com/SHI-Labs/VCoder |
This paper introduces Versatile vision enCoders (VCoder) that enhance object perception abilities in Multimodal Large Language Models (MLLMs) by incorporating control inputs like segmentation and depth maps. |
Existing MLLMs, while proficient in complex visual reasoning, often struggle with basic object perception tasks such as accurate object identification and counting. This work aims to bridge this gap by improving the foundational object-level perception skills of MLLMs. |
The authors propose (1) a new dataset named COCO Segmentation Text (COST) designed to train and evaluate MLLMs on object identification and counting tasks; (2) VCoder, an adapter module that processes additional perception modalities as control inputs and integrates them into the MLLM framework; (3) novel metrics - Count Score (CS), Hallucination Score (HS), and Depth Score (DS) - to quantitatively assess the object perception performance of MLLMs. |
VCoder-adapted LLaVA-1.5 outperforms existing open-source MLLMs and GPT-4V on the COST dataset, demonstrating significant improvement in object identification and counting.
Incorporating segmentation maps as control inputs considerably improves the MLLM's ability to perceive both salient and background objects.
VCoder's ability to leverage depth maps as control input leads to substantial enhancement in predicting the order of objects in an image. |
The COST dataset, while a significant contribution, is limited by the object categories present in the COCO dataset and would benefit from expansion to include a wider variety of objects with varying granularity.
The current evaluation metrics rely on one-to-one word matching for calculating scores, requiring manual mapping of synonyms. Exploring methods to overcome this limitation would be beneficial. |
multimodal large language models, object perception, vision-language models, object counting, hallucination |
2312.14232
Report |
Parrot Captions Teach CLIP to Spot Text |
Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou |
Despite CLIP being the foundation model in numerous vision-language
applications, the CLIP suffers from a severe text spotting bias. Such bias
causes CLIP models to `Parrot' the visual text embedded within images while
disregarding the authentic visual semantics. We uncover that in the most
popular image-text dataset LAION-2B, the captions also densely parrot (spell)
the text embedded in images. Our analysis shows that around 50% of images are
embedded with visual text content, and around 30% of captions words are in
these embedded visual content. Based on such observation, we thoroughly inspect
the different released versions of CLIP models and verify that the visual text
is the dominant factor in measuring the LAION-style image-text similarity for
these models. To examine whether these parrot captions shape the text spotting
bias, we train a series of CLIP models with LAION subsets curated by different
parrot-caption-oriented criteria. We show that training with parrot captions
easily shapes such bias but harms the expected visual-language representation
learning in CLIP models. This suggests that it is urgent to revisit either the
design of CLIP-like models or the existing image-text dataset curation pipeline
built on CLIP score filtering. |
This paper reveals a significant text-spotting bias in CLIP models, where they heavily rely on recognizing text within images instead of understanding true visual semantics. This bias is linked to the prevalence of "parrot captions" in datasets like LAION-2B, which simply describe the text present in the images. |
This finding is crucial as CLIP, being a foundational model in many vision-language applications, might be exhibiting skewed behaviors. This bias can lead to inaccurate interpretations of visual content and hinder the development of robust and fair vision-language models. |
The authors analyze LAION-2B, finding a high correlation between image text and captions. They then conduct experiments by training CLIP models on different subsets of LAION-2B, curated based on the presence and extent of "parrot captions". Additionally, they assess the impact of removing text from images on CLIP's performance. |
Over 50% of images in LAION-2B contain embedded text, with around 30% of caption words directly parroting this text.
Released CLIP models show a strong preference for image-text pairs containing parrot captions, achieving higher similarity scores for such pairs.
Training CLIP models on datasets with a high proportion of parrot captions results in a strong text-spotting bias, negatively impacting their performance on downstream tasks. |
The text spotting model used for analysis might not be perfect, potentially leading to inaccurate estimations of text presence and correlation.
Further investigation is needed to develop more robust and scalable data curation pipelines to mitigate the impact of parrot captions. |
clip, text spotting bias, parrot captions, vision-language models, dataset bias |
2312.14216
Report |
DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models |
Brian Nlong Zhao, Yuhang Xiao, Jiashu Xu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Laurent Itti, Vibhav Vineet, Yunhao Ge |
The popularization of Text-to-Image (T2I) diffusion models enables the
generation of high-quality images from text descriptions. However, generating
diverse customized images with reference visual attributes remains challenging.
This work focuses on personalizing T2I diffusion models at a more abstract
concept or category level, adapting commonalities from a set of reference
images while creating new instances with sufficient variations. We introduce a
solution that allows a pretrained T2I diffusion model to learn a set of soft
prompts, enabling the generation of novel images by sampling prompts from the
learned distribution. These prompts offer text-guided editing capabilities and
additional flexibility in controlling variation and mixing between multiple
distributions. We also show the adaptability of the learned prompt distribution
to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our
approach through quantitative analysis including automatic evaluation and human
assessment. Project website: https://briannlongzhao.github.io/DreamDistribution |
This paper proposes a method to personalize text-to-image generation using a set of user-provided reference images by learning a distribution of prompts. |
The proposed method enables diverse and personalized image generation while maintaining the text editability of pre-trained text-to-image diffusion models. |
The method learns a distribution of text embedding vectors from multiple learnable text prompts. These prompts are optimized to reconstruct user-provided reference images using a pre-trained diffusion model. |
The learned prompt distribution enables diverse image generation by sampling different prompts from it.
The method allows users to control the generation diversity by scaling the standard deviation of the learned prompt distribution.
Generated images using this method achieve high classification accuracy when used as synthetic training data, outperforming images generated from class names only. |
The number of learnable prompts is a hyperparameter that needs to be manually tuned.
Training the prompt distribution requires a significant amount of time and resources. |
text-to-image generation, personalization, diffusion models, prompt learning, synthetic data |
2312.14198
Report |
ZeroShape: Regression-based Zero-shot Shape Reconstruction |
Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg |
We study the problem of single-image zero-shot 3D shape reconstruction.
Recent works learn zero-shot shape reconstruction through generative modeling
of 3D assets, but these models are computationally expensive at train and
inference time. In contrast, the traditional approach to this problem is
regression-based, where deterministic models are trained to directly regress
the object shape. Such regression methods possess much higher computational
efficiency than generative methods. This raises a natural question: is
generative modeling necessary for high performance, or conversely, are
regression-based approaches still competitive? To answer this, we design a
strong regression-based model, called ZeroShape, based on the converging
findings in this field and a novel insight. We also curate a large real-world
evaluation benchmark, with objects from three different real-world 3D datasets.
This evaluation benchmark is more diverse and an order of magnitude larger than
what prior works use to quantitatively evaluate their models, aiming at
reducing the evaluation variance in our field. We show that ZeroShape not only
achieves superior performance over state-of-the-art methods, but also
demonstrates significantly higher computational and data efficiency. |
This paper presents ZeroShape, a regression-based method for zero-shot 3D shape reconstruction from single images, achieving state-of-the-art performance while being computationally efficient. |
Zero-shot 3D shape reconstruction is crucial for various applications like AR and robotics, and current generative approaches, while impressive, suffer from high computational costs. This work explores the effectiveness of a more efficient regression-based approach. |
ZeroShape leverages a novel architecture with three modules: a depth and camera estimator, a geometric unprojection unit, and a projection-guided shape reconstructor. It is trained on a large synthetic dataset with diverse camera poses and lighting conditions. Additionally, a large-scale real-world evaluation benchmark is created to rigorously assess the model's performance. |
ZeroShape achieves state-of-the-art zero-shot performance on the proposed benchmark, outperforming existing methods including generative approaches.
The model demonstrates significantly higher computational efficiency compared to generative counterparts, making it more suitable for real-world applications.
Jointly learning depth and camera intrinsics for 3D visible surface estimation is crucial for achieving high accuracy. |
The model's performance with the full Objaverse dataset is yet to be explored due to computational constraints.
The current work does not consider object texture modeling, which could be a promising future direction. |
zero-shot learning, 3d shape reconstruction, single image reconstruction, generative models, computer vision |
2312.14140
Report |
HeadCraft: Modeling High-Detail Shape Variations for Animated 3DMMs |
Artem Sevastopolsky, Philip-William Grassal, Simon Giebenhain, ShahRukh Athar, Luisa Verdoliva, Matthias Niessner |
Current advances in human head modeling allow to generate plausible-looking
3D head models via neural representations. Nevertheless, constructing complete
high-fidelity head models with explicitly controlled animation remains an
issue. Furthermore, completing the head geometry based on a partial
observation, e.g. coming from a depth sensor, while preserving details is often
problematic for the existing methods. We introduce a generative model for
detailed 3D head meshes on top of an articulated 3DMM which allows explicit
animation and high-detail preservation at the same time. Our method is trained
in two stages. First, we register a parametric head model with vertex
displacements to each mesh of the recently introduced NPHM dataset of accurate
3D head scans. The estimated displacements are baked into a hand-crafted UV
layout. Second, we train a StyleGAN model in order to generalize over the UV
maps of displacements. The decomposition of the parametric model and
high-quality vertex displacements allows us to animate the model and modify it
semantically. We demonstrate the results of unconditional generation and
fitting to the full or partial observation. The project page is available at
https://seva100.github.io/headcraft. |
A generative model for highly-detailed 3D human head meshes, built on top of an articulated 3D Morphable Model (3DMM) for explicit animation and detail preservation. |
Existing methods struggle to create high-fidelity, animatable head models, especially when reconstructing from partial observations like depth maps. This work combines the strengths of explicit parametric models and neural representations for detail. |
Two-stage registration of a subdivided FLAME template with vertex displacements to a dataset of 3D head scans. Displacements are baked into UV maps and used to train a StyleGAN2 model for generalization. |
The model generates diverse and highly-detailed heads, surpassing baselines in visual fidelity and quantitative metrics like FID, KID, IS, MMD, JSD, and COV.
The model can be effectively fitted to full or partial point clouds, enabling reconstruction from depth maps.
The use of an underlying 3DMM allows for realistic animation and semantic editing, such as hair transfer. |
The current model lacks an appearance component for color and relighting.
Future work could incorporate a physics-based hair movement model for more realistic animation. |
3d head modeling, generative models, 3d morphable models, stylegan, animation |
2312.14132
Report |
DUSt3R: Geometric 3D Vision Made Easy |
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, Jerome Revaud |
Multi-view stereo reconstruction (MVS) in the wild requires to first estimate
the camera parameters e.g. intrinsic and extrinsic parameters. These are
usually tedious and cumbersome to obtain, yet they are mandatory to triangulate
corresponding pixels in 3D space, which is the core of all best performing MVS
algorithms. In this work, we take an opposite stance and introduce DUSt3R, a
radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction
of arbitrary image collections, i.e. operating without prior information about
camera calibration nor viewpoint poses. We cast the pairwise reconstruction
problem as a regression of pointmaps, relaxing the hard constraints of usual
projective camera models. We show that this formulation smoothly unifies the
monocular and binocular reconstruction cases. In the case where more than two
images are provided, we further propose a simple yet effective global alignment
strategy that expresses all pairwise pointmaps in a common reference frame. We
base our network architecture on standard Transformer encoders and decoders,
allowing us to leverage powerful pretrained models. Our formulation directly
provides a 3D model of the scene as well as depth information, but
interestingly, we can seamlessly recover from it, pixel matches, relative and
absolute camera. Exhaustive experiments on all these tasks showcase that the
proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on
monocular/multi-view depth estimation as well as relative pose estimation. In
summary, DUSt3R makes many geometric 3D vision tasks easy. |
\duster{} is a novel end-to-end deep learning approach for dense and unconstrained 3D reconstruction from uncalibrated and unposed image collections. It unifies monocular and multi-view stereo by regressing 3D pointmaps, simplifying the traditional reconstruction pipeline. |
Existing 3D reconstruction methods rely on complex pipelines with multiple independent steps, leading to error accumulation. They also struggle with uncalibrated images and often fail in challenging conditions like low scene views or non-Lambertian surfaces. This work aims to simplify the process and improve robustness. |
The core of \duster{} is a transformer-based network trained to regress dense pointmaps from image pairs. These pointmaps encode scene geometry, pixel-to-point mapping, and viewpoint relations. A global alignment strategy extends the method to multiple views, aligning pairwise predictions in a common reference frame. |
\duster{} achieves state-of-the-art results on monocular and multi-view depth estimation benchmarks without requiring ground-truth camera parameters.
It demonstrates superior performance on multi-view camera pose estimation compared to existing learning-based and structure-based methods.
The method produces accurate and consistent dense 3D reconstructions, even in challenging scenarios with uncalibrated images. |
While \duster{} shows promising results for visual localization with known intrinsics, it faces limitations when intrinsics are unknown, particularly in outdoor scenes with sparse ground-truth.
The regression-based nature of \duster{} might limit its accuracy in 3D reconstruction compared to methods leveraging explicit camera parameters and sub-pixel triangulation. |
3d reconstruction, uncalibrated images, pointmap regression, transformer networks, multi-view stereo |
2312.14091
Report |
HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models |
Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi |
Recent progress in text-guided image inpainting, based on the unprecedented
success of text-to-image diffusion models, has led to exceptionally realistic
and visually plausible results. However, there is still significant potential
for improvement in current text-to-image inpainting models, particularly in
better aligning the inpainted area with user prompts and performing
high-resolution inpainting. Therefore, we introduce HD-Painter, a training free
approach that accurately follows prompts and coherently scales to high
resolution image inpainting. To this end, we design the Prompt-Aware
Introverted Attention (PAIntA) layer enhancing self-attention scores by prompt
information resulting in better text aligned generations. To further improve
the prompt coherence we introduce the Reweighting Attention Score Guidance
(RASG) mechanism seamlessly integrating a post-hoc sampling strategy into the
general form of DDIM to prevent out-of-distribution latent shifts. Moreover,
HD-Painter allows extension to larger scales by introducing a specialized
super-resolution technique customized for inpainting, enabling the completion
of missing regions in images of up to 2K resolution. Our experiments
demonstrate that HD-Painter surpasses existing state-of-the-art approaches
quantitatively and qualitatively across multiple metrics and a user study. Code
is publicly available at: https://github.com/Picsart-AI-Research/HD-Painter |
Introduces HD-Painter, a training-free approach for text-guided image inpainting that excels in prompt alignment and high-resolution generation. |
Addresses the limitations of existing methods in aligning inpainted content with user prompts, particularly at high resolutions. |
Combines two novel components: Prompt-Aware Introverted Attention (PAIntA) to enhance prompt relevance in self-attention and Reweighting Attention Score Guidance (RASG) for domain-preserving post-hoc guidance. Employs a specialized super-resolution technique for upscaling. |
Outperforms state-of-the-art methods quantitatively across CLIP score, accuracy, aesthetic score, and PickScore.
Demonstrates superior qualitative results, effectively addressing background and nearby object dominance issues.
Enables high-resolution (up to 2048x2048) inpainting with seamless integration of known region details. |
Inherits some quality limitations from the backbone inpainting model, occasionally leading to illogical appearances.
Future work can explore alternative upscaling techniques for further quality improvements in high-resolution inpainting. |
image inpainting, text-guided synthesis, diffusion models, prompt alignment, high-resolution |
2312.13980
Report |
Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning |
Desai Xie, Jiahao Li, Hao Tan, Xin Sun, Zhixin Shu, Yi Zhou, Sai Bi, Sören Pirk, Arie E. Kaufman |
Multi-view diffusion models, obtained by applying Supervised Finetuning (SFT)
to text-to-image diffusion models, have driven recent breakthroughs in
text-to-3D research. However, due to the limited size and quality of existing
3D datasets, they still suffer from multi-view inconsistencies and Neural
Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view
diffusion models can benefit from further Reinforcement Learning Finetuning
(RLFT), which allows models to learn from the data generated by themselves and
improve beyond their dataset limitations during SFT. To this end, we introduce
Carve3D, an improved RLFT algorithm coupled with a novel Multi-view
Reconstruction Consistency (MRC) metric, to enhance the consistency of
multi-view diffusion models. To measure the MRC metric on a set of multi-view
images, we compare them with their corresponding NeRF renderings at the same
camera viewpoints. The resulting model, which we denote as Carve3DM,
demonstrates superior multi-view consistency and NeRF reconstruction quality
than existing models. Our results suggest that pairing SFT with Carve3D's RLFT
is essential for developing multi-view-consistent diffusion models, mirroring
the standard Large Language Model (LLM) alignment pipeline. Our code, training
and testing data, and video results are available at:
https://desaixie.github.io/carve-3d. |
Introduces Carve3D, an improved RLFT algorithm paired with a novel Multi-view Reconstruction Consistency (MRC) metric to enhance the consistency of multi-view diffusion models for text-to-3D generation. |
Existing multi-view diffusion models, primarily trained with SFT, suffer from inconsistencies across generated views, leading to artifacts in 3D reconstructions. RLFT offers a way to improve consistency without being limited by the size and quality of existing 3D datasets. |
Develops MRC metric that compares generated multi-view images with renderings from a NeRF reconstructed from those images, using LPIPS for image similarity and bounding box normalization. Employs an improved on-policy DDPO algorithm for RLFT with KL divergence regularization to maintain proximity to the base model. |
Carve3DM, trained with Carve3D, achieves superior multi-view consistency and NeRF reconstruction quality compared to baselines like Instant3D and MVDream.
Carve3DM preserves prompt alignment, diversity, and realistic details of the base model, avoiding the degradation seen in models with prolonged SFT.
User study confirms that Carve3DM generates significantly more 3D-consistent results while maintaining comparable prompt alignment. |
Reconstruction quality is limited by the accuracy of the Sparse View LRM used.
High computational cost of SDXL and DDPO limits further scaling up of data and batch size. |
text-to-3d, multi-view consistency, diffusion models, reinforcement learning finetuning, neural radiance fields |
2312.13964
Report |
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models |
Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen |
Recent advancements in personalized text-to-image (T2I) models have
revolutionized content creation, empowering non-experts to generate stunning
images with unique styles. While promising, adding realistic motions into these
personalized images by text poses significant challenges in preserving distinct
styles, high-fidelity details, and achieving motion controllability by text. In
this paper, we present PIA, a Personalized Image Animator that excels in
aligning with condition images, achieving motion controllability by text, and
the compatibility with various personalized T2I models without specific tuning.
To achieve these goals, PIA builds upon a base T2I model with well-trained
temporal alignment layers, allowing for the seamless transformation of any
personalized T2I model into an image animation model. A key component of PIA is
the introduction of the condition module, which utilizes the condition frame
and inter-frame affinity as input to transfer appearance information guided by
the affinity hint for individual frame synthesis in the latent space. This
design mitigates the challenges of appearance-related image alignment within
and allows for a stronger focus on aligning with motion-related guidance. |
PIA, a personalized image animator that turns any personalized text-to-image model into an image animation model, allowing animation of stylized images while preserving their unique features. |
Existing methods struggle to animate personalized images while preserving their distinct styles, high-fidelity details, and achieving motion controllability via text prompts. |
PIA leverages a base T2I model, temporal alignment layers, and a novel condition module. The condition module takes the condition frame and inter-frame affinity as inputs, transferring appearance information to individual frames, thus improving alignment and allowing for better motion control. |
PIA demonstrates superior image alignment and motion controllability compared to state-of-the-art methods on the introduced AnimateBench benchmark.
PIA allows users to control motion magnitude by adjusting the inter-frame affinity.
PIA can even achieve style transfer effects when using models and input images from different domains. |
PIA may exhibit color discrepancies when applied to images with significantly different styles from the training data.
Color inconsistencies can occur if trigger words used in personalized image generation are absent from animation prompts. |
image animation, personalized text-to-image, text-to-video synthesis, motion controllability, style transfer |
2312.13913
Report |
Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models |
Xianfang Zeng, Xin Chen, Zhongqi Qi, Wen Liu, Zibo Zhao, Zhibin Wang, Bin Fu, Yong Liu, Gang Yu |
This paper presents Paint3D, a novel coarse-to-fine generative framework that
is capable of producing high-resolution, lighting-less, and diverse 2K UV
texture maps for untextured 3D meshes conditioned on text or image inputs. The
key challenge addressed is generating high-quality textures without embedded
illumination information, which allows the textures to be re-lighted or
re-edited within modern graphics pipelines. To achieve this, our method first
leverages a pre-trained depth-aware 2D diffusion model to generate
view-conditional images and perform multi-view texture fusion, producing an
initial coarse texture map. However, as 2D models cannot fully represent 3D
shapes and disable lighting effects, the coarse texture map exhibits incomplete
areas and illumination artifacts. To resolve this, we train separate UV
Inpainting and UVHD diffusion models specialized for the shape-aware refinement
of incomplete areas and the removal of illumination artifacts. Through this
coarse-to-fine process, Paint3D can produce high-quality 2K UV textures that
maintain semantic consistency while being lighting-less, significantly
advancing the state-of-the-art in texturing 3D objects. |
The paper introduces Paint3D, a coarse-to-fine framework for generating high-quality, lighting-less 2K UV texture maps for 3D meshes using text or image prompts. |
Existing methods struggle to generate textures that are both high-quality and free from pre-illumination artifacts, limiting their compatibility with traditional rendering pipelines. |
The method utilizes a pre-trained 2D image diffusion model for initial multi-view texture generation and then refines the texture in UV space with specialized diffusion models for inpainting and high-definition enhancement. |
Paint3D outperforms state-of-the-art methods in text-to-texture and image-to-texture generation tasks on both qualitative and quantitative metrics.
The coarse-to-fine strategy effectively combines the strengths of large-scale image generation priors and lighting-less texture refinement.
The use of position maps in UV space guides the diffusion process to produce semantically consistent and visually appealing textures. |
The method can suffer from multi-face issues due to inconsistencies in multi-view images from the pre-trained 2D diffusion model.
Paint3D currently does not generate material maps and cannot manipulate 3D geometry. |
texture synthesis, 3d model, diffusion models, coarse-to-fine, uv mapping |
2312.13834
Report |
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis |
Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, Peter Vajda |
In this paper, we introduce Fairy, a minimalist yet robust adaptation of
image-editing diffusion models, enhancing them for video editing applications.
Our approach centers on the concept of anchor-based cross-frame attention, a
mechanism that implicitly propagates diffusion features across frames, ensuring
superior temporal coherence and high-fidelity synthesis. Fairy not only
addresses limitations of previous models, including memory and processing
speed. It also improves temporal consistency through a unique data augmentation
strategy. This strategy renders the model equivariant to affine transformations
in both source and target images. Remarkably efficient, Fairy generates
120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds,
outpacing prior works by at least 44x. A comprehensive user study, involving
1000 generated samples, confirms that our approach delivers superior quality,
decisively outperforming established methods. |
\ourmodel is a fast and robust video editing framework adapted from image diffusion models. It leverages anchor-based cross-frame attention for feature propagation, ensuring temporal consistency. |
Existing video editing methods struggle with temporal consistency, especially for complex videos with large motions. This work addresses these limitations, enabling high-quality, efficient video editing. |
The method uses a set of anchor frames to extract diffusion features. Cross-frame attention with these anchor features is applied to subsequent frames, ensuring consistency. The model is fine-tuned using an equivariant strategy with affine transformations for further consistency enhancement. |
Human evaluation of 1000 generated videos confirms superior quality over existing methods like Rerender, Tokenflow, and Gen-1.
Quantitative metrics demonstrate improved temporal consistency and frame-wise editing accuracy compared to baselines.
\ourmodel achieves significant speedup, being 53x faster than TokenFlow and 44x faster than Rerender when utilizing 8 GPUs. |
The model inherits limitations from the underlying image-editing model, such as difficulties with dynamic visual effects like lightning or flames.
Instructions involving camera motion, like zooming in or out, are not handled effectively. |
video editing, diffusion models, temporal consistency, cross-frame attention, equivariant finetuning |
2312.13789
Report |
TinySAM: Pushing the Envelope for Efficient Segment Anything Model |
Han Shu, Wenshuo Li, Yehui Tang, Yiman Zhang, Yihao Chen, Houqiang Li, Yunhe Wang, Xinghao Chen |
Recently segment anything model (SAM) has shown powerful segmentation
capability and has drawn great attention in computer vision fields. Massive
following works have developed various applications based on the pretrained SAM
and achieved impressive performance on downstream vision tasks.
However, SAM consists of heavy architectures and requires massive
computational capacity, which hinders the further application of SAM on
computation constrained edge devices. To this end, in this paper we propose a
framework to obtain a tiny segment anything model (TinySAM) while maintaining
the strong zero-shot performance. We first propose a full-stage knowledge
distillation method with hard prompt sampling and hard mask weighting strategy
to distill a lightweight student model. We also adapt the post-training
quantization to the promptable segmentation task and further reduce the
computational cost. Moreover, a hierarchical segmenting everything strategy is
proposed to accelerate the everything inference by $2\times$ with almost no
performance degradation. With all these proposed methods, our TinySAM leads to
orders of magnitude computational reduction and pushes the envelope for
efficient segment anything task. Extensive experiments on various zero-shot
transfer tasks demonstrate the significantly advantageous performance of our
TinySAM against counterpart methods. Pre-trained models and codes are available
at https://github.com/xinghaochen/TinySAM and
https://gitee.com/mindspore/models/tree/master/research/cv/TinySAM. |
This paper presents TinySAM, a highly efficient framework for segmenting anything, which significantly reduces computational cost while maintaining strong zero-shot segmentation capabilities. |
The existing Segment Anything Model (SAM), though powerful, has a heavy architecture and high computational demands, hindering its deployment on resource-constrained devices. |
The TinySAM framework employs three key techniques: 1) Hard Mining Full-Stage Knowledge Distillation to train a lightweight image encoder with guidance from the original SAM. 2) Post-Training Quantization adapted for promptable segmentation to further reduce computational complexity. 3) Hierarchical Segmenting Everything strategy to accelerate inference by reducing redundant computations. |
TinySAM achieves superior performance compared to other efficient SAM variants, exhibiting a 4% AP improvement over FastSAM with only 12.2% FLOPs and 25% latency.
The proposed model outperforms MobileSAM in zero-shot instance segmentation tasks on COCO and LVIS datasets, demonstrating higher accuracy with the same computational cost.
The hierarchical everything inference strategy reduces inference time by approximately 50% while maintaining comparable results to the original points grid strategy. |
The performance of TinySAM with quantization, while significantly more efficient, still lags behind the full-precision model.
The hierarchical everything inference strategy relies on pre-defined thresholds and may require adjustments for different datasets or applications. |
segment anything model, knowledge distillation, model quantization, efficient inference, zero-shot segmentation |
2312.13770
Report |
3D Points Splatting for Real-Time Dynamic Hand Reconstruction |
Zheheng Jiang, Hossein Rahmani, Sue Black, Bryan M. Williams |
We present 3D Points Splatting Hand Reconstruction (3D-PSHR), a real-time and
photo-realistic hand reconstruction approach. We propose a self-adaptive
canonical points upsampling strategy to achieve high-resolution hand geometry
representation. This is followed by a self-adaptive deformation that deforms
the hand from the canonical space to the target pose, adapting to the dynamic
changing of canonical points which, in contrast to the common practice of
subdividing the MANO model, offers greater flexibility and results in improved
geometry fitting. To model texture, we disentangle the appearance color into
the intrinsic albedo and pose-aware shading, which are learned through a
Context-Attention module. Moreover, our approach allows the geometric and the
appearance models to be trained simultaneously in an end-to-end manner. We
demonstrate that our method is capable of producing animatable, photorealistic
and relightable hand reconstructions using multiple datasets, including
monocular videos captured with handheld smartphones and large-scale multi-view
videos featuring various hand poses. We also demonstrate that our approach
achieves real-time rendering speeds while simultaneously maintaining superior
performance compared to existing state-of-the-art methods. |
This supplementary material provides further details on the 3D points splatting method for real-time dynamic hand reconstruction presented in the main paper, including context-attention modules, ablation studies, training algorithm, and additional comparisons with state-of-the-art methods. |
This approach addresses limitations in existing hand reconstruction methods by introducing a novel 3D point splatting technique that enables efficient and accurate reconstruction of dynamic hand poses from monocular images. |
The method utilizes a differentiable renderer and a learned canonical representation of the hand. It optimizes a set of canonical points, their corresponding colors, and shading parameters to reconstruct the hand's 3D shape and appearance. |
The proposed method outperforms state-of-the-art methods in terms of both geometry and appearance reconstruction quality on the Hand Appearance Dataset.
Ablation studies demonstrate the effectiveness of different components of the proposed method, such as the context-attention modules, loss functions, and training algorithm.
The method achieves real-time performance while maintaining high reconstruction accuracy. |
The method relies on a pre-defined hand template (MANO model), which may limit its ability to generalize to hands with significant shape variations.
Future work could explore incorporating hand shape parameters into the learning process to improve generalization. |
hand reconstruction, 3d point splatting, differentiable rendering, canonical representation, real-time |
2312.13735
Report |
DECO: Query-Based End-to-End Object Detection with ConvNets |
Xinghao Chen, Siwei Li, Yijing Yang, Yunhe Wang |
Detection Transformer (DETR) and its variants have shown great potential for
accurate object detection in recent years. The mechanism of object query
enables DETR family to directly obtain a fixed number of object predictions and
streamlines the detection pipeline. Meanwhile, recent studies also reveal that
with proper architecture design, convolution networks (ConvNets) also achieve
competitive performance with transformers, \eg, ConvNeXt. To this end, in this
paper we explore whether we could build a query-based end-to-end object
detection framework with ConvNets instead of sophisticated transformer
architecture. The proposed framework, \ie, Detection ConvNet (DECO), is
composed of a backbone and convolutional encoder-decoder architecture. We
carefully design the DECO encoder and propose a novel mechanism for our DECO
decoder to perform interaction between object queries and image features via
convolutional layers. We compare the proposed DECO against prior detectors on
the challenging COCO benchmark. Despite its simplicity, our DECO achieves
competitive performance in terms of detection accuracy and running speed.
Specifically, with the ResNet-50 and ConvNeXt-Tiny backbone, DECO obtains
$38.6\%$ and $40.8\%$ AP on COCO \textit{val} set with $35$ and $28$ FPS
respectively and outperforms the DETR model. Incorporated with advanced
multi-scale feature module, our DECO+ achieves $47.8\%$ AP with $34$ FPS. We
hope the proposed DECO brings another perspective for designing object
detection framework. |
This paper introduces DECO, a novel end-to-end object detection framework built solely on convolutional neural networks (CNNs) while adopting the query-based prediction mechanism of DETR (Detection Transformer). |
The motivation stems from recent studies demonstrating the competitiveness of well-designed ConvNets against transformers in various vision tasks. This, coupled with the potential benefits of query-based detection, like eliminating the need for Non-Maximum Suppression (NMS), led to the exploration of a CNN-based alternative to DETR. |
DECO comprises a CNN backbone, an encoder built upon ConvNeXt blocks, and a novel decoder designed for object query and image feature interaction. The decoder leverages depthwise and 1x1 convolutions, along with upsampling and pooling operations, to facilitate this interaction, deviating from the attention-based mechanism employed in DETR. |
DECO achieves competitive performance on the COCO benchmark in terms of accuracy and speed, surpassing DETR in both aspects.
With ResNet-50 and ConvNeXt-Tiny backbones, DECO attains 38.6% and 40.8% AP on COCO validation set at 35 and 28 FPS, respectively.
The enhanced version, DECO+, incorporating multi-scale features, further boosts the performance to 47.8% AP with 34 FPS. |
One limitation is the absence of specialized techniques like deformable attention or denoising training tailored for CNN-based architectures.
Future work could focus on incorporating these strategies into DECO to potentially further enhance its performance. |
object detection, convolutional neural networks, end-to-end detection, query-based detection, detr |
2312.13729
Report |
Gaussian Splatting with NeRF-based Color and Opacity |
Dawid Malarz, Weronika Smolak, Jacek Tabor, Sławomir Tadeja, Przemysław Spurek |
Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of
neural networks to capture the intricacies of 3D objects. By encoding the shape
and color information within neural network weights, NeRFs excel at producing
strikingly sharp novel views of 3D objects. Recently, numerous generalizations
of NeRFs utilizing generative models have emerged, expanding its versatility.
In contrast, Gaussian Splatting (GS) offers a similar render quality with
faster training and inference as it does not need neural networks to work. We
encode information about the 3D objects in the set of Gaussian distributions
that can be rendered in 3D similarly to classical meshes. Unfortunately, GS are
difficult to condition since they usually require circa hundred thousand
Gaussian components. To mitigate the caveats of both models, we propose a
hybrid model Viewing Direction Gaussian Splatting (VDGS) that uses GS
representation of the 3D object's shape and NeRF-based encoding of color and
opacity. Our model uses Gaussian distributions with trainable positions (i.e.
means of Gaussian), shape (i.e. covariance of Gaussian), color and opacity, and
neural network, which takes parameters of Gaussian and viewing direction to
produce changes in color and opacity. Consequently, our model better describes
shadows, light reflections, and transparency of 3D objects. |
This paper proposes Viewing Direction Gaussian Splatting (VDGS), a hybrid neural rendering method that combines Gaussian Splatting (GS) with NeRF-based color and opacity encoding. |
VDGS aims to combine the speed of GS with the view-dependent effects of NeRF, leading to faster training and inference while better modeling shadows, reflections, and transparency in 3D objects. |
VDGS utilizes GS to represent the 3D object's shape and a NeRF-based neural network to predict view-dependent changes in the color and opacity of the Gaussian components. |
VDGS achieves better quantitative results than both GS and neural rendering methods on the NeRF Synthetic dataset.
VDGS effectively models light reflections and shadows on the Tanks and Temples dataset, outperforming both GS and NeRF in most cases.
VDGS demonstrates comparable performance to NeRF and GS on the Shiny Blender dataset, showcasing its ability to handle reflective surfaces. |
The paper acknowledges a slightly longer training and inference time compared to pure GS due to the added neural network.
Future work could explore alternative ways to combine color and opacity updates or investigate the impact of pre-training the GS component. |
neural rendering, gaussian splatting, nerf, 3d object representation, view synthesis |
2312.13691
Report |
DreamTuner: Single Image is Enough for Subject-Driven Generation |
Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, Qian He |
Diffusion-based models have demonstrated impressive capabilities for
text-to-image generation and are expected for personalized applications of
subject-driven generation, which require the generation of customized concepts
with one or a few reference images. However, existing methods based on
fine-tuning fail to balance the trade-off between subject learning and the
maintenance of the generation capabilities of pretrained models. Moreover,
other methods that utilize additional image encoders tend to lose important
details of the subject due to encoding compression. To address these
challenges, we propose DreamTurner, a novel method that injects reference
information from coarse to fine to achieve subject-driven image generation more
effectively. DreamTurner introduces a subject-encoder for coarse subject
identity preservation, where the compressed general subject features are
introduced through an attention layer before visual-text cross-attention. We
then modify the self-attention layers within pretrained text-to-image models to
self-subject-attention layers to refine the details of the target subject. The
generated image queries detailed features from both the reference image and
itself in self-subject-attention. It is worth emphasizing that
self-subject-attention is an effective, elegant, and training-free method for
maintaining the detailed features of customized subjects and can serve as a
plug-and-play solution during inference. Finally, with additional
subject-driven fine-tuning, DreamTurner achieves remarkable performance in
subject-driven image generation, which can be controlled by a text or other
conditions such as pose. For further details, please visit the project page at
https://dreamtuner-diffusion.github.io/. |
Proposes DreamTuner, a subject-driven image generation method that uses a single reference image to generate new images of a specific subject in different scenes guided by text or pose. |
Personalized image generation with customized subjects is in high demand for various applications, but existing methods struggle to balance subject identity preservation and model controllability. |
Combines a subject-encoder for coarse identity preservation, self-subject-attention for fine identity details, and a subject-driven fine-tuning stage to optimize the model for a specific subject. |
Achieves high-fidelity subject-driven image generation with a single reference image.
Outperforms existing methods in terms of subject fidelity and prompt consistency.
Demonstrates strong capability in generating images with detailed subject features while adapting to different text prompts and poses. |
Training the subject-encoder and fine-tuning the model require additional computational resources.
Further exploration on extending the method to handle multiple subjects in a single image. |
image generation, diffusion models, subject-driven generation, self-attention, text-to-image |
2312.13663
Report |
Free-Editor: Zero-shot Text-driven 3D Scene Editing |
Nazmul Karim, Umar Khalid, Hasan Iqbal, Jing Hua, Chen Chen |
Text-to-Image (T2I) diffusion models have gained popularity recently due to
their multipurpose and easy-to-use nature, e.g. image and video generation as
well as editing. However, training a diffusion model specifically for 3D scene
editing is not straightforward due to the lack of large-scale datasets. To
date, editing 3D scenes requires either re-training the model to adapt to
various 3D edited scenes or design-specific methods for each special editing
type. Furthermore, state-of-the-art (SOTA) methods require multiple
synchronized edited images from the same scene to facilitate the scene editing.
Due to the current limitations of T2I models, it is very challenging to apply
consistent editing effects to multiple images, i.e. multi-view inconsistency in
editing. This in turn compromises the desired 3D scene editing performance if
these images are used. In our work, we propose a novel training-free 3D scene
editing technique, Free-Editor, which allows users to edit 3D scenes without
further re-training the model during test time. Our proposed method
successfully avoids the multi-view style inconsistency issue in SOTA methods
with the help of a "single-view editing" scheme. Specifically, we show that
editing a particular 3D scene can be performed by only modifying a single view.
To this end, we introduce an Edit Transformer that enforces intra-view
consistency and inter-view style transfer by utilizing self- and
cross-attention, respectively. Since it is no longer required to re-train the
model and edit every view in a scene, the editing time, as well as memory
resources, are reduced significantly, e.g., the runtime being $\sim \textbf{20}
\times$ faster than SOTA. We have conducted extensive experiments on a wide
range of benchmark datasets and achieve diverse editing capabilities with our
proposed technique. |
Proposes Free-Editor, a zero-shot text-guided 3D scene editing technique that synthesizes novel views based on a text description while maintaining 3D consistency without retraining. |
Existing 3D scene editing methods using T2I diffusion models require retraining for each scene or editing type, leading to computational overhead and limitations in practical applications. |
Leverages a generalized NeRF model and introduces an Edit Transformer to transfer style information from a single edited starting view to target views using cross-attention. Employs multi-view consistency and self-view robust losses for spatial smoothness and color consistency. |
Achieves state-of-the-art performance in text-driven 3D scene editing, accurately reflecting text descriptions in novel views.
Demonstrates superior efficiency compared to previous methods, achieving significantly faster editing time and constant space complexity due to the zero-shot nature.
Maintains 3D consistency and preserves details of the original scene while effectively implementing edits. |
Relies on successful 2D pre-editing of the starting view, where failures can adversely affect 3D editing outcomes.
Addressing multi-view inconsistency in edited images requires careful consideration, potentially through trial and error or a view-filtering system. |
3d scene editing, text-guided image synthesis, neural radiance fields (nerf), diffusion models, zero-shot learning |
2312.13578
Report |
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation |
Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng |
The generation of emotional talking faces from a single portrait image
remains a significant challenge. The simultaneous achievement of expressive
emotional talking and accurate lip-sync is particularly difficult, as
expressiveness is often compromised for the accuracy of lip-sync. As widely
adopted by many prior works, the LSTM network often fails to capture the
subtleties and variations of emotional expressions. To address these
challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven
framework, tailored for generating diverse expressions and accurate lip-sync
concurrently. In the first stage, we propose EmoDiff, a novel diffusion module
that generates diverse highly dynamic emotional expressions and head poses in
accordance with the audio and the referenced emotion style. Given the strong
correlation between lip motion and audio, we then refine the dynamics with
enhanced lip-sync accuracy using audio features and emotion style. To this end,
we deploy a video-to-video rendering module to transfer the expressions and lip
motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively
and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of
expressiveness, lip-sync accuracy and perceptual quality. |
DREAM-Talk, a novel two-stage diffusion-based framework, generates photorealistic, lip-synchronized talking face videos with high-quality emotional expressions from a single portrait image, audio, and emotion style example. |
Existing methods struggle to simultaneously achieve expressive emotional talking and accurate lip-sync, often compromising expressiveness for lip-sync accuracy. |
DREAM-Talk uses a two-stage pipeline: 1) **EmoDiff Module**: An emotion-conditioned diffusion model generates dynamic emotional expressions and head poses from audio and emotion style. 2) **Lip Refinement**: A lip-sync refinement network enhances lip-sync accuracy using audio and emotion style while preserving emotional expressiveness. Finally, a video-to-video rendering module transfers expressions and lip motions to an arbitrary portrait. |
Outperforms state-of-the-art methods in expressiveness, lip-sync accuracy, and perceptual quality.
Effectively captures high-frequency facial details and subtle variations in emotional expressions.
Demonstrates superior performance in both quantitative metrics and subjective user studies. |
Relies on the accuracy of pre-trained emotion recognition models.
Limited generalization ability to unseen emotional expressions or speaking styles. |
talking face generation, emotional expression, lip sync, diffusion models, deep learning |
2312.13528
Report |
DyBluRF: Dynamic Deblurring Neural Radiance Fields for Blurry Monocular Video |
Minh-Quan Viet Bui, Jongmin Park, Jihyong Oh, Munchurl Kim |
Neural Radiance Fields (NeRF), initially developed for static scenes, have
inspired many video novel view synthesis techniques. However, the challenge for
video view synthesis arises from motion blur, a consequence of object or camera
movement during exposure, which hinders the precise synthesis of sharp
spatio-temporal views. In response, we propose a novel dynamic deblurring NeRF
framework for blurry monocular video, called DyBluRF, consisting of a Base Ray
Initialization (BRI) stage and a Motion Decomposition-based Deblurring (MDD)
stage. Our DyBluRF is the first that handles the novel view synthesis for
blurry monocular video with a novel two-stage framework. In the BRI stage, we
coarsely reconstruct dynamic 3D scenes and jointly initialize the base ray,
which is further used to predict latent sharp rays, using the inaccurate camera
pose information from the given blurry frames. In the MDD stage, we introduce a
novel Incremental Latent Sharp-rays Prediction (ILSP) approach for the blurry
monocular video frames by decomposing the latent sharp rays into global camera
motion and local object motion components. We further propose two loss
functions for effective geometry regularization and decomposition of static and
dynamic scene components without any mask supervision. Experiments show that
DyBluRF outperforms qualitatively and quantitatively the SOTA methods. |
DyBluRF, a novel dynamic deblurring Neural Radiance Field (NeRF) framework, synthesizes sharp, novel spatio-temporal views from blurry monocular videos. |
Existing video view synthesis methods struggle with motion blur in casually captured monocular videos, hindering the generation of sharp, temporally consistent novel views. |
DyBluRF employs a two-stage approach: (1) Base Ray Initialization (BRI) coarsely reconstructs the 3D scene and initializes base rays from imprecise camera poses; (2) Motion Decomposition-based Deblurring (MDD) refines the rays by considering global camera and local object motion, simulating the blur process during training. |
DyBluRF significantly outperforms state-of-the-art methods in novel view synthesis from blurry monocular videos, both qualitatively and quantitatively.
The proposed Unsupervised Staticness Maximization and Local Geometry Variance Distillation losses enable robust decomposition of static and dynamic scene components and accurate geometry reconstruction, respectively.
DyBluRF demonstrates robustness against varying degrees of blurriness, maintaining consistent performance across different dataset capture qualities. |
DyBluRF's performance is limited by the diversity of training and validation views in the dataset, leading to overfitting in scenarios with significantly different lighting conditions.
Future work includes integrating Gaussian Splatting-based shading networks for improved training and rendering efficiency. |
deblurring nerf, dynamic nerf, video view synthesis, motion blur, monocular video |
2312.13324
Report |
ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors |
Weijia Mao, Yan-Pei Cao, Jia-Wei Liu, Zhongcong Xu, Mike Zheng Shou |
We introduce ShowRoom3D, a three-stage approach for generating high-quality
3D room-scale scenes from texts. Previous methods using 2D diffusion priors to
optimize neural radiance fields for generating room-scale scenes have shown
unsatisfactory quality. This is primarily attributed to the limitations of 2D
priors lacking 3D awareness and constraints in the training methodology. In
this paper, we utilize a 3D diffusion prior, MVDiffusion, to optimize the 3D
room-scale scene. Our contributions are in two aspects. Firstly, we propose a
progressive view selection process to optimize NeRF. This involves dividing the
training process into three stages, gradually expanding the camera sampling
scope. Secondly, we propose the pose transformation method in the second stage.
It will ensure MVDiffusion provide the accurate view guidance. As a result,
ShowRoom3D enables the generation of rooms with improved structural integrity,
enhanced clarity from any view, reduced content repetition, and higher
consistency across different perspectives. Extensive experiments demonstrate
that our method, significantly outperforms state-of-the-art approaches by a
large margin in terms of user study. |
Introduces ShowRoom3D, a novel three-stage pipeline utilizing a 3D diffusion prior (MVDiffusion) to optimize NeRF for generating high-quality 3D room-scale scenes from text prompts. |
Generating high-quality 3D room-scale scenes is crucial for various industries, including VR/AR and the Metaverse. Existing methods face challenges such as the Janus problem, unreasonable room structures, and style inconsistencies. |
Combines MVDiffusion and NeRF using a progressive view selection approach: (1) Generates a panoramic view to determine room structure and geometry. (2) Improves geometry and layout by adding training views from various positions facing outward. Introduces pose transformation for accurate view guidance. (3) Freely positions the camera and applies rotations to fine-tune the NeRF model for rendering from any position and rotation. |
Generates rooms with improved structural integrity and clarity from any view.
Reduces content repetition and enhances consistency across different perspectives.
Significantly outperforms state-of-the-art approaches in user studies, demonstrating superior overall quality, text alignment, and consistency. |
Generated results exhibit oversaturation despite employing techniques to mitigate the issue.
The three-stage training process is time-consuming. |
3d scene generation, text-to-3d, diffusion models, neural radiance fields (nerf), score distillation sampling (sds) |
2312.13314
Report |
Unlocking Pre-trained Image Backbones for Semantic Image Synthesis |
Tariq Berrada, Jakob Verbeek, Camille Couprie, Karteek Alahari |
Semantic image synthesis, i.e., generating images from user-provided semantic
label maps, is an important conditional image generation task as it allows to
control both the content as well as the spatial layout of generated images.
Although diffusion models have pushed the state of the art in generative image
modeling, the iterative nature of their inference process makes them
computationally demanding. Other approaches such as GANs are more efficient as
they only need a single feed-forward pass for generation, but the image quality
tends to suffer on large and diverse datasets. In this work, we propose a new
class of GAN discriminators for semantic image synthesis that generates highly
realistic images by exploiting feature backbone networks pre-trained for tasks
such as image classification. We also introduce a new generator architecture
with better context modeling and using cross-attention to inject noise into
latent variables, leading to more diverse generated images. Our model, which we
dub DP-SIMS, achieves state-of-the-art results in terms of image quality and
consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes,
surpassing recent diffusion models while requiring two orders of magnitude less
compute for inference. |
This paper introduces a novel GAN-based semantic image synthesis model that leverages pre-trained image backbones as encoders in the discriminator, enhancing image quality and consistency with input segmentation masks. |
Semantic image synthesis is crucial for tasks demanding precise control over object location and boundaries, such as photo editing and data augmentation, where existing methods struggle with quality, consistency, or speed. |
The proposed method uses a UNet-like discriminator with a pre-trained backbone encoder and a trainable decoder. It introduces a novel generator architecture with cross-attention for noise injection and employs a contrastive loss and diversity constraint during training. |
Achieves state-of-the-art performance in FID and mIoU across COCO, ADE20k, and Cityscapes datasets.
Outperforms recent diffusion models in quality and consistency while being significantly faster in inference (two orders of magnitude).
Demonstrates the effectiveness of pre-trained backbones, feature conditioning, and novel architectural modifications through extensive ablations. |
Transformer-based backbones, like Swin, exhibit instability during training and require further investigation.
Exploration of larger encoder architectures for handling more complex datasets could be beneficial. |
semantic image synthesis, generative adversarial networks (gans), pre-trained backbones, contrastive learning, image generation |
2312.13308
Report |
SWAGS: Sampling Windows Adaptively for Dynamic 3D Gaussian Splatting |
Richard Shaw, Jifei Song, Arthur Moreau, Michal Nazarczuk, Sibi Catley-Chandar, Helisa Dhamo, Eduardo Perez-Pellitero |
Novel view synthesis has shown rapid progress recently, with methods capable
of producing evermore photo-realistic results. 3D Gaussian Splatting has
emerged as a particularly promising method, producing high-quality renderings
of static scenes and enabling interactive viewing at real-time frame rates.
However, it is currently limited to static scenes only. In this work, we extend
3D Gaussian Splatting to reconstruct dynamic scenes. We model the dynamics of a
scene using a tunable MLP, which learns the deformation field from a canonical
space to a set of 3D Gaussians per frame. To disentangle the static and dynamic
parts of the scene, we learn a tuneable parameter for each Gaussian, which
weighs the respective MLP parameters to focus attention on the dynamic parts.
This improves the model's ability to capture dynamics in scenes with an
imbalance of static to dynamic regions. To handle scenes of arbitrary length
whilst maintaining high rendering quality, we introduce an adaptive window
sampling strategy to partition the sequence into windows based on the amount of
movement in the sequence. We train a separate dynamic Gaussian Splatting model
for each window, allowing the canonical representation to change, thus enabling
the reconstruction of scenes with significant geometric or topological changes.
Temporal consistency is enforced using a fine-tuning step with self-supervising
consistency loss on randomly sampled novel views. As a result, our method
produces high-quality renderings of general dynamic scenes with competitive
quantitative performance, which can be viewed in real-time with our dynamic
interactive viewer. |
This paper introduces a novel method for high-quality, real-time novel view synthesis of dynamic scenes using an extension of 3D Gaussian Splatting. |
Existing methods struggle with long sequences, complex motions, and often lack temporal consistency. This work addresses these issues to achieve realistic and efficient dynamic scene rendering. |
The method employs adaptive window sampling based on motion, learns per-window canonical representations and deformation fields using tuneable MLPs, and ensures temporal consistency through a fine-tuning step with a self-supervised consistency loss. |
The method achieves state-of-the-art PSNR and SSIM performance on the Neural 3D Video dataset.
It enables real-time interactive viewing of dynamic scenes.
The adaptive window sampling and temporal consistency fine-tuning effectively handle complex motions and long sequences without distracting flickering. |
The method relies on pre-computed camera parameters and may be sensitive to their accuracy.
Future work could explore joint optimization of camera poses and scene representation. |
novel view synthesis, dynamic scene reconstruction, 3d gaussian splatting, temporal consistency, neural rendering |
2312.13307
Report |
Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models |
Wenhao Li, Xiu Su, Shan You, Tao Huang, Fei Wang, Chen Qian, Chang Xu |
Diffusion models have demonstrated remarkable efficacy in various generative
tasks with the predictive prowess of denoising model. Currently, these models
employ a uniform denoising approach across all timesteps. However, the inherent
variations in noisy latents at each timestep lead to conflicts during training,
constraining the potential of diffusion models. To address this challenge, we
propose a novel two-stage training strategy termed Step-Adaptive Training. In
the initial stage, a base denoising model is trained to encompass all
timesteps. Subsequently, we partition the timesteps into distinct groups,
fine-tuning the model within each group to achieve specialized denoising
capabilities. Recognizing that the difficulties of predicting noise at
different timesteps vary, we introduce a diverse model size requirement. We
dynamically adjust the model size for each timestep by estimating task
difficulty based on its signal-to-noise ratio before fine-tuning. This
adjustment is facilitated by a proxy-based structural importance assessment
mechanism, enabling precise and efficient pruning of the base denoising model.
Our experiments validate the effectiveness of the proposed training strategy,
demonstrating an improvement in the FID score on CIFAR10 by over 0.3 while
utilizing only 80\% of the computational resources. This innovative approach
not only enhances model performance but also significantly reduces
computational costs, opening new avenues for the development and application of
diffusion models. |
This paper introduces a novel two-stage training strategy called Step-Adaptive Training for diffusion models, aiming to address the limitations of uniform denoising across timesteps. |
The conventional uniform denoising approach in diffusion models leads to training conflicts and inefficient resource allocation due to varying noise levels across timesteps, limiting their efficiency and performance. |
The approach involves initially training a base denoising model across all timesteps. Subsequently, timesteps are partitioned into groups, and the model is fine-tuned within each group with a specific FLOPs budget determined by the signal-to-noise ratio. This process is facilitated by a GPT-4 proxy-based pruning method for efficient model size adjustment. |
The Step-Adaptive Training Strategy improves the FID score on CIFAR10 by over 0.3 while using only 80% of the computational resources.
The two-stage training approach, in comparison to single-stage training, demonstrates significant improvement in convergence speed and performance.
The proposed GPT-4 proxy-based pruning method outperforms other pruning algorithms for diffusion models, leading to smaller models with competitive performance. |
The paper mainly focuses on image generation tasks, and further exploration is needed for other applications of diffusion models.
While the GPT-4 proxy shows promising results, investigating alternative pruning methods specifically tailored for diffusion models could be beneficial.
Future work includes exploring the application of Step-Adaptive Training to diverse diffusion model architectures and datasets. |
diffusion models, denoising, model pruning, step-adaptive training, gpt-4 proxy |
2312.13299
Report |
Compact 3D Scene Representation via Self-Organizing Gaussian Grids |
Wieland Morgenstern, Florian Barthel, Anna Hilsmann, Peter Eisert |
3D Gaussian Splatting has recently emerged as a highly promising technique
for modeling of static 3D scenes. In contrast to Neural Radiance Fields, it
utilizes efficient rasterization allowing for very fast rendering at
high-quality. However, the storage size is significantly higher, which hinders
practical deployment, e.g. on resource constrained devices. In this paper, we
introduce a compact scene representation organizing the parameters of 3D
Gaussian Splatting (3DGS) into a 2D grid with local homogeneity, ensuring a
drastic reduction in storage requirements without compromising visual quality
during rendering. Central to our idea is the explicit exploitation of
perceptual redundancies present in natural scenes. In essence, the inherent
nature of a scene allows for numerous permutations of Gaussian parameters to
equivalently represent it. To this end, we propose a novel highly parallel
algorithm that regularly arranges the high-dimensional Gaussian parameters into
a 2D grid while preserving their neighborhood structure. During training, we
further enforce local smoothness between the sorted parameters in the grid. The
uncompressed Gaussians use the same structure as 3DGS, ensuring a seamless
integration with established renderers. Our method achieves a reduction factor
of 17x to 42x in size for complex scenes with no increase in training time,
marking a substantial leap forward in the domain of 3D scene distribution and
consumption. Additional information can be found on our project page:
https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/ |
This paper introduces a novel method for compact 3D scene representation using self-organizing Gaussian grids, significantly reducing the storage requirements of 3D Gaussian Splatting (3DGS) without compromising rendering quality. |
3DGS offers high-quality rendering at fast speeds, but its large storage size hinders practical deployment on devices with limited resources. This work addresses this limitation, making 3DGS more practical for various applications. |
The method employs a novel parallel sorting algorithm (PLAS) to arrange 3DGS parameters into a 2D grid, ensuring local homogeneity. A smoothness loss is integrated into the training process to encourage compressible parameter arrangements, and off-the-shelf compression methods are used for storage. |
Achieves a 17x to 42x reduction in storage size compared to vanilla 3DGS.
Maintains high visual quality comparable to original 3DGS rendering.
Sorting and compression do not increase training time compared to 3DGS. |
Current implementation relies on high-dimensional spherical harmonics, which could be improved for further compression.
Future work includes extending the method to 4D scenes with temporal dependencies. |
3d scene representation, 3d gaussian splatting, compression, self-organizing maps, neural rendering |
2312.13286
Report |
Generative Multimodal Models are In-Context Learners |
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang |
The human ability to easily solve multimodal tasks in context (i.e., with
only a few demonstrations or simple instructions), is what current multimodal
systems have largely struggled to imitate. In this work, we demonstrate that
the task-agnostic in-context learning capabilities of large multimodal models
can be significantly enhanced by effective scaling-up. We introduce Emu2, a
generative multimodal model with 37 billion parameters, trained on large-scale
multimodal sequences with a unified autoregressive objective. Emu2 exhibits
strong multimodal in-context learning abilities, even emerging to solve tasks
that require on-the-fly reasoning, such as visual prompting and object-grounded
generation. The model sets a new record on multiple multimodal understanding
tasks in few-shot settings. When instruction-tuned to follow specific
instructions, Emu2 further achieves new state-of-the-art on challenging tasks
such as question answering benchmarks for large multimodal models and
open-ended subject-driven generation. These achievements demonstrate that Emu2
can serve as a base model and general-purpose interface for a wide range of
multimodal tasks. Code and models are publicly available to facilitate future
research. |
This paper introduces Emu++, a 37B parameter generative multimodal model demonstrating significant advancements in in-context learning for multimodal tasks. |
The work is crucial as it addresses the limitations of current multimodal systems in replicating human-like in-context learning abilities for diverse and complex tasks. |
Emu++ is trained on a massive dataset of multimodal sequences (text, image-text, video-text) with a unified autoregressive objective to predict the next multimodal element. The model architecture includes a visual encoder, multimodal modeling, and visual decoder for image and video generation. |
Emu++ achieves state-of-the-art performance on various few-shot multimodal understanding tasks, including visual question answering.
It demonstrates strong in-context learning capabilities, excelling in tasks like visual prompting and object-grounded generation.
With instruction tuning, Emu++ excels in controllable visual generation, accepting text, location, and image inputs for context-aware image synthesis. |
The in-context learning capability of Emu++ can be limited in complex situations, such as counting objects in crowded scenes.
There's a performance gap between Emu++ and specialized multimodal systems, particularly in question-answering tasks. |
multimodal learning, in-context learning, generative models, large language models, visual generation |
2312.13285
Report |
UniSDF: Unifying Neural Representations for High-Fidelity 3D Reconstruction of Complex Scenes with Reflections |
Fangjinhua Wang, Marie-Julie Rakotosaona, Michael Niemeyer, Richard Szeliski, Marc Pollefeys, Federico Tombari |
Neural 3D scene representations have shown great potential for 3D
reconstruction from 2D images. However, reconstructing real-world captures of
complex scenes still remains a challenge. Existing generic 3D reconstruction
methods often struggle to represent fine geometric details and do not
adequately model reflective surfaces of large-scale scenes. Techniques that
explicitly focus on reflective surfaces can model complex and detailed
reflections by exploiting better reflection parameterizations. However, we
observe that these methods are often not robust in real unbounded scenarios
where non-reflective as well as reflective components are present. In this
work, we propose UniSDF, a general purpose 3D reconstruction method that can
reconstruct large complex scenes with reflections. We investigate both
view-based as well as reflection-based color prediction parameterization
techniques and find that explicitly blending these representations in 3D space
enables reconstruction of surfaces that are more geometrically accurate,
especially for reflective surfaces. We further combine this representation with
a multi-resolution grid backbone that is trained in a coarse-to-fine manner,
enabling faster reconstructions than prior methods. Extensive experiments on
object-level datasets DTU, Shiny Blender as well as unbounded datasets Mip-NeRF
360 and Ref-NeRF real demonstrate that our method is able to robustly
reconstruct complex large-scale scenes with fine details and reflective
surfaces. Please see our project page at
https://fangjinhuawang.github.io/UniSDF. |
UniSDF, a novel algorithm that combines camera view and reflected view radiance fields, enabling robust and accurate 3D reconstruction of complex scenes with reflections. |
Existing methods struggle to balance accurate geometry and reflections, especially in complex real-world scenes. UniSDF addresses this by leveraging the strengths of different radiance field parameterizations. |
The method uses a hash grid backbone for fast training and combines two radiance fields: one parameterized by camera view direction and the other by reflected view direction. A learned weight field blends these in 3D space. A coarse-to-fine training strategy is employed to enhance reconstruction quality. |
Achieves state-of-the-art reconstruction quality on DTU, outperforming Neuralangelo and PermutoSDF.
Demonstrates high-fidelity reconstruction on Shiny Blender dataset, surpassing BakedSDF in capturing reflective surfaces accurately.
Reconstructs complex unbounded scenes with fine details and reflections, outperforming baselines like BakedSDF on Mip-NeRF 360 and Ref-NeRF real datasets. |
The method's performance on highly specular and less specular reflections is not explicitly analyzed.
Further exploration of the optimization challenges related to the diffuse component in the reflected view radiance field, particularly with high-frequency iNGP representations, is warranted. |
3d reconstruction, neural radiance fields, reflections, hash grids, signed distance functions |
2312.13271
Report |
Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting |
Junwu Zhang, Zhenyu Tang, Yatian Pang, Xinhua Cheng, Peng Jin, Yida Wei, Munan Ning, Li Yuan |
Recent one image to 3D generation methods commonly adopt Score Distillation
Sampling (SDS). Despite the impressive results, there are multiple deficiencies
including multi-view inconsistency, over-saturated and over-smoothed textures,
as well as the slow generation speed. To address these deficiencies, we present
Repaint123 to alleviate multi-view bias as well as texture degradation and
speed up the generation process. The core idea is to combine the powerful image
generation capability of the 2D diffusion model and the texture alignment
ability of the repainting strategy for generating high-quality multi-view
images with consistency. We further propose visibility-aware adaptive
repainting strength for overlap regions to enhance the generated image quality
in the repainting process. The generated high-quality and multi-view consistent
images enable the use of simple Mean Square Error (MSE) loss for fast 3D
content generation. We conduct extensive experiments and show that our method
has a superior ability to generate high-quality 3D content with multi-view
consistency and fine textures in 2 minutes from scratch. Our project page is
available at https://pku-yuangroup.github.io/repaint123/. |
Presents Repaint123, a novel method that generates high-quality, multi-view consistent 3D content from a single image in approximately 2 minutes. |
Addresses limitations of existing single image to 3D generation methods, which suffer from multi-view inconsistency, over-saturated and smoothed textures, and slow generation speed. |
Employs a two-stage optimization strategy: a coarse stage using Gaussian Splatting and a refining stage that leverages a 2D controllable diffusion model with a progressive, controllable repainting scheme for texture refinement. |
Achieves superior multi-view consistency compared to existing methods, as evidenced by CLIP-similarity and contextual distance metrics.
Generates high-quality textures, addressing the over-smoothing issue common in other methods.
Significantly faster (around 2 minutes) than NeRF-based methods while maintaining high quality. |
Reliance on Gaussian Splatting, which is still under development and may exhibit geometry artifacts.
Potential for further improvement in reference-view reconstruction quality compared to some NeRF-based approaches. |
3d generation, image-to-3d, gaussian splatting, diffusion models, controllable image synthesis |
2312.13253
Report |
Conditional Image Generation with Pretrained Generative Model |
Rajesh Shrestha, Bowen Xie |
In recent years, diffusion models have gained popularity for their ability to
generate higher-quality images in comparison to GAN models. However, like any
other large generative models, these models require a huge amount of data,
computational resources, and meticulous tuning for successful training. This
poses a significant challenge, rendering it infeasible for most individuals. As
a result, the research community has devised methods to leverage pre-trained
unconditional diffusion models with additional guidance for the purpose of
conditional image generative. These methods enable conditional image
generations on diverse inputs and, most importantly, circumvent the need for
training the diffusion model. In this paper, our objective is to reduce the
time-required and computational overhead introduced by the addition of guidance
in diffusion models -- while maintaining comparable image quality. We propose a
set of methods based on our empirical analysis, demonstrating a reduction in
computation time by approximately threefold. |
This paper proposes methods to reduce the computational overhead of guided image generation using pretrained diffusion models, specifically focusing on the Universal Guidance method. |
Guided diffusion models allow for controlled image generation but introduce significant computational overhead, making them time-consuming. Reducing this overhead is crucial for broader application. |
The paper analyzes the impact of hyperparameters (self-recurrence steps and backward guidance steps) and the necessity of guidance at different diffusion steps. It also explores a model-based approach to approximate the guidance process. |
Reducing self-recurrence steps to 5 and backward guidance steps to 10 significantly reduces computation time without significant quality loss.
Guidance is more critical in the initial stages of the diffusion process, allowing for its deactivation in later stages without substantial impact.
A model-based approach to replace the iterative guidance process shows potential but requires further refinement. |
The experiments were conducted on a limited dataset, requiring further validation on a larger scale.
The model-based approximation needs further development and a larger, more diverse dataset for training. |
diffusion models, image generation, guided diffusion, computational efficiency, clip |
2312.13150
Report |
Splatter Image: Ultra-Fast Single-View 3D Reconstruction |
Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi |
We introduce the \method, an ultra-efficient approach for monocular 3D object
reconstruction. Splatter Image is based on Gaussian Splatting, which allows
fast and high-quality reconstruction of 3D scenes from multiple images. We
apply Gaussian Splatting to monocular reconstruction by learning a neural
network that, at test time, performs reconstruction in a feed-forward manner,
at 38 FPS. Our main innovation is the surprisingly straightforward design of
this network, which, using 2D operators, maps the input image to one 3D
Gaussian per pixel. The resulting set of Gaussians thus has the form an image,
the Splatter Image. We further extend the method take several images as input
via cross-view attention. Owning to the speed of the renderer (588 FPS), we use
a single GPU for training while generating entire images at each iteration to
optimize perceptual metrics like LPIPS. On several synthetic, real,
multi-category and large-scale benchmark datasets, we achieve better results in
terms of PSNR, LPIPS, and other metrics while training and evaluating much
faster than prior works. Code, models, demo and more results are available at
https://szymanowiczs.github.io/splatter-image. |
Introduces Splatter Image, an ultra-efficient method for single- and few-view 3D object reconstruction using Gaussian Splatting. |
Addresses limitations of existing methods in terms of speed, efficiency, and reconstruction quality, particularly for single-view reconstruction. |
Predicts a 'Splatter Image' where each pixel represents parameters of a 3D Gaussian, using a U-Net architecture. Employs Gaussian Splatting for fast and high-quality rendering. Extends to multi-view by registering and fusing Gaussian mixtures from different views. |
Achieves state-of-the-art reconstruction quality on ShapeNet, CO3D, and Google Scanned Objects datasets, outperforming or being comparable to much slower methods.
Significantly faster than previous methods in both training and inference, enabling single-GPU training for standard benchmarks and competing with methods 50x more expensive in training.
Demonstrates the ability to reconstruct full 360° 3D objects from single views. |
Assumes a fixed number of views during training, potentially limiting generalisation to arbitrary viewpoints.
Relies on relative camera poses, which may be challenging to obtain accurately in some real-world scenarios. |
3d reconstruction, gaussian splatting, single-view reconstruction, few-shot learning, computer vision |
2312.12661
Report |
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining |
Bumsoo Kim, Yeonsik Jo, Jinhyung Kim, Seung Hwan Kim |
Contrastive Language-Image Pretraining has emerged as a prominent approach
for training vision and text encoders with uncurated image-text pairs from the
web. To enhance data-efficiency, recent efforts have introduced additional
supervision terms that involve random-augmented views of the image. However,
since the image augmentation process is unaware of its text counterpart, this
procedure could cause various degrees of image-text misalignments during
training. Prior methods either disregarded this discrepancy or introduced
external models to mitigate the impact of misalignments during training. In
contrast, we propose a novel metric learning approach that capitalizes on these
misalignments as an additional training source, which we term "Misalign,
Contrast then Distill (MCD)". Unlike previous methods that treat augmented
images and their text counterparts as simple positive pairs, MCD predicts the
continuous scales of misalignment caused by the augmentation. Our extensive
experimental results show that our proposed MCD achieves state-of-the-art
transferability in multiple classification and retrieval downstream datasets. |
This paper introduces MCD (Misalign, Contrast then Distill), a novel training framework that leverages the various levels of misalignments between random augmented images and its text description during training for Contrastive Language-Image Pretraining. |
Random image augmentations in Contrastive Language-Image Pretraining can cause misalignments between the image and its corresponding text description, leading to performance degradation if not addressed properly. |
MCD utilizes a teacher-student network where the student learns from the continuous distance between the image--text and augmented image--text of the teacher model with a log-ratio loss. |
MCD achieves state-of-the-art transferability in multiple classification and retrieval downstream datasets.
MCD outperforms previous methods without relying on external models or additional parameters for inference.
The proposed distillation strategies, addressing misalignments in positive pairs, negative pairs, and noisy pairs, all contribute positively to the final performance. |
The paper mainly focuses on image augmentations as the source of misalignments and could further explore other sources.
Future work could extend MCD frameworks to other modalities beyond vision and language. |
contrastive learning, language-image pretraining, multi-modal learning, knowledge distillation, misalignment |
2312.12540
Report |
Fixed-point Inversion for Text-to-image diffusion models |
Barak Meiri, Dvir Samuel, Nir Darshan, Gal Chechik, Shai Avidan, Rami Ben-Ari |
Text-guided diffusion models offer powerful new ways to generate and
manipulate images. Several applications of these models, including image
editing interpolation, and semantic augmentation, require diffusion inversion.
This is the process of finding a noise seed that can be used to generate a
given image. Current techniques for inverting a given image can be slow or
inaccurate. The technical challenge for inverting the diffusion process arises
from an implicit equation over the latent that cannot be solved in closed form.
Previous approaches proposed to solve this issue by approximation or various
learning schemes. Here, we formulate the problem as a fixed-point equation
problem and solve it using fixed-point iterations, a well-studied approach in
numerical analysis. We further identify a source of inconsistency that
significantly hurts the inversion of real images encoded to the latent space.
We show how to correct it by applying a prompt-aware adjustment of the
encoding. Our solution, Fixed-point inversion, is much faster than previous
techniques like EDICT and Null-text, with similar inversion quality. It can be
combined with any pretrained diffusion model and requires no model training,
prompt tuning, or additional parameters. In a series of experiments, we find
that Fixed-point inversion shows improved results in several downstream tasks:
image editing, image interpolation, and generation of rare objects. |
This paper introduces Fixed-Point Inversion (FPI), a novel, fast, and accurate method for inverting real images in text-guided diffusion models. |
Diffusion inversion is crucial for many applications, including image editing and rare concept generation. Existing methods are either slow or inaccurate. |
FPI leverages fixed-point iterations to efficiently solve the implicit equation governing the diffusion process. It also introduces a prompt-aware adjustment to improve consistency. |
FPI achieves comparable or better reconstruction quality than state-of-the-art methods while being significantly faster.
FPI demonstrates superior performance in image editing, preserving image structure and adhering to target prompts.
FPI improves the generation of rare concepts by providing more accurate and efficient seed initialization for methods like SeedSelect. |
Theoretical convergence guarantees for FPI are not fully established.
Exploring more sophisticated implicit function solvers may further enhance FPI's performance. |
diffusion models, image inversion, image editing, rare concept generation, fixed-point iteration |
2312.12491
Report |
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation |
Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Kurt Keutzer |
We introduce StreamDiffusion, a real-time diffusion pipeline designed for
interactive image generation. Existing diffusion models are adept at creating
images from text or image prompts, yet they often fall short in real-time
interaction. This limitation becomes particularly evident in scenarios
involving continuous input, such as Metaverse, live video streaming, and
broadcasting, where high throughput is imperative. To address this, we present
a novel approach that transforms the original sequential denoising into the
batching denoising process. Stream Batch eliminates the conventional
wait-and-interact approach and enables fluid and high throughput streams. To
handle the frequency disparity between data input and model throughput, we
design a novel input-output queue for parallelizing the streaming process.
Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG),
which requires additional U-Net computation. To mitigate the redundant
computations, we propose a novel residual classifier-free guidance (RCFG)
algorithm that reduces the number of negative conditional denoising steps to
only one or even zero. Besides, we introduce a stochastic similarity
filter(SSF) to optimize power consumption. Our Stream Batch achieves around
1.5x speedup compared to the sequential denoising method at different denoising
levels. The proposed RCFG leads to speeds up to 2.05x higher than the
conventional CFG. Combining the proposed strategies and existing mature
acceleration tools makes the image-to-image generation achieve up-to 91.07fps
on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers
over 59.56x. Furthermore, our proposed StreamDiffusion also significantly
reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one
RTX4090, respectively. |
Introduces StreamDiffusion, a real-time diffusion pipeline for interactive image generation, prioritizing high throughput and energy efficiency. |
Existing diffusion models lack real-time interactivity needed for applications like Metaverse, video games, and live streaming. |
Employs stream batch denoising, residual classifier-free guidance (RCFG), an input-output queue, stochastic similarity filtering, pre-computation, and a tiny autoencoder. |
Achieves up to 91.07fps on an RTX 4090 GPU, outperforming Diffusers Autopipeline by up to 59.6x.
RCFG achieves up to 2.05x speedup compared to conventional classifier-free guidance.
Stochastic similarity filtering significantly reduces GPU power usage (up to 2.39x on an RTX 3060 GPU). |
Fixed input dimensions and batch sizes limit flexibility, requiring new engine builds for different configurations.
Further exploration of more sophisticated similarity metrics for the stochastic similarity filter. |
diffusion models, real-time image generation, interactive ai, high throughput, energy efficiency |
2312.12490
Report |
InstructVideo: Instructing Video Diffusion Models with Human Feedback |
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni |
Diffusion models have emerged as the de facto paradigm for video generation.
However, their reliance on web-scale data of varied quality often yields
results that are visually unappealing and misaligned with the textual prompts.
To tackle this problem, we propose InstructVideo to instruct text-to-video
diffusion models with human feedback by reward fine-tuning. InstructVideo has
two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by
generating through the full DDIM sampling chain, we recast reward fine-tuning
as editing. By leveraging the diffusion process to corrupt a sampled video,
InstructVideo requires only partial inference of the DDIM sampling chain,
reducing fine-tuning cost while improving fine-tuning efficiency. 2) To
mitigate the absence of a dedicated video reward model for human preferences,
we repurpose established image reward models, e.g., HPSv2. To this end, we
propose Segmental Video Reward, a mechanism to provide reward signals based on
segmental sparse sampling, and Temporally Attenuated Reward, a method that
mitigates temporal modeling degradation during fine-tuning. Extensive
experiments, both qualitative and quantitative, validate the practicality and
efficacy of using image reward models in InstructVideo, significantly enhancing
the visual quality of generated videos without compromising generalization
capabilities. Code and models will be made publicly available. |
This paper introduces \method, a novel approach to enhance text-to-video diffusion models by leveraging human feedback through reward fine-tuning. |
Existing text-to-video diffusion models often generate videos with subpar visual quality and misalignment with the textual prompts, primarily due to reliance on large-scale web data of inconsistent quality. Aligning model outputs with human preferences is crucial for improving video quality and prompt adherence. |
The proposed \method recasts the reward fine-tuning process as an editing task, reducing computational burden by utilizing partial inference of the DDIM sampling chain. It further introduces Segmental Video Reward (SegVR) and Temporally Attenuated Reward (TAR) to effectively utilize image reward models for evaluating video quality. |
\method significantly improves the visual quality of generated videos compared to the base model, demonstrating clearer structures, more appealing colors, finer details, and improved text-to-video alignment.
\method outperforms other reward fine-tuning methods in terms of both efficiency and effectiveness while exhibiting strong generalization ability to unseen text prompts.
The study validates that the quality of fine-tuning data does not limit the potential quality of the fine-tuned results, suggesting \method can generate videos exceeding the quality of the data it was trained on. |
While the use of image reward models proves effective, developing specialized video reward models could further enhance performance by capturing human preferences in a more holistic manner.
Future work could explore strategies to mitigate the risk of over-optimization, a common challenge in reward fine-tuning. |
text-to-video generation, diffusion models, reward fine-tuning, human preferences, video quality |
2312.12487
Report |
Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models |
Angela Castillo, Jonas Kohler, Juan C. Pérez, Juan Pablo Pérez, Albert Pumarola, Bernard Ghanem, Pablo Arbeláez, Ali Thabet |
This paper presents a comprehensive study on the role of Classifier-Free
Guidance (CFG) in text-conditioned diffusion models from the perspective of
inference efficiency. In particular, we relax the default choice of applying
CFG in all diffusion steps and instead search for efficient guidance policies.
We formulate the discovery of such policies in the differentiable Neural
Architecture Search framework. Our findings suggest that the denoising steps
proposed by CFG become increasingly aligned with simple conditional steps,
which renders the extra neural network evaluation of CFG redundant, especially
in the second half of the denoising process. Building upon this insight, we
propose "Adaptive Guidance" (AG), an efficient variant of CFG, that adaptively
omits network evaluations when the denoising process displays convergence. Our
experiments demonstrate that AG preserves CFG's image quality while reducing
computation by 25%. Thus, AG constitutes a plug-and-play alternative to
Guidance Distillation, achieving 50% of the speed-ups of the latter while being
training-free and retaining the capacity to handle negative prompts. Finally,
we uncover further redundancies of CFG in the first half of the diffusion
process, showing that entire neural function evaluations can be replaced by
simple affine transformations of past score estimates. This method, termed
LinearAG, offers even cheaper inference at the cost of deviating from the
baseline model. Our findings provide insights into the efficiency of the
conditional denoising process that contribute to more practical and swift
deployment of text-conditioned diffusion models. |
This paper introduces "Adaptive Guidance" (AG), an efficient variant of Classifier-Free Guidance (CFG) for text-conditioned diffusion models that reduces computational cost without sacrificing image quality. |
CFG, while effective for enhancing sample quality in text-to-image generation, doubles the number of function evaluations (NFEs), making it computationally expensive. |
The authors use Neural Architecture Search (NAS) to discover efficient guidance policies for diffusion models. They identify that CFG's denoising steps become redundant in the later stages as conditional and unconditional steps converge. AG leverages this finding by adaptively switching from CFG to cheaper conditional updates when the similarity between these steps is high. |
AG preserves CFG's image quality while reducing computation by 25%.
AG is a plug-and-play, training-free alternative to Guidance Distillation, achieving 50% of its speed-ups while handling negative prompts.
The study uncovers further CFG redundancies, showing potential for replacing NFEs with affine transformations of past score estimates. |
The linear approximation method (LR-based AG), while promising for further runtime reduction, requires extensive evaluation as it deviates from replicating the baseline model.
Future work could explore extending the approach to multimodal conditioning beyond text and image. |
diffusion models, text-to-image generation, classifier-free guidance, neural architecture search, efficient inference |
2312.12483
Report |
SCoTTi: Save Computation at Training Time with an adaptive framework |
Ziyu Lin, Enzo Tartaglione, Van-Tam Nguyen |
On-device training is an emerging approach in machine learning where models
are trained on edge devices, aiming to enhance privacy protection and real-time
performance. However, edge devices typically possess restricted computational
power and resources, making it challenging to perform computationally intensive
model training tasks. Consequently, reducing resource consumption during
training has become a pressing concern in this field. To this end, we propose
SCoTTi (Save Computation at Training Time), an adaptive framework that
addresses the aforementioned challenge. It leverages an optimizable threshold
parameter to effectively reduce the number of neuron updates during training
which corresponds to a decrease in memory and computation footprint. Our
proposed approach demonstrates superior performance compared to the
state-of-the-art methods regarding computational resource savings on various
commonly employed benchmarks and popular architectures, including ResNets,
MobileNet, and Swin-T. |
SCoTTi, an adaptive framework that reduces the computational cost of on-device training by selectively updating neurons based on their learning progress. |
On-device training is challenging due to limited computational resources of edge devices, making resource efficiency crucial. |
SCoTTi combines an ultimate optimizer for dynamic learning rate adjustment and the concept of neuron velocity to identify neurons requiring updates. It introduces a learnable threshold to determine neuron equilibrium and dynamically adjusts it during training. |
SCoTTi achieves significant FLOPs reduction across various datasets and architectures.
The method maintains or even slightly improves accuracy compared to traditional training methods.
SCoTTi effectively mitigates overfitting by gradually increasing the neuron update threshold during training. |
The performance of SCoTTi can degrade if the average FLOPs value falls below a certain threshold.
Further research is needed to explore the adaptability of SCoTTi to other domains and hardware platforms. |
on-device training, resource efficiency, adaptive training, neuron velocity, flops reduction |
2312.12468
Report |
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers |
Haoyu Ma, Shahin Mahdizadehaghdam, Bichen Wu, Zhipeng Fan, Yuchao Gu, Wenliang Zhao, Lior Shapira, Xiaohui Xie |
Recent advances in generative AI have significantly enhanced image and video
editing, particularly in the context of text prompt control. State-of-the-art
approaches predominantly rely on diffusion models to accomplish these tasks.
However, the computational demands of diffusion-based methods are substantial,
often necessitating large-scale paired datasets for training, and therefore
challenging the deployment in real applications. To address these issues, this
paper breaks down the text-based video editing task into two stages. First, we
leverage an pre-trained text-to-image diffusion model to simultaneously edit
few keyframes in an zero-shot way. Second, we introduce an efficient model
called MaskINT, which is built on non-autoregressive masked generative
transformers and specializes in frame interpolation between the edited
keyframes, using the structural guidance from intermediate frames. Experimental
results suggest that our MaskINT achieves comparable performance with
diffusion-based methodologies, while significantly improve the inference time.
This research offers a practical solution for text-based video editing and
showcases the potential of non-autoregressive masked generative transformers in
this domain. |
Introduces MaskINT, a two-stage text-based video editing framework that combines keyframe editing with structure-aware frame interpolation using non-autoregressive masked generative transformers. |
Addresses limitations of diffusion-based methods for video editing, such as high computational cost and the need for large paired text-video datasets. |
Uses a pre-trained text-to-image model to edit keyframes and a novel structure-aware frame interpolation module based on non-autoregressive transformers to generate intermediate frames. |
Achieves comparable quality to diffusion-based methods in terms of temporal consistency and adherence to text prompts.
Significantly faster than diffusion-based methods, with a 5-7 times improvement in inference time.
Demonstrates the potential of non-autoregressive masked generative transformers for efficient video editing. |
Limited to structure-preserving edits and struggles with new objects appearing in intermediate frames.
Performance relies heavily on the accuracy of the keyframe editing model and structure detector. |
video editing, text-to-video, generative transformers, frame interpolation, non-autoregressive generation |
2312.12433
Report |
TAO-Amodal: A Benchmark for Tracking Any Object Amodally |
Cheng-Yen Hsieh, Kaihua Chen, Achal Dave, Tarasha Khurana, Deva Ramanan |
Amodal perception, the ability to comprehend complete object structures from
partial visibility, is a fundamental skill, even for infants. Its significance
extends to applications like autonomous driving, where a clear understanding of
heavily occluded objects is essential. However, modern detection and tracking
algorithms often overlook this critical capability, perhaps due to the
prevalence of \textit{modal} annotations in most benchmarks. To address the
scarcity of amodal benchmarks, we introduce TAO-Amodal, featuring 833 diverse
categories in thousands of video sequences. Our dataset includes
\textit{amodal} and modal bounding boxes for visible and partially or fully
occluded objects, including those that are partially out of the camera frame.
We investigate the current lay of the land in both amodal tracking and
detection by benchmarking state-of-the-art modal trackers and amodal
segmentation methods. We find that existing methods, even when adapted for
amodal tracking, struggle to detect and track objects under heavy occlusion. To
mitigate this, we explore simple finetuning schemes that can increase the
amodal tracking and detection metrics of occluded objects by 2.1\% and 3.3\%. |
Introduces TAO-Amodal, a large-scale benchmark for amodal tracking of diverse objects in videos, featuring 833 categories and amodal bounding box annotations for occluded objects. |
Amodal perception, the ability to perceive the full extent of objects even when occluded, is crucial for applications like autonomous driving but overlooked in current benchmarks that focus on modal perception. |
Annotates 17,000 object tracks with amodal bounding boxes, leveraging the existing TAO dataset for modal annotations, and proposes evaluation metrics for amodal tracking. |
Existing trackers and amodal segmentation methods struggle with heavy occlusion and out-of-frame scenarios.
Fine-tuning modal trackers on TAO-Amodal improves amodal tracking performance.
Amodal expander, a lightweight module for predicting amodal boxes, shows promising results, especially when combined with data augmentation. |
Limited size of the amodal training set.
Exploiting temporal information for amodal tracking requires further exploration. |
amodal perception, object tracking, benchmarking, occlusion reasoning, computer vision |
2312.12423
Report |
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model |
Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi |
The ability of large language models (LLMs) to process visual inputs has
given rise to general-purpose vision systems, unifying various vision-language
(VL) tasks by instruction tuning. However, due to the enormous diversity in
input-output formats in the vision domain, existing general-purpose models fail
to successfully integrate segmentation and multi-image inputs with coarse-level
tasks into a single framework. In this work, we introduce VistaLLM, a powerful
visual system that addresses coarse- and fine-grained VL tasks over single and
multiple input images using a unified framework. VistaLLM utilizes an
instruction-guided image tokenizer that filters global embeddings using task
descriptions to extract compressed and refined features from numerous images.
Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to
represent binary segmentation masks as sequences, significantly improving over
previously used uniform sampling. To bolster the desired capability of
VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning
dataset with 6.8M samples. We also address the lack of multi-image grounding
datasets by introducing a novel task, AttCoSeg (Attribute-level
Co-Segmentation), which boosts the model's reasoning and grounding capability
over multiple input images. Extensive experiments on a wide range of V- and VL
tasks demonstrate the effectiveness of VistaLLM by achieving consistent
state-of-the-art performance over strong baselines across all downstream tasks.
Our project page can be found at https://shramanpramanick.github.io/VistaLLM/. |
VistaLLM, a general-purpose vision model, seamlessly integrates coarse- and fine-grained vision-language reasoning and grounding tasks over single and multiple input images, including segmentation tasks which previous general-purpose models could not handle. |
Unifying diverse vision-language tasks into a single framework reduces computational overhead associated with task-specific fine-tuning and improves performance by sharing feature representations. |
VistaLLM uses an instruction-guided image tokenizer to refine and compress global image embeddings, a gradient-aware adaptive sampling technique to represent segmentation masks as sequences, and a Vicuna LLM decoder to process image and language features and generate outputs. The model is trained on a large-scale coarse-to-fine instruction-tuning dataset (CoFiT) containing 6.8M samples and a new multi-image grounding dataset (AttCoSeg). |
VistaLLM achieves state-of-the-art performance across 15 vision-language benchmarks, surpassing specialist systems in many tasks.
The proposed adaptive sampling technique for segmentation masks improves mIoU scores by 3-4 points compared to uniform sampling.
The instruction-guided image tokenizer significantly enhances performance in tasks involving multiple images, such as NLVR and CoSeg. |
VistaLLM struggles to accurately ground tiny or obscured objects in cluttered environments, requiring further improvement in image feature robustness.
VistaLLM may generate harmful or unsafe outputs similar to other LLMs, necessitating active research in mitigating such risks. |
vision-language, general-purpose vision model, instruction tuning, segmentation, multi-image reasoning |
2312.12419
Report |
Scene-Conditional 3D Object Stylization and Composition |
Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht |
Recently, 3D generative models have made impressive progress, enabling the
generation of almost arbitrary 3D assets from text or image inputs. However,
these approaches generate objects in isolation without any consideration for
the scene where they will eventually be placed. In this paper, we propose a
framework that allows for the stylization of an existing 3D asset to fit into a
given 2D scene, and additionally produce a photorealistic composition as if the
asset was placed within the environment. This not only opens up a new level of
control for object stylization, for example, the same assets can be stylized to
reflect changes in the environment, such as summer to winter or fantasy versus
futuristic settings-but also makes the object-scene composition more
controllable. We achieve this by combining modeling and optimizing the object's
texture and environmental lighting through differentiable ray tracing with
image priors from pre-trained text-to-image diffusion models. We demonstrate
that our method is applicable to a wide variety of indoor and outdoor scenes
and arbitrary objects. |
This paper introduces a novel framework that adapts a 3D object's appearance to a given 2D scene, enabling photorealistic composition. |
This addresses the challenge of seamlessly integrating 3D objects into existing scenes, a task crucial for applications like video games and media production. |
The framework leverages differentiable ray tracing and image priors from pre-trained text-to-image diffusion models to optimize object texture and environmental lighting. |
The method achieves realistic adaptation of object appearance to diverse environments, including lighting and shadow effects.
The framework effectively preserves the object's original identity and structural details during the adaptation process.
The proposed light capturing apparatus, inspired by real-world techniques, enables accurate estimation of scene lighting from a single image. |
The reliance on differentiable rendering can be computationally demanding.
The current method assumes a single dominant light source for outdoor scenes, which might not always hold. |
3d object stylization, scene composition, diffusion models, differentiable rendering, light estimation |
2312.12416
Report |
Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models |
Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal |
The quality of the prompts provided to text-to-image diffusion models
determines how faithful the generated content is to the user's intent, often
requiring `prompt engineering'. To harness visual concepts from target images
without prompt engineering, current approaches largely rely on embedding
inversion by optimizing and then mapping them to pseudo-tokens. However,
working with such high-dimensional vector representations is challenging
because they lack semantics and interpretability, and only allow simple vector
operations when using them. Instead, this work focuses on inverting the
diffusion model to obtain interpretable language prompts directly. The
challenge of doing this lies in the fact that the resulting optimization
problem is fundamentally discrete and the space of prompts is exponentially
large; this makes using standard optimization techniques, such as stochastic
gradient descent, difficult. To this end, we utilize a delayed projection
scheme to optimize for prompts representative of the vocabulary space in the
model. Further, we leverage the findings that different timesteps of the
diffusion process cater to different levels of detail in an image. The later,
noisy, timesteps of the forward diffusion process correspond to the semantic
information, and therefore, prompt inversion in this range provides tokens
representative of the image semantics. We show that our approach can identify
semantically interpretable and meaningful prompts for a target image which can
be used to synthesize diverse images with similar content. We further
illustrate the application of the optimized prompts in evolutionary image
generation and concept removal. |
This paper presents PH2P, a novel method for inverting text-to-image diffusion models to generate interpretable language prompts directly from images, surpassing the limitations of embedding inversion techniques. |
Existing methods for generating image prompts rely on prompt engineering or embedding inversion, which lack interpretability and limit prompt manipulation. Direct prompt inversion enables semantic understanding and flexible image editing. |
PH2P utilizes a delayed projection scheme and leverages the sensitivity of later diffusion timesteps to semantic information. This enables optimization for discrete text tokens within the model's vocabulary using the L-BFGS algorithm. |
PH2P generates semantically meaningful prompts for accurate and diverse image synthesis, outperforming baselines in CLIP similarity and LPIPS metrics.
The generated prompts exhibit high contextual similarity to ground-truth captions, as measured by BertScore, indicating human-interpretable prompt generation.
The method enables applications like evolutionary multi-concept image synthesis and concept removal through negative image prompting. |
The current study primarily focuses on single-image prompt inversion.
Future work will investigate efficient strategies for prompt optimization in multi-image settings. |
diffusion models, prompt engineering, image generation, text-to-image synthesis, prompt inversion |
2312.12359
Report |
CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation |
Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, Patrick Pérez |
The popular CLIP model displays impressive zero-shot capabilities thanks to
its seamless interaction with arbitrary text prompts. However, its lack of
spatial awareness makes it unsuitable for dense computer vision tasks, e.g.,
semantic segmentation, without an additional fine-tuning step that often uses
annotations and can potentially suppress its original open-vocabulary
properties. Meanwhile, self-supervised representation methods have demonstrated
good localization properties without human-made annotations nor explicit
supervision. In this work, we take the best of both worlds and propose an
open-vocabulary semantic segmentation method, which does not require any
annotations. We propose to locally improve dense MaskCLIP features, which are
computed with a simple modification of CLIP's last pooling layer, by
integrating localization priors extracted from self-supervised features. By
doing so, we greatly improve the performance of MaskCLIP and produce smooth
outputs. Moreover, we show that the used self-supervised feature properties can
directly be learnt from CLIP features. Our method CLIP-DINOiser needs only a
single forward pass of CLIP and two light convolutional layers at inference, no
extra supervision nor extra memory and reaches state-of-the-art results on
challenging and fine-grained benchmarks such as COCO, Pascal Context,
Cityscapes and ADE20k. The code to reproduce our results is available at
https://github.com/wysoczanska/clip_dinoiser. |
This paper introduces a novel open-vocabulary semantic segmentation method that enhances the dense features of MaskCLIP using localization cues derived from self-supervised learning (SSL) models, achieving state-of-the-art performance without requiring annotations or retraining CLIP. |
CLIP, despite its zero-shot capabilities, lacks spatial awareness for dense tasks like segmentation. Existing solutions often compromise CLIP's open-vocabulary nature. This work addresses this gap by integrating the localization strengths of SSL with the open-vocabulary nature of CLIP. |
The method refines MaskCLIP features by leveraging patch correlations from self-supervised DINO features. This is achieved via a lightweight convolutional layer trained to predict DINO-like correlations directly from CLIP features. It further incorporates a background filtering mechanism by learning objectness information, also inspired by DINO, to refine background predictions. |
The method surpasses previous state-of-the-art techniques on benchmarks like COCO, Pascal Context, Cityscapes, and ADE20k.
It demonstrates CLIP's inherent capacity for localization, effectively learned using simple convolutional layers.
The approach operates efficiently with a single forward pass of CLIP and minimal additional computation. |
The method's performance is inherently limited by CLIP's ability to differentiate between classes.
Future work may explore adaptive feature correlation granularity and address ambiguities in textual queries for further improvement. |
open-vocabulary semantic segmentation, clip, self-supervised learning, dino, localization |
2312.12337
Report |
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction |
David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann |
We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D
radiance fields parameterized by 3D Gaussian primitives from pairs of images.
Our model features real-time and memory-efficient rendering for scalable
training as well as fast 3D reconstruction at inference time. To overcome local
minima inherent to sparse and locally supported representations, we predict a
dense probability distribution over 3D and sample Gaussian means from that
probability distribution. We make this sampling operation differentiable via a
reparameterization trick, allowing us to back-propagate gradients through the
Gaussian splatting representation. We benchmark our method on wide-baseline
novel view synthesis on the real-world RealEstate10k and ACID datasets, where
we outperform state-of-the-art light field transformers and accelerate
rendering by 2.5 orders of magnitude while reconstructing an interpretable and
editable 3D radiance field. |
pixelSplat is a novel view synthesis model that reconstructs a 3D radiance field represented by 3D Gaussian primitives from image pairs, enabling real-time rendering and scalable training. |
Existing differentiable rendering methods are computationally expensive, while light field transformers lack interpretable 3D structure. pixelSplat addresses these limitations by combining the efficiency of Gaussian splatting with generalizable view synthesis. |
pixelSplat utilizes a two-view image encoder with epipolar attention to resolve scale ambiguity in real-world datasets. It predicts a dense probability distribution over 3D space and samples Gaussian means from it, using a reparameterization trick to maintain differentiability. |
Outperforms state-of-the-art light field transformers on RealEstate10k and ACID datasets.
Achieves 2.5 orders of magnitude faster rendering compared to baselines.
Produces an interpretable and editable 3D radiance field. |
Gaussian fusion and de-duplication from different views is not addressed.
Limited to in-distribution view synthesis and doesn't model unseen regions. |
novel view synthesis, 3d gaussian splatting, epipolar transformer, differentiable rendering, scale ambiguity |
2312.12198
Report |
Mask Grounding for Referring Image Segmentation |
Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang |
Referring Image Segmentation (RIS) is a challenging task that requires an
algorithm to segment objects referred by free-form language expressions.
Despite significant progress in recent years, most state-of-the-art (SOTA)
methods still suffer from considerable language-image modality gap at the pixel
and word level. These methods generally 1) rely on sentence-level language
features for language-image alignment and 2) lack explicit training supervision
for fine-grained visual grounding. Consequently, they exhibit weak object-level
correspondence between visual and language features. Without well-grounded
features, prior methods struggle to understand complex expressions that require
strong reasoning over relationships among multiple objects, especially when
dealing with rarely used or ambiguous clauses. To tackle this challenge, we
introduce a novel Mask Grounding auxiliary task that significantly improves
visual grounding within language features, by explicitly teaching the model to
learn fine-grained correspondence between masked textual tokens and their
matching visual objects. Mask Grounding can be directly used on prior RIS
methods and consistently bring improvements. Furthermore, to holistically
address the modality gap, we also design a cross-modal alignment loss and an
accompanying alignment module. These additions work synergistically with Mask
Grounding. With all these techniques, our comprehensive approach culminates in
MagNet (Mask-grounded Network), an architecture that significantly outperforms
prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating
our method's effectiveness in addressing current limitations of RIS algorithms.
Our code and pre-trained weights will be released. |
This paper introduces Mask Grounding, a novel auxiliary task that enhances Referring Image Segmentation (RIS) by improving fine-grained visual grounding in language features. |
Current RIS methods often struggle with complex referring expressions requiring detailed visual grounding due to the language-image modality gap. |
Mask Grounding trains the model to predict masked textual tokens using visual, linguistic, and segmentation information, encouraging fine-grained visual-textual correspondence. Additionally, a cross-modal alignment module and loss are introduced to further bridge the modality gap. |
MagNet, incorporating Mask Grounding, achieves state-of-the-art performance on RefCOCO, RefCOCO+, and G-Ref benchmarks.
Mask Grounding significantly improves language-image alignment compared to baseline methods.
The universality of Mask Grounding is demonstrated by its successful integration into other RIS methods like LAVT, ReLA, and CRIS, consistently boosting their performance. |
The impact of different masking strategies on the effectiveness of Mask Grounding warrants further investigation.
Future work will explore the application of Mask Grounding to other multimodal dense prediction tasks beyond RIS. |
referring image segmentation, visual grounding, masked language modeling, multimodal learning, computer vision |
2312.12030
Report |
Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint Method |
Jiachun Pan, Hanshu Yan, Jun Hao Liew, Jiashi Feng, Vincent Y. F. Tan |
Training-free guided sampling in diffusion models leverages off-the-shelf
pre-trained networks, such as an aesthetic evaluation model, to guide the
generation process. Current training-free guided sampling algorithms obtain the
guidance energy function based on a one-step estimate of the clean image.
However, since the off-the-shelf pre-trained networks are trained on clean
images, the one-step estimation procedure of the clean image may be inaccurate,
especially in the early stages of the generation process in diffusion models.
This causes the guidance in the early time steps to be inaccurate. To overcome
this problem, we propose Symplectic Adjoint Guidance (SAG), which calculates
the gradient guidance in two inner stages. Firstly, SAG estimates the clean
image via $n$ function calls, where $n$ serves as a flexible hyperparameter
that can be tailored to meet specific image quality requirements. Secondly, SAG
uses the symplectic adjoint method to obtain the gradients accurately and
efficiently in terms of the memory requirements. Extensive experiments
demonstrate that SAG generates images with higher qualities compared to the
baselines in both guided image and video generation tasks. |
This paper proposes Symplectic Adjoint Guidance (SAG), a training-free method for guided diffusion models that improves generation quality by using a multiple-step estimate of the clean image and a memory-efficient symplectic adjoint method for gradient backpropagation. |
Existing training-free guided sampling methods for diffusion models often produce inaccurate guidance due to the misalignment between the final generated image and its one-step denoised approximation, especially in early sampling stages. This leads to lower quality in generated images. |
SAG calculates gradient guidance in two stages: 1) it estimates the clean image using *n* denoising steps for higher accuracy, and 2) it employs the symplectic adjoint method to accurately and efficiently backpropagate gradients through these *n* steps. |
SAG generates images with higher quality compared to baseline methods like FreeDOM and Universal Guidance in style-guided image generation, as measured by style loss and CLIP score.
SAG achieves superior aesthetic improvement compared to Stable Diffusion, FreeDOM, and DOODL, as evidenced by higher aesthetic scores from LAION, PickScore, and HPSv2.
In personalized image generation, SAG outperforms DreamBooth, FreeDOM, and DOODL in object guidance, achieving higher CLIP image similarity, and demonstrates better face-ID matching and lower FID scores compared to FreeDOM. |
There is a trade-off between computation cost and generation quality depending on the number of estimation steps *n*.
The paper primarily focuses on image generation and video stylization, leaving exploration of other guidance tasks for future work. |
guided diffusion models, training-free guidance, symplectic adjoint method, image generation, video stylization |
2312.11894
Report |
3D-LFM: Lifting Foundation Model |
Mosam Dabhi, Laszlo A. Jeni, Simon Lucey |
The lifting of 3D structure and camera from 2D landmarks is at the
cornerstone of the entire discipline of computer vision. Traditional methods
have been confined to specific rigid objects, such as those in
Perspective-n-Point (PnP) problems, but deep learning has expanded our
capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL)
with resilience to noise, occlusions, and perspective distortions. All these
techniques, however, have been limited by the fundamental need to establish
correspondences across the 3D training data -- significantly limiting their
utility to applications where one has an abundance of "in-correspondence" 3D
data. Our approach harnesses the inherent permutation equivariance of
transformers to manage varying number of points per 3D data instance,
withstands occlusions, and generalizes to unseen categories. We demonstrate
state of the art performance across 2D-3D lifting task benchmarks. Since our
approach can be trained across such a broad class of structures we refer to it
simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind. |
This paper presents a novel 3D Lifting Foundation Model (3D-LFM) capable of lifting 2D landmarks to 3D structures across 30+ categories using a single unified model, trained without object-specific information. |
Existing 2D-3D lifting methods are limited by the need for correspondences across 3D training data and object-specific knowledge, hindering their generalizability. 3D-LFM addresses these limitations by leveraging permutation equivariance in transformers and Procrustean alignment, enabling unified learning across diverse object categories. |
3D-LFM utilizes a graph-based transformer architecture with tokenized positional encoding (TPE) to handle varying numbers of landmarks. It employs Procrustean alignment to focus on deformable aspects within a canonical frame and a hybrid attention mechanism for efficient feature aggregation. |
3D-LFM achieves state-of-the-art performance on benchmarks like H3WB, outperforming specialized methods for human body, face, and hand categories.
It exhibits strong generalization by handling unseen object categories and rig configurations, evidenced by successful reconstructions on Acinoset, PASCAL3D+, and Panoptic Studio datasets.
Ablation studies confirm the efficacy of TPE in handling data imbalance and rig transfer, while Procrustean alignment and hybrid attention enhance performance and convergence speed. |
The model can face challenges when extreme perspective distortions cause misinterpretations of 2D keypoint configurations.
Future work involves incorporating visual features and temporal dynamics to enhance depth perception and object category differentiation under challenging real-world scenarios.
Exploring cross-category knowledge transfer |
3d lifting, foundation models, transformers, permutation equivariance, procrustean alignment |
2312.11841
Report |
MixRT: Mixed Neural Representations For Real-Time NeRF Rendering |
Chaojian Li, Bichen Wu, Peter Vajda, Yingyan, Lin |
Neural Radiance Field (NeRF) has emerged as a leading technique for novel
view synthesis, owing to its impressive photorealistic reconstruction and
rendering capability. Nevertheless, achieving real-time NeRF rendering in
large-scale scenes has presented challenges, often leading to the adoption of
either intricate baked mesh representations with a substantial number of
triangles or resource-intensive ray marching in baked representations. We
challenge these conventions, observing that high-quality geometry, represented
by meshes with substantial triangles, is not necessary for achieving
photorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF
representation that includes a low-quality mesh, a view-dependent displacement
map, and a compressed NeRF model. This design effectively harnesses the
capabilities of existing graphics hardware, thus enabling real-time NeRF
rendering on edge devices. Leveraging a highly-optimized WebGL-based rendering
framework, our proposed MixRT attains real-time rendering speeds on edge
devices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop),
better rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360
datasets), and a smaller storage size (less than 80% compared to
state-of-the-art methods). |
MixRT, a novel NeRF representation for real-time rendering on edge devices, combining a low-quality mesh, a view-dependent displacement map, and a compressed NeRF model (Instant-NGP). |
Real-time NeRF rendering in large-scale scenes is challenging, with existing methods relying on intricate baked meshes or resource-intensive ray marching. |
Leverages a low-quality mesh for coarse geometry, a view-dependent displacement map for refined intersection points, and a compressed Instant-NGP model for color. Employs a highly-optimized WebGL-based rendering framework. |
Achieves real-time rendering speeds on edge devices (over 30 FPS at 1280x720 on MacBook M1 Pro).
Delivers high rendering quality, outperforming state-of-the-art methods (0.2 PSNR higher in Unbounded-360 indoor scenes).
Reduces storage size compared to state-of-the-art methods (less than 80% of existing methods). |
Rendering quality in complex outdoor scenes can be further improved.
Limited to rasterization-based rendering methods, potentially impacting quality in specific scenarios. |
neural radiance field, nerf, real-time rendering, webgl, edge devices |
2312.11774
Report |
Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation |
Yuze He, Yushi Bai, Matthieu Lin, Jenny Sheng, Yubin Hu, Qi Wang, Yu-Hui Wen, Yong-Jin Liu |
By lifting the pre-trained 2D diffusion models into Neural Radiance Fields
(NeRFs), text-to-3D generation methods have made great progress. Many
state-of-the-art approaches usually apply score distillation sampling (SDS) to
optimize the NeRF representations, which supervises the NeRF optimization with
pre-trained text-conditioned 2D diffusion models such as Imagen. However, the
supervision signal provided by such pre-trained diffusion models only depends
on text prompts and does not constrain the multi-view consistency. To inject
the cross-view consistency into diffusion priors, some recent works finetune
the 2D diffusion model with multi-view data, but still lack fine-grained view
coherence. To tackle this challenge, we incorporate multi-view image conditions
into the supervision signal of NeRF optimization, which explicitly enforces
fine-grained view consistency. With such stronger supervision, our proposed
text-to-3D method effectively mitigates the generation of floaters (due to
excessive densities) and completely empty spaces (due to insufficient
densities). Our quantitative evaluations on the T$^3$Bench dataset demonstrate
that our method achieves state-of-the-art performance over existing text-to-3D
methods. We will make the code publicly available. |
This paper proposes Text-Image Conditioned Diffusion (TICD), a novel text-to-3D generation method that leverages both text-conditioned and image-conditioned diffusion models to improve view consistency and geometric fidelity in generated 3D models. |
Existing text-to-3D methods often produce inconsistent multi-view images and struggle with generating accurate object densities, leading to artifacts like floaters or empty spaces. This method aims to address these limitations by incorporating fine-grained view consistency during the generation process. |
TICD uses two diffusion models during NeRF optimization: a text-conditioned multi-view model for coarse consistency and an image-conditioned novel view model for fine-grained view consistency. The method first renders reference views from sampled camera poses and uses them as conditions for the image-guided diffusion model. Both models contribute to score distillation, guiding the NeRF to generate consistent and accurate 3D models. |
TICD achieves state-of-the-art performance on the T^3Bench dataset, outperforming existing text-to-3D methods in terms of quality and text alignment.
The inclusion of the image-conditioned diffusion module significantly improves the generation quality and reduces artifacts like density collapse and color inconsistency.
Quantitative and qualitative results demonstrate that TICD generates 3D content with higher fidelity, clearer geometry, and improved consistency compared to previous approaches. |
The method relies on two separate diffusion models, which increases the number of parameters and computational cost.
Future work could explore designing a single diffusion model capable of handling both text-conditioned multi-view and image-conditioned novel view generation. |
text-to-3d generation, aigc, diffusion models, neural radiance fields (nerfs), multi-view consistency |
2312.11595
Report |
TIP: Text-Driven Image Processing with Semantic and Restoration Instructions |
Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Qifeng Chen, Hossein Talebi |
Text-driven diffusion models have become increasingly popular for various
image editing tasks, including inpainting, stylization, and object replacement.
However, it still remains an open research problem to adopt this
language-vision paradigm for more fine-level image processing tasks, such as
denoising, super-resolution, deblurring, and compression artifact removal. In
this paper, we develop TIP, a Text-driven Image Processing framework that
leverages natural language as a user-friendly interface to control the image
restoration process. We consider the capacity of text information in two
dimensions. First, we use content-related prompts to enhance the semantic
alignment, effectively alleviating identity ambiguity in the restoration
outcomes. Second, our approach is the first framework that supports fine-level
instruction through language-based quantitative specification of the
restoration strength, without the need for explicit task-specific design. In
addition, we introduce a novel fusion mechanism that augments the existing
ControlNet architecture by learning to rescale the generative prior, thereby
achieving better restoration fidelity. Our extensive experiments demonstrate
the superior restoration performance of TIP compared to the state of the arts,
alongside offering the flexibility of text-based control over the restoration
effects. |
TIP, a text-driven image processing framework, uses natural language instructions for semantic and quantitative control over image restoration. |
Existing restoration methods struggle with semantic ambiguities in degraded images and lack flexible control over restoration strength. |
TIP decouples semantic and restoration prompts, leveraging a ControlNet adaptor trained on a synthetic dataset with paired text instructions. It introduces a modulation fusion layer for adaptive feature alignment. |
TIP outperforms existing image restoration methods both quantitatively and qualitatively.
Semantic prompts in TIP allow controlling the identity of objects in restored images.
Restoration prompts enable users to adjust the type and strength of restoration effects using natural language. |
The current implementation primarily focuses on four common degradation types.
Future work includes exploring more complex compositions of degradations. |
image restoration, text-guided image editing, diffusion models, controlnet, semantic image processing |
2312.11535
Report |
Customize-It-3D: High-Quality 3D Creation from A Single Image Using Subject-Specific Knowledge Prior |
Nan Huang, Ting Zhang, Yuhui Yuan, Dong Chen, Shanghang Zhang |
In this paper, we present a novel two-stage approach that fully utilizes the
information provided by the reference image to establish a customized knowledge
prior for image-to-3D generation. While previous approaches primarily rely on a
general diffusion prior, which struggles to yield consistent results with the
reference image, we propose a subject-specific and multi-modal diffusion model.
This model not only aids NeRF optimization by considering the shading mode for
improved geometry but also enhances texture from the coarse results to achieve
superior refinement. Both aspects contribute to faithfully aligning the 3D
content with the subject. Extensive experiments showcase the superiority of our
method, Customize-It-3D, outperforming previous works by a substantial margin.
It produces faithful 360-degree reconstructions with impressive visual quality,
making it well-suited for various applications, including text-to-3D creation. |
Presents Customize-It-3D, a novel two-stage approach for image-to-3D generation that utilizes a subject-specific and multi-modal diffusion model to enhance the personalization of 3D content creation. |
Existing image-to-3D generation methods often produce inconsistent results with the reference image, lacking fidelity and consistency in reconstructing high-fidelity 3D objects. |
The method uses a two-stage coarse-to-fine framework. The coarse stage optimizes a NeRF using a subject-specific diffusion model for novel view synthesis and shading-aware guidance. The refine stage transforms the NeRF into a point cloud, enhancing texture realism through a subject-specific T2I model and a deferred rendering scheme. |
Significantly outperforms previous state-of-the-art methods in image-to-3D generation.
Produces faithful 360-degree reconstructions with impressive visual quality and 3D consistency.
Demonstrates versatility in handling general objects and enables applications like text-to-3D creation. |
Reliance on pretrained models for depth and normal estimation can impact overall generation quality.
Inherent geometry ambiguity from using generative priors can lead to issues like the Janus problem or over-flat geometry. |
image-to-3d generation, neural radiance fields (nerf), diffusion models, subject-specific knowledge prior, multi-modal learning |
2312.11473
Report |
Synthetic Shifts to Initial Seed Vector Exposes the Brittle Nature of Latent-Based Diffusion Models |
Mao Po-Yuan, Shashank Kotyan, Tham Yik Foong, Danilo Vasconcellos Vargas |
Recent advances in Conditional Diffusion Models have led to substantial
capabilities in various domains. However, understanding the impact of
variations in the initial seed vector remains an underexplored area of concern.
Particularly, latent-based diffusion models display inconsistencies in image
generation under standard conditions when initialized with suboptimal initial
seed vectors. To understand the impact of the initial seed vector on generated
samples, we propose a reliability evaluation framework that evaluates the
generated samples of a diffusion model when the initial seed vector is
subjected to various synthetic shifts. Our results indicate that slight
manipulations to the initial seed vector of the state-of-the-art Stable
Diffusion (Rombach et al., 2022) can lead to significant disturbances in the
generated samples, consequently creating images without the effect of
conditioning variables. In contrast, GLIDE (Nichol et al., 2022) stands out in
generating reliable samples even when the initial seed vector is transformed.
Thus, our study sheds light on the importance of the selection and the impact
of the initial seed vector in the latent-based diffusion model. |
This paper introduces a framework for systematically evaluating the robustness of diffusion models, focusing on their ability to handle variations in initial noise. |
Understanding the robustness of diffusion models is crucial as they are increasingly used for content creation and synthetic data generation. This study investigates why models like Stable Diffusion might exhibit inconsistent performance under slight variations in initial input noise, impacting their reliability. |
The authors apply five different noise perturbation techniques (uniform mean shift, random mean shift, standard deviation shift, mixed shift, and pixel arrangement shift) to the initial noise vector of different diffusion models (Stable Diffusion versions, Glide). They then evaluate the effect of these perturbations on image generation quality using metrics like top-1 and top-5 accuracy on ImageNet-100, as well as CLIP score. |
Stable Diffusion models show significant performance degradation with increasing noise perturbation, highlighting their sensitivity to the initial noise vector.
Glide demonstrates significantly higher robustness to noise perturbations compared to Stable Diffusion models, maintaining consistent performance across different noise levels.
The paper identifies the fixed variance in Stable Diffusion's denoising process and its tendency to amplify prediction errors as potential reasons for its reduced robustness compared to Glide. |
The study primarily focuses on image generation and might not directly generalize to other diffusion model applications.
Future work could explore the development of more robust diffusion models by incorporating the findings of this study, such as investigating alternative denoising processes and training strategies. |
diffusion models, robustness, stable diffusion, glide, image generation |
2312.11461
Report |
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning |
Ye Yuan, Xueting Li, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, Umar Iqbal |
Gaussian splatting has emerged as a powerful 3D representation that harnesses
the advantages of both explicit (mesh) and implicit (NeRF) 3D representations.
In this paper, we seek to leverage Gaussian splatting to generate realistic
animatable avatars from textual descriptions, addressing the limitations (e.g.,
flexibility and efficiency) imposed by mesh or NeRF-based representations.
However, a naive application of Gaussian splatting cannot generate high-quality
animatable avatars and suffers from learning instability; it also cannot
capture fine avatar geometries and often leads to degenerate body parts. To
tackle these problems, we first propose a primitive-based 3D Gaussian
representation where Gaussians are defined inside pose-driven primitives to
facilitate animation. Second, to stabilize and amortize the learning of
millions of Gaussians, we propose to use neural implicit fields to predict the
Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries
and extract detailed meshes, we propose a novel SDF-based implicit mesh
learning approach for 3D Gaussians that regularizes the underlying geometries
and extracts highly detailed textured meshes. Our proposed method, GAvatar,
enables the large-scale generation of diverse animatable avatars using only
text prompts. GAvatar significantly surpasses existing methods in terms of both
appearance and geometry quality, and achieves extremely fast rendering (100
fps) at 1K resolution. |
GAvatar: a novel approach for generating animatable avatars from text using a novel primitive-based implicit Gaussian representation and a new SDF-based implicit mesh learning approach for 3D Gaussians. |
Existing methods for text-to-3D avatar generation struggle to balance fine-grained geometry, efficient rendering, and animation capabilities. GAvatar addresses these limitations. |
GAvatar represents avatars with pose-driven primitives, each containing 3D Gaussians. Neural implicit fields predict Gaussian attributes (color, opacity, etc.) for stable training with SDS loss. An SDF-based approach regularizes geometry and enables mesh extraction. |
Generates high-quality, animatable avatars with fine geometry details, surpassing existing methods.
Achieves fast rendering speed (100 fps at 1K resolution) due to the use of 3D Gaussians.
Enables high-quality textured mesh extraction from the learned 3D Gaussian avatar. |
Occasional color oversaturation in generated avatars, similar to other SDS-based methods.
Potential misalignment between geometry and appearance, requiring further exploration of consistent supervision techniques. |
text-to-3d, avatar generation, gaussian splatting, implicit mesh learning, animatable avatars |
2312.11459
Report |
VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder |
Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, Baining Guo |
This paper introduces a pioneering 3D volumetric encoder designed for
text-to-3D generation. To scale up the training data for the diffusion model, a
lightweight network is developed to efficiently acquire feature volumes from
multi-view images. The 3D volumes are then trained on a diffusion model for
text-to-3D generation using a 3D U-Net. This research further addresses the
challenges of inaccurate object captions and high-dimensional feature volumes.
The proposed model, trained on the public Objaverse dataset, demonstrates
promising outcomes in producing diverse and recognizable samples from text
prompts. Notably, it empowers finer control over object part characteristics
through textual cues, fostering model creativity by seamlessly combining
multiple concepts within a single object. This research significantly
contributes to the progress of 3D generation by introducing an efficient,
flexible, and scalable representation methodology. Code is available at
https://github.com/checkcrab/VolumeDiffusion. |
This paper introduces VolumeDiffusion, a novel text-to-3D generation method using a novel 3D volumetric representation and a lightweight encoder for efficient feature volume acquisition from multi-view images. |
Scaling up training data for text-to-3D generation is crucial, and this method addresses limitations of previous representations by being efficient, flexible, and enabling fine-grained text control. |
The method uses a two-stage approach: 1) a lightweight encoder converts multi-view images to feature volumes, and 2) a 3D U-Net diffusion model learns the distribution of these volumes conditioned on text prompts. |
The lightweight encoder efficiently generates high-quality 3D volumes, processing 30 objects per second on a single GPU.
The diffusion model, trained on a subset of the Objaverse dataset, generates diverse and recognizable 3D objects from text prompts.
Compared to methods like Shap·E, VolumeDiffusion exhibits superior control over object part characteristics through textual cues. |
The model exhibits a bias towards generating white objects due to the prevalence of texture-less objects in the training dataset.
Generated objects often have over-smooth surfaces, potentially limited by the spatial resolution of feature volumes. |
text-to-3d generation, 3d volumetric representation, diffusion model, multi-view images, feature volume |
2312.11458
Report |
GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis |
Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, Lei Xiao |
We propose a method for dynamic scene reconstruction using deformable 3D
Gaussians that is tailored for monocular video. Building upon the efficiency of
Gaussian splatting, our approach extends the representation to accommodate
dynamic elements via a deformable set of Gaussians residing in a canonical
space, and a time-dependent deformation field defined by a multi-layer
perceptron (MLP). Moreover, under the assumption that most natural scenes have
large regions that remain static, we allow the MLP to focus its
representational power by additionally including a static Gaussian point cloud.
The concatenated dynamic and static point clouds form the input for the
Gaussian Splatting rasterizer, enabling real-time rendering. The differentiable
pipeline is optimized end-to-end with a self-supervised rendering loss. Our
method achieves results that are comparable to state-of-the-art dynamic neural
radiance field methods while allowing much faster optimization and rendering.
Project website: https://lynl7130.github.io/gaufre/index.html |
This paper introduces GauFRe, a novel method for dynamic scene reconstruction from monocular videos using deformable 3D Gaussians and Gaussian splatting. |
Existing methods struggle to balance high-quality reconstruction with fast optimization and rendering, especially for dynamic scenes in monocular videos. |
GauFRe uses a deformation field parameterized by an MLP to deform canonical Gaussians, representing dynamic scene parts. A separate set of static Gaussians captures quasi-static regions. The model is optimized end-to-end with a self-supervised rendering loss. |
GauFRe achieves comparable or superior reconstruction quality to state-of-the-art dynamic neural radiance field methods.
The method allows for much faster optimization (around 20 minutes) compared to hours for some methods.
GauFRe enables real-time rendering for novel view synthesis. |
Modeling scenes with large or irregular motions is challenging due to the single MLP used for the entire deformation field.
The dynamic/static separation is sensitive to the quality of the initial structure-from-motion point cloud for real-world scenes. |
dynamic scene reconstruction, monocular video, deformable gaussians, gaussian splatting, novel view synthesis |
2312.11417
Report |
PolyDiff: Generating 3D Polygonal Meshes with Diffusion Models |
Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, Matthias Nießner |
We introduce PolyDiff, the first diffusion-based approach capable of directly
generating realistic and diverse 3D polygonal meshes. In contrast to methods
that use alternate 3D shape representations (e.g. implicit representations),
our approach is a discrete denoising diffusion probabilistic model that
operates natively on the polygonal mesh data structure. This enables learning
of both the geometric properties of vertices and the topological
characteristics of faces. Specifically, we treat meshes as quantized triangle
soups, progressively corrupted with categorical noise in the forward diffusion
phase. In the reverse diffusion phase, a transformer-based denoising network is
trained to revert the noising process, restoring the original mesh structure.
At inference, new meshes can be generated by applying this denoising network
iteratively, starting with a completely noisy triangle soup. Consequently, our
model is capable of producing high-quality 3D polygonal meshes, ready for
integration into downstream 3D workflows. Our extensive experimental analysis
shows that PolyDiff achieves a significant advantage (avg. FID and JSD
improvement of 18.2 and 5.8 respectively) over current state-of-the-art
methods. |
\OURS is the first diffusion-based generative model that operates directly on polygonal meshes, representing them as quantized triangle soups to learn both geometric and topological characteristics. |
Generating high-fidelity 3D shapes, often as polygonal meshes, is crucial for various applications, but existing methods struggle to capture mesh characteristics due to their reliance on alternate 3D representations. |
\OURS employs a discrete denoising diffusion model that corrupts quantized triangle soups with categorical noise and then learns to reverse this process using a transformer-based denoising network. |
\OURS outperforms state-of-the-art methods in unconditional mesh generation, achieving significant gains in FID and JSD metrics.
The method generates more coherent and cohesive 3D shapes compared to techniques relying on autoencoders or autoregressive models.
Analysis confirms that the discrete diffusion approach is better suited for mesh generation than continuous Gaussian noise. |
Extending \OURS to generate scene-level meshes instead of single objects is a potential future direction.
The sampling speed of \OURS could be improved by exploring better sampling techniques and diffusion model formulations. |
3d mesh generation, diffusion models, deep learning, generative models, computer vision |
2312.11396
Report |
MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance |
Qi Mao, Lan Chen, Yuchao Gu, Zhen Fang, Mike Zheng Shou |
Recent diffusion-based image editing approaches have exhibited impressive
editing capabilities in images with simple compositions. However, localized
editing in complex scenarios has not been well-studied in the literature,
despite its growing real-world demands. Existing mask-based inpainting methods
fall short of retaining the underlying structure within the edit region.
Meanwhile, mask-free attention-based methods often exhibit editing leakage and
misalignment in more complex compositions. In this work, we develop MAG-Edit, a
training-free, inference-stage optimization method, which enables localized
image editing in complex scenarios. In particular, MAG-Edit optimizes the noise
latent feature in diffusion models by maximizing two mask-based cross-attention
constraints of the edit token, which in turn gradually enhances the local
alignment with the desired prompt. Extensive quantitative and qualitative
experiments demonstrate the effectiveness of our method in achieving both text
alignment and structure preservation for localized editing within complex
scenarios. |
MAG-Edit, a training-free method for localized image editing in complex scenes with multiple objects, by optimizing noise latent features in diffusion models. |
Existing mask-based methods struggle to maintain structural integrity within edited regions, while mask-free methods suffer from editing leakage and misalignment. |
MAG-Edit optimizes noise latent features by maximizing two mask-based cross-attention constraints of the edit token, enhancing local alignment with the desired prompt. |
MAG-Edit effectively balances editing efficiency and structure preservation in complex scenes.
Quantitative evaluations demonstrate significant improvements in text alignment within localized regions.
User studies confirm the superiority of MAG-Edit in text alignment, structure preservation, and overall editing quality. |
Long inference time due to the optimization process.
Limitations in editing scenarios requiring significant pose changes due to reliance on maintaining structure through cross-attention maps. |
image editing, diffusion models, cross-attention, localized editing, complex scenes |
2312.11392
Report |
SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing |
Zeyinzi Jiang, Chaojie Mao, Yulin Pan, Zhen Han, Jingfeng Zhang |
Image diffusion models have been utilized in various tasks, such as
text-to-image generation and controllable image synthesis. Recent research has
introduced tuning methods that make subtle adjustments to the original models,
yielding promising results in specific adaptations of foundational generative
diffusion models. Rather than modifying the main backbone of the diffusion
model, we delve into the role of skip connection in U-Net and reveal that
hierarchical features aggregating long-distance information across encoder and
decoder make a significant impact on the content and quality of image
generation. Based on the observation, we propose an efficient generative tuning
framework, dubbed SCEdit, which integrates and edits Skip Connection using a
lightweight tuning module named SC-Tuner. Furthermore, the proposed framework
allows for straightforward extension to controllable image synthesis by
injecting different conditions with Controllable SC-Tuner, simplifying and
unifying the network design for multi-condition inputs. Our SCEdit
substantially reduces training parameters, memory usage, and computational
expense due to its lightweight tuners, with backward propagation only passing
to the decoder blocks. Extensive experiments conducted on text-to-image
generation and controllable image synthesis tasks demonstrate the superiority
of our method in terms of efficiency and performance. Project page:
\url{https://scedit.github.io/} |
This paper proposes SCEdit, an efficient and controllable image diffusion generation framework for efficient fine-tuning and controllable image synthesis. |
Fine-tuning large diffusion models is resource-intensive. This work introduces an efficient alternative, SCEdit, that achieves comparable results with reduced computational cost and improved controllability. |
SCEdit introduces lightweight tuning modules (SC-Tuner & CSC-Tuner) that edit the latent features within skip connections of a pre-trained U-Net, allowing for efficient adaptation without modifying the main backbone. |
SCEdit outperforms existing text-to-image tuning methods on COCO2017 in FID score and visual quality while using significantly fewer parameters and memory.
For controllable synthesis, SCEdit achieves strong results with various conditions (edges, depth, segmentation, etc.) using only 7.9% of ControlNet's parameters and 30% less memory.
SCEdit supports composable generation by combining multiple conditions and demonstrates generalization ability, enabling tasks like sketch-to-image and controlled outpainting. |
The performance depends on the pre-trained model due to the frozen backbone.
Potential misuse of high-risk data during training could lead to harmful outputs. |
image generation, diffusion models, efficient tuning, controllable synthesis, skip connections |
2312.11370
Report |
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model |
Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong |
Large language models (LLMs) have shown remarkable proficiency in human-level
reasoning and generation capabilities, which encourages extensive research on
their application in mathematical problem solving. However, current work has
been largely focused on text-based mathematical problems, with limited
investigation in problems involving geometric information. Addressing this gap,
we aim to enable LLMs to solve geometric problems by understanding image input.
We first analyze the limitations of current Multimodal Large Language Models
(MLLMs) in this area: they struggle to accurately comprehending basic geometric
elements and their relationships. To overcome these challenges, we take
advantage of the unique characteristics of geometric problems (such as unique
geometric logical form, and geometric scalability) and the capacity of the
textual LLMs to build an enriched multimodal geometry dataset based on existing
data. The augmented dataset, Geo170K, contains more than 170K geometric
image-caption and question-answer pairs. Utilizing our constructed Geo170K
dataset, we develop G-LLaVA, which demonstrates exceptional performance in
solving geometric problems, significantly outperforming GPT-4-V on the
MathVista benchmark with only 7B parameters. |
This paper introduces a novel method for enhancing Multimodal Large Language Models (MLLMs) to solve geometric problems, addressing the limitations of current models in comprehending and reasoning about geometric information. |
Existing MLLMs often struggle to understand geometric elements and their relationships, hindering their ability to solve geometric problems effectively. This paper aims to bridge this gap by improving the models' geometric reasoning capabilities. |
The authors propose a two-phase approach: (1) **Geometric Cross-Modal Alignment**: Using existing datasets, they generate image captions and contrastive question-answer pairs, focusing on basic geometric elements. (2) **Geometric Instruction Tuning**: Utilizing text-only LLMs like ChatGPT, they enrich existing datasets by generating new problem variations, such as equation solving, value scaling, re-formulating conditions, and sentence paraphrasing. |
The resulting model, G-LLaVA, significantly outperforms existing MLLMs on the MathVista benchmark, even surpassing GPT-4-V with only 7B parameters.
G-LLaVA also demonstrates superior performance compared to traditional in-domain models on the GeoQA benchmark.
The effectiveness of the proposed cross-modal alignment and instruction tuning strategies is validated through ablation studies. |
The study primarily focuses on geometric problems, and further research is needed to assess its generalizability to other domains of mathematical reasoning.
Future work can explore incorporating more sophisticated geometric reasoning techniques and expanding the dataset to encompass a wider range of geometric concepts and problem types. |
multimodal large language models, geometric reasoning, data augmentation, instruction tuning, mathematical problem solving |
2312.11360
Report |
Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering |
Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll |
We present Paint-it, a text-driven high-fidelity texture map synthesis method
for 3D meshes via neural re-parameterized texture optimization. Paint-it
synthesizes texture maps from a text description by
synthesis-through-optimization, exploiting the Score-Distillation Sampling
(SDS). We observe that directly applying SDS yields undesirable texture quality
due to its noisy gradients. We reveal the importance of texture
parameterization when using SDS. Specifically, we propose Deep Convolutional
Physically-Based Rendering (DC-PBR) parameterization, which re-parameterizes
the physically-based rendering (PBR) texture maps with randomly initialized
convolution-based neural kernels, instead of a standard pixel-based
parameterization. We show that DC-PBR inherently schedules the optimization
curriculum according to texture frequency and naturally filters out the noisy
signals from SDS. In experiments, Paint-it obtains remarkable quality PBR
texture maps within 15 min., given only a text description. We demonstrate the
generalizability and practicality of Paint-it by synthesizing high-quality
texture maps for large-scale mesh datasets and showing test-time applications
such as relighting and material control using a popular graphics engine.
Project page: https://kim-youwang.github.io/paint-it |
Paint-it: Text-driven high-fidelity PBR texture map synthesis for 3D meshes via neural re-parameterized texture optimization. |
Existing methods for generating textured 3D assets from text often produce low-quality results or rely on computationally expensive techniques. This paper addresses these limitations by directly synthesizing high-fidelity, physically-based texture maps on existing 3D models, facilitating practical use in graphics engines and pipelines. |
The method uses Score-Distillation Sampling (SDS) to guide the optimization of a Deep Convolutional Physically-Based Rendering (DC-PBR) model, which represents texture maps as randomly initialized U-Net convolutional kernels. This approach inherently schedules the optimization curriculum according to texture frequency and filters out noisy signals from SDS, leading to higher-quality results. |
Paint-it generates high-quality PBR texture maps for various 3D meshes, including objects, humans, and animals, demonstrating its generalizability.
The method produces superior texture maps compared to existing methods, as evidenced by qualitative comparisons and quantitative metrics like FID and user study scores.
The synthesized PBR texture maps are compatible with popular graphics engines and enable practical applications like relighting and material control. |
The optimization process can be time-consuming, taking 15-30 minutes per mesh.
Future work includes exploring faster optimization techniques and building a large-scale PBR texture map dataset for training feed-forward generative models. |
text-driven synthesis, pbr texture maps, 3d mesh texturing, score-distillation sampling, deep convolutional re-parameterization |
2312.11232
Report |
Self-Supervised Learning for Image Super-Resolution and Deblurring |
Jérémy Scanvic, Mike Davies, Patrice Abry, Julián Tachella |
Self-supervised methods have recently proved to be nearly as effective as
supervised methods in various imaging inverse problems, paving the way for
learning-based methods in scientific and medical imaging applications where
ground truth data is hard or expensive to obtain. This is the case in magnetic
resonance imaging and computed tomography. These methods critically rely on
invariance to translations and/or rotations of the image distribution to learn
from incomplete measurement data alone. However, existing approaches fail to
obtain competitive performances in the problems of image super-resolution and
deblurring, which play a key role in most imaging systems. In this work, we
show that invariance to translations and rotations is insufficient to learn
from measurements that only contain low-frequency information. Instead, we
propose a new self-supervised approach that leverages the fact that many image
distributions are approximately scale-invariant, and that enables recovering
high-frequency information lost in the measurement process. We demonstrate
throughout a series of experiments on real datasets that the proposed method
outperforms other self-supervised approaches, and obtains performances on par
with fully supervised learning. |
This paper introduces a novel self-supervised learning approach for image super-resolution and deblurring leveraging the approximate scale-invariance of many image distributions. |
Super-resolution and deblurring are crucial for various imaging systems, but existing self-supervised methods struggle when high-frequency information is lost in the measurement process. |
The method trains a deep neural network using a new loss function combining the SURE loss and a novel equivariant loss based on downscaling transformations. A key aspect is stopping the gradient during downscaling to enhance performance. |
The approach outperforms other self-supervised methods for both image deblurring and super-resolution.
It achieves performance on par with fully supervised learning methods.
Stopping the gradient during downscaling is shown to significantly improve reconstruction performance. |
Theoretical analysis of necessary and sufficient conditions for learning from measurements alone with scaling transformations (semi-group) is missing.
Exploring different downscaling implementations and training strategies could further improve performance. |
image deblurring, image super-resolution, self-supervised learning, scale-invariance, equivariant imaging |
2312.10998
Report |
ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation |
Jia-Hao Wu, Fu-Jen Tsai, Yan-Tsung Peng, Chung-Chi Tsai, Chia-Wen Lin, Yen-Yu Lin |
Image deblurring aims to remove undesired blurs from an image captured in a
dynamic scene. Much research has been dedicated to improving deblurring
performance through model architectural designs. However, there is little work
on data augmentation for image deblurring. Since continuous motion causes
blurred artifacts during image exposure, we aspire to develop a groundbreaking
blur augmentation method to generate diverse blurred images by simulating
motion trajectories in a continuous space. This paper proposes Implicit
Diffusion-based reBLurring AUgmentation (ID-Blau), utilizing a sharp image
paired with a controllable blur condition map to produce a corresponding
blurred image. We parameterize the blur patterns of a blurred image with their
orientations and magnitudes as a pixel-wise blur condition map to simulate
motion trajectories and implicitly represent them in a continuous space. By
sampling diverse blur conditions, ID-Blau can generate various blurred images
unseen in the training set. Experimental results demonstrate that ID-Blau can
produce realistic blurred images for training and thus significantly improve
performance for state-of-the-art deblurring models. The source code is
available at https://github.com/plusgood-steven/ID-Blau. |
This paper proposes Implicit Diffusion-based reBLurring AUgmentation (ID-Blau), a novel data augmentation strategy for image deblurring. ID-Blau generates realistic blurred images by simulating motion trajectories in a continuous space, using a sharp image and a controllable blur condition map. |
Effective data augmentation is crucial for improving the performance of image deblurring models, but existing methods are limited in their ability to generate diverse and controllable blurred images. |
The authors model blur conditions in a continuous space, representing blur orientations and magnitudes. They then use a diffusion model conditioned on sharp images and blur condition maps to generate realistic blurred images. |
ID-Blau significantly improves the performance of four state-of-the-art deblurring models (MIMO-UNet+, Restormer, Stripformer, and FFTformer) on GoPro, HIDE, and RealBlur datasets.
ID-Blau generates more realistic blurred images compared to the GoPro dataset by simulating continuous motion trajectories.
The use of a diffusion process in ID-Blau leads to additional performance gains compared to directly training a reblurring model without diffusion. |
The authors mainly evaluate ID-Blau on synthetic datasets (GoPro and HIDE) and one real-world dataset (RealBlur). Further evaluation on more diverse real-world blurry images is needed.
ID-Blau requires additional training complexity compared to not using augmentation. Exploring more efficient ways to generate augmented data is an interesting direction. |
image deblurring, data augmentation, diffusion models, blur simulation, continuous blur condition |
2312.10945
Report |
LaViP:Language-Grounded Visual Prompts |
Nilakshan Kunananthaseelan, Jing Zhang, Mehrtash Harandi |
We introduce a language-grounded visual prompting method to adapt the visual
encoder of vision-language models for downstream tasks. By capitalizing on
language integration, we devise a parameter-efficient strategy to adjust the
input of the visual encoder, eliminating the need to modify or add to the
model's parameters. Due to this design choice, our algorithm can operate even
in black-box scenarios, showcasing adaptability in situations where access to
the model's parameters is constrained. We will empirically demonstrate that,
compared to prior art, grounding visual prompts with language enhances both the
accuracy and speed of adaptation. Moreover, our algorithm excels in
base-to-novel class generalization, overcoming limitations of visual prompting
and exhibiting the capacity to generalize beyond seen classes. We thoroughly
assess and evaluate our method across a variety of image recognition datasets,
such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning
situations, including few-shot learning, base-to-novel class generalization,
and transfer learning. |
This paper introduces LaViP, a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. |
Existing visual prompting techniques suffer from limitations such as unimodality in learning prompts and an inability to generalize beyond seen classes. LaViP addresses these limitations by leveraging the multimodal nature of vision-language models. |
LaViP generates input-dependent visual prompts through low-rank matrix decomposition, incorporating both language and image information. It uses a Kronecker product to efficiently embed novel class knowledge for base-to-novel generalization. |
LaViP outperforms previous methods in few-shot learning by a significant margin, achieving up to 11.84% improvement over CLIP Zero-Shot.
In base-to-novel generalization, LaViP shows competitive performance, achieving an absolute gain of 2.64% compared to CoOp and CoCoOp.
LaViP demonstrates strong performance across diverse datasets and consistently outperforms existing visual prompting methods. |
LaViP's performance is sensitive to the chosen prompt template and may struggle with low-resolution images or datasets with limited semantic variation.
Future work could explore learning multimodal prompts with mutual synergy between visual and textual information. |
visual prompting, vision-language models, model reprogramming, few-shot learning, base-to-novel generalization |
2312.10899
Report |
MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising |
Bingyuan Wang, Hengyu Meng, Zeyu Cai, Lanjiong Li, Yue Ma, Qifeng Chen, Zeyu Wang |
Visual storytelling often uses nontypical aspect-ratio images like scroll
paintings, comic strips, and panoramas to create an expressive and compelling
narrative. While generative AI has achieved great success and shown the
potential to reshape the creative industry, it remains a challenge to generate
coherent and engaging content with arbitrary size and controllable style,
concept, and layout, all of which are essential for visual storytelling. To
overcome the shortcomings of previous methods including repetitive content,
style inconsistency, and lack of controllability, we propose MagicScroll, a
multi-layered, progressive diffusion-based image generation framework with a
novel semantic-aware denoising process. The model enables fine-grained control
over the generated image on object, scene, and background levels with text,
image, and layout conditions. We also establish the first benchmark for
nontypical aspect-ratio image generation for visual storytelling including
mediums like paintings, comics, and cinematic panoramas, with customized
metrics for systematic evaluation. Through comparative and ablation studies,
MagicScroll showcases promising results in aligning with the narrative text,
improving visual coherence, and engaging the audience. We plan to release the
code and benchmark in the hope of a better collaboration between AI researchers
and creative practitioners involving visual storytelling. |
MagicScroll, a multi-layered, progressive diffusion-based image generation framework for creating coherent and engaging nontypical aspect-ratio images for visual storytelling, featuring semantic-aware denoising and multi-level control over style, content, and layout. |
Existing methods struggle to generate coherent and engaging visual content for storytelling, especially with arbitrary size and controllable style, concept, and layout, which are crucial for conveying narrative and emotion in mediums like scroll paintings, comic strips, and panoramas. |
The framework leverages GPT-based layout prediction, semantic-aware denoising, and text/image-based style control modules. It utilizes predicted object/scene masks, reference images, and style concepts to guide the generation process, ensuring coherence and alignment with the narrative text. |
Outperforms existing methods in generating nontypical aspect-ratio images with higher content richness and fidelity to input text.
Demonstrates superior performance in visual coherence and user engagement based on both quantitative metrics and subjective user ratings.
Provides fine-grained control over the generated images at object, scene, and background levels, enabling diverse visual storytelling scenarios including painting, comic, and panorama styles. |
Exploration of tokenizers and encoders specifically designed for ultra-long texts to improve story processing.
Integration of additional conditional controls at various stages of the generation process and incorporation of pre-trained modules for enhanced controllability over visual effects. |
visual storytelling, image generation, diffusion models, layout control, semantic-aware denoising |
2312.10835
Report |
Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models |
Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk |
Knowledge distillation methods have recently shown to be a promising
direction to speedup the synthesis of large-scale diffusion models by requiring
only a few inference steps. While several powerful distillation methods were
recently proposed, the overall quality of student samples is typically lower
compared to the teacher ones, which hinders their practical usage. In this
work, we investigate the relative quality of samples produced by the teacher
text-to-image diffusion model and its distilled student version. As our main
empirical finding, we discover that a noticeable portion of student samples
exhibit superior fidelity compared to the teacher ones, despite the
"approximate" nature of the student. Based on this finding, we propose an
adaptive collaboration between student and teacher diffusion models for
effective text-to-image synthesis. Specifically, the distilled model produces
the initial sample, and then an oracle decides whether it needs further
improvements with a slow teacher model. Extensive experiments demonstrate that
the designed pipeline surpasses state-of-the-art text-to-image alternatives for
various inference budgets in terms of human preference. Furthermore, the
proposed approach can be naturally used in popular applications such as
text-guided image editing and controllable generation. |
This paper finds that distilled text-to-image models can outperform teacher models on a significant number of samples, and proposes an adaptive collaboration method between student and teacher diffusion models for cost-effective and high-quality text-to-image synthesis. |
Large-scale diffusion models excel in text-conditional image generation but suffer from high inference costs. Distillation methods offer faster inference but often at a quality loss. This paper explores a new direction of student-teacher collaboration to leverage the advantages of both. |
The paper analyzes the performance of distilled models, revealing their strengths. It then proposes a three-step adaptive approach: 1) student generates an initial image, 2) an oracle (ImageReward estimator with a cut-off threshold) determines if improvement is needed, 3) if so, the teacher either refines the student sample or regenerates a new one. |
Distilled text-to-image models can generate superior samples compared to teacher models for a noticeable portion of prompts, especially on challenging cases.
The proposed adaptive collaborative approach surpasses baselines (including teacher models and other distillation methods) in terms of both human preference and automatic metrics (FID, CLIP score, ImageReward).
The method effectively improves the quality and efficiency of text-guided image editing and controllable generation tasks. |
The performance of the approach relies on the accuracy of the sample quality estimator and the effectiveness of the no-reference decision-making procedure.
The paper primarily focuses on consistency distillation and evaluates limited distilled models. Exploring other distillation techniques and student models is left for future work. |
text-to-image generation, diffusion models, knowledge distillation, adaptive collaboration, image quality assessment |
2312.10763
Report |
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts |
Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, Tao Chen |
Recently, 3D understanding has become popular to facilitate autonomous agents
to perform further decisionmaking. However, existing 3D datasets and methods
are often limited to specific tasks. On the other hand, recent progress in
Large Language Models (LLMs) and Multimodal Language Models (MLMs) have
demonstrated exceptional general language and imagery tasking performance.
Therefore, it is interesting to unlock MLM's potential to be 3D generalist for
wider tasks. However, current MLMs' research has been less focused on 3D tasks
due to a lack of large-scale 3D instruction-following datasets. In this work,
we introduce a comprehensive 3D instructionfollowing dataset called M3DBench,
which possesses the following characteristics: 1) It supports general
multimodal instructions interleaved with text, images, 3D objects, and other
visual prompts. 2) It unifies diverse 3D tasks at both region and scene levels,
covering a variety of fundamental abilities in real-world 3D environments. 3)
It is a large-scale 3D instruction-following dataset with over 320k
instruction-response pairs. Furthermore, we establish a new benchmark for
assessing the performance of large models in understanding multi-modal 3D
prompts. Extensive experiments demonstrate the effectiveness of our dataset and
baseline, supporting general 3D-centric tasks, which can inspire future
research. |
This paper introduces M3DBench, a large-scale multi-modal 3D instruction-following dataset for developing general-purpose assistants in 3D environments. |
Existing 3D datasets often focus on specific tasks, limiting the development of general-purpose 3D assistants. This dataset aims to bridge this gap and unlock the potential of Multi-modal Language Models (MLMs) in the 3D domain. |
M3DBench leverages existing 3D datasets and uses LLMs to generate multi-modal instructions interleaved with text, coordinates, images, and 3D objects. The dataset covers diverse 3D tasks, including object detection, visual grounding, dense captioning, question answering, dialogue, planning, and navigation. |
M3DBench contains over 320k instruction-response pairs, including over 138k multi-modal instructions.
A simple baseline model trained on M3DBench demonstrates the effectiveness of the dataset in enabling MLMs to understand 3D scenes and follow instructions.
The authors establish a benchmark for evaluating the performance of MLMs on various 3D tasks, including scene understanding, reasoning, and planning. |
The performance of baseline models on certain tasks, such as detailed description and object localization, is suboptimal, indicating room for improvement in 3D MLM development.
Future work can explore more sophisticated model architectures and training strategies to further enhance the capabilities of 3D MLMs. |
3d vision, multi-modal learning, instruction following, large language models, dataset |
2312.10665
Report |
Silkie: Preference Distillation for Large Visual Language Models |
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong |
This paper explores preference distillation for large vision language models
(LVLMs), improving their ability to generate helpful and faithful responses
anchoring the visual context. We first build a vision-language feedback
(VLFeedback) dataset utilizing AI annotation. Specifically, responses are
generated by models sampled from 12 LVLMs, conditioned on multi-modal
instructions sourced from various datasets. We adopt GPT-4V to assess the
generated outputs regarding helpfulness, visual faithfulness, and ethical
considerations. Furthermore, the preference supervision is distilled into
Qwen-VL-Chat through the direct preference optimization (DPO) method. The
resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME
benchmark regarding the perception and cognition capabilities, respectively.
Silkie also demonstrates reduced hallucination by setting a new
state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis
shows that DPO with our VLFeedback dataset mainly boosts the fine-grained
perception and complex cognition abilities of LVLMs, leading to more
comprehensive improvements compared to human-annotated preference datasets. |
This paper introduces Silkie, a large vision language model (LVLM) enhanced with preference distillation to generate more helpful and faithful responses grounded in visual context. |
Open-sourced LVLMs often exhibit misalignment issues, generating misleading content or biased responses. This work aims to improve LVLMs' reliability by aligning them with human preferences. |
The authors construct VLFeedback, a large-scale multi-modal preference dataset annotated by GPT-4V. This dataset covers 80k multi-modal instructions and responses from 12 LVLMs, evaluated on helpfulness, visual faithfulness, and ethical considerations. Silkie is then trained using direct preference optimization (DPO) on this dataset, distilling the preferences into the model. |
Silkie achieves significant improvements on the MME benchmark, demonstrating 6.9% and 9.5% relative gains in perception and cognition tasks, respectively.
The model shows reduced hallucination, achieving a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark.
Analysis reveals that VLFeedback particularly benefits fine-grained perception tasks (e.g., OCR) and complex cognition tasks (e.g., code reasoning). |
The VLFeedback dataset currently lacks sufficient supervision for safety alignment, potentially requiring the incorporation of red-teaming techniques in future work.
The work focuses on a limited range of LVLMs and instruction datasets. Future iterations could incorporate newer models and diverse datasets for broader evaluation. |
vision language models, preference distillation, ai feedback, hallucination reduction, multi-modal alignment |
2312.10656
Report |
VidToMe: Video Token Merging for Zero-Shot Video Editing |
Xirui Li, Chao Ma, Xiaokang Yang, Ming-Hsuan Yang |
Diffusion models have made significant advances in generating high-quality
images, but their application to video generation has remained challenging due
to the complexity of temporal motion. Zero-shot video editing offers a solution
by utilizing pre-trained image diffusion models to translate source videos into
new ones. Nevertheless, existing methods struggle to maintain strict temporal
consistency and efficient memory consumption. In this work, we propose a novel
approach to enhance temporal consistency in generated videos by merging
self-attention tokens across frames. By aligning and compressing temporally
redundant tokens across frames, our method improves temporal coherence and
reduces memory consumption in self-attention computations. The merging strategy
matches and aligns tokens according to the temporal correspondence between
frames, facilitating natural temporal consistency in generated video frames. To
manage the complexity of video processing, we divide videos into chunks and
develop intra-chunk local token merging and inter-chunk global token merging,
ensuring both short-term video continuity and long-term content consistency.
Our video editing approach seamlessly extends the advancements in image editing
to video editing, rendering favorable results in temporal consistency over
state-of-the-art methods. |
This paper presents VidToMe, a novel approach for zero-shot video editing that enhances temporal consistency by merging self-attention tokens across video frames. |
Existing video editing methods struggle to maintain strict temporal consistency and efficient memory consumption due to the complexity of temporal motion in videos. |
VidToMe merges similar tokens across video frames in the self-attention module of a pre-trained text-to-image diffusion model. It employs local token merging within short chunks and global token merging across chunks to ensure both short-term and long-term video consistency. This method reduces redundant computations and enforces consistent feature extraction across frames. |
VidToMe significantly improves temporal consistency in generated videos compared to state-of-the-art methods, as demonstrated by qualitative and quantitative evaluations.
The method reduces memory consumption in self-attention computations, making it more efficient.
VidToMe seamlessly integrates with existing image editing techniques, allowing for versatile video editing applications. |
The editing quality heavily relies on the performance of the chosen image editing method.
The similarity-based token matching, while generally effective, has room for improvement to prevent the incorrect merging of visually similar objects. |
video editing, diffusion models, temporal consistency, self-attention, token merging |
2312.10457
Report |
Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning |
Kaiyou Song, Shan Zhang, Tong Wang |
The development of autoregressive modeling (AM) in computer vision lags
behind natural language processing (NLP) in self-supervised pre-training. This
is mainly caused by the challenge that images are not sequential signals and
lack a natural order when applying autoregressive modeling. In this study,
inspired by human beings' way of grasping an image, i.e., focusing on the main
object first, we present a semantic-aware autoregressive image modeling
(SemAIM) method to tackle this challenge. The key insight of SemAIM is to
autoregressive model images from the semantic patches to the less semantic
patches. To this end, we first calculate a semantic-aware permutation of
patches according to their feature similarities and then perform the
autoregression procedure based on the permutation. In addition, considering
that the raw pixels of patches are low-level signals and are not ideal
prediction targets for learning high-level semantic representation, we also
explore utilizing the patch features as the prediction targets. Extensive
experiments are conducted on a broad range of downstream tasks, including image
classification, object detection, and instance/semantic segmentation, to
evaluate the performance of SemAIM. The results demonstrate SemAIM achieves
state-of-the-art performance compared with other self-supervised methods.
Specifically, with ViT-B, SemAIM achieves 84.1% top-1 accuracy for fine-tuning
on ImageNet, 51.3% AP and 45.4% AP for object detection and instance
segmentation on COCO, which outperforms the vanilla MAE by 0.5%, 1.0%, and
0.5%, respectively. |
This paper introduces SemAIM, a semantic-aware autoregressive image modeling method that predicts image patches in a semantically meaningful order (from most to least semantic) derived from patch feature similarities, aiming to mimic human visual understanding. |
Autoregressive modeling in vision lags behind NLP due to the lack of a natural order for images. This paper addresses this by incorporating semantic understanding into the prediction order, making it more consistent with human perception and improving representation learning. |
SemAIM calculates a semantic-aware permutation of image patches based on their feature similarities. A parallel encoder-decoder architecture then performs autoregressive modeling, predicting patches in the determined order. The method also explores using pre-trained features as prediction targets instead of raw pixels for learning richer semantic representations. |
SemAIM significantly outperforms autoregressive methods using raster or stochastic orders, highlighting the importance of semantic-aware prediction.
Using pre-trained features (DINO, CLIP) as prediction targets leads to better performance than predicting raw RGB values, indicating the benefit of learning from high-level representations.
SemAIM achieves state-of-the-art results on ImageNet classification, COCO object detection/segmentation, and ADE20k semantic segmentation, showcasing its strong representation learning capabilities. |
The current implementation considers only one 'center' patch for permutation generation, which may not be optimal for images with multiple salient objects.
Future work can explore calculating multiple center patches or developing more sophisticated strategies for handling multi-object scenes. |
autoregressive image modeling, self-supervised learning, vision transformer, semantic representation learning, computer vision |
2312.10240
Report |
Rich Human Feedback for Text-to-Image Generation |
Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katie Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, Vidhya Navalpakkam |
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and
Imagen have made significant progress in generating high-resolution images
based on text descriptions. However, many generated images still suffer from
issues such as artifacts/implausibility, misalignment with text descriptions,
and low aesthetic quality. Inspired by the success of Reinforcement Learning
with Human Feedback (RLHF) for large language models, prior works collected
human-provided scores as feedback on generated images and trained a reward
model to improve the T2I generation. In this paper, we enrich the feedback
signal by (i) marking image regions that are implausible or misaligned with the
text, and (ii) annotating which words in the text prompt are misrepresented or
missing on the image. We collect such rich human feedback on 18K generated
images (RichHF-18K) and train a multimodal transformer to predict the rich
feedback automatically. We show that the predicted rich human feedback can be
leveraged to improve image generation, for example, by selecting high-quality
training data to finetune and improve the generative models, or by creating
masks with predicted heatmaps to inpaint the problematic regions. Notably, the
improvements generalize to models (Muse) beyond those used to generate the
images on which human feedback data were collected (Stable Diffusion variants).
The RichHF-18K data set will be released in our GitHub repository:
https://github.com/google-research/google-research/tree/master/richhf_18k. |
This paper introduces RichHF-18K, the first rich human feedback dataset for image generation, containing fine-grained scores, implausibility/misalignment regions, and misaligned keywords on 18K generated images. |
Current T2I evaluation metrics lack interpretability and actionable insights. This work aims to provide a more detailed and explainable understanding of image quality beyond single-score metrics. |
They collect rich human feedback (scores, marked regions, misaligned keywords) on 18K images. Then, they train a multimodal transformer model, RAHF, to automatically predict this rich feedback. |
RAHF effectively predicts human annotations for scores, implausibility/misalignment regions, and keywords.
Using RAHF scores for finetuning or as guidance improves image generation quality in terms of plausibility and aesthetics.
RAHF generalizes well to different generative models (e.g., improving Muse model trained on Stable Diffusion data). |
Misalignment heatmap prediction is less accurate than implausibility heatmaps, possibly due to annotation noise.
Future work includes collecting more diverse data beyond Stable Diffusion and exploring more ways to leverage rich feedback for T2I model improvement. |
text-to-image generation, human feedback, image quality assessment, multimodal learning, explainable ai |
2312.10144
Report |
Data-Efficient Multimodal Fusion on a Single GPU |
Noël Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, Maksims Volkovs |
The goal of multimodal alignment is to learn a single latent space that is
shared between multimodal inputs. The most powerful models in this space have
been trained using massive datasets of paired inputs and large-scale
computational resources, making them prohibitively expensive to train in many
practical scenarios. We surmise that existing unimodal encoders pre-trained on
large amounts of unimodal data should provide an effective bootstrap to create
multimodal models from unimodal ones at much lower costs. We therefore propose
FuseMix, a multimodal augmentation scheme that operates on the latent spaces of
arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal
alignment, we achieve competitive performance -- and in certain cases
outperform state-of-the art methods -- in both image-text and audio-text
retrieval, with orders of magnitude less compute and data: for example, we
outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \!
600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs.
Additionally, we show how our method can be applied to convert pre-trained
text-to-image generative models into audio-to-image ones. Code is available at:
https://github.com/layer6ai-labs/fusemix. |
FuseMix, a computationally and data-efficient multimodal augmentation scheme for aligning latent spaces of pre-trained unimodal encoders. |
Existing multimodal alignment models are computationally expensive and require massive paired datasets, limiting their practical application. |
FuseMix leverages pre-trained unimodal encoders and performs mixup on their latent spaces with a shared mixing coefficient, followed by training lightweight adapters to align the augmented latents. |
FuseMix achieves competitive multimodal alignment, outperforming some state-of-the-art methods in image-text and audio-text retrieval tasks.
FuseMix requires significantly less compute and data compared to methods like CLIP.
Dataset quality and diversity are crucial for good performance, especially in low-data regimes. |
Limited by the semantic information learned by the pre-trained unimodal encoders.
Future work could explore fine-tuning unimodal encoders during fusion. |
multimodal fusion, multimodal alignment, data augmentation, contrastive learning, mixup |
2312.10136
Report |
Gradient-based Parameter Selection for Efficient Fine-Tuning |
Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, Shanghang Zhang |
With the growing size of pre-trained models, full fine-tuning and storing all
the parameters for various downstream tasks is costly and infeasible. In this
paper, we propose a new parameter-efficient fine-tuning method, Gradient-based
Parameter Selection (GPS), demonstrating that only tuning a few selected
parameters from the pre-trained model while keeping the remainder of the model
frozen can generate similar or better performance compared with the full model
fine-tuning method. Different from the existing popular and state-of-the-art
parameter-efficient fine-tuning approaches, our method does not introduce any
additional parameters and computational costs during both the training and
inference stages. Another advantage is the model-agnostic and non-destructive
property, which eliminates the need for any other design specific to a
particular model. Compared with the full fine-tuning, GPS achieves 3.33%
(91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the
accuracy with tuning only 0.36% parameters of the pre-trained model on average
over 24 image classification tasks; it also demonstrates a significant
improvement of 17% and 16.8% in mDice and mIoU, respectively, on medical image
segmentation task. Moreover, GPS achieves state-of-the-art performance compared
with existing PEFT methods. |
This paper introduces GPS, a novel parameter-efficient fine-tuning method that selects and tunes a small subset of parameters from a pre-trained model based on gradient values, achieving comparable or superior performance to full fine-tuning. |
Full fine-tuning large pre-trained models for various downstream tasks is computationally expensive and infeasible. PEFT methods aim to address this by tuning only a minimal set of parameters while maintaining or improving performance. |
GPS calculates the gradient of a loss function (SCL) with respect to the model parameters and selects the top-K connections with the highest gradient value for each neuron. During fine-tuning, only the selected parameters are updated using a binary mask. |
GPS outperforms previous PEFT methods and full fine-tuning on FGVC and VTAB benchmarks, using only 0.36% of parameters on average.
The method is model-agnostic, achieving consistent improvements across ViT, Swin Transformer, and ConvNeXt architectures.
GPS demonstrates strong data efficiency, achieving good performance even with limited training data (few-shot learning). |
The method doesn't fully exploit potential parameter sharing across similar downstream tasks.
Reliance on a pre-trained model raises concerns about potential biases if the upstream model was trained on biased or harmful data. |
parameter-efficient fine-tuning, gradient-based parameter selection, sub-network training, vision transformer, few-shot learning |
2312.10120
Report |
MVHuman: Tailoring 2D Diffusion with Multi-view Sampling For Realistic 3D Human Generation |
Suyi Jiang, Haimin Luo, Haoran Jiang, Ziyu Wang, Jingyi Yu, Lan Xu |
Recent months have witnessed rapid progress in 3D generation based on
diffusion models. Most advances require fine-tuning existing 2D Stable
Diffsuions into multi-view settings or tedious distilling operations and hence
fall short of 3D human generation due to the lack of diverse 3D human datasets.
We present an alternative scheme named MVHuman to generate human radiance
fields from text guidance, with consistent multi-view images directly sampled
from pre-trained Stable Diffsuions without any fine-tuning or distilling. Our
core is a multi-view sampling strategy to tailor the denoising processes of the
pre-trained network for generating consistent multi-view images. It encompasses
view-consistent conditioning, replacing the original noises with
``consistency-guided noises'', optimizing latent codes, as well as utilizing
cross-view attention layers. With the multi-view images through the sampling
process, we adopt geometry refinement and 3D radiance field generation followed
by a subsequent neural blending scheme for free-view rendering. Extensive
experiments demonstrate the efficacy of our method, as well as its superiority
to state-of-the-art 3D human generation methods. |
Presents MVHuman, a novel scheme for generating human radiance fields from text guidance using pre-trained 2D diffusion models without fine-tuning or distillation. |
Addresses limitations of existing 3D human generation methods, such as the reliance on scarce 3D datasets, inefficient optimization, and the presence of artifacts. |
Employs a multi-view sampling strategy with a pre-trained Stable Diffusion model, including view-consistent conditioning, consistency-guided noise for denoising, optimization of latent codes, and cross-view attention. |
Generates high-quality human assets with consistent multi-view images directly from text prompts.
Outperforms state-of-the-art 3D human generation methods in qualitative and user study evaluations.
Enables seamless integration of text-based editing and style transfer with LoRA models from 2D to 3D. |
Relies on the accuracy of initial mesh and SMPL-X alignment.
Limited ability to articulate specific details solely from textual descriptions. |
3d human generation, text-to-3d, diffusion models, multi-view consistency, neural radiance fields |
2312.10113
Report |
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation |
Qin Guo, Tianwei Lin |
Recently, diffusion-based methods, like InstructPix2Pix (IP2P), have achieved
effective instruction-based image editing, requiring only natural language
instructions from the user. However, these methods often inadvertently alter
unintended areas and struggle with multi-instruction editing, resulting in
compromised outcomes. To address these issues, we introduce the Focus on Your
Instruction (FoI), a method designed to ensure precise and harmonious editing
across multiple instructions without extra training or test-time optimization.
In the FoI, we primarily emphasize two aspects: (1) precisely extracting
regions of interest for each instruction and (2) guiding the denoising process
to concentrate within these regions of interest. For the first objective, we
identify the implicit grounding capability of IP2P from the cross-attention
between instruction and image, then develop an effective mask extraction
method. For the second objective, we introduce a cross attention modulation
module for rough isolation of target editing regions and unrelated regions.
Additionally, we introduce a mask-guided disentangle sampling strategy to
further ensure clear region isolation. Experimental results demonstrate that
FoI surpasses existing methods in both quantitative and qualitative
evaluations, especially excelling in multi-instruction editing task. |
Introduces FoI, a method leveraging the implicit grounding ability of InstructPix2Pix for precise and harmonious multi-instruction image editing without extra training or test-time optimization. |
Addresses limitations of existing text-guided image editing methods in accurately targeting editing areas, especially for multi-instruction edits, to achieve desired results without unintended modifications. |
Utilizes IP2P's grounding ability to extract masks for areas of interest, introduces cross-condition attention modulation to focus instructions within their masks, and proposes a mask-guided disentangle sampling strategy to isolate editing regions. |
Outperforms state-of-the-art methods in qualitative and quantitative evaluations, particularly in multi-instruction editing tasks.
Achieves superior results in CLIP image similarity, Dinov2 image similarity, and PickScore, demonstrating fidelity to both original and edited images.
Demonstrates robustness in balancing image preservation and instruction execution without requiring precise tuning of guidance scales. |
Limited ultra-fine editing ability due to the resolution of cross-attention maps.
Effectiveness is dependent on the capabilities of the pretrained IP2P model. |
image editing, diffusion models, text-guided image manipulation, multi-instruction editing, attention mechanisms |
2312.10111
Report |
Plasticine3D: Non-rigid 3D editting with text guidance |
Yige Chen, Ang Chen, Siyuan Chen, Ran Yi |
With the help of Score Distillation Sampling(SDS) and the rapid development
of various trainable 3D representations, Text-to-Image(T2I) diffusion models
have been applied to 3D generation tasks and achieved considerable results.
There are also some attempts toward the task of editing 3D objects leveraging
this Text-to-3D pipeline. However, most methods currently focus on adding
additional geometries, overwriting textures or both. But few of them can
perform non-rigid transformation of 3D objects. For those who can perform
non-rigid editing, on the other hand, suffer from low-resolution, lack of
fidelity and poor flexibility. In order to address these issues, we present:
Plasticine3D, a general, high-fidelity, photo-realistic and controllable
non-rigid editing pipeline. Firstly, our work divides the editing process into
a geometry editing stage and a texture editing stage to achieve more detailed
and photo-realistic results ; Secondly, in order to perform non-rigid
transformation with controllable results while maintain the fidelity towards
original 3D models in the same time, we propose a multi-view-embedding(MVE)
optimization strategy to ensure that the diffusion model learns the overall
features of the original object and an embedding-fusion(EF) to control the
degree of editing by adjusting the value of the fusing rate. We also design a
geometry processing step before optimizing on the base geometry to cope with
different needs of various editing tasks. Further more, to fully leverage the
geometric prior from the original 3D object, we provide an optional replacement
of score distillation sampling named score projection sampling(SPS) which
enables us to directly perform optimization from the origin 3D mesh in most
common median non-rigid editing scenarios. We demonstrate the effectiveness of
our method on both the non-rigid 3D editing task and general 3D editing task. |
Plasticine3D, a novel semantic-driven, photo-realistic, controllable non-rigid 3D editing pipeline that divides the editing process into geometry and appearance stages for detailed results. |
Addresses limitations in existing 3D editing methods that struggle with non-rigid transformations, especially in preserving original details and offering control over the degree of editing. |
Utilizes a two-stage geometry-appearance pipeline with multi-view embedding optimization (MVE), embedding fusion (EF), geometry processing, and score projection sampling (SPS) to achieve controllable and high-fidelity non-rigid transformations. |
Embedding fusion enables control over the degree of editing by interpolating between optimized and target embeddings.
Score projection sampling (SPS) enhances median-scale non-rigid editing by leveraging the original geometry as a starting point and guiding the transformation towards the target prompt.
Outperforms baseline methods in qualitative and quantitative comparisons, demonstrating superior performance in non-rigid editing tasks while preserving details. |
Janus problem (e.g., two-headed horse) occurs in some global and median-scale transformations.
Fine-tuning the diffusion model is computationally expensive and time-consuming. |
3d editing, non-rigid transformation, diffusion models, score distillation sampling, semantic-driven editing |
2312.10103
Report |
GSVA: Generalized Segmentation via Multimodal Large Language Models |
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang |
Generalized Referring Expression Segmentation (GRES) extends the scope of
classic RES to refer to multiple objects in one expression or identify the
empty targets absent in the image. GRES poses challenges in modeling the
complex spatial relationships of the instances in the image and identifying
non-existing referents. Multimodal Large Language Models (MLLMs) have recently
shown tremendous progress in these complicated vision-language tasks.
Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient
in understanding contexts with visual inputs. Among them, LISA, as a
representative, adopts a special [SEG] token to prompt a segmentation mask
decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing
solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot
correctly handle the cases where users might reference multiple subjects in a
singular prompt or provide descriptions incongruent with any image target. In
this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to
address this gap. Specifically, GSVA reuses the [SEG] token to prompt the
segmentation model towards supporting multiple mask references simultaneously
and innovatively learns to generate a [REJ] token to reject the null targets
explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue,
marking a notable enhancement and setting a new record on the GRES benchmark
gRefCOCO dataset. GSVA also proves effective across various classic referring
segmentation and comprehension tasks. |
This paper introduces GSVA, a multimodal large language model that enhances referring expression segmentation by addressing multiple-target and empty-target scenarios in GRES. |
Existing referring expression segmentation models struggle to segment multiple objects from a single instruction or handle descriptions that don't match any object in the image. This limits their practicality in real-world applications like embodied AI. |
GSVA leverages the power of MLLMs and introduces two key designs: (1) learning to predict multiple [SEG] tokens to segment multiple targets, and (2) employing [REJ] tokens to reject descriptions of objects absent in the image. |
GSVA achieves state-of-the-art performance on the GRES benchmark gRefCOCO dataset.
GSVA demonstrates strong performance on classic referring segmentation tasks (RefCOCO, RefCOCO+, RefCOCOg) and comprehension tasks.
Ablation studies confirm the importance of multiple [SEG] tokens and the [REJ] token for GSVA's performance. |
The model might misperceive small or unclear objects, leading to incorrect [REJ] predictions.
Using higher-resolution vision encoders could further enhance the model's accuracy. |
referring expression segmentation, generalized referring expression segmentation, multimodal large language models, empty target rejection, multiple target segmentation |
2312.10034
Report |
SlimmeRF: Slimmable Radiance Fields |
Shiran Yuan, Hao Zhao |
Neural Radiance Field (NeRF) and its variants have recently emerged as
successful methods for novel view synthesis and 3D scene reconstruction.
However, most current NeRF models either achieve high accuracy using large
model sizes, or achieve high memory-efficiency by trading off accuracy. This
limits the applicable scope of any single model, since high-accuracy models
might not fit in low-memory devices, and memory-efficient models might not
satisfy high-quality requirements. To this end, we present SlimmeRF, a model
that allows for instant test-time trade-offs between model size and accuracy
through slimming, thus making the model simultaneously suitable for scenarios
with different computing budgets. We achieve this through a newly proposed
algorithm named Tensorial Rank Incrementation (TRaIn) which increases the rank
of the model's tensorial representation gradually during training. We also
observe that our model allows for more effective trade-offs in sparse-view
scenarios, at times even achieving higher accuracy after being slimmed. We
credit this to the fact that erroneous information such as floaters tend to be
stored in components corresponding to higher ranks. Our implementation is
available at https://github.com/Shiran-Yuan/SlimmeRF. |
This paper proposes SlimmeRF, a novel method to reduce the number of parameters in neural radiance fields (NeRFs) using a slimmable tensorial representation. |
Reducing the size of NeRF models is crucial for their application in resource-constrained environments. Existing compression methods often compromise model performance. This paper addresses the need for compact and accurate NeRFs. |
SlimmeRF utilizes a tensorial representation for the appearance grid in NeRF and introduces a Training in Rank order with Initial Control (TRaIn) algorithm. This algorithm trains components of different tensor ranks sequentially, prioritizing lower ranks, to improve slimmability. |
SlimmeRF significantly reduces the number of parameters in NeRFs while maintaining comparable or even exceeding the performance of baselines.
The method demonstrates strong performance on benchmark datasets, including Synthetic NeRF, Tanks & Temples, and LLFF.
The paper provides a theoretical analysis to explain the slimmability achieved by the TRaIn algorithm. |
The paper notes limitations in controlling the degree of slimming for specific applications.
Future work could explore extending the TRaIn algorithm to other components of NeRF, such as the density grid. |
neural radiance fields, nerf, model compression, tensorial representation, view synthesis |
2312.10032
Report |
Osprey: Pixel Understanding with Visual Instruction Tuning |
Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu |
Multimodal large language models (MLLMs) have recently achieved impressive
general-purpose vision-language capabilities through visual instruction tuning.
However, current MLLMs primarily focus on image-level or box-level
understanding, falling short in achieving fine-grained vision-language
alignment at pixel level. Besides, the lack of mask-based instruction data
limits their advancements. In this paper, we propose Osprey, a mask-text
instruction tuning approach, to extend MLLMs by incorporating fine-grained mask
regions into language instruction, aiming at achieving pixel-wise visual
understanding. To achieve this goal, we first meticulously curate a mask-based
region-text dataset with 724K samples, and then design a vision-language model
by injecting pixel-level representation into LLM. Specifically, Osprey adopts a
convolutional CLIP backbone as the vision encoder and employs a mask-aware
visual extractor to extract precise visual mask features from high resolution
input. Experimental results demonstrate Osprey's superiority in various region
understanding tasks, showcasing its new capability for pixel-level instruction
tuning. In particular, Osprey can be integrated with Segment Anything Model
(SAM) seamlessly to obtain multi-granularity semantics. The source code,
dataset and demo can be found at https://github.com/CircleRadon/Osprey. |
Presents Osprey, a novel approach that integrates pixel-level mask region references into language instructions, enhancing Multimodal Large Language Models (MLLMs) for fine-grained visual understanding. |
Existing MLLMs struggle with fine-grained visual understanding tasks due to reliance on image-level or box-level understanding, lacking pixel-level alignment between vision and language. |
Introduces a mask-aware visual extractor to capture precise mask features, employs a convolutional CLIP backbone for high-resolution input, and curates a large-scale mask-based region-text dataset (Osprey-724K) for instruction tuning. |
Osprey significantly outperforms previous methods on open-vocabulary segmentation, achieving 50.64% PQ, 29.17% AP, and 49.78% mIoU on Cityscapes.
Achieves state-of-the-art results on referring object classification, obtaining 65.24% SS and 38.19% S-IoU on LVIS, and 73.06% SS and 52.72% S-IoU on PACO.
Demonstrates superior performance in referring description and reasoning tasks on Ferret-Bench and exhibits strong performance on object hallucination benchmark POPE. |
Computational cost increases significantly with larger input image sizes.
Further research on effectively incorporating multi-modal information from various sources is needed. |
multimodal large language models, fine-grained visual understanding, mask-based instruction tuning, region-based image understanding, open-vocabulary segmentation |
2312.09767
Report |
DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models |
Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, Zhidong Deng |
Diffusion models have shown remarkable success in a variety of downstream
generative tasks, yet remain under-explored in the important and challenging
expressive talking head generation. In this work, we propose a DreamTalk
framework to fulfill this gap, which employs meticulous design to unlock the
potential of diffusion models in generating expressive talking heads.
Specifically, DreamTalk consists of three crucial components: a denoising
network, a style-aware lip expert, and a style predictor. The diffusion-based
denoising network is able to consistently synthesize high-quality audio-driven
face motions across diverse expressions. To enhance the expressiveness and
accuracy of lip motions, we introduce a style-aware lip expert that can guide
lip-sync while being mindful of the speaking styles. To eliminate the need for
expression reference video or text, an extra diffusion-based style predictor is
utilized to predict the target expression directly from the audio. By this
means, DreamTalk can harness powerful diffusion models to generate expressive
faces effectively and reduce the reliance on expensive style references.
Experimental results demonstrate that DreamTalk is capable of generating
photo-realistic talking faces with diverse speaking styles and achieving
accurate lip motions, surpassing existing state-of-the-art counterparts. |
DreamTalk, a novel framework leveraging diffusion models for generating expressive talking heads with diverse speaking styles and minimal reliance on style references. |
Existing methods struggle to generate high-quality talking heads with diverse and accurate expressions, often relying on laborious style references like videos or text. |
DreamTalk comprises a denoising network, a style-aware lip expert, and a style predictor. The denoising network generates facial motions conditioned on audio and a style reference. The lip expert ensures lip-sync accuracy across styles. The style predictor infers speaking styles directly from audio and the portrait, eliminating the need for reference videos. |
Outperforms state-of-the-art methods in quantitative and qualitative evaluations on datasets like MEAD, HDTF, and Voxceleb2, demonstrating superior lip-sync accuracy, visual quality, and style consistency.
Exhibits strong generalization capabilities, effectively handling out-of-domain portraits, multilingual speech, noisy audio, and songs.
Enables versatile speaking style manipulation through techniques like classifier-free guidance scaling and style code interpolation. |
Occasional artifacts, particularly around the mouth area during intense expressions, requiring further refinement of teeth generation and exploration of emotion-specific renderers.
Lacks temporal awareness of speaking style variations, potentially leading to unnatural expressions at speech boundaries. Future work could focus on dynamically predicting style evolution. |
talking head generation, diffusion models, expressive synthesis, lip sync, style prediction |
2312.09641
Report |
Ins-HOI: Instance Aware Human-Object Interactions Recovery |
Jiajun Zhang, Yuxiang Zhang, Hongwen Zhang, Xiao Zhou, Boyao Zhou, Ruizhi Shao, Zonghai Hu, Yebin Liu |
Accurately modeling detailed interactions between human/hand and object is an
appealing yet challenging task. Current multi-view capture systems are only
capable of reconstructing multiple subjects into a single, unified mesh, which
fails to model the states of each instance individually during interactions. To
address this, previous methods use template-based representations to track
human/hand and object. However, the quality of the reconstructions is limited
by the descriptive capabilities of the templates so that these methods are
inherently struggle with geometry details, pressing deformations and invisible
contact surfaces. In this work, we propose an end-to-end Instance-aware
Human-Object Interactions recovery (Ins-HOI) framework by introducing an
instance-level occupancy field representation. However, the real-captured data
is presented as a holistic mesh, unable to provide instance-level supervision.
To address this, we further propose a complementary training strategy that
leverages synthetic data to introduce instance-level shape priors, enabling the
disentanglement of occupancy fields for different instances. Specifically,
synthetic data, created by randomly combining individual scans of humans/hands
and objects, guides the network to learn a coarse prior of instances.
Meanwhile, real-captured data helps in learning the overall geometry and
restricting interpenetration in contact areas. As demonstrated in experiments,
our method Ins-HOI supports instance-level reconstruction and provides
reasonable and realistic invisible contact surfaces even in cases of extremely
close interaction. To facilitate the research of this task, we collect a
large-scale, high-fidelity 3D scan dataset, including 5.2k high-quality scans
with real-world human-chair and hand-object interactions. The code and data
will be public for research purposes. |
This paper proposes Ins-HOI, an end-to-end framework for instance-level reconstruction of human/hand-object interactions from sparse-view RGB inputs, modeling intricate geometry and invisible contact surfaces using implicit surface representations. |
Existing methods for HOI reconstruction often rely on template-based representations, limiting their ability to capture fine-grained geometry and soft deformations caused by contact. This work aims to overcome these limitations by leveraging implicit surface representations and introducing a novel complementary training strategy. |
Ins-HOI utilizes an instance-level occupancy field to represent human/hand and object separately. It leverages both real-scanned data and synthetic data with instance-level ground truth for complementary training, ensuring both individual shape completeness and overall reconstruction reasonableness. The intersection between predicted instances is penalized during training to ensure plausible contact surfaces. |
Ins-HOI achieves comparable or superior performance to state-of-the-art methods like PIFu and NeuS2 on holistic reconstruction while uniquely supporting instance-level reconstruction.
The method effectively reconstructs invisible contact surfaces with plausible soft deformations, even for challenging cases of close interaction, as demonstrated by low intersection volumes between reconstructed instances.
Experiments on unseen object types show that Ins-HOI can generalize well with a small amount of synthetic data fine-tuning. |
While Ins-HOI produces reasonable contact surface reconstructions, capturing the precise deformations remains a challenge.
The method currently requires fine-tuning for novel object types. |
human-object interaction, hand-object interaction, 3d reconstruction, implicit surface representation, complementary training |
2312.09608
Report |
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models |
Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, Jian Yang |
One of the key components within diffusion models is the UNet for noise
prediction. While several works have explored basic properties of the UNet
decoder, its encoder largely remains unexplored. In this work, we conduct the
first comprehensive study of the UNet encoder. We empirically analyze the
encoder features and provide insights to important questions regarding their
changes at the inference process. In particular, we find that encoder features
change gently, whereas the decoder features exhibit substantial variations
across different time-steps. This finding inspired us to omit the encoder at
certain adjacent time-steps and reuse cyclically the encoder features in the
previous time-steps for the decoder. Further based on this observation, we
introduce a simple yet effective encoder propagation scheme to accelerate the
diffusion sampling for a diverse set of tasks. By benefiting from our
propagation scheme, we are able to perform in parallel the decoder at certain
adjacent time-steps. Additionally, we introduce a prior noise injection method
to improve the texture details in the generated image. Besides the standard
text-to-image task, we also validate our approach on other tasks:
text-to-video, personalized generation and reference-guided generation. Without
utilizing any knowledge distillation technique, our approach accelerates both
the Stable Diffusion (SD) and the DeepFloyd-IF models sampling by 41$\%$ and
24$\%$ respectively, while maintaining high-quality generation performance. Our
code is available in
\href{https://github.com/hutaiHang/Faster-Diffusion}{FasterDiffusion}. |
This paper presents encoder propagation, a novel method for accelerating diffusion model sampling without knowledge distillation. |
Diffusion model sampling is computationally expensive due to iterative denoising. This work addresses this issue by reusing encoder features to improve efficiency. |
The paper empirically analyzes UNet features and finds that encoder features change minimally across time-steps. This observation leads to the proposed encoder propagation scheme, which reuses encoder features from previous time-steps, enabling parallel decoding and significant speedup. |
Encoder propagation accelerates Stable Diffusion sampling by 41% and DeepFloyd-IF by 24% while maintaining high generation quality.
The method is compatible with existing acceleration techniques like DPM-Solver and ToMe.
Qualitative and quantitative evaluations on tasks like text-to-video generation, personalized image generation, and reference-guided generation demonstrate the effectiveness of the proposed approach. |
The method faces challenges in maintaining quality when using very few sampling steps (e.g., 5).
Future work could explore adapting the technique for even faster generation with limited sampling steps. |
diffusion models, image generation, sampling acceleration, encoder propagation, parallel decoding |
2312.09579
Report |
MobileSAMv2: Faster Segment Anything to Everything |
Chaoning Zhang, Dongshen Han, Sheng Zheng, Jinwoo Choi, Tae-Ho Kim, Choong Seon Hong |
Segment anything model (SAM) addresses two practical yet challenging
segmentation tasks: \textbf{segment anything (SegAny)}, which utilizes a
certain point to predict the mask for a single object of interest, and
\textbf{segment everything (SegEvery)}, which predicts the masks for all
objects on the image. What makes SegAny slow for SAM is its heavyweight image
encoder, which has been addressed by MobileSAM via decoupled knowledge
distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in
its mask decoder because it needs to first generate numerous masks with
redundant grid-search prompts and then perform filtering to obtain the final
valid masks. We propose to improve its efficiency by directly generating the
final masks with only valid prompts, which can be obtained through object
discovery. Our proposed approach not only helps reduce the total time on the
mask decoder by at least 16 times but also achieves superior performance.
Specifically, our approach yields an average performance boost of 3.6\% (42.5\%
\textit{v.s.} 38.9\%) for zero-shot object proposal on the LVIS dataset with
the mask AR@$K$ metric. Qualitative results show that our approach generates
fine-grained masks while avoiding over-segmenting things. This project
targeting faster SegEvery than the original SAM is termed MobileSAMv2 to
differentiate from MobileSAM which targets faster SegAny. Moreover, we
demonstrate that our new prompt sampling is also compatible with the distilled
image encoders in MobileSAM, contributing to a unified framework for efficient
SegAny and SegEvery. The code is available at the same link as MobileSAM
Project
\href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{https://github.com/ChaoningZhang/MobileSAM}}.
\end{abstract} |
This paper introduces MobileSAMv2, an efficient approach for segmenting everything (SegEvery) in an image, addressing the efficiency bottleneck of the original SAM's mask decoder for this task. |
The original SAM's SegEvery, while effective, is computationally expensive, particularly in the mask decoding stage, hindering its practical use. |
MobileSAMv2 replaces SAM's grid-search prompt sampling with an object-aware prompt sampling strategy using YOLOv8 for object detection. This reduces the number of prompts, thereby speeding up mask decoding. |
MobileSAMv2 significantly improves SegEvery efficiency by at least 16 times compared to SAM.
It achieves comparable and even superior performance to SAM on the LVIS dataset for zero-shot object proposal.
MobileSAMv2 effectively addresses the over-segmentation issue observed in SAM due to its object-aware prompt sampling. |
The current implementation relies on object discovery for prompt sampling, which could be further optimized for efficiency.
Exploring more powerful distilled image encoders to further reduce computation time without significantly sacrificing performance. |
image segmentation, segment anything model (sam), object detection, prompt engineering, efficiency |
2312.09305
Report |
Stable Score Distillation for High-Quality 3D Generation |
Boshi Tang, Jianan Wang, Zhiyong Wu, Lei Zhang |
Although Score Distillation Sampling (SDS) has exhibited remarkable
performance in conditional 3D content generation, a comprehensive understanding
of its formulation is still lacking, hindering the development of 3D
generation. In this work, we decompose SDS as a combination of three functional
components, namely mode-seeking, mode-disengaging and variance-reducing terms,
analyzing the properties of each. We show that problems such as over-smoothness
and implausibility result from the intrinsic deficiency of the first two terms
and propose a more advanced variance-reducing term than that introduced by SDS.
Based on the analysis, we propose a simple yet effective approach named Stable
Score Distillation (SSD) which strategically orchestrates each term for
high-quality 3D generation and can be readily incorporated to various 3D
generation frameworks and 3D representations. Extensive experiments validate
the efficacy of our approach, demonstrating its ability to generate
high-fidelity 3D content without succumbing to issues such as over-smoothness. |
This paper presents Stable Score Distillation (SSD), a novel method for high-quality 3D content generation that leverages a comprehensive understanding of Score Distillation Sampling (SDS). The core contribution is decomposing the SDS estimator into three functional components: mode-disengaging, mode-seeking, and variance-reducing terms, and proposing a strategy to orchestrate them for improved 3D generation. |
Despite the success of SDS in conditional 3D content generation, a thorough understanding of its formulation was lacking, hindering further development in the field. This paper addresses this gap by dissecting and analyzing SDS, paving the way for more effective and stable 3D generation techniques. |
The paper analyzes the mathematical and numerical properties of each SDS component, identifying their limitations and strengths under different timestep regimes. It then proposes SSD, which strategically combines these components, leveraging the variance-reduced mode-seeking term for plausibility at low timesteps and the mode-disengaging term for trap escaping at high timesteps. |
SSD successfully mitigates over-smoothness and implausibility issues prevalent in SDS-based 3D generation.
The paper provides theoretical explanations for common observations and practices in 3D generation, such as the use of large CFG scales.
Extensive experiments demonstrate SSD's efficacy and compatibility with various 3D generation frameworks and representations, achieving superior results compared to state-of-the-art methods. |
The paper primarily focuses on single-object generation. Extending the analysis and method to more complex scenes with multiple objects presents an interesting future direction.
While SSD effectively reduces over-smoothing, further investigation into alternative strategies for transient mode avoidance could lead to additional improvements. |
3d generation, score distillation sampling, diffusion models, text-to-3d, generative modeling |
2312.09256
Report |
LIME: Localized Image Editing via Attention Regularization in Diffusion Models |
Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari |
Diffusion models (DMs) have gained prominence due to their ability to
generate high-quality, varied images, with recent advancements in text-to-image
generation. The research focus is now shifting towards the controllability of
DMs. A significant challenge within this domain is localized editing, where
specific areas of an image are modified without affecting the rest of the
content. This paper introduces LIME for localized image editing in diffusion
models that do not require user-specified regions of interest (RoI) or
additional text input. Our method employs features from pre-trained methods and
a simple clustering technique to obtain precise semantic segmentation maps.
Then, by leveraging cross-attention maps, it refines these segments for
localized edits. Finally, we propose a novel cross-attention regularization
technique that penalizes unrelated cross-attention scores in the RoI during the
denoising steps, ensuring localized edits. Our approach, without re-training
and fine-tuning, consistently improves the performance of existing methods in
various editing benchmarks. |
This paper introduces LIME, a localized image editing technique for diffusion models that leverages pre-trained InstructPix2Pix without requiring user-specified regions of interest or additional text input. |
Localized image editing in diffusion models is a significant challenge due to the intertwined nature of image representations, where changes intended for one area can unintentionally affect others. Existing methods often rely on additional user input, such as masking the target area or providing extra text information, which adds complexity and doesn't guarantee seamless editing. |
LIME uses features from pre-trained InstructPix2Pix and a simple clustering technique to obtain precise semantic segmentation maps. It then leverages cross-attention maps to refine these segments for localized edits. Finally, it employs a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the region of interest during denoising steps, ensuring localized edits. |
LIME consistently improves the performance of existing methods in various editing benchmarks.
LIME effectively implements localized edits while preserving the overall scene context, outperforming state-of-the-art models, including their fine-tuned versions on manually annotated datasets.
LIME achieves significant improvements on metrics measuring structure and background preservation, indicating precise edits according to instructions while avoiding unintended changes to unaffected regions. |
LIME may alter the scene's style, particularly in color, due to base model entanglement, though it still significantly improves edits compared to InstructPix2Pix.
Prompt content can impact edit quality, as all tokens except start-of-text, stop words, and padding affect the region of interest during editing, leading to feature mixing. |
image editing, diffusion models, localized editing, attention regularization, semantic segmentation |
2312.09252
Report |
FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection |
Hongsuk Choi, Isaac Kasahara, Selim Engin, Moritz Graule, Nikhil Chavan-Dafle, Volkan Isler |
Recently introduced ControlNet has the ability to steer the text-driven image
generation process with geometric input such as human 2D pose, or edge
features. While ControlNet provides control over the geometric form of the
instances in the generated image, it lacks the capability to dictate the visual
appearance of each instance. We present FineControlNet to provide fine control
over each instance's appearance while maintaining the precise pose control
capability. Specifically, we develop and demonstrate FineControlNet with
geometric control via human pose images and appearance control via
instance-level text prompts. The spatial alignment of instance-specific text
prompts and 2D poses in latent space enables the fine control capabilities of
FineControlNet. We evaluate the performance of FineControlNet with rigorous
comparison against state-of-the-art pose-conditioned text-to-image diffusion
models. FineControlNet achieves superior performance in generating images that
follow the user-provided instance-specific text prompts and poses compared with
existing methods. Project webpage:
https://samsunglabs.github.io/FineControlNet-project-page |
FineControlNet allows users to control the appearance and pose of individual instances in a scene, enhancing text-to-image generation with fine-grained control over multiple objects. |
Existing methods often struggle to generate images with distinct appearances for different instances, leading to visual feature blending or ignoring specific descriptions. FineControlNet addresses this limitation by providing fine-grained control over each instance's appearance while maintaining accurate pose control. |
FineControlNet spatially aligns instance-level text prompts with corresponding 2D poses in the latent space during the reverse diffusion process. It separates and composes different conditions, leveraging pretrained Stable Diffusion and ControlNet, to generate images conditioned on both text and poses. |
FineControlNet demonstrates superior performance in generating images that accurately reflect user-provided instance-specific text prompts and poses.
Quantitative analysis shows FineControlNet achieves competitive image quality (FID) and pose control accuracy (AP) compared to state-of-the-art baselines.
FineControlNet excels in CLIP Identity Observance (CIO) metrics, indicating a higher degree of text-image consistency and distinct identity generation for each instance. |
FineControlNet may exhibit limitations in handling challenging poses, generating realistic human faces, and ensuring physically plausible scene compositions.
Future work could explore enhancing generalization capabilities for extreme variations in instance count, scale, and proximity. |
text-to-image generation, fine-grained control, instance-level conditioning, diffusion models, controlnet |
2312.09249
Report |
ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining |
Ruoxi Shi, Xinyue Wei, Cheng Wang, Hao Su |
We present ZeroRF, a novel per-scene optimization method addressing the
challenge of sparse view 360{\deg} reconstruction in neural field
representations. Current breakthroughs like Neural Radiance Fields (NeRF) have
demonstrated high-fidelity image synthesis but struggle with sparse input
views. Existing methods, such as Generalizable NeRFs and per-scene optimization
approaches, face limitations in data dependency, computational cost, and
generalization across diverse scenarios. To overcome these challenges, we
propose ZeroRF, whose key idea is to integrate a tailored Deep Image Prior into
a factorized NeRF representation. Unlike traditional methods, ZeroRF
parametrizes feature grids with a neural network generator, enabling efficient
sparse view 360{\deg} reconstruction without any pretraining or additional
regularization. Extensive experiments showcase ZeroRF's versatility and
superiority in terms of both quality and speed, achieving state-of-the-art
results on benchmark datasets. ZeroRF's significance extends to applications in
3D content generation and editing. Project page:
https://sarahweiii.github.io/zerorf/ |
ZeroRF is a novel per-scene optimization method that integrates a tailored Deep Image Prior into a factorized NeRF representation for fast and high-quality sparse view 360° reconstruction. |
Existing methods for sparse-view 360° reconstruction struggle with data dependency, high computational cost, and limited generalization across diverse scenarios. They often fail to produce accurate and visually pleasing results due to noisy and distorted features obtained from limited input views. |
ZeroRF parametrizes feature grids of factorized NeRF representations with a randomly-initialized deep neural network generator. It employs a plain MSE rendering loss and does not require any pretraining or additional regularization. The method leverages the deep prior captured within the generator network's architecture to produce clean and well-structured features even with sparse input views. |
ZeroRF achieves state-of-the-art results on sparse view benchmarks, outperforming previous methods in terms of PSNR, SSIM, and LPIPS.
ZeroRF is significantly faster than existing per-scene optimization approaches, reconstructing objects in as low as 30 seconds for common resolutions in 3D generation tasks.
ZeroRF is robust and generalizes well across diverse scenarios, demonstrated by its high-quality reconstructions from both synthetic and real-world datasets. |
ZeroRF might magnify the limitations of underlying factorized NeRF representations, such as axis-aligned artifacts in TensoRF.
Extending ZeroRF to unbounded scenes requires further investigation, as the non-linear contraction of space in grid representations for unbounded scenes leads to distorted features. |
neural radiance fields, nerf, sparse view reconstruction, deep image prior, 3d reconstruction |
2312.09246
Report |
SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds |
Minghao Chen, Junyu Xie, Iro Laina, Andrea Vedaldi |
We propose a novel feed-forward 3D editing framework called Shap-Editor.
Prior research on editing 3D objects primarily concentrated on editing
individual objects by leveraging off-the-shelf 2D image editing networks. This
is achieved via a process called distillation, which transfers knowledge from
the 2D network to 3D assets. Distillation necessitates at least tens of minutes
per asset to attain satisfactory editing results, and is thus not very
practical. In contrast, we ask whether 3D editing can be carried out directly
by a feed-forward network, eschewing test-time optimisation. In particular, we
hypothesise that editing can be greatly simplified by first encoding 3D objects
in a suitable latent space. We validate this hypothesis by building upon the
latent space of Shap-E. We demonstrate that direct 3D editing in this space is
possible and efficient by building a feed-forward editor network that only
requires approximately one second per edit. Our experiments show that
Shap-Editor generalises well to both in-distribution and out-of-distribution 3D
assets with different prompts, exhibiting comparable performance with methods
that carry out test-time optimisation for each edited instance. |
This paper introduces \emph{\method}, a novel feed-forward 3D editing framework that performs semantic edits on 3D objects in latent space based on natural language instructions. |
Existing 3D editing methods rely on time-consuming test-time optimisation, making them impractical for interactive applications. \method addresses this limitation by enabling near-instantaneous editing within a learned latent space. |
\method leverages the latent space of a pre-trained 3D auto-encoder (Shap-E) and distills knowledge from multiple 2D image editors using a score distillation sampling loss. It learns a latent editor function that maps a source 3D object's latent code to an edited latent code based on the input instruction. |
\emph{\method} achieves superior editing results compared to state-of-the-art optimisation-based methods while reducing inference time from minutes to seconds.
The learned latent editor exhibits good generalisation capabilities, effectively editing unseen 3D objects and handling compositions of multiple edits.
The latent space demonstrates partial linearity, enabling control over the strength of the applied edit through simple arithmetic operations. |
The quality of \method is limited by the expressiveness of the underlying pre-trained 3D auto-encoder and 2D image editors.
Although \method can learn from multiple instructions, achieving a fully open-ended 3D editor remains an open challenge. |
3d editing, latent space, score distillation sampling, text-guided editing, feed-forward network |
2312.09242
Report |
Text2Immersion: Generative Immersive Scene with 3D Gaussians |
Hao Ouyang, Kathryn Heal, Stephen Lombardi, Tiancheng Sun |
We introduce Text2Immersion, an elegant method for producing high-quality 3D
immersive scenes from text prompts. Our proposed pipeline initiates by
progressively generating a Gaussian cloud using pre-trained 2D diffusion and
depth estimation models. This is followed by a refining stage on the Gaussian
cloud, interpolating and refining it to enhance the details of the generated
scene. Distinct from prevalent methods that focus on single object or indoor
scenes, or employ zoom-out trajectories, our approach generates diverse scenes
with various objects, even extending to the creation of imaginary scenes.
Consequently, Text2Immersion can have wide-ranging implications for various
applications such as virtual reality, game development, and automated content
creation. Extensive evaluations demonstrate that our system surpasses other
methods in rendering quality and diversity, further progressing towards
text-driven 3D scene generation. We will make the source code publicly
accessible at the project page. |
Presents Text2Immersion, a method for generating high-quality 3D immersive scenes from text prompts using 3D Gaussians. |
Addresses limitations in existing text-to-3D methods that struggle with scene generation, limited diversity, and slow rendering speeds. |
Two-stage pipeline: 1) Initialization of 3D Gaussian cloud from anchor views using diffusion models and depth estimation. 2) Refinement of the Gaussian cloud via inpainting and super-resolution using additional generated views. |
Generates high-fidelity, diverse, and immersive 3D scenes from text prompts.
Outperforms existing methods in terms of rendering quality, diversity, and alignment with text prompts.
Achieves real-time rendering speeds (180 FPS on a 3070 laptop GPU). |
Reliance on monocular depth estimation can lead to visual artifacts if estimations are inaccurate.
Inpainting new objects during refinement may cause ghosting effects. |
text-to-3d, 3d scene generation, 3d gaussian splatting, diffusion models, immersive environments |
2312.09237
Report |
Pixel Aligned Language Models |
Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid |
Large language models have achieved great success in recent years, so as
their variants in vision. Existing vision-language models can describe images
in natural languages, answer visual-related questions, or perform complex
reasoning about the image. However, it is yet unclear how localization tasks,
such as word grounding or referring localization, can be performed using large
language models. In this work, we aim to develop a vision-language model that
can take locations, for example, a set of points or boxes, as either inputs or
outputs. When taking locations as inputs, the model performs
location-conditioned captioning, which generates captions for the indicated
object or region. When generating locations as outputs, our model regresses
pixel coordinates for each output word generated by the language model, and
thus performs dense word grounding. Our model is pre-trained on the Localized
Narrative dataset, which contains pixel-word-aligned captioning from human
attention. We show our model can be applied to various location-aware
vision-language tasks, including referring localization, location-conditioned
captioning, and dense object captioning, archiving state-of-the-art performance
on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM . |
Introduces PixelLLM, a vision-language model that generates captions and aligns each word to a pixel location, enabling localization capabilities within LLMs. |
Addresses the lack of fine-grained localization abilities in existing vision-language models, allowing for spatial understanding and reasoning in LLMs. |
Leverages a novel architecture with a prompt feature extractor to condition image features on location prompts and adds a parallel MLP layer to the language model for per-token location regression. Trains on the Localized Narrative dataset with synchronized caption-location annotations. |
Achieves state-of-the-art performance on RefCOCO referring localization and segmentation.
Outperforms previous methods on dense object captioning and location-conditioned captioning tasks.
Demonstrates the effectiveness of the per-token localization formulation, especially with dense supervision from the Localized Narrative dataset. |
Limited evaluation on the less explored task of controlled trace generation.
Reliance on the quality and noise levels within the Localized Narrative dataset's mouse trace annotations. |
vision-language models, localization, referring expression comprehension, dense captioning, large language models |
2312.09228
Report |
3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting |
Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, Siyu Tang |
We introduce an approach that creates animatable human avatars from monocular
videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural
radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image
synthesis but often require days of training, and are extremely slow at
inference time. Recently, the community has explored fast grid structures for
efficient training of clothed avatars. Albeit being extremely fast at training,
these methods can barely achieve an interactive rendering frame rate with
around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a
non-rigid deformation network to reconstruct animatable clothed human avatars
that can be trained within 30 minutes and rendered at real-time frame rates
(50+ FPS). Given the explicit nature of our representation, we further
introduce as-isometric-as-possible regularizations on both the Gaussian mean
vectors and the covariance matrices, enhancing the generalization of our model
on highly articulated unseen poses. Experimental results show that our method
achieves comparable and even better performance compared to state-of-the-art
approaches on animatable avatar creation from a monocular input, while being
400x and 250x faster in training and inference, respectively. |
This paper presents 3DGS-Avatar, a novel method for creating animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). |
Existing NeRF-based methods for avatar creation are computationally expensive and slow in training and inference, making them impractical for real-time applications. This paper aims to address this limitation by leveraging the efficiency of 3DGS. |
The proposed method leverages 3DGS and learns a non-rigid deformation network to reconstruct animatable clothed human avatars. It decomposes human deformation into non-rigid (pose-dependent cloth deformation) and rigid (skeleton-controlled) components. The approach uses a small MLP for color decoding, accounting for local deformations and dynamic lighting. As-isometric-as-possible regularizations are applied to Gaussian mean vectors and covariance matrices to enhance generalization to unseen poses. |
The method achieves comparable or better performance than state-of-the-art approaches in animatable avatar creation from monocular inputs.
It achieves significantly faster training (400x) and inference (250x) speeds compared to the most competitive baseline (HumanNeRF).
The approach effectively generalizes to unseen poses and preserves finer details compared to other methods. |
The training time, while significantly improved, still doesn't match the fastest grid-based methods.
The method may produce blurry results in areas with high-frequency textures. |
3d gaussian splatting, animatable avatars, monocular reconstruction, neural rendering, real-time rendering |
2312.09222
Report |
Mosaic-SDF for 3D Generative Models |
Lior Yariv, Omri Puny, Natalia Neverova, Oran Gafni, Yaron Lipman |
Current diffusion or flow-based generative models for 3D shapes divide to
two: distilling pre-trained 2D image diffusion models, and training directly on
3D shapes. When training a diffusion or flow models on 3D shapes a crucial
design choice is the shape representation. An effective shape representation
needs to adhere three design principles: it should allow an efficient
conversion of large 3D datasets to the representation form; it should provide a
good tradeoff of approximation power versus number of parameters; and it should
have a simple tensorial form that is compatible with existing powerful neural
architectures. While standard 3D shape representations such as volumetric grids
and point clouds do not adhere to all these principles simultaneously, we
advocate in this paper a new representation that does. We introduce Mosaic-SDF
(M-SDF): a simple 3D shape representation that approximates the Signed Distance
Function (SDF) of a given shape by using a set of local grids spread near the
shape's boundary. The M-SDF representation is fast to compute for each shape
individually making it readily parallelizable; it is parameter efficient as it
only covers the space around the shape's boundary; and it has a simple matrix
form, compatible with Transformer-based architectures. We demonstrate the
efficacy of the M-SDF representation by using it to train a 3D generative flow
model including class-conditioned generation with the 3D Warehouse dataset, and
text-to-3D generation using a dataset of about 600k caption-shape pairs. |
This paper introduces Mosaic-SDF (M-SDF), a novel 3D shape representation for training generative models, which approximates the Signed Distance Function (SDF) using a set of local grids near the shape's boundary. |
An effective 3D shape representation for generative models should be efficiently computable for large datasets, parameter efficient, and compatible with modern neural architectures. Existing representations often lack one or more of these properties. |
M-SDF represents a shape as a set of local grids, each with a center, scale, and grid values sampled from the shape's SDF. This representation is trained using a permutation-equivariant Flow Matching model on two datasets: ShapeNetCore-V2 and a dataset of shapes with text descriptions. |
M-SDF achieves superior surface approximation per parameter budget compared to Implicit Neural Representations (INRs) while requiring significantly less computation time.
M-SDF outperforms or achieves comparable results to state-of-the-art methods in class-conditional 3D shape generation on ShapeNetCore-V2, as measured by various metrics including Fréchet PointNet++ Distance (FPD), Coverage (COV), and 1-Nearest Neighbor Accuracy (1-NNA).
Qualitative results demonstrate that M-SDF generates high-fidelity shapes with sharper details compared to baselines, which often produce overly smooth results. |
The current M-SDF representation only encodes the SDF and lacks texture or color information.
The simple linear layer connecting local grids to the transformer could be replaced with more sophisticated architectures like convolutional layers or autoencoders. |
3d shape representation, generative models, signed distance function, flow matching, mosaic-sdf |
2312.09158
Report |
General Object Foundation Model for Images and Videos at Scale |
Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai |
We present GLEE in this work, an object-level foundation model for locating
and identifying objects in images and videos. Through a unified framework, GLEE
accomplishes detection, segmentation, tracking, grounding, and identification
of arbitrary objects in the open world scenario for various object perception
tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from
diverse data sources with varying supervision levels to formulate general
object representations, excelling in zero-shot transfer to new data and tasks.
Specifically, we employ an image encoder, text encoder, and visual prompter to
handle multi-modal inputs, enabling to simultaneously solve various
object-centric downstream tasks while maintaining state-of-the-art performance.
Demonstrated through extensive training on over five million images from
diverse benchmarks, GLEE exhibits remarkable versatility and improved
generalization performance, efficiently tackling downstream tasks without the
need for task-specific adaptation. By integrating large volumes of
automatically labeled data, we further enhance its zero-shot generalization
capabilities. Additionally, GLEE is capable of being integrated into Large
Language Models, serving as a foundational model to provide universal
object-level information for multi-modal tasks. We hope that the versatility
and universality of our method will mark a significant step in the development
of efficient visual foundation models for AGI systems. The model and code will
be released at https://glee-vision.github.io . |
\methodNAME is a novel object-level foundation model for locating and identifying objects in images and videos, achieving detection, segmentation, tracking, grounding, and identification in an open-world setting. |
Existing visual foundation models often focus on global image-level understanding, lacking the crucial ability to locate and identify individual objects. \methodNAME addresses this limitation by providing general and accurate object-level information. |
\methodNAME utilizes a unified framework with an image encoder, text encoder, visual prompter, and object decoder. This enables multi-modal input handling and simultaneous solving of various object-centric tasks. Trained on over five million images with multi-granularity joint supervision, it excels in zero-shot transfer to new data and tasks. |
\methodNAME achieves state-of-the-art performance on various object-level image tasks, including detection, referring expression comprehension, and open-world detection.
It exhibits remarkable zero-shot generalization capabilities in large-vocabulary open-world video tracking tasks, surpassing existing models.
Integrating automatically annotated data (SA1B, GRIT) enhances \methodNAME's zero-shot generalization and allows scaling to 10 million training images. |
While \methodNAME excels in zero-shot transfer, it might benefit from fine-tuning for tasks heavily reliant on temporal consistency, like OVIS.
Further improvements can be achieved by incorporating a larger and more diverse set of captioned data for enhanced text comprehension. |
foundation models, object detection, instance segmentation, object tracking, zero-shot learning |
2312.09147
Report |
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers |
Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, Song-Hai Zhang |
Recent advancements in 3D reconstruction from single images have been driven
by the evolution of generative models. Prominent among these are methods based
on Score Distillation Sampling (SDS) and the adaptation of diffusion models in
the 3D domain. Despite their progress, these techniques often face limitations
due to slow optimization or rendering processes, leading to extensive training
and optimization times. In this paper, we introduce a novel approach for
single-view reconstruction that efficiently generates a 3D model from a single
image via feed-forward inference. Our method utilizes two transformer-based
networks, namely a point decoder and a triplane decoder, to reconstruct 3D
objects using a hybrid Triplane-Gaussian intermediate representation. This
hybrid representation strikes a balance, achieving a faster rendering speed
compared to implicit representations while simultaneously delivering superior
rendering quality than explicit representations. The point decoder is designed
for generating point clouds from single images, offering an explicit
representation which is then utilized by the triplane decoder to query Gaussian
features for each point. This design choice addresses the challenges associated
with directly regressing explicit 3D Gaussian attributes characterized by their
non-structural nature. Subsequently, the 3D Gaussians are decoded by an MLP to
enable rapid rendering through splatting. Both decoders are built upon a
scalable, transformer-based architecture and have been efficiently trained on
large-scale 3D datasets. The evaluations conducted on both synthetic datasets
and real-world images demonstrate that our method not only achieves higher
quality but also ensures a faster runtime in comparison to previous
state-of-the-art techniques. Please see our project page at
https://zouzx.github.io/TriplaneGaussian/. |
Introduces TriplaneGaussian, a novel approach for fast 3D object reconstruction from single-view images using a hybrid Triplane-Gaussian representation. |
Addresses limitations of existing methods that suffer from slow optimization or rendering processes, hindering fast 3D content creation. |
Employs two transformer-based networks: a point decoder to generate a coarse point cloud and a triplane decoder to output implicit triplane features. Leverages projection-aware conditioning and geometry-aware encoding for improved reconstruction and novel view synthesis. |
Achieves higher quality geometry reconstruction than Point-E, Shap-E, and One-2-3-45 on the GSO dataset.
Outperforms Zero-1-2-3 and One-2-3-45 in novel view synthesis, demonstrating higher consistency and detail.
Significantly faster in both reconstruction and rendering compared to baseline methods due to its feed-forward architecture and efficient rasterization. |
Rendering quality is dependent on the accuracy of the initial point cloud prediction.
Backside rendering tends to be blurry due to the non-probabilistic nature of the model. |
3d reconstruction, single-view reconstruction, gaussian splatting, transformer, novel view synthesis |
2312.09138
Report |
Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments |
Liyuan Zhu, Shengyu Huang, Konrad Schindler, Iro Armeni |
Research into dynamic 3D scene understanding has primarily focused on
short-term change tracking from dense observations, while little attention has
been paid to long-term changes with sparse observations. We address this gap
with MoRE, a novel approach for multi-object relocalization and reconstruction
in evolving environments. We view these environments as "living scenes" and
consider the problem of transforming scans taken at different points in time
into a 3D reconstruction of the object instances, whose accuracy and
completeness increase over time. At the core of our method lies an
SE(3)-equivariant representation in a single encoder-decoder network, trained
on synthetic data. This representation enables us to seamlessly tackle instance
matching, registration, and reconstruction. We also introduce a joint
optimization algorithm that facilitates the accumulation of point clouds
originating from the same instance across multiple scans taken at different
points in time. We validate our method on synthetic and real-world data and
demonstrate state-of-the-art performance in both end-to-end performance and
individual subtasks. |
Introduces MORE, a novel method for multi-object relocalization and reconstruction in evolving 3D environments over long time spans and from sparse observations. |
Addresses the gap in research focusing on long-term changes with sparse observations in dynamic 3D scene understanding. This is important for applications that benefit from an integrated understanding of the environment accumulated over time. |
Uses a single encoder-decoder network with an SE(3)-equivariant representation to tackle instance matching, registration, and reconstruction. Introduces a joint optimization algorithm to refine registration and reconstruction, accumulating point clouds from different scans for improved accuracy and completeness. |
Achieves state-of-the-art performance in multi-object relocalization and reconstruction on synthetic and real-world datasets.
Demonstrates the effectiveness of the joint optimization algorithm in improving geometric accuracy and completeness over time.
Shows robustness to noisy and incomplete instance segmentation masks. |
Test-time optimizations prevent real-time end-to-end execution.
Faces challenges with multiple identical, similar, or symmetric shapes in the scene. |
3d scene understanding, multi-object relocalization, point cloud registration, 3d reconstruction, se(3)-equivariant networks |
2312.09128
Report |
Tokenize Anything via Prompting |
Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan |
We present a unified, promptable model capable of simultaneously segmenting,
recognizing, and captioning anything. Unlike SAM, we aim to build a versatile
region representation in the wild via visual prompting. To achieve this, we
train a generalizable model with massive segmentation masks, e.g., SA-1B masks,
and semantic priors from a pre-trained CLIP model with 5 billion parameters.
Specifically, we construct a promptable image decoder by adding a semantic
token to each mask token. The semantic token is responsible for learning the
semantic priors in a predefined concept space. Through joint optimization of
segmentation on mask tokens and concept prediction on semantic tokens, our
model exhibits strong regional recognition and localization capabilities. For
example, an additional 38M-parameter causal text decoder trained from scratch
sets a new record with a CIDEr score of 150.7 on the Visual Genome region
captioning task. We believe this model can be a versatile region-level image
tokenizer, capable of encoding general-purpose region context for a broad range
of perception tasks. Code and models are available at
https://github.com/baaivision/tokenize-anything. |
This paper introduces TAP, a unified and promptable model that simultaneously performs segmentation, recognition, and captioning of any given region in an image. |
This is important because it moves towards a single, versatile vision model capable of diverse perception tasks with strong zero-shot generalization. |
The authors achieve this by combining the segmentation capabilities of SAM with the semantic understanding of CLIP. They pre-train TAP on a new dataset, SemanticSA-1B, which integrates web-scale semantics from LAION-2B into the segmentation masks of SA-1B. This allows TAP to learn both pixel-level localization and region-level semantic understanding. |
TAP exhibits strong zero-shot instance classification performance, achieving 59.0 AP on LVIS.
TAP achieves competitive zero-shot segmentation performance compared to SAM, indicating that the added semantic understanding does not compromise its segmentation abilities.
TAP sets a new record on the Visual Genome region captioning task with a CIDEr score of 150.7, demonstrating its capability to understand and generate language descriptions of visual regions. |
The model is currently limited by the human-curated label space used during training, falling short of true open-world learning.
The text decoder is fine-tuned on a limited region captioning dataset, potentially restricting its scalability and capacity for complex visual-language understanding. |
vision foundation model, promptable segmentation, region recognition, image captioning, zero-shot learning |
2312.09109
Report |
VideoLCM: Video Latent Consistency Model |
Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, Nong Sang |
Consistency models have demonstrated powerful capability in efficient image
generation and allowed synthesis within a few sampling steps, alleviating the
high computational cost in diffusion models. However, the consistency model in
the more challenging and resource-consuming video generation is still less
explored. In this report, we present the VideoLCM framework to fill this gap,
which leverages the concept of consistency models from image generation to
efficiently synthesize videos with minimal steps while maintaining high
quality. VideoLCM builds upon existing latent video diffusion models and
incorporates consistency distillation techniques for training the latent
consistency model. Experimental results reveal the effectiveness of our
VideoLCM in terms of computational efficiency, fidelity and temporal
consistency. Notably, VideoLCM achieves high-fidelity and smooth video
synthesis with only four sampling steps, showcasing the potential for real-time
synthesis. We hope that VideoLCM can serve as a simple yet effective baseline
for subsequent research. The source code and models will be publicly available. |
Introduces VideoLCM, a framework extending latent consistency models to video generation for efficient, high-quality synthesis with minimal steps. |
Addresses the high computational cost and numerous sampling steps required by diffusion models for video generation, hindering real-time applications. |
Leverages pretrained latent video diffusion models and consistency distillation to train a video latent consistency model. Employs DDIM as the ODE solver and incorporates classifier-free guidance during distillation. |
Achieves high-fidelity video synthesis with only 4-6 sampling steps, significantly reducing computational cost compared to ~50 steps in previous methods.
Demonstrates effectiveness for both text-to-video generation and compositional video synthesis tasks (e.g., depth-to-video, sketch-to-video).
Exhibits improved time efficiency, particularly for high-resolution videos, compared to baseline diffusion models. |
Relies on a strong teacher diffusion model for distillation, potentially limiting performance when training data is unavailable or from different domains.
While significantly reducing steps, real-time video generation remains unachieved, motivating further research on faster algorithms without sacrificing quality. |
video generation, consistency model, diffusion model, text-to-video, compositional video synthesis |
2312.09069
Report |
PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion |
Ying-Tian Liu, Yuan-Chen Guo, Guan Luo, Heyi Sun, Wei Yin, Song-Hai Zhang |
Diffusion models trained on large-scale text-image datasets have demonstrated
a strong capability of controllable high-quality image generation from
arbitrary text prompts. However, the generation quality and generalization
ability of 3D diffusion models is hindered by the scarcity of high-quality and
large-scale 3D datasets. In this paper, we present PI3D, a framework that fully
leverages the pre-trained text-to-image diffusion models' ability to generate
high-quality 3D shapes from text prompts in minutes. The core idea is to
connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB
Images. We fine-tune an existing text-to-image diffusion model to produce such
pseudo-images using a small number of text-3D pairs. Surprisingly, we find that
it can already generate meaningful and consistent 3D shapes given complex text
descriptions. We further take the generated shapes as the starting point for a
lightweight iterative refinement using score distillation sampling to achieve
high-quality generation under a low budget. PI3D generates a single 3D shape
from text in only 3 minutes and the quality is validated to outperform existing
3D generative models by a large margin. |
PI3D, a framework that leverages pre-trained text-to-image diffusion models for high-quality 3D shape generation from text prompts. |
Addresses the challenge of limited availability of high-quality, large-scale 3D datasets, which hinders the development of robust 3D generative models. |
Represents 3D shapes as sets of pseudo RGB images derived from triplane representations. Fine-tunes a pre-trained text-to-image diffusion model on these pseudo-images and real images for enhanced generalization. Utilizes score distillation sampling for lightweight refinement of generated 3D shapes. |
Generates 3D shapes from text prompts within minutes.
Exhibits superior visual quality, 3D consistency, and generation speed compared to existing text-to-3D methods.
Demonstrates improved generalization ability by leveraging knowledge from both 2D and 3D data. |
Triplane fitting during training incurs linear cost with dataset size.
Representational capacity limited by feature dimensions, posing challenges for detailed 3D generation. |
text-to-3d, diffusion models, triplane representation, score distillation sampling, 3d generative models |
2312.08892
Report |
VaLID: Variable-Length Input Diffusion for Novel View Synthesis |
Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen Gall, Amirhossein Habibian |
Novel View Synthesis (NVS), which tries to produce a realistic image at the
target view given source view images and their corresponding poses, is a
fundamental problem in 3D Vision. As this task is heavily under-constrained,
some recent work, like Zero123, tries to solve this problem with generative
modeling, specifically using pre-trained diffusion models. Although this
strategy generalizes well to new scenes, compared to neural radiance
field-based methods, it offers low levels of flexibility. For example, it can
only accept a single-view image as input, despite realistic applications often
offering multiple input images. This is because the source-view images and
corresponding poses are processed separately and injected into the model at
different stages. Thus it is not trivial to generalize the model into
multi-view source images, once they are available. To solve this issue, we try
to process each pose image pair separately and then fuse them as a unified
visual representation which will be injected into the model to guide image
synthesis at the target-views. However, inconsistency and computation costs
increase as the number of input source-view images increases. To solve these
issues, the Multi-view Cross Former module is proposed which maps
variable-length input data to fix-size output data. A two-stage training
strategy is introduced to further improve the efficiency during training time.
Qualitative and quantitative evaluation over multiple datasets demonstrates the
effectiveness of the proposed method against previous approaches. The code will
be released according to the acceptance. |
VaLID, a diffusion-based novel view synthesis model, is proposed, enabling variable-length multi-view image fusion in both training and inference. |
Existing diffusion-based NVS methods are limited to single-view input, hindering flexibility in real-world applications where multiple source images are often available. |
VaLID employs an appearance-pose-entanglement conditioning strategy with a Multi-view Cross Former module to process variable-length input tokens into a fixed-size representation for efficient and consistent novel view generation. A two-stage training strategy enhances efficiency. |
VaLID surpasses previous state-of-the-art methods quantitatively and qualitatively on GSO and RTMV datasets, even with single-view input.
Performance improves with increasing input source images, showcasing the model's ability to leverage multi-view information.
The token sampling strategy in training and inference improves efficiency without significant performance loss. |
The fixed number of learnable tokens in Multi-view Cross Former may limit information collection when input tokens are excessive.
Future work could explore incorporating geometric priors or text prompts for enhanced performance. |
novel view synthesis, diffusion models, multi-view fusion, vision transformer, cross attention |
2312.08889
Report |
SEEAvatar: Photorealistic Text-to-3D Avatar Generation with Constrained Geometry and Appearance |
Yuanyou Xu, Zongxin Yang, Yi Yang |
Powered by large-scale text-to-image generation models, text-to-3D avatar
generation has made promising progress. However, most methods fail to produce
photorealistic results, limited by imprecise geometry and low-quality
appearance. Towards more practical avatar generation, we present SEEAvatar, a
method for generating photorealistic 3D avatars from text with SElf-Evolving
constraints for decoupled geometry and appearance. For geometry, we propose to
constrain the optimized avatar in a decent global shape with a template avatar.
The template avatar is initialized with human prior and can be updated by the
optimized avatar periodically as an evolving template, which enables more
flexible shape generation. Besides, the geometry is also constrained by the
static human prior in local parts like face and hands to maintain the delicate
structures. For appearance generation, we use diffusion model enhanced by
prompt engineering to guide a physically based rendering pipeline to generate
realistic textures. The lightness constraint is applied on the albedo texture
to suppress incorrect lighting effect. Experiments show that our method
outperforms previous methods on both global and local geometry and appearance
quality by a large margin. Since our method can produce high-quality meshes and
textures, such assets can be directly applied in classic graphics pipeline for
realistic rendering under any lighting condition. Project page at:
https://yoxu515.github.io/SEEAvatar/. |
Presents SEEAvatar, a method for generating photorealistic 3D avatars from text descriptions using self-evolving constraints for decoupled geometry and appearance. |
Existing methods struggle to create photorealistic 3D avatars from text due to limitations in generating precise geometry and high-quality appearance, hindering applications in VR, gaming, and film. |
Leverages DMTet for shape representation, guided by a 2D diffusion model with self-evolving constraints from a template avatar. Employs a physically based rendering pipeline with diffusion model guidance and lightness constraints for realistic texture generation. |
Outperforms previous methods in generating high-quality avatars with accurate global shapes, fine local structures, and detailed textures.
Generates decoupled meshes and textures, enabling integration with classic graphics pipelines for rendering and editing.
Demonstrates flexibility in editing avatar geometry and appearance through text prompts. |
Struggles to represent highly detailed structures like hair strands, loose clothing, and complex accessories.
Appearance generation, while improved, still exhibits limitations in accurate roughness values and occasional lighting artifacts. |
text-to-3d, avatar generation, photorealistic rendering, diffusion models, self-evolving constraints |
2312.08887
Report |
SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models |
Weilong Chai, DanDan Zheng, Jiajiong Cao, Zhiquan Chen, Changbao Wang, Chenguang Ma |
Text-to-image diffusion models (SD) exhibit significant advancements while
requiring extensive computational resources. Though many acceleration methods
have been proposed, they suffer from generation quality degradation or extra
training cost generalizing to new fine-tuned models. To address these
limitations, we propose a novel and universal Stable-Diffusion (SD)
acceleration module called SpeedUpNet(SUN). SUN can be directly plugged into
various fine-tuned SD models without extra training. This technique utilizes
cross-attention layers to learn the relative offsets in the generated image
results between negative and positive prompts achieving classifier-free
guidance distillation with negative prompts controllable, and introduces a
Multi-Step Consistency (MSC) loss to ensure a harmonious balance between
reducing inference steps and maintaining consistency in the generated output.
Consequently, SUN significantly reduces the number of inference steps to just 4
steps and eliminates the need for classifier-free guidance. It leads to an
overall speedup of more than 10 times for SD models compared to the
state-of-the-art 25-step DPM-solver++, and offers two extra advantages: (1)
classifier-free guidance distillation with controllable negative prompts and
(2) seamless integration into various fine-tuned Stable-Diffusion models
without training. The effectiveness of the SUN has been verified through
extensive experimentation. Project Page:
https://williechai.github.io/speedup-plugin-for-stable-diffusions.github.io |
The paper proposes SpeedUpNet (SUN), a universal Stable-Diffusion acceleration module that reduces inference steps to 4 while maintaining quality and diversity in generated images. |
Existing diffusion model acceleration techniques often degrade generation quality or require retraining for new models, limiting their practical use. SUN aims to address these limitations. |
SUN utilizes a teacher-student distillation framework with a SUN adapter containing cross-attention modules. It learns relative offsets between negative and positive prompts and introduces a Multi-Step Consistency (MSC) loss to maintain output consistency. |
SUN achieves a 10x speedup compared to the 25-step DPM-solver++.
SUN seamlessly integrates into various fine-tuned SD models without retraining.
SUN demonstrates controllable classifier-free guidance distillation by learning from various negative prompts. |
The paper primarily focuses on Stable-Diffusion models and may require adaptation for other architectures.
Further research could explore extending SUN's efficiency and applicability to a wider range of generative tasks. |
diffusion models, text-to-image generation, model acceleration, classifier-free guidance, knowledge distillation |
2312.08885
Report |
SceneWiz3D: Towards Text-guided 3D Scene Composition |
Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, Hsin-Ying Lee |
We are witnessing significant breakthroughs in the technology for generating
3D objects from text. Existing approaches either leverage large text-to-image
models to optimize a 3D representation or train 3D generators on object-centric
datasets. Generating entire scenes, however, remains very challenging as a
scene contains multiple 3D objects, diverse and scattered. In this work, we
introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes
from text. We marry the locality of objects with globality of scenes by
introducing a hybrid 3D representation: explicit for objects and implicit for
scenes. Remarkably, an object, being represented explicitly, can be either
generated from text using conventional text-to-3D approaches, or provided by
users. To configure the layout of the scene and automatically place objects, we
apply the Particle Swarm Optimization technique during the optimization
process. Furthermore, it is difficult for certain parts of the scene (e.g.,
corners, occlusion) to receive multi-view supervision, leading to inferior
geometry. We incorporate an RGBD panorama diffusion model to mitigate it,
resulting in high-quality geometry. Extensive evaluation supports that our
approach achieves superior quality over previous approaches, enabling the
generation of detailed and view-consistent 3D scenes. |
Introduces SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text using a hybrid explicit-implicit representation and Particle Swarm Optimization. |
Generating entire 3D scenes from text is crucial for immersive experiences but challenging due to the complexity of multiple objects and layouts. |
Combines DMTets (explicit) for objects of interest and NeRF (implicit) for the environment; uses PSO to optimize object placement based on CLIP similarity; leverages RGBD panorama diffusion for enhanced geometry. |
Achieves state-of-the-art performance in text-to-3D scene generation.
Successfully synthesizes detailed and view-consistent scenes across various styles and layouts.
Effectively mitigates the multi-face (Janus) problem often found in scene generation. |
Shares limitations with SDS-based methods like long optimization time and potential color saturation.
Scene configuration optimization is limited by CLIP's capabilities for fine-grained manipulation. |
text-to-3d, scene synthesis, hybrid representation, particle swarm optimization, rgbd panorama diffusion |
2312.08883
Report |
EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection |
Xuanyu Zhang, Runyi Li, Jiwen Yu, Youmin Xu, Weiqi Li, Jian Zhang |
In the era where AI-generated content (AIGC) models can produce stunning and
lifelike images, the lingering shadow of unauthorized reproductions and
malicious tampering poses imminent threats to copyright integrity and
information security. Current image watermarking methods, while widely accepted
for safeguarding visual content, can only protect copyright and ensure
traceability. They fall short in localizing increasingly realistic image
tampering, potentially leading to trust crises, privacy violations, and legal
disputes. To solve this challenge, we propose an innovative proactive forensics
framework EditGuard, to unify copyright protection and tamper-agnostic
localization, especially for AIGC-based editing methods. It can offer a
meticulous embedding of imperceptible watermarks and precise decoding of
tampered areas and copyright information. Leveraging our observed fragility and
locality of image-into-image steganography, the realization of EditGuard can be
converted into a united image-bit steganography issue, thus completely
decoupling the training process from the tampering types. Extensive experiments
demonstrate that our EditGuard balances the tamper localization accuracy,
copyright recovery precision, and generalizability to various AIGC-based
tampering methods, especially for image forgery that is difficult for the naked
eye to detect. The project page is available at
https://xuanyuzhang21.github.io/project/editguard/. |
EditGuard, a proactive forensics framework, is presented to unify copyright protection and tamper-agnostic localization, particularly for AIGC-based editing methods. |
The rise of AIGC models necessitates robust tools to combat unauthorized reproduction and malicious tampering, protecting copyright integrity and information security. |
EditGuard embeds dual invisible watermarks (localization and copyright) into images. Leveraging the fragility and locality of I2I steganography, tamper localization is converted into a united image-bit steganography issue, decoupling training from specific tampering types. |
EditGuard achieves over 95% localization precision and nearly 100% copyright accuracy, outperforming state-of-the-art methods.
It demonstrates superior generalizability to various AIGC-based tampering methods, including those producing visually imperceptible forgeries.
The framework shows robustness to common image degradations like noise and compression. |
Future work includes optimizing localization watermark selection and extending EditGuard to other modalities like video and 3D scenes.
Exploring end-to-end optimization for learning optimal localization watermarks and reducing information capacity for enhanced robustness are key areas of interest. |
proactive forensics, tamper localization, copyright protection, image watermarking, ai-generated content (aigc) |
2312.08882
Report |
Neural Video Fields Editing |
Shuzhou Yang, Chong Mou, Jiwen Yu, Yuhan Wang, Xiandong Meng, Jian Zhang |
Diffusion models have revolutionized text-driven video editing. However,
applying these methods to real-world editing encounters two significant
challenges: (1) the rapid increase in GPU memory demand as the number of frames
grows, and (2) the inter-frame inconsistency in edited videos. To this end, we
propose NVEdit, a novel text-driven video editing framework designed to
mitigate memory overhead and improve consistent editing for real-world long
videos. Specifically, we construct a neural video field, powered by tri-plane
and sparse grid, to enable encoding long videos with hundreds of frames in a
memory-efficient manner. Next, we update the video field through off-the-shelf
Text-to-Image (T2I) models to impart text-driven editing effects. A progressive
optimization strategy is developed to preserve original temporal priors.
Importantly, both the neural video field and T2I model are adaptable and
replaceable, thus inspiring future research. Experiments demonstrate the
ability of our approach to edit hundreds of frames with impressive inter-frame
consistency. Our project is available at: https://nvedit.github.io/. |
Presents NVEdit, a memory-efficient video editing framework that leverages neural video fields and off-the-shelf image processing techniques like Instruct-Pix2Pix+ (an enhanced version of Instruct-Pix2Pix). |
Addresses challenges in existing text-driven video editing methods related to GPU memory constraints and inter-frame inconsistency, particularly for long videos. |
Employs a two-stage process: 1) **Video Fitting:** Constructs a Neural Video Field (NVF) to capture temporal and content priors of a video efficiently. 2) **Field Editing:** Updates the NVF using a T2I model (primarily IP2P+) to impart text-driven edits while preserving temporal consistency through progressive optimization. |
Achieves state-of-the-art temporal consistency and editing accuracy compared to existing methods.
Demonstrates efficient memory usage, enabling editing of long videos with hundreds of frames.
Showcases versatility by supporting various editing tasks like shape modification, scene changes, style transfer, and frame interpolation. |
Temporal priors might still be affected during the field editing stage.
Editing long videos can be time-consuming due to the increased number of frames requiring iterative optimization. |
video editing, neural video field, text-to-image, instruct-pix2pix, temporal consistency |
2312.08880
Report |
GenDet: Towards Good Generalizations for AI-Generated Image Detection |
Mingjian Zhu, Hanting Chen, Mouxiao Huang, Wei Li, Hailin Hu, Jie Hu, Yunhe Wang |
The misuse of AI imagery can have harmful societal effects, prompting the
creation of detectors to combat issues like the spread of fake news. Existing
methods can effectively detect images generated by seen generators, but it is
challenging to detect those generated by unseen generators. They do not
concentrate on amplifying the output discrepancy when detectors process real
versus fake images. This results in a close output distribution of real and
fake samples, increasing classification difficulty in detecting unseen
generators. This paper addresses the unseen-generator detection problem by
considering this task from the perspective of anomaly detection and proposes an
adversarial teacher-student discrepancy-aware framework. Our method encourages
smaller output discrepancies between the student and the teacher models for
real images while aiming for larger discrepancies for fake images. We employ
adversarial learning to train a feature augmenter, which promotes smaller
discrepancies between teacher and student networks when the inputs are fake
images. Our method has achieved state-of-the-art on public benchmarks, and the
visualization results show that a large output discrepancy is maintained when
faced with various types of generators. |
This paper proposes GenDet, an adversarial teacher-student discrepancy-aware framework for detecting AI-generated images, particularly those from unseen generators. |
Detecting AI-generated images is crucial to combat misinformation and harmful societal effects, especially as these images become increasingly realistic. |
GenDet uses a teacher-student framework with discrepancy-aware learning to differentiate real and fake images. It also employs a feature augmenter trained via adversarial learning to enhance generalization to unseen generators. |
GenDet outperforms state-of-the-art methods on UniversalFakeDetect and GenImage datasets, showing significant improvements in average accuracy and mAP.
The framework effectively handles degraded image classification tasks like low resolution and compression.
Cross-dataset evaluation demonstrates GenDet's strong generalization ability even with large domain gaps. |
The method relies on a pre-trained feature extractor (CLIP), potentially limiting its applicability to domains not well-represented in CLIP's training data.
Further research can explore alternative feature augmentation techniques and architectures to enhance robustness. |
ai-generated image detection, anomaly detection, teacher-student learning, adversarial learning, generalization |
2312.08874
Report |
Agent Attention: On the Integration of Softmax and Linear Attention |
Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang |
The attention module is the key component in Transformers. While the global
attention mechanism offers high expressiveness, its excessive computational
cost restricts its applicability in various scenarios. In this paper, we
propose a novel attention paradigm, Agent Attention, to strike a favorable
balance between computational efficiency and representation power.
Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$,
introduces an additional set of agent tokens $A$ into the conventional
attention module. The agent tokens first act as the agent for the query tokens
$Q$ to aggregate information from $K$ and $V$, and then broadcast the
information back to $Q$. Given the number of agent tokens can be designed to be
much smaller than the number of query tokens, the agent attention is
significantly more efficient than the widely adopted Softmax attention, while
preserving global context modelling capability. Interestingly, we show that the
proposed agent attention is equivalent to a generalized form of linear
attention. Therefore, agent attention seamlessly integrates the powerful
Softmax attention and the highly efficient linear attention. Extensive
experiments demonstrate the effectiveness of agent attention with various
vision Transformers and across diverse vision tasks, including image
classification, object detection, semantic segmentation and image generation.
Notably, agent attention has shown remarkable performance in high-resolution
scenarios, owning to its linear attention nature. For instance, when applied to
Stable Diffusion, our agent attention accelerates generation and substantially
enhances image generation quality without any additional training. Code is
available at https://github.com/LeapLabTHU/Agent-Attention. |
This paper proposes Agent Attention, a novel attention mechanism for Vision Transformers that balances computational efficiency and representation power by introducing agent tokens to aggregate and broadcast global information. |
The widely used Softmax attention in Transformers incurs high computational cost, limiting its applicability in vision tasks, while existing efficient attention mechanisms often compromise long-range modeling capabilities. |
Agent Attention introduces a set of agent tokens to the conventional attention triplet (Q, K, V), forming a quadruplet (Q, A, K, V). Agent tokens first aggregate information from keys and values and then broadcast it to query tokens, effectively integrating Softmax and linear attention. |
Agent Attention significantly improves performance across various vision tasks, including image classification, object detection, and semantic segmentation.
The method excels in high-resolution scenarios, demonstrating the advantage of its global receptive field.
Applied to Stable Diffusion, Agent Attention accelerates generation and enhances image quality without requiring additional training. |
The paper primarily explores pooling for obtaining agent tokens, leaving room for investigating more advanced techniques.
Future work includes exploring the application of Agent Attention to video modeling and multi-modal foundation models. |
vision transformer, attention mechanism, agent attention, linear attention, image generation |
2312.08873
Report |
Diffusion Cocktail: Fused Generation from Diffusion Models |
Haoming Liu, Yuanhe Guo, Shengjie Wang, Hongyi Wen |
Diffusion models excel at generating high-quality images and are easy to
extend, making them extremely popular among active users who have created an
extensive collection of diffusion models with various styles by fine-tuning
base models such as Stable Diffusion. Recent work has focused on uncovering
semantic and visual information encoded in various components of a diffusion
model, enabling better generation quality and more fine-grained control.
However, those methods target improving a single model and overlook the vastly
available collection of fine-tuned diffusion models. In this work, we study the
combinations of diffusion models. We propose Diffusion Cocktail (Ditail), a
training-free method that can accurately transfer content information between
two diffusion models. This allows us to perform diverse generations using a set
of diffusion models, resulting in novel images that are unlikely to be obtained
by a single model alone. We also explore utilizing Ditail for style transfer,
with the target style set by a diffusion model instead of an image. Ditail
offers a more detailed manipulation of the diffusion generation, thereby
enabling the vast community to integrate various styles and contents seamlessly
and generate any content of any style. |
Presents Diffusion Cocktail (Ditail), a training-free method for transferring content information between two diffusion models (DMs) for novel image generation and style transfer. |
Addresses the limitation of existing methods that focus on improving single DMs and overlooks the vast collection of fine-tuned DMs, enabling diverse image generation by leveraging existing DM resources. |
Injects latent representations from a source DM into specific layers of a target DM during the diffusion process, enabling style transfer with a target style defined by a DM. |
Achieves high-quality style transfer between DMs, generating novel images by combining content and style information from different models.
Enables style transfer of real images with the target style defined by a DM, allowing users to leverage diverse styles from fine-tuned DMs.
Offers fine-grained control over the generation process through parameters like guidance strength and regional injection masks. |
The effect of the negative prompt guidance parameter (beta) is case-sensitive and may not always yield significant results.
Editing prompts that change the number of objects often lead to unsatisfactory results due to strong structure emphasis. |
diffusion models, style transfer, image generation, content injection, deep learning |
2312.08872
Report |
The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization |
Jiafeng Mao, Xueting Wang, Kiyoharu Aizawa |
Text-to-image diffusion models allow users control over the content of
generated images. Still, text-to-image generation occasionally leads to
generation failure requiring users to generate dozens of images under the same
text prompt before they obtain a satisfying result. We formulate the lottery
ticket hypothesis in denoising: randomly initialized Gaussian noise images
contain special pixel blocks (winning tickets) that naturally tend to be
denoised into specific content independently. The generation failure in
standard text-to-image synthesis is caused by the gap between optimal and
actual spatial distribution of winning tickets in initial noisy images. To this
end, we implement semantic-driven initial image construction creating initial
noise from known winning tickets for each concept mentioned in the prompt. We
conduct a series of experiments that verify the properties of winning tickets
and demonstrate their generalizability across images and prompts. Our results
show that aggregating winning tickets into the initial noise image effectively
induce the model to generate the specified object at the corresponding
location. |
This paper discovers and verifies the "Lottery Ticket Hypothesis in Denoising," revealing that specific pixel blocks within the initial noise images of diffusion models have inherent predispositions towards generating specific content, and introduces a semantic-driven initial image construction method leveraging these "winning tickets." |
This discovery provides new insights into the denoising process in text-to-image diffusion models and offers a novel approach to enhance control over generated imagery, addressing the limitation of existing methods that primarily focus on refining the generation process rather than manipulating the initial noise. |
The authors leverage the cross-attention mechanism in diffusion models to identify "winning tickets" - pixel blocks exhibiting high cross-attention values for specific concepts. They construct a collection of these winning tickets and utilize them to create semantically-driven initial images, guiding the model to generate specific content at desired locations. |
Diffusion models demonstrate tolerance for non-Gaussian initial images constructed using winning tickets, producing high-quality images with effective content control.
Winning tickets exhibit versatility across different prompts and images, maintaining their generative tendencies even when transferred between them.
Combining semantic-driven initialization with existing layout-to-image synthesis methods significantly enhances their control effectiveness. |
The winning ticket selection method based solely on category names may lead to unintended generation biases (e.g., color).
The constructed initial images may deviate significantly from the normal distribution, potentially compromising the quality of generated images, especially when controlling large-sized objects. |
diffusion models, text-to-image generation, lottery ticket hypothesis, semantic-driven initialization, cross-attention mechanism |
2312.08870
Report |
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens |
Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang |
Recent advances in large video-language models have displayed promising
outcomes in video comprehension. Current approaches straightforwardly convert
video into language tokens and employ large language models for multi-modal
tasks. However, this method often leads to the generation of irrelevant
content, commonly known as "hallucination", as the length of the text increases
and the impact of the video diminishes. To address this problem, we propose
Vista-LLaMA, a novel framework that maintains the consistent distance between
all visual tokens and any language tokens, irrespective of the generated text
length. Vista-LLaMA omits relative position encoding when determining attention
weights between visual and text tokens, retaining the position encoding for
text and text tokens. This amplifies the effect of visual tokens on text
generation, especially when the relative distance is longer between visual and
text tokens. The proposed attention mechanism significantly reduces the chance
of producing irrelevant text related to the video content. Furthermore, we
present a sequential visual projector that projects the current video frame
into tokens of language space with the assistance of the previous frame. This
approach not only captures the temporal relationship within the video, but also
allows less visual tokens to encompass the entire video. Our approach
significantly outperforms various previous methods (e.g., Video-ChatGPT,
MovieChat) on four challenging open-ended video question answering benchmarks.
We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot
MSRVTT-QA, setting a new state-of-the-art performance. This project is
available at https://jinxxian.github.io/Vista-LLaMA. |
This paper introduces Vista-LLaMA, a novel video-language model that enhances video understanding and temporal modeling within large language models (LLMs) for reliable video narration. |
Existing methods for video comprehension often generate irrelevant content ('hallucination') as text length increases due to diminishing visual impact and lack of explicit temporal modeling. |
Vista-LLaMA leverages two key innovations: 1) Equal Distance to Visual Tokens (EDVT) attention to maintain consistent influence of visual information on text generation, and 2) a sequential visual projector to encode temporal relationships between video frames. |
Vista-LLaMA outperforms previous state-of-the-art methods on four challenging open-ended video question answering benchmarks, including achieving new state-of-the-art performance on zero-shot NExT-QA and MSRVTT-QA.
EDVT attention significantly improves accuracy across various question types, demonstrating its ability to enhance multi-modal understanding in LLMs.
The sequential visual projector effectively encodes temporal context, leading to improved performance compared to other visual projectors. |
The evaluation relies on GPT-3.5 for assessment, which may introduce inaccuracies compared to the more expensive GPT-4.
The study focuses on fine-tuning rather than pre-training, potentially limiting the full exploration of EDVT-Attention's capabilities for video comprehension and other multi-modal tasks. |
video understanding, large language models, video question answering, multi-modal learning, hallucination reduction |
2312.08825
Report |
Guided Diffusion from Self-Supervised Diffusion Features |
Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M. Asano, Cees G. M. Snoek, Bjorn Ommer |
Guidance serves as a key concept in diffusion models, yet its effectiveness
is often limited by the need for extra data annotation or classifier
pretraining. That is why guidance was harnessed from self-supervised learning
backbones, like DINO. However, recent studies have revealed that the feature
representation derived from diffusion model itself is discriminative for
numerous downstream tasks as well, which prompts us to propose a framework to
extract guidance from, and specifically for, diffusion models. Our research has
yielded several significant contributions. Firstly, the guidance signals from
diffusion models are on par with those from class-conditioned diffusion models.
Secondly, feature regularization, when based on the Sinkhorn-Knopp algorithm,
can further enhance feature discriminability in comparison to unconditional
diffusion models. Thirdly, we have constructed an online training approach that
can concurrently derive guidance from diffusion models for diffusion models.
Lastly, we have extended the application of diffusion models along the constant
velocity path of ODE to achieve a more favorable balance between sampling steps
and fidelity. The performance of our methods has been outstanding,
outperforming related baseline comparisons in large-resolution datasets, such
as ImageNet256, ImageNet256-100 and LSUN-Churches. Our code will be released. |
This paper proposes a novel framework to extract guidance signals directly from diffusion models themselves, eliminating the need for external data annotations or self-supervised learning backbones. |
Current methods for guiding diffusion models towards higher fidelity outputs rely heavily on labor-intensive data annotation or the use of external pretrained models, which limits their practical applicability. |
The authors introduce two approaches: 1) offline guidance extracts guidance from pretrained diffusion model features using k-means clustering, 2) online guidance utilizes an online optimal-transport-based algorithm (Sinkhorn-Knopp) to jointly learn the diffusion model and the guidance signals. |
Guidance signals derived directly from diffusion models are on par with those from class-conditioned diffusion models.
Online Sinkhorn-Knopp clustering significantly enhances feature discriminability compared to unconditional diffusion models.
The proposed framework achieves a favorable balance between sampling speed and fidelity by leveraging the constant velocity path of ODEs. |
Performance gap still exists compared to class-conditioned diffusion models which use extensive data annotations.
Exploring other backbone architectures like DiT for potential improvements in future work. |
diffusion models, self-guidance, image generation, sinkhorn-knopp algorithm, optimal transport |
2312.08768
Report |
Local Conditional Controlling for Text-to-Image Diffusion Models |
Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Wei Zhao, qinglin lu, Boxi Wu, Wei Liu |
Diffusion models have exhibited impressive prowess in the text-to-image task.
Recent methods add image-level controls, e.g., edge and depth maps, to
manipulate the generation process together with text prompts to obtain desired
images. This controlling process is globally operated on the entire image,
which limits the flexibility of control regions. In this paper, we introduce a
new simple yet practical task setting: local control. It focuses on controlling
specific local areas according to user-defined image conditions, where the rest
areas are only conditioned by the original text prompt. This manner allows the
users to flexibly control the image generation in a fine-grained way. However,
it is non-trivial to achieve this goal. The naive manner of directly adding
local conditions may lead to the local control dominance problem. To mitigate
this problem, we propose a training-free method that leverages the updates of
noised latents and parameters in the cross-attention map during the denosing
process to promote concept generation in non-control areas. Moreover, we use
feature mask constraints to mitigate the degradation of synthesized image
quality caused by information differences inside and outside the local control
area. Extensive experiments demonstrate that our method can synthesize
high-quality images to the prompt under local control conditions. Code is
available at https://github.com/YibooZhao/Local-Control. |
This paper introduces 'local control', a new paradigm for controllable image synthesis using diffusion models, where users can control specific regions of an image using image conditions while the remaining image adheres to a text prompt. |
Existing controllable image generation methods mainly focus on global, image-level control, lacking the flexibility for fine-grained local manipulations desired by users. |
The paper proposes a training-free method that integrates with existing control models like ControlNet. It leverages the cross-attention maps during the denoising process to: 1) identify and regenerate objects ignored due to local control dominance, 2) focus token responses to refine object distinction, and 3) apply feature mask constraints to mitigate image degradation caused by information discrepancies. |
The proposed method successfully synthesizes high-quality images that align with both local image conditions and text prompts.
Extensive experiments on COCO and a custom dataset demonstrate superior performance over existing controllable methods like ControlNet and T2I-Adapter, achieving better FID, CLIP Score, and CLIP T2T similarity.
Ablation studies validate the contribution of each proposed component: object regeneration, focused token response, and feature mask constraint, showcasing their effectiveness in handling local control dominance and improving image quality. |
The method might encounter challenges in maintaining semantic consistency and object scaling between the locally controlled region and the rest of the image.
Future work could explore establishing a more comprehensive connection between these regions to enhance visual coherence. |
image synthesis, diffusion models, controllable generation, local control, cross-attention |
2312.08754
Report |
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation |
Zexiang Liu, Yangguang Li, Youtian Lin, Xin Yu, Sida Peng, Yan-Pei Cao, Xiaojuan Qi, Xiaoshui Huang, Ding Liang, Wanli Ouyang |
Recent advancements in text-to-3D generation technology have significantly
advanced the conversion of textual descriptions into imaginative
well-geometrical and finely textured 3D objects. Despite these developments, a
prevalent limitation arises from the use of RGB data in diffusion or
reconstruction models, which often results in models with inherent lighting and
shadows effects that detract from their realism, thereby limiting their
usability in applications that demand accurate relighting capabilities. To
bridge this gap, we present UniDream, a text-to-3D generation framework by
incorporating unified diffusion priors. Our approach consists of three main
components: (1) a dual-phase training process to get albedo-normal aligned
multi-view diffusion and reconstruction models, (2) a progressive generation
procedure for geometry and albedo-textures based on Score Distillation Sample
(SDS) using the trained reconstruction and diffusion models, and (3) an
innovative application of SDS for finalizing PBR generation while keeping a
fixed albedo based on Stable Diffusion model. Extensive evaluations demonstrate
that UniDream surpasses existing methods in generating 3D objects with clearer
albedo textures, smoother surfaces, enhanced realism, and superior relighting
capabilities. |
UniDream, a novel text-to-3D generation framework that generates relightable 3D objects from text descriptions by incorporating unified diffusion priors, disentangling illumination from textures. |
Existing text-to-3D methods lack relighting capabilities due to inherent lighting and shadows baked into object textures, limiting realism and usability in applications demanding accurate lighting control. |
UniDream utilizes a three-stage pipeline: 1) An albedo-normal aligned multi-view diffusion model (AN-MVM) generates consistent multi-view images. 2) A transformer-based reconstruction model (TRM) provides a 3D coarse model from albedo images. 3) Score Distillation Sample (SDS) refines the model, and Stable Diffusion generates PBR materials while keeping albedo fixed. |
Realistic Materials: UniDream accurately generates PBR materials that approximate real-world textures and can be relit in various lighting conditions.
Complete Geometry: UniDream excels at generating comprehensive geometric details, leading to more complete 3D objects.
Stable Generation: UniDream demonstrates greater effectiveness in generating 3D objects due to the 3D prior and normal supervision. |
Limited semantic and material generalization due to training data size.
The rendering pipeline can be upgraded to incorporate path tracing for enhanced realism. |
text-to-3d generation, relightable 3d objects, diffusion models, physically-based rendering, score distillation sampling |
2312.08746
Report |
DreamDrone |
Hanyang Kong, Dongze Lian, Michael Bi Mi, Xinchao Wang |
We introduce DreamDrone, an innovative method for generating unbounded
flythrough scenes from textual prompts. Central to our method is a novel
feature-correspondence-guidance diffusion process, which utilizes the strong
correspondence of intermediate features in the diffusion model. Leveraging this
guidance strategy, we further propose an advanced technique for editing the
intermediate latent code, enabling the generation of subsequent novel views
with geometric consistency. Extensive experiments reveal that DreamDrone
significantly surpasses existing methods, delivering highly authentic scene
generation with exceptional visual quality. This approach marks a significant
step in zero-shot perpetual view generation from textual prompts, enabling the
creation of diverse scenes, including natural landscapes like oases and caves,
as well as complex urban settings such as Lego-style street views. Our code is
publicly available. |
Introduces DreamDrone, a zero-shot, training-free method for generating unbounded flythrough scenes from textual prompts. |
Addresses limitations of existing perpetual view generation methods that struggle with forward camera movement, outdoor scenes, and text-based scene creation. |
Leverages pre-trained text-to-image diffusion models and depth estimation models with a novel feature-correspondence-guidance diffusion process and latent code editing for geometry-consistent novel view generation. |
Significantly outperforms existing methods in terms of visual quality and CLIP score, indicating strong text-scene alignment.
Demonstrates versatility in generating diverse scenes, including natural landscapes, imaginative scenarios, and complex urban settings.
Maintains high fidelity and detail in generated scenes, even over extended sequences of frames, unlike competing approaches. |
Correspondence of high-frequency details between adjacent frames can be further improved.
Reliance on accurate depth estimation can impact performance in scenes with unique styles. |
perpetual view generation, text-to-scene synthesis, diffusion models, zero-shot learning, ai-generated content |
2312.08744
Report |
GOEnFusion: Gradient Origin Encodings for 3D Forward Diffusion Models |
Animesh Karnewar, Andrea Vedaldi, Niloy J. Mitra, David Novotny |
The recently introduced Forward-Diffusion method allows to train a 3D
diffusion model using only 2D images for supervision. However, it does not
easily generalise to different 3D representations and requires a
computationally expensive auto-regressive sampling process to generate the
underlying 3D scenes. In this paper, we propose GOEn: Gradient Origin Encoding
(pronounced "gone"). GOEn can encode input images into any type of 3D
representation without the need to use a pre-trained image feature extractor.
It can also handle single, multiple or no source view(s) alike, by design, and
tries to maximise the information transfer from the views to the encodings. Our
proposed GOEnFusion model pairs GOEn encodings with a realisation of the
Forward-Diffusion model which addresses the limitations of the vanilla
Forward-Diffusion realisation. We evaluate how much information the GOEn
mechanism transfers to the encoded representations, and how well it captures
the prior distribution over the underlying 3D scenes, through the lens of a
partial AutoEncoder. Lastly, the efficacy of the GOEnFusion model is evaluated
on the recently proposed OmniObject3D dataset while comparing to the
state-of-the-art Forward and non-Forward-Diffusion models and other 3D
generative models. |
This paper proposes GOEn (Gradient Origin Encoding), a novel encoding mechanism to encode source views into arbitrary 3D representations, and GOEnFusion, an improved realization of the Forward-Diffusion model for 3D generation and reconstruction. |
The paper aims to address the limitations of existing 3D generative models, particularly in handling different 3D representations and requiring computationally expensive sampling processes. |
GOEn encodes information from source views into 3D representations by computing the gradient of the log-likelihood of the observations under a differentiable forward operation. GOEnFusion integrates GOEn with a denoising network, enabling efficient generation and reconstruction of 3D scenes. |
GOEn effectively transfers information from source views to different 3D representations, achieving promising results in partial autoencoding experiments.
GOEnFusion outperforms the vanilla Forward-Diffusion model in 3D generation on the OmniObject3D dataset, demonstrating improved quality and efficiency.
The GOEn mechanism shows strong potential for 3D reconstruction, achieving competitive results in a regression-based setting. |
The application of GOEnFusion to various 3D representations is limited by their compatibility with existing denoising network architectures.
Exploring the use of GOEn in other stochastic inverse problems beyond 3D vision is a potential area for future research. |
3d generation, 3d reconstruction, forward-diffusion models, gradient origin networks, neural radiance fields |
2312.08568
Report |
NViST: In the Wild New View Synthesis from a Single Image with Transformers |
Wonbong Jang, Lourdes Agapito |
We propose NViST, a transformer-based model for efficient and generalizable
novel-view synthesis from a single image for real-world scenes. In contrast to
many methods that are trained on synthetic data, object-centred scenarios, or
in a category-specific manner, NViST is trained on MVImgNet, a large-scale
dataset of casually-captured real-world videos of hundreds of object categories
with diverse backgrounds. NViST transforms image inputs directly into a
radiance field, conditioned on camera parameters via adaptive layer
normalisation. In practice, NViST exploits fine-tuned masked autoencoder (MAE)
features and translates them to 3D output tokens via cross-attention, while
addressing occlusions with self-attention. To move away from object-centred
datasets and enable full scene synthesis, NViST adopts a 6-DOF camera pose
model and only requires relative pose, dropping the need for canonicalization
of the training data, which removes a substantial barrier to it being used on
casually captured datasets. We show results on unseen objects and categories
from MVImgNet and even generalization to casual phone captures. We conduct
qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that
our model represents a step forward towards enabling true in-the-wild
generalizable novel-view synthesis from a single image. Project webpage:
https://wbjang.github.io/nvist_webpage. |
The paper introduces NVist, a transformer-based model for novel view synthesis from single in-the-wild images, trained on the large-scale MVImgNet dataset. |
Generalizing NeRF-based models to real-world scenes is challenging due to scale ambiguities, scene misalignments, and diverse backgrounds. This work aims to address these challenges by leveraging a large-scale, diverse dataset and a novel transformer architecture. |
NVist uses a fine-tuned MAE as an encoder and a novel transformer decoder that maps features to a vector-matrix radiance field representation. It uses cross-attention for feature mapping, self-attention for occlusion reasoning, and adaptive layer normalization for conditioning on camera parameters. Notably, it only requires relative camera poses, allowing it to learn from casually captured datasets. |
NVist demonstrates high-quality novel view synthesis on challenging real-world scenes from MVImgNet.
The model generalizes well to unseen object categories and out-of-distribution phone-captured scenes.
Quantitative comparisons on MVImgNet and ShapeNet-SRN show competitive performance against state-of-the-art methods. |
Limited training resources led to downsampling images and potentially impacted sharpness.
The absence of GAN losses or SDS might have contributed to some loss of detail. |
novel view synthesis, transformers, neural radiance fields, single image 3d reconstruction, real-world scenes |
2312.08563
Report |
Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models |
Liangchen Song, Liangliang Cao, Jiatao Gu, Yifan Jiang, Junsong Yuan, Hao Tang |
The advancement of text-driven 3D content editing has been blessed by the
progress from 2D generative diffusion models. However, a major obstacle
hindering the widespread adoption of 3D content editing is its time-intensive
processing. This challenge arises from the iterative and refining steps
required to achieve consistent 3D outputs from 2D image-based generative
models. Recent state-of-the-art methods typically require optimization time
ranging from tens of minutes to several hours to edit a 3D scene using a single
GPU. In this work, we propose that by incorporating correspondence
regularization into diffusion models, the process of 3D editing can be
significantly accelerated. This approach is inspired by the notion that the
estimated samples during diffusion should be multiview-consistent during the
diffusion generation process. By leveraging this multiview consistency, we can
edit 3D content at a much faster speed. In most scenarios, our proposed
technique brings a 10$\times$ speed-up compared to the baseline method and
completes the editing of a 3D scene in 2 minutes with comparable quality. |
This paper introduces a novel framework for efficiently editing NeRF models using text-based instructions, achieving a 10x speedup compared to previous methods by leveraging multiview consistency. |
Existing text-driven 3D content editing techniques are computationally expensive and time-consuming, limiting their practical applications. This work addresses this challenge by significantly accelerating the editing process. |
The proposed method regularizes the diffusion denoising process to maintain multiview consistency across generated images. This eliminates the need for iterative dataset updates and enables direct editing of the NeRF representation using a style matching loss. |
The approach achieves a 10x speedup compared to the baseline Instruct-NeRF2NeRF, editing a 3D scene in just 2 minutes.
Experiments demonstrate faster convergence and comparable editing quality to state-of-the-art methods.
The method can be integrated with Instruct-NeRF2NeRF to further enhance editing quality and speed. |
The final editing quality, while comparable, may not always surpass existing methods in all scenarios.
Future work could explore alternative regularization techniques and loss functions to further improve editing fidelity and generalization. |
nerf, 3d content editing, diffusion models, multiview consistency, text-driven editing |
2312.08372
Report |
SAM-guided Graph Cut for 3D Instance Segmentation |
Haoyu Guo, He Zhu, Sida Peng, Yuang Wang, Yujun Shen, Ruizhen Hu, Xiaowei Zhou |
This paper addresses the challenge of 3D instance segmentation by
simultaneously leveraging 3D geometric and multi-view image information. Many
previous works have applied deep learning techniques to 3D point clouds for
instance segmentation. However, these methods often failed to generalize to
various types of scenes due to the scarcity and low-diversity of labeled 3D
point cloud data. Some recent works have attempted to lift 2D instance
segmentations to 3D within a bottom-up framework. The inconsistency in 2D
instance segmentations among views can substantially degrade the performance of
3D segmentation. In this work, we introduce a novel 3D-to-2D query framework to
effectively exploit 2D segmentation models for 3D instance segmentation.
Specifically, we pre-segment the scene into several superpoints in 3D,
formulating the task into a graph cut problem. The superpoint graph is
constructed based on 2D segmentation models, where node features are obtained
from multi-view image features and edge weights are computed based on
multi-view segmentation results, enabling the better generalization ability. To
process the graph, we train a graph neural network using pseudo 3D labels from
2D segmentation models. Experimental results on the ScanNet, ScanNet++ and
KITTI-360 datasets demonstrate that our method achieves robust segmentation
performance and can generalize across different types of scenes. Our project
page is available at https://zju3dv.github.io/sam_graph. |
This paper proposes a novel 3D instance segmentation method that leverages 2D segmentation cues from Segment Anything Model (SAM) within a 3D-to-2D query framework. |
Existing 3D instance segmentation methods suffer from the scarcity of labeled 3D data, while 2D-to-3D lifting methods often fail due to inconsistencies in multi-view 2D segmentation. |
The method pre-segments the 3D scene into superpoints and constructs a graph. SAM is then used to annotate graph edges with affinity scores and nodes with aggregated image features. Finally, a graph neural network trained with pseudo labels from 2D segmentation refines the graph for 3D instance segmentation. |
The method achieves state-of-the-art results on the ScanNet dataset.
It exhibits excellent generalization ability, effectively segmenting scenes from ScanNet++ and KITTI-360 datasets without fine-tuning.
Ablation studies demonstrate the effectiveness of SAM guidance, pseudo label training, and the graph neural network. |
The method's reliance on both 3D geometry and multi-view images limits its application scenarios.
Segmentation accuracy is constrained by the initial superpoint pre-segmentation, which could be improved by incorporating semantic information. |
3d instance segmentation, segment anything model (sam), 3d-to-2d query, graph neural network, pseudo labels |
2312.08366
Report |
See, Say, and Segment: Teaching LMMs to Overcome False Premises |
Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E. Gonzalez, Trevor Darrell |
Current open-source Large Multimodal Models (LMMs) excel at tasks such as
open-vocabulary language grounding and segmentation but can suffer under false
premises when queries imply the existence of something that is not actually
present in the image. We observe that existing methods that fine-tune an LMM to
segment images significantly degrade their ability to reliably determine
("see") if an object is present and to interact naturally with humans ("say"),
a form of catastrophic forgetting. In this work, we propose a cascading and
joint training approach for LMMs to solve this task, avoiding catastrophic
forgetting of previous skills. Our resulting model can "see" by detecting
whether objects are present in an image, "say" by telling the user if they are
not, proposing alternative queries or correcting semantic errors in the query,
and finally "segment" by outputting the mask of the desired objects if they
exist. Additionally, we introduce a novel False Premise Correction benchmark
dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets
(which we call FP-RefCOCO(+/g)). The results show that our method not only
detects false premises up to 55% better than existing approaches, but under
false premise conditions produces relative cIOU improvements of more than 31%
over baselines, and produces natural language feedback judged helpful up to 67%
of the time. |
This paper introduces a novel False Premise Correction task for Large Multimodal Models (LMMs) and a new dataset, FP-RefCOCO(+/g), to address the issue of LMMs hallucinating segmentations for non-existent objects. |
Existing LMMs for referring segmentation often fail to handle queries involving non-existent objects, hindering their ability to interact naturally with humans in real-world scenarios. |
The authors propose two methods: a cascading approach combining separate LMMs for object detection and segmentation, and SESAME, a unified LMM jointly trained on a combined dataset to perform 'see', 'say', and 'segment' functions. |
SESAME outperforms baselines in detecting false premise queries by up to 55%.
SESAME provides helpful natural language feedback in response to false premise queries, judged helpful up to 67% of the time.
SESAME achieves superior segmentation accuracy (cIoU) compared to baselines, with relative improvements of up to 31% under false premise conditions. |
The model's ability to detect false premises still has room for improvement.
The model sometimes generates hallucinated corrected premises that are still factually incorrect. |
large multimodal models, referring segmentation, false premise detection, reasoning segmentation, human-computer interaction |
2312.08338
Report |
Global Latent Neural Rendering |
Thomas Tanay, Matteo Maggioni |
A recent trend among generalizable novel view synthesis methods is to learn a
rendering operator acting over single camera rays. This approach is promising
because it removes the need for explicit volumetric rendering, but it
effectively treats target images as collections of independent pixels. Here, we
propose to learn a global rendering operator acting over all camera rays
jointly. We show that the right representation to enable such rendering is a
5-dimensional plane sweep volume consisting of the projection of the input
images on a set of planes facing the target camera. Based on this
understanding, we introduce our Convolutional Global Latent Renderer (ConvGLR),
an efficient convolutional architecture that performs the rendering operation
globally in a low-resolution latent space. Experiments on various datasets
under sparse and generalizable setups show that our approach consistently
outperforms existing methods by significant margins. |
This paper introduces global latent neural rendering, a novel view synthesis approach that learns a generalizable light field model directly from plane sweep volumes. |
This method addresses limitations of previous generalizable novel view synthesis approaches that rely on single-ray rendering, enabling more efficient and accurate rendering by processing all camera rays jointly. |
The method utilizes a convolutional neural network called ConvGLR (Convolutional Global Latent Renderer). ConvGLR operates on plane sweep volumes (PSVs), exploiting their inherent encoding of epipolar geometry to perform global rendering in a low-resolution latent space. |
ConvGLR consistently outperforms existing methods in sparse and generalizable novel view synthesis tasks across DTU, Real-Forward Facing, and Spaces datasets.
The method exhibits significant improvements in rendering quality, particularly in challenging scenarios with limited input views.
ConvGLR surpasses the performance of the winning entry and organizing team's baseline in the recent ICCV 2023 view synthesis challenge on the ILSH dataset. |
Further optimization of ConvGLR's architecture and training strategies may yield additional performance gains.
Exploration of scene-adaptive depth plane sampling could improve rendering accuracy in complex scenes. |
novel view synthesis, plane sweep volume, global latent rendering, convolutional neural network, epipolar geometry |
2312.08168
Report |
Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers |
Haifeng Huang, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tao Jin, Zhou Zhao |
Recent research has evidenced the significant potentials of Large Language
Models (LLMs) in handling challenging tasks within 3D scenes. However, current
models are constrained to addressing object-centric tasks, where each
question-answer pair focuses solely on an individual object. In real-world
applications, users may pose queries involving multiple objects or expect for
answers that precisely reference various objects. We introduce the use of
object identifiers to freely reference objects during a conversation. While
this solution appears straightforward, it presents two main challenges: 1) How
to establish a reliable one-to-one correspondence between each object and its
identifier? 2) How to incorporate complex spatial relationships among dozens of
objects into the embedding space of the LLM? To address these challenges, we
propose a two-stage alignment method, which involves learning an
attribute-aware token and a relation-aware token for each object. These tokens
capture the object's attributes and spatial relationships with surrounding
objects in the 3D scene. Once the alignment is established, we can fine-tune
our model on various downstream tasks using instruction tuning. Experiments
conducted on traditional datasets like ScanQA, ScanRefer, and Nr3D/Sr3D
showcase the effectiveness of our proposed method. Additionally, we create a 3D
scene captioning dataset annotated with rich object identifiers, with the
assistant of GPT-4. This dataset aims to further explore the capability of
object identifiers in effective object referencing and precise scene
understanding. |
This paper introduces a novel approach to 3D scene understanding using large language models (LLMs) by incorporating unique object identifiers for explicit object referencing. |
Existing 3D scene understanding models are limited to object-centric tasks, struggling with complex queries involving multiple objects and precise referencing. This work enables LLMs to understand and reference objects within a 3D scene more effectively. |
The authors propose a two-stage alignment method: object-level alignment learns attribute-aware tokens by mapping 3D object features to the LLM's embedding space, and scene-level alignment incorporates spatial relationships using a relation module to generate relation-aware tokens. |
The model outperforms previous 3D LLMs and achieves comparable results to supervised baselines on 3D question answering and visual grounding tasks.
The introduction of object identifiers enables the model to reference specific objects unambiguously, improving performance and user experience.
The authors create an identifier-rich scene captioning dataset with GPT-4 assistance, further demonstrating the model's capability for comprehensive scene understanding. |
The limited availability of 3D-language data poses challenges for optimal alignment between 3D and language spaces, impacting the model's ability to understand less frequent object classes.
Future work can explore more data-efficient architectures, training schemes, and data scaling techniques to further enhance the model's 3D scene understanding capabilities. |
3d scene understanding, large language models, object identifiers, multi-modal learning, 3d visual grounding |
2312.08128
Report |
Clockwork Diffusion: Efficient Generation With Model-Step Distillation |
Amirhossein Habibian, Amir Ghodrati, Noor Fathima, Guillaume Sautiere, Risheek Garrepalli, Fatih Porikli, Jens Petersen |
This work aims to improve the efficiency of text-to-image diffusion models.
While diffusion models use computationally expensive UNet-based denoising
operations in every generation step, we identify that not all operations are
equally relevant for the final output quality. In particular, we observe that
UNet layers operating on high-res feature maps are relatively sensitive to
small perturbations. In contrast, low-res feature maps influence the semantic
layout of the final image and can often be perturbed with no noticeable change
in the output. Based on this observation, we propose Clockwork Diffusion, a
method that periodically reuses computation from preceding denoising steps to
approximate low-res feature maps at one or more subsequent steps. For multiple
baselines, and for both text-to-image generation and image editing, we
demonstrate that Clockwork leads to comparable or improved perceptual scores
with drastically reduced computational complexity. As an example, for Stable
Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and
CLIP change. |
Clockwork Diffusion accelerates text-to-image diffusion models by reusing low-resolution feature maps from preceding denoising steps. |
Diffusion models are computationally expensive, and this work makes them faster by identifying and exploiting redundancy in the generation process. |
The authors replace lower-resolution parts of the diffusion UNet with lightweight adaptors, conditioned on previous features and other inputs. They alternate between approximated and full UNet passes during sampling. |
Clockwork Diffusion reduces FLOPs by up to 38% while maintaining comparable image quality on MS-COCO.
The method is complementary to other optimization techniques and can be applied on top of distilled models.
Clockwork Diffusion is also effective for text-guided image editing, leading to significant speedups for methods like Plug-and-Play. |
Clockwork Diffusion is currently trained for a fixed operating point and scheduler.
The method's effectiveness on non-UNet architectures is unknown. |
diffusion models, text-to-image generation, image editing, model distillation, efficient inference |
2312.08071
Report |
Novel View Synthesis with View-Dependent Effects from a Single Image |
Juan Luis Gonzalez Bello, Munchurl Kim |
In this paper, we firstly consider view-dependent effects into single
image-based novel view synthesis (NVS) problems. For this, we propose to
exploit the camera motion priors in NVS to model view-dependent appearance or
effects (VDE) as the negative disparity in the scene. By recognizing
specularities "follow" the camera motion, we infuse VDEs into the input images
by aggregating input pixel colors along the negative depth region of the
epipolar lines. Also, we propose a `relaxed volumetric rendering' approximation
that allows computing the densities in a single pass, improving efficiency for
NVS from single images. Our method can learn single-image NVS from image
sequences only, which is a completely self-supervised learning method, for the
first time requiring neither depth nor camera pose annotations. We present
extensive experiment results and show that our proposed method can learn NVS
with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k
and MannequinChallenge datasets. |
This paper presents NVSVDE-Net, the first single-view novel view synthesis (NVS) method that models view-dependent effects (VDEs) like reflections from a single image. |
Existing single-view NVS methods struggle to model VDEs, limiting their realism. NVSVDE-Net addresses this gap by leveraging camera motion priors to synthesize VDEs. |
The method introduces a 'relaxed volumetric rendering' approximation for efficient novel view generation and a novel approach to synthesize VDEs by leveraging negative disparities in the scene induced by target camera motion. Additionally, it utilizes a self-supervised training scheme based on image sequences, eliminating the need for depth or pose annotations. |
NVSVDE-Net outperforms state-of-the-art single-view NVS methods on RealEstate10k and MannequinChallenge datasets by a large margin in terms of PSNR and other quality metrics.
The method successfully generates plausible VDEs and depth maps from single images, enhancing realism.
The 'relaxed volumetric rendering' approximation, coupled with a sampler module, allows for fast and efficient rendering of high-quality novel views. |
The current architecture is limited in rendering high-frequency VDEs, focusing primarily on glossy reflections.
Rendering novel views with large baselines (significantly exceeding training disparities) remains a challenge due to large occlusions and limited context for the sampler module. |
novel view synthesis, view-dependent effects, single-image rendering, relaxed volumetric rendering, self-supervised learning |
2312.08048
Report |
Compositional Inversion for Stable Diffusion Models |
Xulu Zhang, Xiao-Yong Wei, Jinlin Wu, Tianyi Zhang, Zhaoxiang Zhang, Zhen Lei, Qing Li |
Inversion methods, such as Textual Inversion, generate personalized images by
incorporating concepts of interest provided by user images. However, existing
methods often suffer from overfitting issues, where the dominant presence of
inverted concepts leads to the absence of other desired concepts. It stems from
the fact that during inversion, the irrelevant semantics in the user images are
also encoded, forcing the inverted concepts to occupy locations far from the
core distribution in the embedding space. To address this issue, we propose a
method that guides the inversion process towards the core distribution for
compositional embeddings. Additionally, we introduce a spatial regularization
approach to balance the attention on the concepts being composed. Our method is
designed as a post-training approach and can be seamlessly integrated with
other inversion methods. Experimental results demonstrate the effectiveness of
our proposed approach in mitigating the overfitting problem and generating more
diverse and balanced compositions of concepts in the synthesized images. The
source code is available at
https://github.com/zhangxulu1996/Compositional-Inversion. |
This paper proposes a novel compositional inversion approach for text-to-image synthesis, aiming to address the overfitting issue in existing inversion methods and enable more balanced compositions of concepts in generated images. |
Existing inversion methods often lead to the dominance of inverted concepts in generated images, suppressing the presence of other desired concepts. This limits the diversity and controllability of image synthesis, particularly when composing user-specific concepts with general ones. |
The proposed approach consists of two components: (1) Semantic Inversion: Guides the embedding search towards the core distribution of concepts by utilizing anchor concepts as attractors, improving coherence with other concepts. (2) Spatial Inversion: Employs an MLP to recover coherent locations of composed concepts and regularizes attention maps to avoid the dominance of inverted concepts during image generation. |
The proposed method achieves significant improvements over state-of-the-art methods in terms of text-alignment and concept likelihood, indicating enhanced compositionality and presence of desired concepts in generated images.
The augmented Textual Inversion with the proposed method achieves comparable performance to fine-tuning based methods (Custom Diffusion, DreamBooth) without modifying network parameters.
User study confirms the effectiveness of the method in generating high-quality compositions, while revealing that rigid objects are generally easier to compose than non-rigid objects. |
The semantic inversion, while improving semantic completeness, may sometimes lead to the generation of low-probability scenes.
Future work includes exploring the integration of visual features in location recovery and investigating the potential of the proposed method for multi-concept compositions. |
text-to-image synthesis, textual inversion, compositionality, concept overfitting, spatial regularization |
2312.08019
Report |
AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing |
Zhiyuan Ma, Guoli Jia, Bowen Zhou |
With the great success of text-conditioned diffusion models in creative
text-to-image generation, various text-driven image editing approaches have
attracted the attentions of many researchers. However, previous works mainly
focus on discreteness-sensitive instructions such as adding, removing or
replacing specific objects, background elements or global styles (i.e., hard
editing), while generally ignoring subject-binding but semantically
fine-changing continuity-sensitive instructions such as actions, poses or
adjectives, and so on (i.e., soft editing), which hampers generative AI from
generating user-customized visual contents. To mitigate this predicament, we
propose a spatio-temporal guided adaptive editing algorithm AdapEdit, which
realizes adaptive image editing by introducing a soft-attention strategy to
dynamically vary the guiding degree from the editing conditions to visual
pixels from both temporal and spatial perspectives. Note our approach has a
significant advantage in preserving model priors and does not require model
training, fine-tuning, extra data, or optimization. We present our results over
a wide variety of raw images and editing instructions, demonstrating
competitive performance and showing it significantly outperforms the previous
approaches. |
This paper proposes AdapEdit, a spatio-temporal guided adaptive editing algorithm for complex continuity-sensitive image editing tasks, enhancing soft editing capabilities in text-guided image editing. |
Existing text-based image editing methods struggle with complex, subject-binding instructions (soft editing) like actions or adjectives, limiting user customization in image generation. |
AdapEdit uses a soft-attention strategy with two modules: 1) Flexible Word-Level Temporal (FWT) adjustment assigns guidance scales to words for temporal editing. 2) Dynamic Pixel-Level Spatial (DPS) weighting integrates edited features into the original image for spatial editing. |
AdapEdit effectively performs soft editing tasks (e.g., changing postures, adjusting object counts) while preserving original image details.
Quantitative evaluation shows AdapEdit achieves higher CLIP score and CLIP directional similarity compared to baselines, indicating better semantic consistency.
Ablation studies confirm the effectiveness of FWT and DPS modules in achieving continuity-sensitive editing. |
The performance of AdapEdit is sensitive to hyperparameter selection, requiring careful tuning for optimal results.
Future work includes exploring more advanced attention mechanisms and extending AdapEdit to other generative models. |
image editing, diffusion models, soft editing, text-guided image generation, spatio-temporal attention |
2312.07661
Report |
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor |
Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li |
Existing open-vocabulary image segmentation methods require a fine-tuning
step on mask labels and/or image-text datasets. Mask labels are
labor-intensive, which limits the number of categories in segmentation
datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely
reduced after fine-tuning. However, without fine-tuning, VLMs trained under
weak image-text supervision tend to make suboptimal mask predictions. To
alleviate these issues, we introduce a novel recurrent framework that
progressively filters out irrelevant texts and enhances mask quality without
training efforts. The recurrent unit is a two-stage segmenter built upon a
frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips
it with segmentation ability. Experiments show that our method outperforms not
only the training-free counterparts, but also those fine-tuned with millions of
data samples, and sets the new state-of-the-art records for both zero-shot
semantic and referring segmentation. Concretely, we improve the current record
by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context. |
This paper proposes CLIP-as-RNN (CAR), a novel recurrent framework for open-vocabulary image segmentation that leverages a frozen pre-trained vision-language model (VLM) without requiring fine-tuning. |
Existing open-vocabulary segmentation methods are limited by their reliance on fine-tuning, leading to reduced vocabulary capacity or suboptimal mask predictions. CAR addresses these limitations by preserving the VLM's broad vocabulary and enhancing mask quality without training. |
CAR employs a recurrent architecture with a two-stage segmenter. The segmenter iteratively refines mask proposals and filters irrelevant text queries by assessing the alignment between visual and textual representations. This process continues until a stable state is achieved. |
CAR significantly outperforms previous zero-shot open-vocabulary semantic segmentation methods, achieving state-of-the-art results on Pascal VOC, COCO Object, and Pascal Context.
The method demonstrates strong performance in referring image segmentation, surpassing previous state-of-the-art on RefCOCO, RefCOCO+, and RefCOCOg.
CAR establishes a strong baseline for zero-shot referring video segmentation on Ref-DAVIS 2017. |
The performance of CAR is limited by the capabilities of the pre-trained VLM.
Future work includes incorporating additional trainable modules and exploring integration with other VLMs. |
open-vocabulary segmentation, vision-language models, zero-shot learning, referring segmentation, recurrent neural networks |
2312.07541
Report |
SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration |
Daniel Duckworth, Peter Hedman, Christian Reiser, Peter Zhizhin, Jean-François Thibert, Mario Lučić, Richard Szeliski, Jonathan T. Barron |
Recent techniques for real-time view synthesis have rapidly advanced in
fidelity and speed, and modern methods are capable of rendering
near-photorealistic scenes at interactive frame rates. At the same time, a
tension has arisen between explicit scene representations amenable to
rasterization and neural fields built on ray marching, with state-of-the-art
instances of the latter surpassing the former in quality while being
prohibitively expensive for real-time applications. In this work, we introduce
SMERF, a view synthesis approach that achieves state-of-the-art accuracy among
real-time methods on large scenes with footprints up to 300 m$^2$ at a
volumetric resolution of 3.5 mm$^3$. Our method is built upon two primary
contributions: a hierarchical model partitioning scheme, which increases model
capacity while constraining compute and memory consumption, and a distillation
training strategy that simultaneously yields high fidelity and internal
consistency. Our approach enables full six degrees of freedom (6DOF) navigation
within a web browser and renders in real-time on commodity smartphones and
laptops. Extensive experiments show that our method exceeds the current
state-of-the-art in real-time novel view synthesis by 0.78 dB on standard
benchmarks and 1.78 dB on large scenes, renders frames three orders of
magnitude faster than state-of-the-art radiance field models, and achieves
real-time performance across a wide variety of commodity devices, including
smartphones. We encourage readers to explore these models interactively at our
project website: https://smerf-3d.github.io. |
SMERF, a streamable, memory-efficient radiance field representation for real-time view synthesis of large scenes, is introduced. The method renders in real-time on a variety of devices, including smartphones, while exceeding the quality of existing real-time methods. |
Existing real-time view synthesis techniques struggle to balance quality, speed, and representation size. This work aims to achieve high-fidelity rendering of large scenes in real-time on commodity hardware. |
A hierarchical model architecture composed of MERF-like submodels is built, leveraging coordinate space partitioning, deferred appearance network partitioning, and feature gating. The model is trained via a novel distillation strategy using a high-fidelity ZipNeRF teacher. |
SMERF achieves state-of-the-art accuracy among real-time methods, surpassing the previous best by 0.78 dB on standard benchmarks and 1.78 dB on large scenes.
The method renders frames three orders of magnitude faster than state-of-the-art radiance field models like ZipNeRF.
SMERF achieves real-time performance across a wide variety of commodity devices, including smartphones. |
The model has high storage cost leading to increased loading times and network usage.
Training costs are high, requiring significant GPU resources and time. |
neural radiance fields, volumetric representation, image synthesis, real-time rendering, distillation |
2312.07539
Report |
HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation |
Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, Jing Liao, Qifeng Chen |
This work presents HeadArtist for 3D head generation from text descriptions.
With a landmark-guided ControlNet serving as the generative prior, we come up
with an efficient pipeline that optimizes a parameterized 3D head model under
the supervision of the prior distillation itself. We call such a process self
score distillation (SSD). In detail, given a sampled camera pose, we first
render an image and its corresponding landmarks from the head model, and add
some particular level of noise onto the image. The noisy image, landmarks, and
text condition are then fed into the frozen ControlNet twice for noise
prediction. Two different classifier-free guidance (CFG) weights are applied
during these two predictions, and the prediction difference offers a direction
on how the rendered image can better match the text of interest. Experimental
results suggest that our approach delivers high-quality 3D head sculptures with
adequate geometry and photorealistic appearance, significantly outperforming
state-ofthe-art methods. We also show that the same pipeline well supports
editing the generated heads, including both geometry deformation and appearance
change. |
HeadArtist: a novel pipeline for generating and editing 3D heads from text descriptions, leveraging self-score distillation (SSD) within a landmark-guided ControlNet framework. |
3D head avatars are crucial for various applications, but existing methods struggle with limitations like over-saturation, over-smoothing, and multi-face Janus artifacts. |
HeadArtist disentangles geometry and texture generation. It employs a landmark-guided ControlNet with SSD to optimize a parameterized 3D head model, minimizing the score difference between predicted noise distributions representing generated and target heads. |
Generates high-quality 3D heads with intricate geometry and photorealistic textures, outperforming state-of-the-art methods.
Effectively addresses issues like multi-face Janus artifacts and over-saturation common in previous methods.
Enables 3D head editing, manipulating geometry and texture while preserving character identity. |
Currently cannot achieve photorealism on par with 3D reconstruction or GAN-based methods.
Struggles with generating complex characters, particularly those from Japanese animation, due to limitations of the FLAME initialization and the diffusion model. |
3d head generation, text-guided synthesis, self score distillation, controlnet, 3d head editing |
2312.07537
Report |
FreeInit: Bridging Initialization Gap in Video Diffusion Models |
Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu |
Though diffusion-based video generation has witnessed rapid progress, the
inference results of existing models still exhibit unsatisfactory temporal
consistency and unnatural dynamics. In this paper, we delve deep into the noise
initialization of video diffusion models, and discover an implicit
training-inference gap that attributes to the unsatisfactory inference quality.
Our key findings are: 1) the spatial-temporal frequency distribution of the
initial latent at inference is intrinsically different from that for training,
and 2) the denoising process is significantly influenced by the low-frequency
components of the initial noise. Motivated by these observations, we propose a
concise yet effective inference sampling strategy, FreeInit, which
significantly improves temporal consistency of videos generated by diffusion
models. Through iteratively refining the spatial-temporal low-frequency
components of the initial latent during inference, FreeInit is able to
compensate the initialization gap between training and inference, thus
effectively improving the subject appearance and temporal consistency of
generation results. Extensive experiments demonstrate that FreeInit
consistently enhances the generation results of various text-to-video
generation models without additional training. |
This paper identifies an implicit training-inference gap in video diffusion models' noise initialization and proposes FreeInit, an iterative method refining the initial latent's low-frequency component during inference to enhance temporal consistency in generated videos. |
Existing video diffusion models suffer from poor temporal consistency and unnatural dynamics in generated videos due to a discrepancy between training and inference noise initialization. |
FreeInit iteratively refines initial noise by combining low-frequency components of generated noisy latents with high-frequency components of random Gaussian noise, bridging the gap between training and inference. |
FreeInit significantly improves temporal consistency across various text-to-video models as measured by DINO metric.
Qualitative analysis shows enhanced subject appearance and reduced temporal artifacts in generated videos.
Ablation studies confirm the importance of noise reinitialization and appropriate filter selection for optimal performance. |
FreeInit increases inference time, potentially mitigated by coarse-to-fine sampling strategies.
Small, fast-moving objects may be distorted due to emphasis on low-frequency consistency. |
video generation, diffusion models, temporal consistency, noise initialization, frequency domain analysis |
2312.07536
Report |
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition |
Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou |
Recent approaches such as ControlNet offer users fine-grained spatial control
over text-to-image (T2I) diffusion models. However, auxiliary modules have to
be trained for each type of spatial condition, model architecture, and
checkpoint, putting them at odds with the diverse intents and preferences a
human designer would like to convey to the AI models during the content
creation process. In this work, we present FreeControl, a training-free
approach for controllable T2I generation that supports multiple conditions,
architectures, and checkpoints simultaneously. FreeControl designs structure
guidance to facilitate the structure alignment with a guidance image, and
appearance guidance to enable the appearance sharing between images generated
using the same seed. Extensive qualitative and quantitative experiments
demonstrate the superior performance of FreeControl across a variety of
pre-trained T2I models. In particular, FreeControl facilitates convenient
training-free control over many different architectures and checkpoints, allows
the challenging input conditions on which most of the existing training-free
methods fail, and achieves competitive synthesis quality with training-based
approaches. |
FreeControl, a training-free method for controllable text-to-image (T2I) generation that supports multiple conditions, architectures, and checkpoints simultaneously. |
Existing methods for controlling pre-trained T2I diffusion models require training an auxiliary module for each type of spatial condition, model architecture, and checkpoint, leading to high training cost, poor scalability and limited control signals. |
FreeControl designs structure guidance to facilitate the structure alignment with a guidance image by modeling the subspace of features in T2I models, and appearance guidance to enable the appearance sharing between images generated using the same seed. |
FreeControl supports a wide array of control conditions including challenging ones like 2D projections of point clouds and meshes, model architectures such as SD 1.5, 2.1, SD-XL 1.0, and customized checkpoints.
FreeControl demonstrates superior results compared to previous training-free methods and achieves competitive performance with prior training-based approaches.
FreeControl can be readily adapted for text-guided image-to-image translation. |
FreeControl relies on DDIM inversion for feature extraction and gradient computation, leading to increased inference time.
FreeControl relies on a low-resolution encoding of the guidance image, sometimes failing to recognize inputs with missing structure or accurately locate fine details. |
text-to-image generation, controllable generation, diffusion models, training-free methods, image-to-image translation |
2312.07533
Report |
VILA: On Pre-training for Visual Language Models |
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han |
Visual language models (VLMs) rapidly progressed with the recent success of
large language models. There have been growing efforts on visual instruction
tuning to extend the LLM with visual inputs, but lacks an in-depth study of the
visual language pre-training process, where the model learns to perform joint
modeling on both modalities. In this work, we examine the design options for
VLM pre-training by augmenting LLM towards VLM through step-by-step
controllable comparisons. We introduce three main findings: (1) freezing LLMs
during pre-training can achieve decent zero-shot performance, but lack
in-context learning capability, which requires unfreezing the LLM; (2)
interleaved pre-training data is beneficial whereas image-text pairs alone are
not optimal; (3) re-blending text-only instruction data to image-text data
during instruction fine-tuning not only remedies the degradation of text-only
tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe
we build VILA, a Visual Language model family that consistently outperforms the
state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells
and whistles. Multi-modal pre-training also helps unveil appealing properties
of VILA, including multi-image reasoning, enhanced in-context learning, and
better world knowledge. |
This paper investigates and presents an enhanced pre-training recipe for auto-regressive Visual Language Models (VLMs), aiming to augment Large Language Models (LLMs) for improved visual understanding and reasoning. |
Existing VLM research primarily focuses on instruction tuning, neglecting the crucial visual language pre-training stage. This paper addresses this gap by exploring design choices for effective VLM pre-training, which is vital for modality alignment and inheriting beneficial LLM properties like in-context learning. |
The authors conduct controlled experiments, ablating design choices related to LLM training, visual language corpus selection (interleaved vs. image-text pairs), and data blending during pre-training and instruction tuning. They analyze the impact of these choices on downstream task performance (VQA, captioning, text-only tasks) and provide insights into embedding alignment. |
Updating LLMs during pre-training is crucial for enabling in-context learning capabilities in VLMs, leading to improved performance on few-shot tasks.
Interleaved image-text corpora are superior to image-text pairs for pre-training, preserving text-only capabilities of LLMs and facilitating visual in-context learning.
Joint instruction tuning with both visual and text data remedies the degradation of text-only tasks while boosting VLM task accuracy. |
The study is limited by computational resources, preventing exploration of billion-scale pre-training data.
Future work includes scaling up the pre-training corpus, optimizing training throughput, and investigating token compression techniques for visual inputs. |
visual language model, vlm pre-training, multi-modal learning, in-context learning, large language models |
2312.07532
Report |
Interfacing Foundation Models' Embeddings |
Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang |
We present FIND, a generalized interface for aligning foundation models'
embeddings. As shown in teaser figure, a lightweight transformer interface
without tuning any foundation model weights is enough for a unified image
(segmentation) and dataset-level (retrieval) understanding. The proposed
interface has the following favorable attributes: (1) Generalizable. It applies
to various tasks spanning retrieval, segmentation, \textit{etc.}, under the
same architecture and weights. (2) Prototypable. Different tasks are able to be
implemented through prototyping attention masks and embedding types. (3)
Extendable. The proposed interface is adaptive to new tasks, and new models.
(4) Interleavable. With the benefit of multi-task multi-modal training, the
proposed interface creates an interleaved shared embedding space. In light of
the interleaved embedding space, we introduce the FIND-Bench, which introduces
new training and evaluation annotations to the COCO dataset for interleave
segmentation and retrieval. Our approach achieves state-of-the-art performance
on FIND-Bench and competitive performance on standard retrieval and
segmentation settings. The training, evaluation, and demo code as well as the
dataset have been released at https://github.com/UX-Decoder/FIND. |
This paper presents FIND, a generalized interface for aligning foundation model embeddings across modalities (vision and language) and granularities (pixel to image). |
Training individual foundation models is costly and their full potential is limited by fixed output modalities and task objectives. FIND offers a more efficient and flexible approach by interfacing existing models. |
FIND leverages a lightweight transformer interface with frozen pre-trained foundation models. It employs task-adaptive prototyping through configurable attention masks and embedding types to align vision and language embeddings. |
FIND achieves state-of-the-art performance on the proposed FIND-Bench for interleaved image retrieval and segmentation.
It exhibits competitive performance on standard benchmarks for generic, interactive, and grounded segmentation, as well as image-text retrieval.
FIND demonstrates strong generalization capability, effectively handling out-of-domain images and complex language descriptions. |
The current implementation requires training with a fixed resolution across all tasks, potentially limiting performance on certain tasks like image-text retrieval.
Future work includes incorporating novel foundation models, exploring more cross-modal tasks, extending to longer contexts, and enabling more flexible object query granularities. |
foundation models, multi-modal learning, image segmentation, image retrieval, interleaved understanding |
2312.07509
Report |
PEEKABOO: Interactive Video Generation via Masked-Diffusion |
Yash Jain, Anshul Nasery, Vibhav Vineet, Harkirat Behl |
Modern video generation models like Sora have achieved remarkable success in
producing high-quality videos. However, a significant limitation is their
inability to offer interactive control to users, a feature that promises to
open up unprecedented applications and creativity. In this work, we introduce
the first solution to equip diffusion-based video generation models with
spatio-temporal control. We present Peekaboo, a novel masked attention module,
which seamlessly integrates with current video generation models offering
control without the need for additional training or inference overhead. To
facilitate future research, we also introduce a comprehensive benchmark for
interactive video generation. This benchmark offers a standardized framework
for the community to assess the efficacy of emerging interactive video
generation models. Our extensive qualitative and quantitative assessments
reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline
models, all while maintaining the same latency. Code and benchmark are
available on the webpage. |
This paper presents Peekaboo, a training-free method to add spatio-temporal control to off-the-shelf diffusion-based video generation models, allowing users to control object size, location, and trajectory using masks. |
Interactive control in video generation is crucial for user creativity and various applications like education and entertainment, but existing models lack this feature or require expensive retraining. |
Peekaboo introduces a masked attention module integrated into existing video generation models, focusing spatial, cross, and temporal attention on local contexts defined by user-provided masks without retraining or significant inference overhead. |
Peekaboo achieves up to 3.8x improvement in mIoU over baseline models, demonstrating superior spatial control.
It maintains high video generation quality, even surpassing baselines in some cases, as shown by FVD scores.
The method is versatile, working with different T2V models (ZeroScope, ModelScope) and applicable to text-to-image models. |
The performance depends on the base model's capabilities and can inherit its biases.
Mismatch between input masks and text prompts, like contradicting motion directions, can lead to failures. |
video generation, interactive control, diffusion models, spatio-temporal control, zero-training |
2312.07504
Report |
COLMAP-Free 3D Gaussian Splatting |
Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang |
While neural rendering has led to impressive advances in scene reconstruction
and novel view synthesis, it relies heavily on accurately pre-computed camera
poses. To relax this constraint, multiple efforts have been made to train
Neural Radiance Fields (NeRFs) without pre-processed camera poses. However, the
implicit representations of NeRFs provide extra challenges to optimize the 3D
structure and camera poses at the same time. On the other hand, the recently
proposed 3D Gaussian Splatting provides new opportunities given its explicit
point cloud representations. This paper leverages both the explicit geometric
representation and the continuity of the input video stream to perform novel
view synthesis without any SfM preprocessing. We process the input frames in a
sequential manner and progressively grow the 3D Gaussians set by taking one
input frame at a time, without the need to pre-compute the camera poses. Our
method significantly improves over previous approaches in view synthesis and
camera pose estimation under large motion changes. Our project page is
https://oasisyang.github.io/colmap-free-3dgs |
This paper presents COLMAP-Free 3D Gaussian Splatting (CF-3DGS), a novel method for performing novel view synthesis without relying on pre-computed camera poses from SfM algorithms like COLMAP. |
Current neural rendering methods heavily depend on accurate camera poses, which are time-consuming to obtain and prone to errors. CF-3DGS addresses this limitation by jointly optimizing camera poses and scene reconstruction, enabling more flexible and robust view synthesis. |
CF-3DGS leverages the temporal continuity of videos and the explicit representation of 3D Gaussian Splatting. It processes frames sequentially, using a local 3DGS to estimate relative poses between nearby frames and a global 3DGS to progressively build and refine the scene representation. |
CF-3DGS achieves state-of-the-art novel view synthesis quality on Tanks and Temples and CO3D datasets, outperforming previous pose-unknown methods.
It demonstrates robust camera pose estimation, especially for challenging scenes with large camera motion like 360° videos in CO3D.
The method is efficient, achieving fast training and inference speeds thanks to the advantages of Gaussian Splatting. |
The sequential optimization limits its application to ordered image sequences.
Future work could explore extensions for unordered image collections. |
novel view synthesis, 3d gaussian splatting, pose estimation, neural rendering, sfm-free |
2312.07409
Report |
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing |
Kaiwen Zhang, Yifan Zhou, Xudong Xu, Xingang Pan, Bo Dai |
Diffusion models have achieved remarkable image generation quality surpassing
previous generative models. However, a notable limitation of diffusion models,
in comparison to GANs, is their difficulty in smoothly interpolating between
two image samples, due to their highly unstructured latent space. Such a smooth
interpolation is intriguing as it naturally serves as a solution for the image
morphing task with many applications. In this work, we present DiffMorpher, the
first approach enabling smooth and natural image interpolation using diffusion
models. Our key idea is to capture the semantics of the two images by fitting
two LoRAs to them respectively, and interpolate between both the LoRA
parameters and the latent noises to ensure a smooth semantic transition, where
correspondence automatically emerges without the need for annotation. In
addition, we propose an attention interpolation and injection technique and a
new sampling schedule to further enhance the smoothness between consecutive
images. Extensive experiments demonstrate that DiffMorpher achieves starkly
better image morphing effects than previous methods across a variety of object
categories, bridging a critical functional gap that distinguished diffusion
models from GANs. |
This paper introduces DiffMorpher, a novel approach that enables smooth and natural image interpolation using pre-trained diffusion models, effectively bridging a key functional gap between diffusion models and GANs in image morphing. |
Diffusion models excel in image generation but struggle with smooth interpolation between images, a task where GANs have traditionally excelled. This work addresses this limitation, opening new possibilities for diffusion models in applications requiring smooth image transitions, such as animations and image editing. |
DiffMorpher leverages LoRAs to capture the semantics of two input images, interpolating between their LoRA parameters and latent noises. It also employs attention interpolation and replacement for smooth texture transitions, AdaIN adjustment for color and brightness consistency, and a new sampling schedule for uniform content transition speed. |
DiffMorpher significantly outperforms previous image morphing methods, including GAN-based techniques, in terms of image fidelity, semantic consistency, and transition smoothness.
Quantitative evaluation on the newly introduced MorphBench dataset confirms the superiority of DiffMorpher in achieving smooth and natural image morphing.
A user study further substantiates the effectiveness of DiffMorpher, showing a clear preference for its results over those from baseline methods. |
The need to train a LoRA for each input image adds computational overhead.
DiffMorpher may struggle with morphing images that lack clear correspondence. |
image morphing, diffusion models, lora, attention control, image interpolation |
2312.07315
Report |
NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image |
Yoonwoo Jeong, Jinwoo Lee, Chiheon Kim, Minsu Cho, Doyup Lee |
Transfer learning of large-scale Text-to-Image (T2I) models has recently
shown impressive potential for Novel View Synthesis (NVS) of diverse objects
from a single image. While previous methods typically train large models on
multi-view datasets for NVS, fine-tuning the whole parameters of T2I models not
only demands a high cost but also reduces the generalization capacity of T2I
models in generating diverse images in a new domain. In this study, we propose
an effective method, dubbed NVS-Adapter, which is a plug-and-play module for a
T2I model, to synthesize novel multi-views of visual objects while fully
exploiting the generalization capacity of T2I models. NVS-Adapter consists of
two main components; view-consistency cross-attention learns the visual
correspondences to align the local details of view features, and global
semantic conditioning aligns the semantic structure of generated views with the
reference view. Experimental results demonstrate that the NVS-Adapter can
effectively synthesize geometrically consistent multi-views and also achieve
high performance on benchmarks without full fine-tuning of T2I models. The code
and data are publicly available in
~\href{https://postech-cvlab.github.io/nvsadapter/}{https://postech-cvlab.github.io/nvsadapter/}. |
Proposes NVS-Adapter, a plug-and-play module for Text-to-Image (T2I) models, to synthesize novel multi-views of objects from a single image while preserving the T2I model's ability to generate diverse images. |
Fine-tuning large T2I models for Novel View Synthesis (NVS) is costly and can reduce their generalization ability in new domains. NVS-Adapter addresses this by adapting T2I models for NVS without full fine-tuning. |
NVS-Adapter, integrated into a pretrained T2I model, uses two main components: (1) View-consistency cross-attention: learns visual correspondences between views to align local details. (2) Global semantic conditioning: aligns the semantic structure of generated views with the reference view. |
Synthesizes geometrically consistent multi-views from a single image.
Achieves competitive performance on Objaverse and Google Scanned Objects datasets without full fine-tuning.
Demonstrates compatibility with other plug-and-play modules like ControlNets, enhancing NVS performance further. |
Limited capacity in handling a large number of target views simultaneously.
Reliance on Score Distillation Sampling (SDS) for 3D reconstruction, which can be computationally expensive. |
novel view synthesis, text-to-image, transfer learning, diffusion models, cross-attention |
2312.07231
Report |
Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation |
Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li |
Diffusion Transformers have recently shown remarkable effectiveness in
generating high-quality 3D point clouds. However, training voxel-based
diffusion models for high-resolution 3D voxels remains prohibitively expensive
due to the cubic complexity of attention operators, which arises from the
additional dimension of voxels. Motivated by the inherent redundancy of 3D
compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer
tailored for efficient 3D point cloud generation, which greatly reduces
training costs. Specifically, we draw inspiration from masked autoencoders to
dynamically operate the denoising process on masked voxelized point clouds. We
also propose a novel voxel-aware masking strategy to adaptively aggregate
background/foreground information from voxelized point clouds. Our method
achieves state-of-the-art performance with an extreme masking ratio of nearly
99%. Moreover, to improve multi-category 3D generation, we introduce
Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a
distinct diffusion path with different experts, relieving gradient conflict.
Experimental results on the ShapeNet dataset demonstrate that our method
achieves state-of-the-art high-fidelity and diverse 3D point cloud generation
performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage
metrics when generating 128-resolution voxel point clouds, using only 6.5% of
the original training cost. |
Presents FastDiT-3D, a fast diffusion transformer for efficient 3D point cloud generation that performs denoising on masked voxelized point clouds, achieving state-of-the-art performance at a significantly reduced training cost. |
Addresses the prohibitively expensive training of voxel-based diffusion models for high-resolution 3D point clouds due to the cubic complexity of attention operators. |
Employs a novel foreground-background aware masking strategy for efficient encoding and integrates Mixture of Expert (MoE) layers within Transformer blocks for multi-category adaptation. |
Achieves state-of-the-art performance in generating high-fidelity and diverse 3D point clouds across categories on the ShapeNet dataset.
Significantly reduces training costs to 6.5% of the original cost for 128-resolution voxel point cloud generation.
Demonstrates the effectiveness of voxel-aware masking, 3D window attention, and MoE for efficient and high-quality 3D point cloud generation. |
Exploration of explicit text control for 3D shape generation is left for future work.
Scaling FastDiT-3D to large-scale text-3D datasets for text-to-3D generation is a potential future direction. |
3d point cloud generation, diffusion models, transformers, masked modeling, mixture of experts |
2312.07133
Report |
Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion |
Abdelrahman Eldesokey, Peter Wonka |
We propose a zero-shot approach for consistent Text-to-Animated-Characters
synthesis based on pre-trained Text-to-Image (T2I) diffusion models. Existing
Text-to-Video (T2V) methods are expensive to train and require large-scale
video datasets to produce diverse characters and motions. At the same time,
their zero-shot alternatives fail to produce temporally consistent videos. We
strive to bridge this gap, and we introduce a zero-shot approach that produces
temporally consistent videos of animated characters and requires no training or
fine-tuning. We leverage existing text-based motion diffusion models to
generate diverse motions that we utilize to guide a T2I model. To achieve
temporal consistency, we introduce the Spatial Latent Alignment module that
exploits cross-frame dense correspondences that we compute to align the latents
of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the
diffusion process in a direction that minimizes visual discrepancies. Our
proposed approach generates temporally consistent videos with diverse motions
and styles, outperforming existing zero-shot T2V approaches in terms of
pixel-wise consistency and user preference. |
This paper introduces a zero-shot approach for generating temporally consistent videos of animated characters using pre-trained Text-to-Image (T2I) diffusion models and text-based motion diffusion models. |
Existing Text-to-Video (T2V) methods are computationally expensive to train, require large-scale video datasets, and their zero-shot alternatives fail to produce temporally consistent videos. |
The proposed approach leverages text-based motion diffusion models to generate motion sequences, which are then used to guide a pre-trained T2I model. They introduce a Spatial Latent Alignment module to align latent codes between video frames based on cross-frame dense correspondences and a Pixel-Wise Guidance strategy to refine details and further enhance temporal consistency. |
The proposed approach outperforms existing zero-shot T2V approaches in terms of pixel-wise consistency as measured by the introduced Human Mean Squared Error metric.
User studies show a strong preference for videos generated by the proposed method compared to baselines.
The approach allows for control over character motion and style, enabling the generation of videos for scenarios that trained T2V models struggle with. |
The method relies on the accuracy of ControlNet with depth conditioning and can inherit its limitations.
The Pixel-Wise Guidance module, while effective, is computationally demanding in terms of GPU memory usage. |
text-to-video synthesis, diffusion models, zero-shot learning, temporal consistency, animated characters |
2312.07063
Report |
Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation |
Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, Gerard Pons-Moll |
Reconstructing human-object interaction in 3D from a single RGB image is a
challenging task and existing data driven methods do not generalize beyond the
objects present in the carefully curated 3D interaction datasets. Capturing
large-scale real data to learn strong interaction and 3D shape priors is very
expensive due to the combinatorial nature of human-object interactions. In this
paper, we propose ProciGen (Procedural interaction Generation), a method to
procedurally generate datasets with both, plausible interaction and diverse
object variation. We generate 1M+ human-object interaction pairs in 3D and
leverage this large-scale data to train our HDM (Hierarchical Diffusion Model),
a novel method to reconstruct interacting human and unseen objects, without any
templates. Our HDM is an image-conditioned diffusion model that learns both
realistic interaction and highly accurate human and object shapes. Experiments
show that our HDM trained with ProciGen significantly outperforms prior methods
that requires template meshes and that our dataset allows training methods with
strong generalization ability to unseen object instances. Our code and data are
released. |
This paper proposes ProciGen, a procedural interaction generation method, and HDM, a hierarchical diffusion model, to reconstruct human-object interactions in 3D from a single RGB image without object templates. |
Existing data-driven methods struggle to generalize beyond curated datasets due to the vast number of possible object shapes and interaction variations. Capturing real data at scale is expensive, creating a need for scalable synthetic data generation. |
ProciGen establishes dense correspondences between objects of the same category to transfer contact points from captured interactions to new object instances. It then jointly optimizes human and object poses to ensure plausible interactions. HDM uses a two-stage diffusion process, first jointly reconstructing human and object point clouds with segmentation labels, then refining them with separate diffusion models incorporating cross-attention to preserve interaction context. |
ProciGen generates a dataset of over 1 million interaction images with 21k+ objects paired with 3D ground truth.
HDM trained with ProciGen outperforms template-based methods like CHORE and template-free methods like PC2 on BEHAVE and InterCap datasets.
Models trained on ProciGen demonstrate strong generalization to unseen objects, even generalizing to in-the-wild images from the COCO dataset. |
The diversity of interaction poses in ProciGen is limited by the seed poses from existing datasets.
HDM struggles to reconstruct accurate human shapes when large portions of the body are occluded. |
human-object interaction, 3d reconstruction, diffusion models, synthetic data generation, template-free |
2312.06971
Report |
CCM: Adding Conditional Controls to Text-to-Image Consistency Models |
Jie Xiao, Kai Zhu, Han Zhang, Zhiheng Liu, Yujun Shen, Yu Liu, Xueyang Fu, Zheng-Jun Zha |
Consistency Models (CMs) have showed a promise in creating visual content
efficiently and with high quality. However, the way to add new conditional
controls to the pretrained CMs has not been explored. In this technical report,
we consider alternative strategies for adding ControlNet-like conditional
control to CMs and present three significant findings. 1) ControlNet trained
for diffusion models (DMs) can be directly applied to CMs for high-level
semantic controls but struggles with low-level detail and realism control. 2)
CMs serve as an independent class of generative models, based on which
ControlNet can be trained from scratch using Consistency Training proposed by
Song et al. 3) A lightweight adapter can be jointly optimized under multiple
conditions through Consistency Training, allowing for the swift transfer of
DMs-based ControlNet to CMs. We study these three solutions across various
conditional controls, including edge, depth, human pose, low-resolution image
and masked image with text-to-image latent consistency models. |
This paper explores and compares different strategies for adding ControlNet-like conditional control to Consistency Models (CMs) for image generation. |
CMs are efficient for image generation, but how to effectively add new conditional controls to pretrained CMs remained unexplored. |
The paper investigates three solutions: 1) Directly applying ControlNet trained on diffusion models (DMs) to CMs. 2) Training ControlNet from scratch using consistency training on CMs. 3) Using consistency training to optimize a lightweight adapter for transferring DMs-based ControlNet to CMs. |
Directly applied DM ControlNet can transfer high-level semantic control to CMs but struggles with low-level details and realism.
ControlNet can be successfully trained from scratch on CMs using consistency training, achieving better conditional generation.
A lightweight adapter trained with consistency training can effectively bridge the gap between DMs and CMs, improving the transferability of ControlNet. |
The study primarily focuses on visual quality without quantitative comparisons.
Future work could explore more sophisticated adapter architectures or training strategies for improved transfer learning. |
consistency models, controlnet, image generation, conditional image synthesis, transfer learning |
2312.06947
Report |
MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing |
Kangneng Zhou, Daiheng Gao, Xuan Wang, Jie Zhang, Peng Zhang, Xusen Sun, Longhao Zhang, Shiqi Yang, Bang Zhang, Liefeng Bo, Yaxing Wang, Ming-Ming Cheng |
3D-aware portrait editing has a wide range of applications in multiple
fields. However, current approaches are limited due that they can only perform
mask-guided or text-based editing. Even by fusing the two procedures into a
model, the editing quality and stability cannot be ensured. To address this
limitation, we propose \textbf{MaTe3D}: mask-guided text-based 3D-aware
portrait editing. In this framework, first, we introduce a new SDF-based 3D
generator which learns local and global representations with proposed SDF and
density consistency losses. This enhances masked-based editing in local areas;
second, we present a novel distillation strategy: Conditional Distillation on
Geometry and Texture (CDGT). Compared to exiting distillation strategies, it
mitigates visual ambiguity and avoids mismatch between texture and geometry,
thereby producing stable texture and convincing geometry while editing.
Additionally, we create the CatMask-HQ dataset, a large-scale high-resolution
cat face annotation for exploration of model generalization and expansion. We
perform expensive experiments on both the FFHQ and CatMask-HQ datasets to
demonstrate the editing quality and stability of the proposed method. Our
method faithfully generates a 3D-aware edited face image based on a modified
mask and a text prompt. Our code and models will be publicly released. |
Proposes MaTe3D, a novel framework for mask-guided text-based 3D-aware portrait editing, enabling high-quality and stable manipulation of portraits using both masks and text prompts. |
Addresses limitations of existing 3D portrait editing methods that struggle to effectively combine mask-guided and text-based manipulation in a single model, often resulting in unstable texture or unconvincing geometry. |
Introduces a new SDF-based 3D generator with SDF and density consistency losses for accurate local and global representation learning. Develops Condition Distillation on Geometry and Texture (CDGT) to iteratively refine masks and combine gradients from images and normal maps, ensuring stable texture and convincing geometry during editing. |
Achieves high-fidelity 3D portrait editing with accurate masks and text-driven modifications, outperforming existing methods in qualitative and quantitative comparisons.
Demonstrates superior geometry reconstruction quality compared to IDE-3D, as evidenced by significantly lower Chamfer-L1 distances and higher normal consistency scores.
Enables applications like real portrait editing, out-of-domain editing (e.g., adding animal textures to faces), and face swapping with celebrities. |
Image quality may slightly deteriorate when prioritizing geometry learning through SDFs.
Editing process is more time-consuming than baseline methods due to the iterative optimization strategy. |
3d portrait editing, mask-guided editing, text-guided editing, diffusion models, score distillation sampling |
2312.06742
Report |
Honeybee: Locality-enhanced Projector for Multimodal LLM |
Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh |
In Multimodal Large Language Models (MLLMs), a visual projector plays a
crucial role in bridging pre-trained vision encoders with LLMs, enabling
profound visual understanding while harnessing the LLMs' robust capabilities.
Despite the importance of the visual projector, it has been relatively less
explored. In this study, we first identify two essential projector properties:
(i) flexibility in managing the number of visual tokens, crucial for MLLMs'
overall efficiency, and (ii) preservation of local context from visual
features, vital for spatial understanding. Based on these findings, we propose
a novel projector design that is both flexible and locality-enhanced,
effectively satisfying the two desirable properties. Additionally, we present
comprehensive strategies to effectively utilize multiple and multifaceted
instruction datasets. Through extensive experiments, we examine the impact of
individual design choices. Finally, our proposed MLLM, Honeybee, remarkably
outperforms previous state-of-the-art methods across various benchmarks,
including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly
higher efficiency. Code and models are available at
https://github.com/kakaobrain/honeybee. |
This paper proposes Honeybee, a Multimodal Large Language Model (MLLM) that features a novel locality-enhanced projector. This projector aims to bridge the gap between pre-trained vision encoders and LLMs, enhancing visual understanding and efficiency. |
Existing MLLMs often struggle with balancing efficiency and the preservation of local visual context. This work addresses these limitations to improve performance in tasks like spatial understanding. |
The authors introduce two types of locality-enhanced projectors: Convolutional Abstractor (C-Abstractor) and Deformable attention-based Abstractor (D-Abstractor). They also perform extensive experiments to investigate optimal strategies for utilizing and combining diverse instruction datasets. |
Locality-enhanced projectors demonstrate superior performance in spatial understanding tasks compared to traditional linear projectors and abstractors.
Honeybee achieves state-of-the-art results on several MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench.
The study highlights the importance of dataset diversity, balanced training, and fine-grained template selection in visual instruction tuning. |
The impact of further architectural variations in projectors beyond the explored designs remains to be investigated.
Exploring advanced applications of techniques like LoRA for more efficient LLM training could be beneficial. |
multimodal large language models, visual instruction tuning, locality-enhanced projector, spatial understanding, instruction dataset utilization |
2312.06739
Report |
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models |
Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan |
Current instruction-based editing methods, such as InstructPix2Pix, often
fail to produce satisfactory results in complex scenarios due to their
dependence on the simple CLIP text encoder in diffusion models. To rectify
this, this paper introduces SmartEdit, a novel approach to instruction-based
image editing that leverages Multimodal Large Language Models (MLLMs) to
enhance their understanding and reasoning capabilities. However, direct
integration of these elements still faces challenges in situations requiring
complex reasoning. To mitigate this, we propose a Bidirectional Interaction
Module that enables comprehensive bidirectional information interactions
between the input image and the MLLM output. During training, we initially
incorporate perception data to boost the perception and understanding
capabilities of diffusion models. Subsequently, we demonstrate that a small
amount of complex instruction editing data can effectively stimulate
SmartEdit's editing capabilities for more complex instructions. We further
construct a new evaluation dataset, Reason-Edit, specifically tailored for
complex instruction-based image editing. Both quantitative and qualitative
results on this evaluation dataset indicate that our SmartEdit surpasses
previous methods, paving the way for the practical application of complex
instruction-based image editing. |
SmartEdit is an instruction-based image editing model that leverages Multimodal Large Language Models (MLLMs) to enhance understanding and reasoning in complex editing scenarios. |
Existing methods struggle with complex instructions that involve multiple objects, specific attributes, or require world knowledge. SmartEdit addresses this limitation to improve the practicality of instruction-based editing. |
SmartEdit integrates an MLLM (LLaVA) with a diffusion model, using a novel Bidirectional Interaction Module (BIM) for enhanced image-text feature interaction. It is trained on a dataset combining editing data, segmentation data, and synthesized complex editing pairs. |
SmartEdit outperforms previous methods in complex understanding and reasoning scenarios, as shown on the newly collected Reason-Edit dataset.
The BIM module proves crucial for enabling effective bidirectional information interaction.
Joint training with diverse datasets, including synthetic complex editing data, significantly improves performance. |
Evaluation metrics like CLIP Score and PSNR/SSIM/LPIPS may not perfectly align with human perception of editing quality.
Data synthesis for complex scenarios can be challenging and might benefit from further exploration of automatic generation methods. |
image editing, instruction-based editing, multimodal large language models, diffusion models, reasoning |
2312.06731
Report |
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator |
Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou |
Multimodal Large Language Models (MLLMs) demonstrate exceptional
problem-solving capabilities, but there is limited research focusing on their
ability to generate data by converting unlabeled images into visual instruction
tuning data. To this end, this paper is the first to explore the potential of
empowering MLLM to generate data rather than prompting GPT-4. We introduce
Genixer, a holistic data generation pipeline consisting of four key steps: (i)
instruction data collection, (ii) instruction template design, (iii) empowering
MLLMs, and (iv) data generation and filtering. Additionally, we outline two
modes of data generation: task-agnostic and task-specific, enabling
controllable output. We demonstrate that a synthetic VQA-like dataset trained
with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks.
Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic
dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and
synthetic data analysis, our findings are: (1) current MLLMs can serve as
robust data generators without assistance from GPT-4V; (2) MLLMs trained with
task-specific datasets can surpass GPT-4V in generating complex instruction
tuning data; (3) synthetic datasets enhance performance across various
multimodal benchmarks and help mitigate model hallucinations. The data, code,
and models can be found at https://github.com/zhaohengyuan1/Genixer. |
This paper introduces \genixer{}, an automatic data generation pipeline to produce high-quality instruction tuning data from unlabeled images using Multimodal Large Language Models (MLLMs). |
Current methods for creating visual instruction data for MLLMs are limited by either image diversity or the cost and capabilities of prompting GPT-4. |
\genixer{} consists of: (i) instruction data collection from various VL tasks, (ii) two-level instruction template design for task-specific/agnostic generation, (iii) empowering MLLMs (LLaVA1.5 and Shikra) for data generation, and (iv) automatic data filtering pipelines (Fuyu/CLIP-driven). |
MLLMs trained with \genixer{} can generate high-quality visual instruction tuning data comparable to GPT-4V without extra cost.
MLLMs trained with \genixer{} outperform GPT-4V in generating complex instruction data for tasks like REC.
Synthetic datasets from \genixer{} improve MLLM performance on various benchmarks and mitigate model hallucinations. |
The study is limited by computational constraints for testing larger LLM scales (e.g., 13B or 34B).
Evaluating complex and open-ended data types like Referential Dialogue remains a challenge. |
multimodal large language model, instruction tuning, data generation, synthetic data, visual question answering |
2312.06725
Report |
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion |
Zehuan Huang, Hao Wen, Junting Dong, Yaohui Wang, Yangguang Li, Xinyuan Chen, Yan-Pei Cao, Ding Liang, Yu Qiao, Bo Dai, Lu Sheng |
Generating multiview images from a single view facilitates the rapid
generation of a 3D mesh conditioned on a single image. Recent methods that
introduce 3D global representation into diffusion models have shown the
potential to generate consistent multiviews, but they have reduced generation
speed and face challenges in maintaining generalizability and quality. To
address this issue, we propose EpiDiff, a localized interactive multiview
diffusion model. At the core of the proposed approach is to insert a
lightweight epipolar attention block into the frozen diffusion model,
leveraging epipolar constraints to enable cross-view interaction among feature
maps of neighboring views. The newly initialized 3D modeling module preserves
the original feature distribution of the diffusion model, exhibiting
compatibility with a variety of base diffusion models. Experiments show that
EpiDiff generates 16 multiview images in just 12 seconds, and it surpasses
previous methods in quality evaluation metrics, including PSNR, SSIM and LPIPS.
Additionally, EpiDiff can generate a more diverse distribution of views,
improving the reconstruction quality from generated multiviews. Please see our
project page at https://huanngzh.github.io/EpiDiff/. |
EpiDiff, a localized interactive multiview diffusion model for efficiently generating multi-view consistent and high-quality images from a single view. |
Generating multiview images from a single view is crucial for rapid 3D mesh generation but existing methods are slow or struggle to maintain quality and generalizability. |
A lightweight epipolar attention block is inserted into a frozen diffusion model (Zero123) to enable cross-view interaction among neighboring views using epipolar constraints. |
Generates 16 multi-view images in 12 seconds.
Outperforms previous methods in PSNR, SSIM and LPIPS.
Generates more diverse views, leading to better 3D reconstructions. |
Less effective for views far from the input view due to base model limitations.
Two-step process (synthesis then reconstruction) could be unified. |
multi-view synthesis, diffusion models, epipolar geometry, 3d reconstruction, single-view reconstruction |
2312.06713
Report |
TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video |
Minye Wu, Zehao Wang, Georgios Kouros, Tinne Tuytelaars |
Neural Radiance Fields (NeRF) revolutionize the realm of visual media by
providing photorealistic Free-Viewpoint Video (FVV) experiences, offering
viewers unparalleled immersion and interactivity. However, the technology's
significant storage requirements and the computational complexity involved in
generation and rendering currently limit its broader application. To close this
gap, this paper presents Temporal Tri-Plane Radiance Fields (TeTriRF), a novel
technology that significantly reduces the storage size for Free-Viewpoint Video
(FVV) while maintaining low-cost generation and rendering. TeTriRF introduces a
hybrid representation with tri-planes and voxel grids to support scaling up to
long-duration sequences and scenes with complex motions or rapid changes. We
propose a group training scheme tailored to achieving high training efficiency
and yielding temporally consistent, low-entropy scene representations.
Leveraging these properties of the representations, we introduce a compression
pipeline with off-the-shelf video codecs, achieving an order of magnitude less
storage size compared to the state-of-the-art. Our experiments demonstrate that
TeTriRF can achieve competitive quality with a higher compression rate. |
Presents TeTriRF, a novel FVV modeling approach using Temporal Tri-Plane Radiance Fields for efficient generation and rendering with compact storage. |
Addresses limitations of existing NeRF-based FVV techniques that suffer from large storage requirements and high computational complexity, hindering their application in long-duration sequences and complex scenes. |
Introduces a hybrid representation (tri-planes and voxel grids) and a grouped multi-frame training scheme with intra- and inter-group regularization for temporally consistent and low-entropy representations. It leverages off-the-shelf video codecs (HEVC) for efficient compression. |
Achieves competitive rendering quality with significantly reduced storage (10-100 KB/frame) compared to state-of-the-art methods.
Demonstrates superior time efficiency in both training and rendering, enabling real-time playback.
Successfully handles long sequence FVV, effectively capturing intricate details in dynamic scenes with complex motions. |
Slight quality drop observed in some cases due to limitations of the utilized dataset.
Future work includes exploring alternative video codecs and optimizing rendering for real-time performance on diverse devices using GLSL shaders. |
neural radiance fields, free-viewpoint video, data compression, hybrid representation, video encoding |
2312.06712
Report |
Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models |
Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, Martial Hebert |
Despite recent significant strides achieved by diffusion-based Text-to-Image
(T2I) models, current systems are still less capable of ensuring decent
compositional generation aligned with text prompts, particularly for the
multi-object generation. This work illuminates the fundamental reasons for such
misalignment, pinpointing issues related to low attention activation scores and
mask overlaps. While previous research efforts have individually tackled these
issues, we assert that a holistic approach is paramount. Thus, we propose two
novel objectives, the Separate loss and the Enhance loss, that reduce object
mask overlaps and maximize attention scores, respectively. Our method diverges
from conventional test-time-adaptation techniques, focusing on finetuning
critical parameters, which enhances scalability and generalizability.
Comprehensive evaluations demonstrate the superior performance of our model in
terms of image realism, text-image alignment, and adaptability, notably
outperforming prominent baselines. Ultimately, this research paves the way for
T2I diffusion models with enhanced compositional capacities and broader
applicability. |
This paper introduces Separate-and-Enhance, a compositional finetuning strategy for diffusion-based Text-to-Image (T2I) models to address the issue of compositional misalignment in image generation. |
Existing T2I models struggle to generate images with multiple objects accurately, often exhibiting misalignment between the generated image and the text prompt. This work aims to enhance the compositional capacity of these models, improving their ability to generate images with multiple objects that accurately reflect the input text. |
The authors propose two novel objectives: 1) Separate loss, which minimizes the overlap between attention masks of different objects, and 2) Enhance loss, which maximizes the attention activation scores for each object. They selectively finetune specific parameters, primarily the Key mapping functions in the cross-attention modules of the diffusion model, to optimize these objectives. |
The proposed Separate-and-Enhance method achieves superior text-image alignment and image realism compared to existing state-of-the-art T2I models.
The method demonstrates scalability and effectiveness when trained on a large collection of concepts.
The finetuned model exhibits strong generalization ability, effectively generating images from prompts containing unseen concept combinations. |
The model exhibits limitations in discerning the meaning of polysemous words.
Future work could explore incorporating a more robust language model and implementing a more diverse training process to address the polysemy challenge. |
text-to-image synthesis, diffusion models, compositional generation, attention mechanisms, fine-tuning |
2312.06709
Report |
AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One |
Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov |
A handful of visual foundation models (VFMs) have recently emerged as the
backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are
trained with distinct objectives, exhibiting unique characteristics for various
downstream tasks. We find that despite their conceptual differences, these
models can be effectively merged into a unified model through multi-teacher
distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All
Domains Into One). This integrative approach not only surpasses the performance
of individual teacher models but also amalgamates their distinctive features,
such as zero-shot vision-language comprehension, detailed pixel-level
understanding, and open vocabulary segmentation capabilities. In pursuit of the
most hardware-efficient backbone, we evaluated numerous architectures in our
multi-teacher distillation pipeline using the same training recipe. This led to
the development of a novel architecture (E-RADIO) that exceeds the performance
of its predecessors and is at least 7x faster than the teacher models. Our
comprehensive benchmarking process covers downstream tasks including ImageNet
classification, ADE20k semantic segmentation, COCO object detection and
LLaVa-1.5 framework.
Code: https://github.com/NVlabs/RADIO |
The paper introduces AM-RADIO, a multi-teacher distillation framework for training a single vision foundation model from scratch using multiple pretrained VFMs (CLIP, DINOv2, SAM) as teachers, resulting in a model that combines their strengths and often surpasses them. |
Existing VFMs excel in specific domains (e.g., zero-shot learning, dense tasks) but lack comprehensive capabilities. AM-RADIO addresses this by creating a unified model that inherits and surpasses the strengths of individual teacher models. |
AM-RADIO distills knowledge from multiple teacher VFMs by matching student and teacher feature representations using adaptor heads and a combination of cosine similarity and smooth L1 loss. It addresses challenges like input resolution mismatch and efficient training. |
AM-RADIO models outperform teacher models on various benchmarks, including ImageNet classification, semantic segmentation, and visual question answering.
The framework allows for flexibility in student architecture, leading to the development of E-RADIO, a novel efficient architecture that achieves high throughput without sacrificing accuracy.
The study highlights the importance of full feature distillation for dense tasks and the complementary strengths of different teacher models. |
The partitioned training scheme for different teacher objectives might lead to latent resolution-dependent modes in the student model.
Future work includes exploring more sophisticated loss balancing techniques and student adaptor head architectures. |
knowledge distillation, multi-teacher distillation, vision foundation models, efficient architectures, visual question answering |
2312.06708
Report |
Neutral Editing Framework for Diffusion-based Video Editing |
Sunjae Yoon, Gwanhyeong Koo, Ji Woo Hong, Chang D. Yoo |
Text-conditioned image editing has succeeded in various types of editing
based on a diffusion framework. Unfortunately, this success did not carry over
to a video, which continues to be challenging. Existing video editing systems
are still limited to rigid-type editing such as style transfer and object
overlay. To this end, this paper proposes Neutral Editing (NeuEdit) framework
to enable complex non-rigid editing by changing the motion of a person/object
in a video, which has never been attempted before. NeuEdit introduces a concept
of `neutralization' that enhances a tuning-editing process of diffusion-based
editing systems in a model-agnostic manner by leveraging input video and text
without any other auxiliary aids (e.g., visual masks, video captions).
Extensive experiments on numerous videos demonstrate adaptability and
effectiveness of the NeuEdit framework. The website of our work is available
here: https://neuedit.github.io |
Presents Neutral Editing (NeuEdit), a framework for complex non-rigid video editing (e.g., changing object motion) using text prompts. |
Existing video editing methods struggle with non-rigid edits, often limited to rigid transformations like style transfer and object overlay. |
Introduces 'neutralization', reducing irrelevant content influence during model tuning and editing. Utilizes text and video analysis to identify and disentangle editing factors, generating 'neutral prompts' and 'neutral videos'. |
Significantly improves textual alignment with target prompts, enabling edits like changing a person's pose or an object's motion.
Maintains higher fidelity to unedited regions compared to existing methods.
Demonstrates consistent performance across different video editing models and datasets. |
Editing can be biased, unintentionally changing scene context related to the desired attribute.
Editing moving objects to become still is challenging due to temporal consistency constraints in video diffusion models. |
video editing, diffusion models, text-guided synthesis, non-rigid transformation, content disentanglement |
2312.06706
Report |
UNeR3D: Versatile and Scalable 3D RGB Point Cloud Generation from 2D Images in Unsupervised Reconstruction |
Hongbin Lin, Juangui Xu, Qingfeng Xu, Zhengyu Hu, Handing Xu, Yunzhi Chen, Yongjun Hu, Zhenguo Nie |
In the realm of 3D reconstruction from 2D images, a persisting challenge is
to achieve high-precision reconstructions devoid of 3D Ground Truth data
reliance. We present UNeR3D, a pioneering unsupervised methodology that sets a
new standard for generating detailed 3D reconstructions solely from 2D views.
Our model significantly cuts down the training costs tied to supervised
approaches and introduces RGB coloration to 3D point clouds, enriching the
visual experience. Employing an inverse distance weighting technique for color
rendering, UNeR3D ensures seamless color transitions, enhancing visual
fidelity. Our model's flexible architecture supports training with any number
of views, and uniquely, it is not constrained by the number of views used
during training when performing reconstructions. It can infer with an arbitrary
count of views during inference, offering unparalleled versatility.
Additionally, the model's continuous spatial input domain allows the generation
of point clouds at any desired resolution, empowering the creation of
high-resolution 3D RGB point clouds. We solidify the reconstruction process
with a novel multi-view geometric loss and color loss, demonstrating that our
model excels with single-view inputs and beyond, thus reshaping the paradigm of
unsupervised learning in 3D vision. Our contributions signal a substantial leap
forward in 3D vision, offering new horizons for content creation across diverse
applications. Code is available at https://github.com/HongbinLin3589/UNeR3D. |
UNeR3D: An unsupervised learning methodology for generating detailed 3D reconstructions (RGB point clouds) solely from 2D views. |
Addresses the limitations of supervised 3D reconstruction methods that heavily rely on costly and time-consuming 3D Ground Truth data. |
Combines neural radiance fields with a knn-based inverse distance weighting scheme, employing ResNet34 for feature extraction and a specialized MLP for point processing. Leverages multi-view geometric and color losses for training, allowing for single or multi-view reconstruction. |
Achieves high-fidelity 3D reconstructions without 3D ground truth data during training.
Introduces RGB color attributes to point clouds within the NeRF framework.
Enables flexible reconstruction from a variable number of input views, including single-view reconstruction. |
Generated 2D views may exhibit artifacts due to the model's generalizability.
Distinguishing foreground and background elements in complex scenes can be challenging. |
3d reconstruction, unsupervised learning, neural radiance fields, point cloud generation, inverse distance weighting |
2312.06704
Report |
SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction |
Zechuan Zhang, Zongxin Yang, Yi Yang |
Creating high-quality 3D models of clothed humans from single images for
real-world applications is crucial. Despite recent advancements, accurately
reconstructing humans in complex poses or with loose clothing from in-the-wild
images, along with predicting textures for unseen areas, remains a significant
challenge. A key limitation of previous methods is their insufficient prior
guidance in transitioning from 2D to 3D and in texture prediction. In response,
we introduce SIFU (Side-view Conditioned Implicit Function for Real-world
Usable Clothed Human Reconstruction), a novel approach combining a Side-view
Decoupling Transformer with a 3D Consistent Texture Refinement pipeline.SIFU
employs a cross-attention mechanism within the transformer, using SMPL-X
normals as queries to effectively decouple side-view features in the process of
mapping 2D features to 3D. This method not only improves the precision of the
3D models but also their robustness, especially when SMPL-X estimates are not
perfect. Our texture refinement process leverages text-to-image diffusion-based
prior to generate realistic and consistent textures for invisible views.
Through extensive experiments, SIFU surpasses SOTA methods in both geometry and
texture reconstruction, showcasing enhanced robustness in complex scenarios and
achieving an unprecedented Chamfer and P2S measurement. Our approach extends to
practical applications such as 3D printing and scene building, demonstrating
its broad utility in real-world scenarios. Project page
https://river-zhang.github.io/SIFU-projectpage/ . |
This paper proposes SIFU, a novel method using a Side-view Conditioned Implicit Function with a 3D Consistent Texture Refinement pipeline for high-quality reconstruction of clothed humans from single images, suitable for real-world applications like 3D printing and scene creation. |
Creating realistic 3D human models from single images is crucial for various applications. However, existing methods struggle with complex poses, loose clothing, and texture prediction for unseen areas. |
SIFU uses a side-view decoupling transformer guided by SMPL-X normals to extract precise 3D features. A 3D Consistent Texture Refinement process then uses text-to-image diffusion models and consistent editing for detailed, consistent textures. |
SIFU outperforms SOTA methods in geometry and texture quality, achieving a Chamfer and P2S of 0.6 cm on THuman2.0.
Shows improved robustness in geometry reconstruction, even with inaccurate SMPL-X estimations.
Effectively handles complex poses and loose clothing, producing realistic and consistent textures. |
Reconstruction accuracy can be affected by inaccuracies in SMPL-X estimation.
The method may struggle with clothing significantly separated from the body. Future work can explore diffusion models for both shape and texture, and refine reconstruction of specific body parts. |
3d human reconstruction, implicit function, texture refinement, diffusion models, single-image reconstruction |
2312.06703
Report |
OpenSD: Unified Open-Vocabulary Segmentation and Detection |
Shuai Li, Minghan Li, Pengfei Wang, Lei Zhang |
Recently, a few open-vocabulary methods have been proposed by employing a
unified architecture to tackle generic segmentation and detection tasks.
However, their performance still lags behind the task-specific models due to
the conflict between different tasks, and their open-vocabulary capability is
limited due to the inadequate use of CLIP. To address these challenges, we
present a universal transformer-based framework, abbreviated as OpenSD, which
utilizes the same architecture and network parameters to handle open-vocabulary
segmentation and detection tasks. First, we introduce a decoder decoupled
learning strategy to alleviate the semantic conflict between thing and staff
categories so that each individual task can be learned more effectively under
the same framework. Second, to better leverage CLIP for end-to-end segmentation
and detection, we propose dual classifiers to handle the in-vocabulary domain
and out-of-vocabulary domain, respectively. The text encoder is further trained
to be region-aware for both thing and stuff categories through decoupled prompt
learning, enabling them to filter out duplicated and low-quality predictions,
which is important to end-to-end segmentation and detection. Extensive
experiments are conducted on multiple datasets under various circumstances. The
results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary
segmentation and detection methods in both closed- and open-vocabulary
settings. Code is available at https://github.com/strongwolf/OpenSD |
This paper introduces OpenSD, a unified transformer-based framework for open-vocabulary segmentation and detection, employing the same architecture and parameters for both tasks. |
Existing open-vocabulary segmentation and detection methods often lag behind task-specific models and underutilize CLIP. OpenSD aims to overcome these limitations by offering a single potent framework for these tasks. |
OpenSD utilizes a two-stage pipeline. First, it generates object masks and boxes. Second, it predicts classifications using dual classifiers (for in-vocabulary and out-of-vocabulary domains) based on these outputs and leverages decoupled decoder learning and region-aware prompted dual classifiers to enhance performance. |
OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed and open-vocabulary settings.
The decoupled decoder learning strategy effectively mitigates conflicts between different tasks, leading to performance gains.
The region-aware dual classifiers, particularly the out-of-vocabulary classifier leveraging CLIP, significantly enhance performance in open-vocabulary settings. |
The model currently relies on ensembling in-vocabulary and out-of-vocabulary classifiers, which could be streamlined.
Future work can explore extending OpenSD to other vision tasks like image captioning or visual question answering. |
open-vocabulary, segmentation, detection, transformer, clip |
2312.06680
Report |
Perceptual Similarity guidance and text guidance optimization for Editing Real Images using Guided Diffusion Models |
Ruichen Zhang |
When using a diffusion model for image editing, there are times when the
modified image can differ greatly from the source. To address this, we apply a
dual-guidance approach to maintain high fidelity to the original in areas that
are not altered. First, we employ text-guided optimization, using text
embeddings to direct latent space and classifier-free guidance. Second, we use
perceptual similarity guidance, optimizing latent vectors with posterior
sampling via Tweedie formula during the reverse process. This method ensures
the realistic rendering of both the edited elements and the preservation of the
unedited parts of the original image. |
This paper introduces a novel dual-guidance approach for enhancing real image editing using diffusion models, combining text guidance optimization and perceptual similarity guidance to maintain fidelity to the original image in unaltered areas. |
Existing image editing methods with diffusion models often struggle to balance incorporating edits suggested by the new text prompt while preserving the structure and details of the original image. This dual-guidance approach aims to address this limitation. |
The method leverages text embeddings for text-guided optimization, employing classifier-free guidance to steer latent space manipulation. Additionally, it utilizes perceptual similarity guidance with posterior sampling via Tweedie's formula during the reverse diffusion process, ensuring realistic rendering and preserving unedited parts of the image. |
The combined Perceptual Similarity and text optimization approach demonstrates superior CLIPScore, indicating improved alignment between edited images and the new text prompt.
While all methods maintain PSNR scores around 20, suggesting visually perceptible differences, Perceptual Similarity + text optimization excels at preserving original image details, as corroborated by user evaluations.
Although LPIPS values remain comparable across methods, indicating minor overall image differences, variations in detail preservation are evident. |
The current image domain guidance relies on whole-image comparisons, potentially leading to distortions in extensively edited images. Future work might involve more localized comparisons to address this.
Limitations stemming from Stable Diffusion and Prompt-to-Prompt editing, particularly inaccurate text-image alignment, present further avenues for refinement. |
image editing, diffusion models, text-guided image editing, perceptual similarity, classifier-free guidance |
2312.06663
Report |
CAD: Photorealistic 3D Generation via Adversarial Distillation |
Ziyu Wan, Despoina Paschalidou, Ian Huang, Hongyu Liu, Bokui Shen, Xiaoyu Xiang, Jing Liao, Leonidas Guibas |
The increased demand for 3D data in AR/VR, robotics and gaming applications,
gave rise to powerful generative pipelines capable of synthesizing high-quality
3D objects. Most of these models rely on the Score Distillation Sampling (SDS)
algorithm to optimize a 3D representation such that the rendered image
maintains a high likelihood as evaluated by a pre-trained diffusion model.
However, finding a correct mode in the high-dimensional distribution produced
by the diffusion model is challenging and often leads to issues such as
over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we
propose a novel learning paradigm for 3D synthesis that utilizes pre-trained
diffusion models. Instead of focusing on mode-seeking, our method directly
models the distribution discrepancy between multi-view renderings and diffusion
priors in an adversarial manner, which unlocks the generation of high-fidelity
and photorealistic 3D content, conditioned on a single image and prompt.
Moreover, by harnessing the latent space of GANs and expressive diffusion model
priors, our method facilitates a wide variety of 3D applications including
single-view reconstruction, high diversity generation and continuous 3D
interpolation in the open domain. The experiments demonstrate the superiority
of our pipeline compared to previous works in terms of generation quality and
diversity. |
This paper introduces Consistent Adversarial Distillation (CAD), a novel method for generating high-quality, photorealistic 3D objects from a single image and text prompt by leveraging pre-trained diffusion models. |
Existing 3D generation methods based on score distillation often suffer from limitations like over-saturation, over-smoothing, and limited diversity. CAD aims to overcome these issues by directly modeling the distribution of a pre-trained diffusion model. |
CAD employs a 3D-aware GAN to learn the conditional distribution of a pre-trained diffusion model. It introduces adversarial distillation to minimize the distribution gap between multi-view renderings and diffusion priors. It also proposes strategies for sampling diverse multi-view images from the diffusion model and refining them to enhance quality. |
CAD outperforms baseline methods in terms of photorealism and diversity, as demonstrated by quantitative evaluation using CLIP similarity scores and qualitative comparisons.
The importance of pose pruning and distribution refinement for improving generation quality is highlighted through ablation studies.
The paper shows that directly modeling the 3D distribution using a GAN leads to superior results compared to single-mode fitting methods. |
The optimization speed is limited by the computational cost of volumetric rendering, suggesting the exploration of more efficient rendering techniques as future work.
The paper focuses on single-condition generation, and exploring joint training with multiple conditions could further enhance diversity. |
3d generation, diffusion models, generative adversarial networks, single-view reconstruction, adversarial distillation |
2312.06662
Report |
Photorealistic Video Generation with Diffusion Models |
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama |
We present W.A.L.T, a transformer-based approach for photorealistic video
generation via diffusion modeling. Our approach has two key design decisions.
First, we use a causal encoder to jointly compress images and videos within a
unified latent space, enabling training and generation across modalities.
Second, for memory and training efficiency, we use a window attention
architecture tailored for joint spatial and spatiotemporal generative modeling.
Taken together these design decisions enable us to achieve state-of-the-art
performance on established video (UCF-101 and Kinetics-600) and image
(ImageNet) generation benchmarks without using classifier free guidance.
Finally, we also train a cascade of three models for the task of text-to-video
generation consisting of a base latent video diffusion model, and two video
super-resolution diffusion models to generate videos of $512 \times 896$
resolution at $8$ frames per second. |
The paper introduces WALT (Window Attention Latent Transformer), a novel transformer-based framework for efficient latent video diffusion models. |
Current video diffusion models struggle with the high computational demands of video processing. WALT leverages a unified latent space for images and videos, enabling efficient training and generation by incorporating causal encoding and windowed attention in a transformer architecture. |
WALT has two main stages: (1) A causal 3D CNN encoder-decoder maps images and videos into a shared latent space. (2) A transformer model with alternating spatial and spatiotemporal window attention learns to generate images and videos in this space. The model utilizes AdaLN-LoRA for efficient conditioning, self-conditioning, and a cascaded approach for high-resolution video generation. |
WALT achieves state-of-the-art results on video generation benchmarks UCF-101 and Kinetics-600, and image generation benchmark ImageNet, without relying on classifier-free guidance.
Joint training on image and video data is shown to be crucial for high-quality text-to-video generation.
Ablation studies demonstrate the importance of smaller patch sizes, local window attention, self-conditioning, AdaLN-LoRA, and a zero terminal SNR noise schedule. |
The Inception Score for text-to-video generation, while competitive, is slightly lower than PYoCo, potentially due to the use of less powerful text embeddings.
Further scaling of the model size beyond 3B parameters is expected to further improve performance. |
video generation, diffusion models, transformers, latent space, window attention |
2312.06661
Report |
UpFusion: Novel View Diffusion from Unposed Sparse View Observations |
Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov, Shubham Tulsiani |
We propose UpFusion, a system that can perform novel view synthesis and infer
3D representations for an object given a sparse set of reference images without
corresponding pose information. Current sparse-view 3D inference methods
typically rely on camera poses to geometrically aggregate information from
input views, but are not robust in-the-wild when such information is
unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by
learning to implicitly leverage the available images as context in a
conditional generative model for synthesizing novel views. We incorporate two
complementary forms of conditioning into diffusion models for leveraging the
input views: a) via inferring query-view aligned features using a scene-level
transformer, b) via intermediate attentional layers that can directly observe
the input image tokens. We show that this mechanism allows generating
high-fidelity novel views while improving the synthesis quality given
additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google
Scanned Objects datasets and demonstrate the benefits of our method over
pose-reliant sparse-view methods as well as single-view methods that cannot
leverage additional views. Finally, we also show that our learned model can
generalize beyond the training categories and even allow reconstruction from
self-captured images of generic objects in-the-wild. |
Presents UpFusion, a system for 3D inference and novel view synthesis from sparse, unposed images, leveraging a conditional diffusion model conditioned on UpSRT features. |
Addresses limitations of existing sparse-view 3D methods that rely on accurate camera poses, which are often unavailable in real-world scenarios. |
Combines UpSRT (unposed scene representation transformer) for query-view aligned features with a conditional diffusion model for novel view synthesis. It further optimizes a 3D representation using score-based distillation. |
Outperforms pose-dependent methods relying on predicted camera poses (e.g., SparseFusion with RelPose++).
Achieves better novel view synthesis than UpSRT and single-view methods, especially when leveraging additional unposed images.
Demonstrates generalization beyond training categories, including on self-captured images. |
Generated views may not always be precisely consistent with the input images.
Scaling of performance with additional views is not as strong as in pose-aware methods. |
3d reconstruction, novel view synthesis, diffusion models, unposed images, sparse view |
2312.06660
Report |
EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM |
Chong Zhou, Xiangtai Li, Chen Change Loy, Bo Dai |
This paper presents EdgeSAM, an accelerated variant of the Segment Anything
Model (SAM), optimized for efficient execution on edge devices with minimal
compromise in performance. Our approach involves distilling the original
ViT-based SAM image encoder into a purely CNN-based architecture, better suited
for edge devices. We carefully benchmark various distillation strategies and
demonstrate that task-agnostic encoder distillation fails to capture the full
knowledge embodied in SAM. To overcome this bottleneck, we include both the
prompt encoder and mask decoder in the distillation process, with box and point
prompts in the loop, so that the distilled model can accurately capture the
intricate dynamics between user input and mask generation. To mitigate dataset
bias issues stemming from point prompt distillation, we incorporate a
lightweight module within the encoder. EdgeSAM achieves a 40-fold speed
increase compared to the original SAM, and it also outperforms MobileSAM, being
14 times as fast when deployed on edge devices while enhancing the mIoUs on
COCO and LVIS by 2.3 and 3.2 respectively. It is also the first SAM variant
that can run at over 30 FPS on an iPhone 14. Code and models are available at
https://github.com/chongzhou96/EdgeSAM. |
This paper proposes EdgeSAM, an accelerated version of the Segment Anything Model (SAM), optimized for efficient execution on edge devices while retaining comparable performance. |
Deploying SAM on edge devices like smartphones is challenging due to its large computational requirements, hindering real-time interactive segmentation. |
The authors distill the knowledge from SAM's ViT-based encoder into a CNN-based architecture, employ a novel 'prompt-in-the-loop' distillation strategy to capture the interaction between user input and mask generation, and introduce a module to adapt to granularity priors of specific datasets. |
EdgeSAM achieves a 40-fold speed increase compared to the original SAM and is 1.6 times faster than MobileSAM on an NVIDIA 2080 Ti GPU.
On an iPhone 14, EdgeSAM achieves an encoding speed of 14ms per image, making it 14 times faster than MobileSAM on the same platform.
EdgeSAM maintains comparable accuracy to SAM in box-prompt performance and outperforms MobileSAM in point-prompt performance across various datasets. |
Limitations in model capacity and training exclusively with ground-truth boxes might lead to performance discrepancies.
Further exploration of quantization, pruning, and on-device optimization for enhanced performance. |
interactive segmentation, edge computing, model compression, knowledge distillation, segment anything model (sam) |
2312.06655
Report |
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior |
Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan |
Recently, 3D content creation from text prompts has demonstrated remarkable
progress by utilizing 2D and 3D diffusion models. While 3D diffusion models
ensure great multi-view consistency, their ability to generate high-quality and
diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion
models find a distillation approach that achieves excellent generalization and
rich details without any 3D data. However, 2D lifting methods suffer from
inherent view-agnostic ambiguity thereby leading to serious multi-face Janus
issues, where text prompts fail to provide sufficient guidance to learn
coherent 3D results. Instead of retraining a costly viewpoint-aware model, we
study how to fully exploit easily accessible coarse 3D knowledge to enhance the
prompts and guide 2D lifting optimization for refinement. In this paper, we
propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity,
generalizability, and geometric consistency simultaneously. Specifically, we
design a pair of guiding strategies derived from the coarse 3D prior generated
by the 3D diffusion model: a structural guidance for geometric fidelity and a
semantic guidance for 3D coherence. Employing the two types of guidance, the 2D
diffusion model enriches the 3D content with diversified and high-quality
results. Extensive experiments show the superiority of our Sherpa3D over the
state-of-the-art text-to-3D methods in terms of quality and 3D consistency. |
Presents Sherpa3D, a novel text-to-3D generation framework that leverages readily available 3D diffusion models to guide 2D lifting optimization, resulting in high-fidelity, diverse, and geometrically consistent 3D assets. |
Addresses limitations of existing text-to-3D methods that struggle to balance generalizability, high fidelity, and geometric consistency due to reliance on limited 3D data or costly viewpoint-aware models. |
Utilizes a coarse 3D prior generated by a 3D diffusion model to guide the optimization process of a 2D diffusion model. Introduces two guiding strategies: structural guidance, which leverages normal maps for geometric fidelity, and semantic guidance, which uses high-level features for 3D coherence. Employs a step annealing technique to balance the influence of 3D guidance during optimization. |
Generates high-fidelity 3D assets with compelling texture quality and multi-view consistency, outperforming existing methods.
Exhibits strong generalization ability across diverse text prompts, effectively mitigating multi-face Janus problems.
Achieves high efficiency, generating production-ready 3D models from text prompts within 25 minutes on a single GPU. |
Current generation quality is limited by the backbone of chosen 3D and 2D diffusion models, which can be addressed in future work by using larger, more sophisticated models like SDXL and DeepFloyd.
Future work will explore extending the framework to more creative text-to-4D generation. |
text-to-3d generation, 3d diffusion models, score distillation sampling, multi-view consistency, geometric fidelity |
2312.06644
Report |
AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes |
Rao Fu, Zehao Wen, Zichen Liu, Srinath Sridhar |
Inspired by cognitive theories, we introduce AnyHome, a framework that
translates any text into well-structured and textured indoor scenes at a
house-scale. By prompting Large Language Models (LLMs) with designed templates,
our approach converts provided textual narratives into amodal structured
representations. These representations guarantee consistent and realistic
spatial layouts by directing the synthesis of a geometry mesh within defined
constraints. A Score Distillation Sampling process is then employed to refine
the geometry, followed by an egocentric inpainting process that adds lifelike
textures to it. AnyHome stands out with its editability, customizability,
diversity, and realism. The structured representations for scenes allow for
extensive editing at varying levels of granularity. Capable of interpreting
texts ranging from simple labels to detailed narratives, AnyHome generates
detailed geometries and textures that outperform existing methods in both
quantitative and qualitative measures. |
Introduces AnyHome, a framework that converts text descriptions into structured 3D house models with realistic textures, leveraging LLMs and egocentric inpainting. |
Addresses limitations in existing text-to-3D methods that struggle with robust structure, open-vocabulary furniture/objects, realistic textures, and house-scale generation. |
1. Uses LLMs to convert text into structured representations (floorplans, room layouts). 2. Employs graph-based representations and placement rules for coherent layouts. 3. Uses SDS for refining object placement. 4. Applies egocentric inpainting for realistic textures. |
Generates house-scale scenes with diverse structures and realistic textures from open-vocabulary text.
Allows detailed scene editing through text at various levels (room type, layout, object appearance).
Outperforms baselines in layout quality (OOB rate) and text-scene alignment (Caption-sim, CLIP-sim). |
LLMs' limited understanding of 3D space can lead to illogical object placement.
Maintaining texture consistency in multi-view inpainting remains a challenge. |
text-to-3d, 3d scene generation, large language models, egocentric vision, score distillation sampling |
2312.06642
Report |
CorresNeRF: Image Correspondence Priors for Neural Radiance Fields |
Yixing Lao, Xiaogang Xu, Zhipeng Cai, Xihui Liu, Hengshuang Zhao |
Neural Radiance Fields (NeRFs) have achieved impressive results in novel view
synthesis and surface reconstruction tasks. However, their performance suffers
under challenging scenarios with sparse input views. We present CorresNeRF, a
novel method that leverages image correspondence priors computed by
off-the-shelf methods to supervise NeRF training. We design adaptive processes
for augmentation and filtering to generate dense and high-quality
correspondences. The correspondences are then used to regularize NeRF training
via the correspondence pixel reprojection and depth loss terms. We evaluate our
methods on novel view synthesis and surface reconstruction tasks with
density-based and SDF-based NeRF models on different datasets. Our method
outperforms previous methods in both photometric and geometric metrics. We show
that this simple yet effective technique of using correspondence priors can be
applied as a plug-and-play module across different NeRF variants. The project
page is at https://yxlao.github.io/corres-nerf. |
This paper presents CorresNeRF, a method that leverages image correspondence priors for improved neural radiance field training with sparse input views. |
Training NeRFs with sparse input views is challenging but important for real-world applications where dense view capture is costly. |
The method involves an automatic augmentation and outlier filtering process for dense, high-quality correspondence generation. These correspondences are then used to regularize NeRF training via pixel reprojection and depth loss terms. |
CorresNeRF outperforms previous methods in novel view synthesis, achieving significantly better photometric metrics and depth prediction on the LLFF dataset.
It also excels in surface reconstruction, leading to more accurate surfaces and improved rendering quality compared to baseline SDF-based methods on the DTU dataset.
Ablation studies confirm the efficacy of the correspondence augmentation, filtering, and loss terms, demonstrating robustness to noisy correspondences. |
Current correspondence generation methods may not be robust for extreme cases like unreasonable camera positions or specific textures.
Future work will explore leveraging NeRF for correspondence learning to address these limitations. |
neural radiance fields, nerf, image correspondence, sparse view synthesis, surface reconstruction |
2312.06640
Report |
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution |
Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy |
Text-based diffusion models have exhibited remarkable success in generation
and editing, showing great promise for enhancing visual content with their
generative prior. However, applying these models to video super-resolution
remains challenging due to the high demands for output fidelity and temporal
consistency, which is complicated by the inherent randomness in diffusion
models. Our study introduces Upscale-A-Video, a text-guided latent diffusion
framework for video upscaling. This framework ensures temporal coherence
through two key mechanisms: locally, it integrates temporal layers into U-Net
and VAE-Decoder, maintaining consistency within short sequences; globally,
without training, a flow-guided recurrent latent propagation module is
introduced to enhance overall video stability by propagating and fusing latent
across the entire sequences. Thanks to the diffusion paradigm, our model also
offers greater flexibility by allowing text prompts to guide texture creation
and adjustable noise levels to balance restoration and generation, enabling a
trade-off between fidelity and quality. Extensive experiments show that
Upscale-A-Video surpasses existing methods in both synthetic and real-world
benchmarks, as well as in AI-generated videos, showcasing impressive visual
realism and temporal consistency. |
This paper introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling that leverages pretrained image diffusion models to enhance the quality of low-quality videos. |
Existing VSR methods struggle to generate realistic textures and details, especially in real-world scenarios with complex and unknown degradations. This work explores the potential of diffusion models for VSR to produce temporally consistent videos with realistic details. |
Upscale-A-Video employs a local-global temporal consistency strategy. Locally, it integrates temporal layers into U-Net and VAE-Decoder for short-sequence consistency. Globally, a flow-guided recurrent latent propagation module enhances stability across long sequences. The model also incorporates text prompts for guiding texture creation and allows adjustable noise levels to balance restoration and generation. |
Upscale-A-Video outperforms existing methods in both synthetic and real-world benchmarks, as well as on AI-generated videos, showing improved detail generation and artifact removal.
The model effectively leverages text prompts to guide texture creation, leading to more realistic and high-quality details.
Adjustable noise levels allow users to control the trade-off between restoration fidelity and detail generation, enabling versatility in different scenarios. |
The model currently relies on pretrained image diffusion models, and exploring joint training on image and video data could further enhance performance.
Investigating more efficient and accurate flow estimation methods for the latent propagation module could further improve temporal consistency. |
video super-resolution, diffusion models, temporal consistency, text-guided generation, real-world vsr |
2312.06573
Report |
ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models |
Denis Zavadski, Johann-Friedrich Feiden, Carsten Rother |
The field of image synthesis has made tremendous strides forward in the last
years. Besides defining the desired output image with text-prompts, an
intuitive approach is to additionally use spatial guidance in form of an image,
such as a depth map. For this, a recent and highly popular approach is to use a
controlling network, such as ControlNet, in combination with a pre-trained
image generation model, such as Stable Diffusion. When evaluating the design of
existing controlling networks, we observe that they all suffer from the same
problem of a delay in information flowing between the generation and
controlling process. This, in turn, means that the controlling network must
have generative capabilities. In this work we propose a new controlling
architecture, called ControlNet-XS, which does not suffer from this problem,
and hence can focus on the given task of learning to control. In contrast to
ControlNet, our model needs only a fraction of parameters, and hence is about
twice as fast during inference and training time. Furthermore, the generated
images are of higher quality and the control is of higher fidelity. All code
and pre-trained models will be made publicly available. |
This paper introduces ControlNet-XS, an efficient and effective architecture for controlling text-to-image diffusion models, addressing the issue of delayed information flow in previous control models. |
Controlling large-scale text-to-image diffusion models with spatial guidance, like depth maps or sketches, is crucial for users to achieve their desired image output. |
The paper proposes a novel architecture that eliminates the delay in information flow between the generative and controlling processes by enabling direct communication between their encoders. This allows for a significantly smaller control network trained from scratch without inheriting weights from the generative model. |
ControlNet-XS, despite its smaller size, outperforms state-of-the-art methods like ControlNet and T2I-Adapter in terms of image quality and control fidelity.
The architecture effectively controls large diffusion models, demonstrated by its application to Stable Diffusion XL with a control network significantly smaller in size.
Analysis reveals that control effectiveness varies across different U-Net blocks, with encoder blocks being more critical, and large control models can introduce unwanted biases in the generated images. |
Controlling networks can introduce biases in the generative model, even with a smaller size, demanding further research to minimize these biases.
Future work can explore a better understanding of the generative model's mechanisms to design even more effective and application-specific control tools. |
text-to-image synthesis, diffusion models, controllable image generation, spatial guidance, controlnet |
2312.06439
Report |
DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior |
Tianyu Huang, Yihan Zeng, Zhilu Zhang, Wan Xu, Hang Xu, Songcen Xu, Rynson W. H. Lau, Wangmeng Zuo |
3D generation has raised great attention in recent years. With the success of
text-to-image diffusion models, the 2D-lifting technique becomes a promising
route to controllable 3D generation. However, these methods tend to present
inconsistent geometry, which is also known as the Janus problem. We observe
that the problem is caused mainly by two aspects, i.e., viewpoint bias in 2D
diffusion models and overfitting of the optimization objective. To address it,
we propose a two-stage 2D-lifting framework, namely DreamControl, which
optimizes coarse NeRF scenes as 3D self-prior and then generates fine-grained
objects with control-based score distillation. Specifically, adaptive viewpoint
sampling and boundary integrity metric are proposed to ensure the consistency
of generated priors. The priors are then regarded as input conditions to
maintain reasonable geometries, in which conditional LoRA and weighted score
are further proposed to optimize detailed textures. DreamControl can generate
high-quality 3D content in terms of both geometry consistency and texture
fidelity. Moreover, our control-based optimization guidance is applicable to
more downstream tasks, including user-guided generation and 3D animation. The
project page is available at https://github.com/tyhuang0428/DreamControl. |
The paper proposes DreamControl, a two-stage 2D-lifting framework for text-to-3D generation that addresses the Janus problem (inconsistent geometry) by leveraging a coarse NeRF representation as a 3D self-prior. |
Existing 2D-lifting methods for 3D generation often produce inconsistent geometry due to viewpoint bias in 2D diffusion models and overfitting during optimization. |
DreamControl first generates a coarse NeRF shape as a 3D self-prior using adaptive viewpoint sampling and a boundary integrity metric to minimize inconsistencies. Then, it utilizes control-based score distillation with a conditional LoRA and weighted score to generate detailed textures while maintaining the prior's geometry. |
DreamControl generates high-quality 3D content with improved geometry consistency and texture fidelity compared to previous methods.
The proposed control-based guidance is applicable to other tasks like user-guided generation and 3D animation.
Quantitative results demonstrate DreamControl's superiority in generating consistent geometries and preserving texture details. |
The method may fail when 3D priors look similar from different viewpoints.
Future work could explore ways to further enhance the diversity of generated content. |
text-to-3d generation, 3d self-prior, janus problem, control-based score distillation, nerf |
2312.06285
Report |
Compensation Sampling for Improved Convergence in Diffusion Models |
Hui Lu, Albert ali Salah, Ronald Poppe |
Diffusion models achieve remarkable quality in image generation, but at a
cost. Iterative denoising requires many time steps to produce high fidelity
images. We argue that the denoising process is crucially limited by an
accumulation of the reconstruction error due to an initial inaccurate
reconstruction of the target data. This leads to lower quality outputs, and
slower convergence. To address this issue, we propose compensation sampling to
guide the generation towards the target domain. We introduce a compensation
term, implemented as a U-Net, which adds negligible computation overhead during
training and, optionally, inference. Our approach is flexible and we
demonstrate its application in unconditional generation, face inpainting, and
face de-occlusion using benchmark datasets CIFAR-10, CelebA, CelebA-HQ,
FFHQ-256, and FSG. Our approach consistently yields state-of-the-art results in
terms of image quality, while accelerating the denoising process to converge
during training by up to an order of magnitude. |
This paper proposes "compensation sampling" (CS), a novel sampling method to address the error accumulation issue during the training process of diffusion models, which leads to faster convergence and higher-quality image generation. |
Diffusion models, while achieving impressive results in image generation, often suffer from slow training and inference due to the iterative nature of the denoising process, resulting in the accumulation of reconstruction errors. This paper aims to alleviate this limitation and improve the efficiency of diffusion models. |
The proposed compensation sampling algorithm introduces a learned compensation term, implemented as a lightweight U-Net model, to guide the reconstruction towards the clean data distribution. This term counteracts the accumulation of errors during training. The approach is evaluated on various image generation tasks including unconditional generation, face inpainting, and face de-occlusion. |
Compensation sampling significantly accelerates the training convergence of diffusion models, up to an order of magnitude faster than traditional methods.
The generated images using compensation sampling consistently exhibit higher quality, achieving state-of-the-art results on benchmark datasets like CIFAR-10, CelebA, and FFHQ, outperforming existing diffusion and GAN-based methods.
The compensation term's computational overhead during training and inference is negligible. |
The study primarily focuses on image generation tasks. Further investigation is needed to explore the applicability of compensation sampling in other domains such as audio or video generation.
While the compensation module is lightweight, its impact on memory footprint during training, particularly for high-resolution images, needs further analysis and optimization. |
diffusion models, image generation, compensation sampling, deep learning, computer vision |
2312.06205
Report |
The Journey, Not the Destination: How Data Guides Diffusion Models |
Kristian Georgiev, Joshua Vendrow, Hadi Salman, Sung Min Park, Aleksander Madry |
Diffusion models trained on large datasets can synthesize photo-realistic
images of remarkable quality and diversity. However, attributing these images
back to the training data-that is, identifying specific training examples which
caused an image to be generated-remains a challenge. In this paper, we propose
a framework that: (i) provides a formal notion of data attribution in the
context of diffusion models, and (ii) allows us to counterfactually validate
such attributions. Then, we provide a method for computing these attributions
efficiently. Finally, we apply our method to find (and evaluate) such
attributions for denoising diffusion probabilistic models trained on CIFAR-10
and latent diffusion models trained on MS COCO. We provide code at
https://github.com/MadryLab/journey-TRAK . |
This paper introduces a framework for attributing images synthesized by diffusion models back to their training data by identifying influential training examples at each step of the diffusion process. |
Attributing generated images to training data is crucial for understanding model behavior, detecting memorization and bias, and addressing privacy concerns. This is particularly important for diffusion models, which are increasingly used in various machine learning applications. |
The authors propose a step-by-step attribution method that analyzes the evolution of the conditional distribution of generated images over the diffusion process. They utilize the \trak method to efficiently estimate the influence of training examples on the model output at each step. |
The method identifies positively influential training examples that resemble the generated image throughout the diffusion process and negatively influential examples that differ in specific attributes.
Attribution scores are shown to be counterfactually predictive, meaning they can accurately predict the impact of removing training examples on the generated images.
The framework enables feature-level attribution by localizing the analysis to specific patches in the generated image. |
The current method relies on proxies for certain quantities in the diffusion process, and finding more accurate approximations could further improve the attributions.
Scaling the framework to larger diffusion models and datasets, while feasible, presents a computational challenge. |
data attribution, diffusion models, generative models, counterfactual analysis, feature attribution |
2312.06198
Report |
Optimized View and Geometry Distillation from Multi-view Diffuser |
Youjia Zhang, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang |
Generating multi-view images from a single input view using image-conditioned
diffusion models is a recent advancement and has shown considerable potential.
However, issues such as the lack of consistency in synthesized views and
over-smoothing in extracted geometry persist. Previous methods integrate
multi-view consistency modules or impose additional supervisory to enhance view
consistency while compromising on the flexibility of camera positioning and
limiting the versatility of view synthesis. In this study, we consider the
radiance field optimized during geometry extraction as a more rigid consistency
prior, compared to volume and ray aggregation used in previous works. We
further identify and rectify a critical bias in the traditional radiance field
optimization process through score distillation from a multi-view diffuser. We
introduce an Unbiased Score Distillation (USD) that utilizes unconditioned
noises from a 2D diffusion model, greatly refining the radiance field fidelity.
We leverage the rendered views from the optimized radiance field as the basis
and develop a two-step specialization process of a 2D diffusion model, which is
adept at conducting object-specific denoising and generating high-quality
multi-view images. Finally, we recover faithful geometry and texture directly
from the refined multi-view images. Empirical evaluations demonstrate that our
optimized geometry and view distillation technique generates comparable results
to the state-of-the-art models trained on extensive datasets, all while
maintaining freedom in camera positioning. Please see our project page at
https://youjiazhang.github.io/USD/. |
This paper introduces an optimized view and geometry distillation technique from a multi-view diffusion model, addressing issues like view inconsistency and geometry over-smoothing in previous methods. |
Generating consistent multi-view images and high-quality 3D models from a single image is a challenging task with broad applications in various fields. |
The proposed method uses an Unbiased Score Distillation (USD) to rectify bias in the multi-view diffuser and leverages the optimized radiance field as a consistency prior. A two-step DreamBooth specialization process further refines a 2D diffusion model for generating high-quality multi-view images. Finally, NeuS recovers the geometry and texture from the refined images. |
The USD method significantly improves the quality of the extracted radiance field compared to traditional SDS/SJC methods.
The proposed approach generates multi-view images and geometries comparable to state-of-the-art models trained on large datasets, while maintaining flexibility in camera positioning.
The method effectively addresses the limitations of previous approaches, particularly in terms of view consistency and geometry detail. |
The underlying causes of the bias issue in the Zero-1-to-3 model are not fully understood.
Future work will focus on a theoretical analysis of the bias and explore applications of USD in other domains. |
multi-view diffusion model, 3d reconstruction, score distillation, dreambooth, nerf |
2312.06158
Report |
Adaptive Feature Selection for No-Reference Image Quality Assessment using Contrastive Mitigating Semantic Noise Sensitivity |
Xudong Li, Timin Gao, Xiawu Zheng, Runze Hu, Jingyuan Zheng, Yunhang Shen, Ke Li, Yutao Liu, Pingyang Dai, Yan Zhang, Rongrong Ji |
The current state-of-the-art No-Reference Image Quality Assessment (NR-IQA)
methods typically use feature extraction in upstream backbone networks, which
assumes that all extracted features are relevant. However, we argue that not
all features are beneficial, and some may even be harmful, necessitating
careful selection. Empirically, we find that many image pairs with small
feature spatial distances can have vastly different quality scores. To address
this issue, we propose a Quality-Aware Feature Matching IQA metric(QFM-IQM)
that employs contrastive learning to remove harmful features from the upstream
task. Specifically, our approach enhances the semantic noise distinguish
capabilities of neural networks by comparing image pairs with similar semantic
features but varying quality scores and adaptively adjusting the upstream
task's features by introducing disturbance. Furthermore, we utilize a
distillation framework to expand the dataset and improve the model's
generalization ability. Our approach achieves superior performance to the
state-of-the-art NR-IQA methods on 8 standard NR-IQA datasets, achieving PLCC
values of 0.932 (vs. 0.908 in TID2013) and 0.913 (vs. 0.894 in LIVEC). |
The paper proposes QFM-IQM, a novel No-Reference Image Quality Assessment (NR-IQA) method using contrastive learning to filter irrelevant features and a distillation framework for better generalization. |
Existing NR-IQA methods struggle to differentiate between images with similar semantic content but different quality scores due to feature aliasing. |
QFM-IQM employs 1) Semantic Noise Matching (SNM) to pair images with similar semantics but different quality, 2) Quality Consistency Contrastive (QCC) module for robustness to semantic noise, and 3) Distilled Label Extension (DLE) to expand the dataset using pseudo-labels for improved generalization. |
QFM-IQM outperforms 15 state-of-the-art NR-IQA methods on 8 benchmark datasets.
Cross-dataset validation experiments show QFM-IQM's superior generalization capability.
Ablation study confirms the effectiveness of QCC and DLE in improving performance. |
The impact of varying the number of matched features (K) requires further investigation.
Exploring different knowledge distillation strategies might further improve performance. |
image quality assessment, no-reference iqa, contrastive learning, knowledge distillation, feature selection |
2312.06116
Report |
Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods |
Panos Achlioptas, Alexandros Benetatos, Iordanis Fostiropoulos, Dimitris Skourtis |
In this work, we systematically study the problem of personalized
text-to-image generation, where the output image is expected to portray
information about specific human subjects. E.g., generating images of oneself
appearing at imaginative places, interacting with various items, or engaging in
fictional activities. To this end, we focus on text-to-image systems that input
a single image of an individual to ground the generation process along with
text describing the desired visual context. Our first contribution is to fill
the literature gap by curating high-quality, appropriate data for this task.
Namely, we introduce a standardized dataset (Stellar) that contains
personalized prompts coupled with images of individuals that is an order of
magnitude larger than existing relevant datasets and where rich semantic
ground-truth annotations are readily available. Having established Stellar to
promote cross-systems fine-grained comparisons further, we introduce a rigorous
ensemble of specialized metrics that highlight and disentangle fundamental
properties such systems should obey. Besides being intuitive, our new metrics
correlate significantly more strongly with human judgment than currently used
metrics on this task. Last but not least, drawing inspiration from the recent
works of ELITE and SDXL, we derive a simple yet efficient, personalized
text-to-image baseline that does not require test-time fine-tuning for each
subject and which sets quantitatively and in human trials a new SoTA. For more
information, please visit our project's website:
https://stellar-gen-ai.github.io/. |
This paper introduces Stellar, a large-scale dataset for personalized text-to-image generation, proposes novel evaluation metrics specifically designed for such systems, and presents StellarNet, a simple yet effective baseline model that sets a new state-of-the-art. |
Personalized text-to-image generation, while promising, lacks standardized data and specialized evaluation metrics, hindering progress in the field. |
The authors curate Stellar, a dataset of 20,000 imaginative prompts paired with 400 unique human identities. They introduce five new metrics focusing on identity preservation, attribute accuracy, stability across different images of the same subject, and object/relation faithfulness. They develop StellarNet, leveraging SDXL and dynamic textual inversion to generate personalized images. |
The proposed metrics demonstrate significantly stronger correlation with human judgment compared to existing metrics.
StellarNet outperforms other state-of-the-art personalized text-to-image generation methods both quantitatively and in human evaluation.
The authors highlight the potential for misuse of personalized image generation and urge responsible use and content moderation. |
StellarNet, while effective, inherits potential biases present in the underlying SDXL model.
Future work could focus on mitigating biases and developing more robust content moderation techniques for personalized image generation. |
text-to-image generation, personalized ai, dataset, evaluation metrics, ethical considerations |
2312.06109
Report |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models |
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang |
Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary
-- CLIP, which can cover most common vision tasks. However, for some special
vision task that needs dense and fine-grained vision perception, e.g.,
document-level OCR or chart understanding, especially in non-English scenarios,
the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision
knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose
Vary, an efficient and effective method to scale up the vision vocabulary of
LVLMs. The procedures of Vary are naturally divided into two folds: the
generation and integration of a new vision vocabulary. In the first phase, we
devise a vocabulary network along with a tiny decoder-only transformer to
produce the desired vocabulary via autoregression. In the next, we scale up the
vanilla vision vocabulary by merging the new one with the original one (CLIP),
enabling the LVLMs can quickly garner new features. Compared to the popular
BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while
enjoying more excellent fine-grained perception and understanding ability.
Specifically, Vary is competent in new document parsing features (OCR or
markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet.
Our code will be publicly available on the homepage. |
Proposes Vary, a method for scaling up the vision vocabulary of Large Vision-Language Models (LVLMs) to improve performance on tasks requiring dense and fine-grained vision perception, such as document OCR and chart understanding. |
Existing LVLMs often rely on a CLIP-based vision vocabulary, which may not be efficient or effective for specialized vision tasks, particularly in non-English scenarios. |
Vary generates a new vision vocabulary using a vocabulary network and a tiny decoder-only transformer trained on document and chart images. It then integrates this new vocabulary with the original CLIP vocabulary in the LVLM. |
Vary-base achieves comparable performance to specialized document parsing models on English document OCR and outperforms them on markdown conversion.
Vary-base demonstrates significant improvements on downstream VQA tasks, achieving strong results on DocVQA and ChartQA.
Vary-base maintains competitive general performance compared to other LVLMs on the MMVet benchmark. |
The paper acknowledges that the current method for scaling up the visual vocabulary can be further improved.
Future work will explore applying Vary to other fine-grained vision tasks beyond document and chart understanding. |
vision-language models, vocabulary expansion, document ocr, chart understanding, fine-grained vision perception |
2312.06059
Report |
CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models |
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag |
Images produced by text-to-image diffusion models might not always faithfully
represent the semantic intent of the provided text prompt, where the model
might overlook or entirely fail to produce certain objects. Existing solutions
often require customly tailored functions for each of these problems, leading
to sub-optimal results, especially for complex prompts. Our work introduces a
novel perspective by tackling this challenge in a contrastive context. Our
approach intuitively promotes the segregation of objects in attention maps
while also maintaining that pairs of related attributes are kept close to each
other. We conduct extensive experiments across a wide variety of scenarios,
each involving unique combinations of objects, attributes, and scenes. These
experiments effectively showcase the versatility, efficiency, and flexibility
of our method in working with both latent and pixel-based diffusion models,
including Stable Diffusion and Imagen. Moreover, we publicly share our source
code to facilitate further research. |
This paper introduces CONFORM, a training-free method using a contrastive objective and test-time optimization to improve the fidelity of pre-trained text-to-image diffusion models. |
Existing methods for improving the fidelity of text-to-image models often rely on tailored solutions for specific problems, leading to sub-optimal performance. |
CONFORM leverages attention maps of object and attribute tokens as features. It treats attributes of a specific object as positive pairs and contrasts them against other attributes and objects. The method utilizes InfoNCE loss with a contrastive objective applied to cross-attention maps from multiple timesteps during the diffusion process. |
CONFORM effectively addresses missing objects, attribute binding errors, and object miscounting in generated images.
Quantitative evaluation using CLIP similarity, BLIP captioning similarity, and TIFA scores demonstrate CONFORM's superiority over existing methods.
User studies confirm that CONFORM generates images that better align with text prompts compared to other state-of-the-art approaches. |
The method may struggle to generate successful images when the initial attention map significantly excludes key objects.
In some cases, CONFORM's refinement process might lead to object separation in the generated image, although the text prompt accuracy is improved. |
text-to-image synthesis, diffusion models, contrastive learning, attention maps, image fidelity |
2312.06038
Report |
Correcting Diffusion Generation through Resampling |
Yujian Liu, Yang Zhang, Tommi Jaakkola, Shiyu Chang |
Despite diffusion models' superior capabilities in modeling complex
distributions, there are still non-trivial distributional discrepancies between
generated and ground-truth images, which has resulted in several notable
problems in image generation, including missing object errors in text-to-image
generation and low image quality. Existing methods that attempt to address
these problems mostly do not tend to address the fundamental cause behind these
problems, which is the distributional discrepancies, and hence achieve
sub-optimal results. In this paper, we propose a particle filtering framework
that can effectively address both problems by explicitly reducing the
distributional discrepancies. Specifically, our method relies on a set of
external guidance, including a small set of real images and a pre-trained
object detector, to gauge the distribution gap, and then design the resampling
weight accordingly to correct the gap. Experiments show that our methods can
effectively correct missing object errors and improve image quality in various
image generation tasks. Notably, our method outperforms the existing strongest
baseline by 5% in object occurrence and 1.0 in FID on MS-COCO. Our code is
publicly available at
https://github.com/UCSB-NLP-Chang/diffusion_resampling.git. |
This paper introduces a particle filtering framework for diffusion models, addressing missing object errors and low image quality by minimizing distributional discrepancies between generated and real images. |
Existing methods often fail to address the root cause of these errors: the distributional gap between generated and ground-truth images. |
The proposed framework uses external guidance (real images and/or an object detector) to compute resampling weights during the diffusion process, guiding generated samples closer to the desired distribution. |
Outperforms baselines in object occurrence and image quality on text-to-image generation benchmarks like MS-COCO.
Achieves state-of-the-art FID scores on ImageNet-64 for class-conditioned generation.
Demonstrates the effectiveness of particle filtering and resampling strategies in aligning generated distributions with ground-truth. |
The reliance on external guidance (object detectors, real images) can limit generalizability.
Further exploration is needed to optimize computational efficiency and reduce dependence on large numbers of function evaluations. |
diffusion models, particle filtering, text-to-image generation, image quality, object detection |
2312.05915
Report |
Diffusion for Natural Image Matting |
Yihan Hu, Yiheng Lin, Wei Wang, Yao Zhao, Yunchao Wei, Humphrey Shi |
We aim to leverage diffusion to address the challenging image matting task.
However, the presence of high computational overhead and the inconsistency of
noise sampling between the training and inference processes pose significant
obstacles to achieving this goal. In this paper, we present DiffMatte, a
solution designed to effectively overcome these challenges. First, DiffMatte
decouples the decoder from the intricately coupled matting network design,
involving only one lightweight decoder in the iterations of the diffusion
process. With such a strategy, DiffMatte mitigates the growth of computational
overhead as the number of samples increases. Second, we employ a self-aligned
training strategy with uniform time intervals, ensuring a consistent noise
sampling between training and inference across the entire time domain. Our
DiffMatte is designed with flexibility in mind and can seamlessly integrate
into various modern matting architectures. Extensive experimental results
demonstrate that DiffMatte not only reaches the state-of-the-art level on the
Composition-1k test set, surpassing the best methods in the past by 5% and 15%
in the SAD metric and MSE metric respectively, but also show stronger
generalization ability in other benchmarks. |
This paper proposes DiffMatte, a novel approach that leverages diffusion models for natural image matting tasks, achieving state-of-the-art performance and strong generalization ability. |
Existing deep learning-based matting solutions struggle to capture both high-level context and low-level texture information, leading to inaccuracies. Diffusion models, known for their ability to model complex data distributions and generate realistic textures, have not been effectively applied to matting due to high computational overhead and noise sampling inconsistencies. |
DiffMatte introduces two key innovations: 1) It decouples the image encoder and decoder, utilizing a lightweight decoder for iterative refinement during the diffusion process, significantly reducing computational overhead. 2) It employs a self-aligned training strategy with uniform time intervals, ensuring consistent noise sampling between training and inference, mitigating performance decay caused by data discrepancy. |
DiffMatte achieves state-of-the-art performance on the Composition-1k benchmark, surpassing previous best methods by a significant margin.
The method demonstrates strong generalization ability, outperforming previous methods on Distinctions-646 and Semantic Image Matting test sets.
DiffMatte's iterative refinement process allows for continuous improvement of predictions with increasing sampling steps, enhancing the quality of the generated alpha mattes. |
The model's performance reaches a plateau with increasing sample steps, limited by the accuracy of the matting model's predictions.
Future work can explore incorporating artificial correction methods during inference to achieve interactive matting, expanding its applications in image editing. |
image matting, diffusion models, deep learning, computer vision, iterative refinement |
2312.05889
Report |
SuperPrimitive: Scene Reconstruction at a Primitive Level |
Kirill Mazur, Gwangbin Bae, Andrew J. Davison |
Joint camera pose and dense geometry estimation from a set of images or a
monocular video remains a challenging problem due to its computational
complexity and inherent visual ambiguities. Most dense incremental
reconstruction systems operate directly on image pixels and solve for their 3D
positions using multi-view geometry cues. Such pixel-level approaches suffer
from ambiguities or violations of multi-view consistency (e.g. caused by
textureless or specular surfaces).
We address this issue with a new image representation which we call a
SuperPrimitive. SuperPrimitives are obtained by splitting images into
semantically correlated local regions and enhancing them with estimated surface
normal directions, both of which are predicted by state-of-the-art single image
neural networks. This provides a local geometry estimate per SuperPrimitive,
while their relative positions are adjusted based on multi-view observations.
We demonstrate the versatility of our new representation by addressing three
3D reconstruction tasks: depth completion, few-view structure from motion, and
monocular dense visual odometry. |
This paper introduces SuperPrimitives, a novel image representation for dense 3D reconstruction that combines local geometric priors from single-image neural networks with multi-view optimization. |
Dense incremental reconstruction from monocular images or videos is challenging due to computational complexity and visual ambiguities. Existing methods often struggle to balance reliable initial geometry estimates with multi-view consistency. |
SuperPrimitives are constructed by segmenting an image into semantically correlated regions and enhancing them with estimated surface normals. These primitives' scales are then optimized jointly with camera poses using multi-view photometric consistency. |
The method achieves state-of-the-art results in zero-shot depth completion on the VOID benchmark.
It outperforms competitors in few-view structure from motion on the ScanNet dataset, even without global priors.
The approach enables a simple yet effective monocular visual odometry system that surpasses previous methods on the challenging TUM RGB-D dataset. |
The method assumes geometric continuity within each predicted image segment, which may not always hold.
The current implementation does not explicitly handle occlusions during primitive alignment, which could be improved in future work. |
3d reconstruction, superprimitives, depth completion, structure from motion, visual odometry |
2312.05849
Report |
InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models |
Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, Weipeng Hu |
Large-scale text-to-image (T2I) diffusion models have showcased incredible
capabilities in generating coherent images based on textual descriptions,
enabling vast applications in content generation. While recent advancements
have introduced control over factors such as object localization, posture, and
image contours, a crucial gap remains in our ability to control the
interactions between objects in the generated content. Well-controlling
interactions in generated images could yield meaningful applications, such as
creating realistic scenes with interacting characters. In this work, we study
the problems of conditioning T2I diffusion models with Human-Object Interaction
(HOI) information, consisting of a triplet label (person, action, object) and
corresponding bounding boxes. We propose a pluggable interaction control model,
called InteractDiffusion that extends existing pre-trained T2I diffusion models
to enable them being better conditioned on interactions. Specifically, we
tokenize the HOI information and learn their relationships via interaction
embeddings. A conditioning self-attention layer is trained to map HOI tokens to
visual tokens, thereby conditioning the visual tokens better in existing T2I
diffusion models. Our model attains the ability to control the interaction and
location on existing T2I diffusion models, which outperforms existing baselines
by a large margin in HOI detection score, as well as fidelity in FID and KID.
Project page: https://jiuntian.github.io/interactdiffusion. |
This work proposes InteractDiffusion, a pluggable interaction control model, which enhances pre-trained text-to-image diffusion models with Human-Object Interaction (HOI) control, improving the generation of images with specific interactions between objects. |
Current text-to-image diffusion models struggle to accurately depict interactions between objects, a crucial aspect of realistic image generation. Controlling interactions enables diverse applications in e-commerce, gaming, and interactive storytelling. |
The paper introduces InteractDiffusion, a module consisting of: (1) Interaction Tokenizer (InToken) to transform HOI information into meaningful tokens; (2) Interaction Embedding (InBedding) to capture intricate relationships between interacting objects; (3) Interaction Transformer (InFormer) to integrate HOI tokens into the visual tokens of the diffusion model. |
InteractDiffusion renders more accurate and coherent interactions between objects compared to existing methods, aligning better with provided instructions.
The model maintains high image generation quality, even with added parameters for interaction control, showing comparable or even slightly better FID and KID scores.
Quantitative evaluation using HOI Detection Score shows significant improvement over baselines, demonstrating effective control over interactions in generated images. |
Generated interactions, while improved, may still lack finer details compared to real images, as evidenced by performance differences with larger HOI detectors.
Existing large pre-trained models lack comprehensive understanding of interactions, potentially hindering the full potential of interaction control. |
text-to-image generation, diffusion models, human-object interaction, interaction control, image synthesis |
2312.05760
Report |
RepViT-SAM: Towards Real-Time Segmenting Anything |
Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding |
Segment Anything Model (SAM) has shown impressive zero-shot transfer
performance for various computer vision tasks recently. However, its heavy
computation costs remain daunting for practical applications. MobileSAM
proposes to replace the heavyweight image encoder in SAM with TinyViT by
employing distillation, which results in a significant reduction in
computational requirements. However, its deployment on resource-constrained
mobile devices still encounters challenges due to the substantial memory and
computational overhead caused by self-attention mechanisms. Recently, RepViT
achieves the state-of-the-art performance and latency trade-off on mobile
devices by incorporating efficient architectural designs of ViTs into CNNs.
Here, to achieve real-time segmenting anything on mobile devices, following
MobileSAM, we replace the heavyweight image encoder in SAM with RepViT model,
ending up with the RepViT-SAM model. Extensive experiments show that RepViT-SAM
can enjoy significantly better zero-shot transfer capability than MobileSAM,
along with nearly $10\times$ faster inference speed. The code and models are
available at \url{https://github.com/THU-MIG/RepViT}. |
This paper introduces RepViT-SAM, a model that replaces the heavy image encoder in Segment Anything Model (SAM) with RepViT to enable real-time performance on mobile devices. |
SAM, despite its impressive zero-shot transfer capabilities for various computer vision tasks, suffers from high computational costs. This limits its practical applications, especially on resource-constrained devices. |
The authors replace the ViT-H image encoder in SAM with RepViT-M2.3. They train RepViT-SAM using a decoupled distillation strategy, directly distilling the image encoder from the ViT-H in the original SAM with an MSE loss. |
RepViT-SAM achieves significantly better zero-shot transfer capability than MobileSAM across various tasks like edge detection, instance segmentation, and video object segmentation.
RepViT-SAM exhibits a near 10x faster inference speed compared to MobileSAM, making it suitable for real-time applications on mobile devices.
RepViT-SAM demonstrates strong performance even surpassing the original SAM in specific downstream tasks like anomaly detection. |
The decoupled distillation strategy, while enabling efficient transfer of macro-level visual features, may limit the model's performance on tasks requiring fine-grained details.
Further exploration of different RepViT variants and distillation strategies could potentially yield even better performance and efficiency trade-offs. |
segment anything model, repvit, mobile vision, zero-shot learning, model compression |
2312.05695
Report |
The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel Size might be All You Need |
Tianjin Huang, Tianlong Chen, Zhangyang Wang, Shiwei Liu |
Vision Transformers have been rapidly uprising in computer vision thanks to
their outstanding scaling trends, and gradually replacing convolutional neural
networks (CNNs). Recent works on self-supervised learning (SSL) introduce
siamese pre-training tasks, on which Transformer backbones continue to
demonstrate ever stronger results than CNNs. People come to believe that
Transformers or self-attention modules are inherently more suitable than CNNs
in the context of SSL. However, it is noteworthy that most if not all prior
arts of SSL with CNNs chose the standard ResNets as their backbones, whose
architecture effectiveness is known to already lag behind advanced Vision
Transformers. Therefore, it remains unclear whether the self-attention
operation is crucial for the recent advances in SSL - or CNNs can deliver the
same excellence with more advanced designs, too? Can we close the SSL
performance gap between Transformers and CNNs? To answer these intriguing
questions, we apply self-supervised pre-training to the recently proposed,
stronger lager-kernel CNN architecture and conduct an apple-to-apple comparison
with Transformers, in their SSL performance. Our results show that we are able
to build pure CNN SSL architectures that perform on par with or better than the
best SSL-trained Transformers, by just scaling up convolutional kernel sizes
besides other small tweaks. Impressively, when transferring to the downstream
tasks \texttt{MS COCO} detection and segmentation, our SSL pre-trained CNN
model (trained in 100 epochs) achieves the same good performance as the
300-epoch pre-trained Transformer counterpart. We hope this work can help to
better understand what is essential (or not) for self-supervised learning
backbones. |
This paper investigates whether the recent success of large-kernel CNNs in supervised learning can be translated to self-supervised learning (SSL), challenging the notion that Transformers are inherently superior for SSL. |
The dominance of Transformers in SSL has led to a belief that self-attention is crucial, neglecting the potential of advanced CNN architectures. |
The authors adapt the ConvNeXt architecture for SSL by adding BatchNorm layers after depthwise convolutions and scaling up kernel sizes. They compare this modified CNN, dubbed BC-SSL, with state-of-the-art SSL-trained Transformers (ViT and Swin) using the DINO framework. |
BC-SSL achieves comparable or better performance than Swin Transformers on ImageNet classification with linear probe and k-NN evaluation, while having faster inference throughput.
BC-SSL exhibits significant performance gains on downstream tasks like object detection and segmentation on MS COCO, outperforming Swin Transformers trained for the same number of epochs.
BC-SSL demonstrates increasing robustness to distribution shifts with larger kernel sizes, surpassing both ResNet and Swin Transformer in robustness benchmarks. |
The benefits of increasing kernel size in BC-SSL seem to saturate at 9x9, exploring larger kernels with techniques like structure re-parameterization or sparsity is left for future work.
The study focuses on ConvNeXt; exploring other large-kernel CNN architectures like RepLKNet and SLaK in SSL could be beneficial. |
self-supervised learning, convolutional neural networks, vision transformers, large kernels, robustness |
2312.05664
Report |
CoGS: Controllable Gaussian Splatting |
Heng Yu, Joel Julin, Zoltán Á. Milacski, Koichiro Niinuma, László A. Jeni |
Capturing and re-animating the 3D structure of articulated objects present
significant barriers. On one hand, methods requiring extensively calibrated
multi-view setups are prohibitively complex and resource-intensive, limiting
their practical applicability. On the other hand, while single-camera Neural
Radiance Fields (NeRFs) offer a more streamlined approach, they have excessive
training and rendering costs. 3D Gaussian Splatting would be a suitable
alternative but for two reasons. Firstly, existing methods for 3D dynamic
Gaussians require synchronized multi-view cameras, and secondly, the lack of
controllability in dynamic scenarios. We present CoGS, a method for
Controllable Gaussian Splatting, that enables the direct manipulation of scene
elements, offering real-time control of dynamic scenes without the prerequisite
of pre-computing control signals. We evaluated CoGS using both synthetic and
real-world datasets that include dynamic objects that differ in degree of
difficulty. In our evaluations, CoGS consistently outperformed existing dynamic
and controllable neural representations in terms of visual fidelity. |
Presents CoGS, a method for Controllable Gaussian Splatting that allows direct manipulation of dynamic scenes captured by a monocular camera without pre-computed control signals. |
Addresses limitations of NeRFs (computational cost, implicit representation) by using explicit 3D Gaussian representations for efficient rendering and manipulation of dynamic scenes. |
Extends 3D Gaussian Splatting to dynamic scenes by learning deformation fields for Gaussian parameters and introduces control by: 1) generating 3D masks, 2) extracting control signals from Gaussian trajectories, and 3) re-aligning these signals for manipulation. |
Outperforms existing dynamic and controllable neural representations in visual fidelity on synthetic and real-world datasets.
Successfully manipulates various dynamic scenes including faces, toy cars, and animated objects.
Demonstrates real-time control capabilities without relying on pre-defined control signals. |
Faces challenges with highly reflective objects and large-scale non-rigid motion.
Current control signal extraction using PCA may not generalize to highly complex movements. |
gaussian splatting, dynamic scene representation, controllable animation, monocular vision, 3d reconstruction |
2312.05616
Report |
Iterative Token Evaluation and Refinement for Real-World Super-Resolution |
Chaofeng Chen, Shangchen Zhou, Liang Liao, Haoning Wu, Wenxiu Sun, Qiong Yan, Weisi Lin |
Real-world image super-resolution (RWSR) is a long-standing problem as
low-quality (LQ) images often have complex and unidentified degradations.
Existing methods such as Generative Adversarial Networks (GANs) or continuous
diffusion models present their own issues including GANs being difficult to
train while continuous diffusion models requiring numerous inference steps. In
this paper, we propose an Iterative Token Evaluation and Refinement (ITER)
framework for RWSR, which utilizes a discrete diffusion model operating in the
discrete token representation space, i.e., indexes of features extracted from a
VQGAN codebook pre-trained with high-quality (HQ) images. We show that ITER is
easier to train than GANs and more efficient than continuous diffusion models.
Specifically, we divide RWSR into two sub-tasks, i.e., distortion removal and
texture generation. Distortion removal involves simple HQ token prediction with
LQ images, while texture generation uses a discrete diffusion model to
iteratively refine the distortion removal output with a token refinement
network. In particular, we propose to include a token evaluation network in the
discrete diffusion process. It learns to evaluate which tokens are good
restorations and helps to improve the iterative refinement results. Moreover,
the evaluation network can first check status of the distortion removal output
and then adaptively select total refinement steps needed, thereby maintaining a
good balance between distortion removal and texture generation. Extensive
experimental results show that ITER is easy to train and performs well within
just 8 iterative steps. Our codes will be available publicly. |
This paper proposes ITER, an Iterative Token Evaluation and Refinement framework for Real-World Super-Resolution (RWSR) that operates in the discrete token representation space. |
RWSR is challenging due to complex, unidentified degradations in low-quality images. Existing GAN-based methods are difficult to train, while diffusion models require many inference steps. ITER addresses these limitations by being easier to train than GANs and more efficient than diffusion models. |
The method divides RWSR into distortion removal and texture generation. It uses a pre-trained VQGAN codebook, a distortion removal encoder, and a conditioned discrete diffusion model with a novel token evaluation block for iterative refinement. |
ITER achieves state-of-the-art performance on real-world benchmarks, outperforming both GAN and diffusion-based methods.
The iterative refinement with token evaluation effectively generates realistic textures and avoids local propagation problems.
The adaptive inference strategy balances distortion removal and texture generation based on initial restoration quality and allows user control over texture strength through a threshold. |
The performance of ITER is limited by the reconstruction quality of the pre-trained VQGAN.
Future work could explore faster architectures for the token evaluation and refinement networks to further improve inference speed. |
image super-resolution, real-world super-resolution, discrete diffusion model, token evaluation, vqgan |
2312.05541
Report |
DPoser: Diffusion Model as Robust 3D Human Pose Prior |
Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Yulun Zhang, Haoqian Wang |
This work targets to construct a robust human pose prior. However, it remains
a persistent challenge due to biomechanical constraints and diverse human
movements. Traditional priors like VAEs and NDFs often exhibit shortcomings in
realism and generalization, notably with unseen noisy poses. To address these
issues, we introduce DPoser, a robust and versatile human pose prior built upon
diffusion models. DPoser regards various pose-centric tasks as inverse problems
and employs variational diffusion sampling for efficient solving. Accordingly,
designed with optimization frameworks, DPoser seamlessly benefits human mesh
recovery, pose generation, pose completion, and motion denoising tasks.
Furthermore, due to the disparity between the articulated poses and structured
images, we propose truncated timestep scheduling to enhance the effectiveness
of DPoser. Our approach demonstrates considerable enhancements over common
uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%,
and 3.8% across human mesh recovery, pose completion, and motion denoising,
respectively. Comprehensive experiments demonstrate the superiority of DPoser
over existing state-of-the-art pose priors across multiple tasks. |
Presents DPoser, a novel human pose prior built upon diffusion models that achieves state-of-the-art performance across diverse pose-related tasks. |
Existing human pose priors struggle with realism and generalization, especially for unseen or noisy poses, limiting their effectiveness in real-world applications. |
DPoser leverages variational diffusion sampling to integrate diffusion prior within optimization frameworks for tasks like human mesh recovery, pose generation, pose completion, and motion denoising. It also introduces a truncated timestep scheduling strategy tailored for the characteristics of pose data. |
DPoser generates more realistic and diverse human poses compared to previous state-of-the-art methods.
In human mesh recovery, DPoser outperforms existing priors even when fitting from scratch.
For pose completion, DPoser consistently shows superior performance under various occlusion scenarios, effectively reconstructing plausible full 3D poses from partial observations. |
DPoser, relying on variational inference, might exhibit mode-seeking behavior, limiting solution diversity.
Future work could explore particle-based variational inference or other advanced diffusion-based solvers to enhance solution diversity and handle more complex inverse problems with unknown parameters in the measurement operator. |
human pose prior, diffusion models, variational diffusion sampling, truncated timestep scheduling, human mesh recovery |
2312.05525
Report |
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception |
Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo |
Human-centric perception (e.g. pedetrian detection, segmentation, pose
estimation, and attribute analysis) is a long-standing problem for computer
vision. This paper introduces a unified and versatile framework (HQNet) for
single-stage multi-person multi-task human-centric perception (HCP). Our
approach centers on learning a unified human query representation, denoted as
Human Query, which captures intricate instance-level features for individual
persons and disentangles complex multi-person scenarios. Although different HCP
tasks have been well-studied individually, single-stage multi-task learning of
HCP tasks has not been fully exploited in the literature due to the absence of
a comprehensive benchmark dataset. To address this gap, we propose
COCO-UniHuman benchmark dataset to enable model development and comprehensive
evaluation. Experimental results demonstrate the proposed method's
state-of-the-art performance among multi-task HCP models and its competitive
performance compared to task-specific HCP models. Moreover, our experiments
underscore Human Query's adaptability to new HCP tasks, thus demonstrating its
robust generalization capability. Codes and data will be publicly accessible. |
This paper introduces HQNet, a unified and versatile single-stage framework for multi-person multi-task human-centric perception (HCP) that centers on learning a unified human query representation (Human Query). |
Single-stage multi-task learning of HCP tasks has not been fully exploited due to the absence of a comprehensive benchmark dataset, hindering the development of algorithms that treat various HCP tasks as a unified problem. |
HQNet uses a backbone network, Transformer encoder and decoder, and task-specific heads. It leverages HumanQuery-Instance Matching and Gender-aided human Model Selection to exploit interactions between HCP tasks. |
HQNet achieves state-of-the-art performance on the COCO-UniHuman benchmark for various HCP tasks, including detection, segmentation, pose estimation, and attribute recognition.
The learned Human Query exhibits strong transferability to novel HCP tasks such as face detection and multi-object tracking.
Co-learning multiple HCP tasks within the unified framework leads to improved overall performance due to inter-task synergy. |
The framework is currently limited to RGB images and could be extended to video or multi-modal data.
Future work can explore more comprehensive multi-task HCP scenarios. |
human-centric perception, unified vision model, multi-task learning, query-based learning, coco-unihuman dataset |
2312.05482
Report |
BARET : Balanced Attention based Real image Editing driven by Target-text Inversion |
Yuming Qiao, Fanyi Wang, Jingwen Su, Yanhao Zhang, Yunjie Yu, Siyu Wu, Guo-Jun Qi |
Image editing approaches with diffusion models have been rapidly developed,
yet their applicability are subject to requirements such as specific editing
types (e.g., foreground or background object editing, style transfer), multiple
conditions (e.g., mask, sketch, caption), and time consuming fine-tuning of
diffusion models. For alleviating these limitations and realizing efficient
real image editing, we propose a novel editing technique that only requires an
input image and target text for various editing types including non-rigid edits
without fine-tuning diffusion model. Our method contains three novelties:(I)
Target-text Inversion Schedule (TTIS) is designed to fine-tune the input target
text embedding to achieve fast image reconstruction without image caption and
acceleration of convergence.(II) Progressive Transition Scheme applies
progressive linear interpolation between target text embedding and its
fine-tuned version to generate transition embedding for maintaining non-rigid
editing capability.(III) Balanced Attention Module (BAM) balances the tradeoff
between textual description and image semantics.By the means of combining
self-attention map from reconstruction process and cross-attention map from
transition process, the guidance of target text embeddings in diffusion process
is optimized.In order to demonstrate editing capability, effectiveness and
efficiency of the proposed BARET, we have conducted extensive qualitative and
quantitative experiments. Moreover, results derived from user study and
ablation study further prove the superiority over other methods. |
This paper introduces BARET, a text-based real image editing technique that uses only an input image and target text for various edits, including non-rigid transformations, without requiring fine-tuning of the diffusion model. |
Existing methods have limitations like requiring specific editing types, multiple input conditions, or time-consuming fine-tuning. This work aims to overcome these limitations for efficient and versatile real image editing. |
BARET consists of three components: 1) Target-text Inversion Schedule (TTIS) for efficient image reconstruction by fine-tuning target text embeddings. 2) Progressive Transition Scheme to enhance non-rigid editing by progressively interpolating between target text and fine-tuned embeddings. 3) Balanced Attention Module (BAM) to balance original image features and non-rigid changes by leveraging self-attention and cross-attention maps. |
BARET outperforms existing methods in terms of text alignment, image fidelity, and efficiency, especially for complex non-rigid edits.
User study confirms BARET's superiority in visual quality, achieving higher scores compared to baseline methods.
BARET demonstrates fast convergence (16s for reconstruction) compared to methods requiring diffusion model fine-tuning (10-20 minutes). |
The effectiveness of BARET for complex compositions or edits requiring high-level semantic understanding needs further investigation.
Exploring automatic optimization of interpolation parameters for different editing tasks could enhance usability. |
image editing, diffusion models, text-guided editing, non-rigid transformation, attention mechanisms |
2312.05476
Report |
Exploring the Naturalness of AI-Generated Images |
Zijian Chen, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai, Wenjun Zhang |
The proliferation of Artificial Intelligence-Generated Images (AGIs) has
greatly expanded the Image Naturalness Assessment (INA) problem. Different from
early definitions that mainly focus on tone-mapped images with limited
distortions (e.g., exposure, contrast, and color reproduction), INA on
AI-generated images is especially challenging as it has more diverse contents
and could be affected by factors from multiple perspectives, including
low-level technical distortions and high-level rationality distortions. In this
paper, we take the first step to benchmark and assess the visual naturalness of
AI-generated images. First, we construct the AI-Generated Image Naturalness
(AGIN) database by conducting a large-scale subjective study to collect human
opinions on the overall naturalness as well as perceptions from technical and
rationality perspectives. AGIN verifies that naturalness is universally and
disparately affected by technical and rationality distortions. Second, we
propose the Joint Objective Image Naturalness evaluaTor (JOINT), to
automatically predict the naturalness of AGIs that aligns human ratings.
Specifically, JOINT imitates human reasoning in naturalness evaluation by
jointly learning both technical and rationality features. We demonstrate that
JOINT significantly outperforms baselines for providing more subjectively
consistent results on naturalness assessment. |
This paper introduces AGIN, the first database for AI-generated image naturalness assessment, and proposes JOINT, an objective naturalness evaluator that jointly learns technical and rationality features. |
Evaluating the naturalness of AI-generated images is crucial as they become increasingly prevalent, and traditional IQA methods fall short in addressing the diverse contents and rationality factors involved. |
AGIN was constructed through a large-scale subjective study collecting human opinions on technical and rationality perspectives, and JOINT uses a two-branch architecture mimicking human naturalness reasoning. |
AGIN reveals that naturalness is disparately affected by both technical and rationality distortions.
The impact of factors within these two perspectives on naturalness varies significantly.
JOINT significantly outperforms baselines, demonstrating the effectiveness of joint learning in image naturalness evaluation. |
The database is limited to five generative tasks, and future work can expand to more tasks and modalities.
The current objective model, while effective, can be further improved by exploring more sophisticated architectures and training strategies. |
ai-generated images, image naturalness assessment, database, subjective evaluation, deep learning |
2312.05390
Report |
NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models |
Yusuf Dalva, Pinar Yanardag |
Generative models have been very popular in the recent years for their image
generation capabilities. GAN-based models are highly regarded for their
disentangled latent space, which is a key feature contributing to their success
in controlled image editing. On the other hand, diffusion models have emerged
as powerful tools for generating high-quality images. However, the latent space
of diffusion models is not as thoroughly explored or understood. Existing
methods that aim to explore the latent space of diffusion models usually relies
on text prompts to pinpoint specific semantics. However, this approach may be
restrictive in areas such as art, fashion, or specialized fields like medicine,
where suitable text prompts might not be available or easy to conceive thus
limiting the scope of existing work. In this paper, we propose an unsupervised
method to discover latent semantics in text-to-image diffusion models without
relying on text prompts. Our method takes a small set of unlabeled images from
specific domains, such as faces or cats, and a pre-trained diffusion model, and
discovers diverse semantics in unsupervised fashion using a contrastive
learning objective. Moreover, the learned directions can be applied
simultaneously, either within the same domain (such as various types of facial
edits) or across different domains (such as applying cat and face edits within
the same image) without interfering with each other. Our extensive experiments
show that our method achieves highly disentangled edits, outperforming existing
approaches in both diffusion-based and GAN-based latent space editing methods. |
This paper introduces NoiseCLR, an unsupervised contrastive learning method for discovering interpretable semantic directions within the latent space of pre-trained text-to-image diffusion models like Stable Diffusion. |
This is important because it allows for disentangled image editing in diffusion models without relying on text prompts, which can be limiting in domains where suitable prompts are difficult to formulate. |
NoiseCLR leverages a contrastive learning objective to learn latent directions by encouraging similarity between edits made by the same direction while repelling edits from different directions. It operates directly on the noise estimations of the diffusion model. |
NoiseCLR successfully discovers a variety of disentangled directions across different domains (faces, cats, cars, artwork) using a single diffusion model.
The method allows for intra-domain and cross-domain editing, enabling the combination of multiple edits within and across different semantic categories.
Evaluations demonstrate that NoiseCLR outperforms existing diffusion-based and achieves competitive results with GAN-based image editing methods, both qualitatively and quantitatively. |
The manipulation capabilities of NoiseCLR are inherently limited by the biases present in the datasets used to train the underlying diffusion model and its associated language model.
Similar to other image synthesis tools, there are ethical concerns regarding the potential misuse of NoiseCLR for malicious purposes, such as generating deepfakes. |
diffusion models, image editing, latent space exploration, contrastive learning, unsupervised learning |
2312.05295
Report |
Disentangled Clothed Avatar Generation from Text Descriptions |
Jionghao Wang, Yuan Liu, Zhiyang Dou, Zhengming Yu, Yongqing Liang, Xin Li, Wenping Wang, Rong Xie, Li Song |
In this paper, we introduced a novel text-to-avatar generation method that
separately generates the human body and the clothes and allows high-quality
animation on the generated avatar. While recent advancements in text-to-avatar
generation have yielded diverse human avatars from text prompts, these methods
typically combine all elements-clothes, hair, and body-into a single 3D
representation. Such an entangled approach poses challenges for downstream
tasks like editing or animation. To overcome these limitations, we propose a
novel disentangled 3D avatar representation named Sequentially Offset-SMPL
(SO-SMPL), building upon the SMPL model. SO-SMPL represents the human body and
clothes with two separate meshes, but associates them with offsets to ensure
the physical alignment between the body and the clothes. Then, we design an
Score Distillation Sampling(SDS)-based distillation framework to generate the
proposed SO-SMPL representation from text prompts. In comparison with existing
text-to-avatar methods, our approach not only achieves higher exture and
geometry quality and better semantic alignment with text prompts, but also
significantly improves the visual quality of character animation, virtual
try-on, and avatar editing. Our project page is at
https://shanemankiw.github.io/SO-SMPL/. |
This paper presents SO-SMPL, a novel method for generating disentangled 3D human avatars with clothes from text prompts. |
Disentangling clothes and body in 3D avatar generation allows for more realistic animation, easier editing, and virtual try-on applications. |
SO-SMPL represents human body and clothes as separate meshes with offsets to ensure alignment, leveraging score distillation sampling and a two-stage optimization process for generation. |
Achieves higher texture and geometry quality compared to previous text-to-avatar methods.
Generates separate clothes meshes that can be fitted to different body shapes.
Enables realistic animation by simulating clothes and body motions separately. |
Limited to clothing types compatible with the SMPL-X topology, excluding items like skirts and dresses.
Generated clothes lack sewing patterns and physical properties. |
3d avatar generation, text-to-3d, disentangled representation, score distillation sampling, virtual try-on |
2312.05288
Report |
MotionCrafter: One-Shot Motion Customization of Diffusion Models |
Yuxin Zhang, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Weiming Dong, Changsheng Xu |
The essence of a video lies in its dynamic motions, including character
actions, object movements, and camera movements. While text-to-video generative
diffusion models have recently advanced in creating diverse contents,
controlling specific motions through text prompts remains a significant
challenge. A primary issue is the coupling of appearance and motion, often
leading to overfitting on appearance. To tackle this challenge, we introduce
MotionCrafter, a novel one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the
reference motion into the temporal component of the base model, while the
spatial module is independently adjusted for character or style control. To
enhance the disentanglement of motion and appearance, we propose an innovative
dual-branch motion disentanglement approach, comprising a motion
disentanglement loss and an appearance prior enhancement strategy. During
training, a frozen base model provides appearance normalization, effectively
separating appearance from motion and thereby preserving diversity.
Comprehensive quantitative and qualitative experiments, along with user
preference tests, demonstrate that MotionCrafter can successfully integrate
dynamic motions while preserving the coherence and quality of the base model
with a wide range of appearance generation capabilities. Project page:
https://zyxelsa.github.io/homepage-motioncrafter. Codes are available at
https://github.com/zyxElsa/MotionCrafter. |
Introduces MotionCrafter, a one-shot instance-guided method for customizing dynamic motions in text-to-video generation. |
Addresses the challenge of controlling specific motions in generated videos, which current text-to-video models struggle with, particularly in decoupling motion from appearance. |
Employs a parallel spatial-temporal architecture to fine-tune pre-trained text-to-video models, separating appearance and motion learning. Introduces a dual-branch motion disentanglement approach using a frozen base model as an appearance prior and a motion disentanglement loss to separate motion from the reference video's appearance. |
Successfully integrates dynamic motions from reference videos into generated videos with different appearances based on text prompts.
Outperforms state-of-the-art methods in qualitative and quantitative evaluations, demonstrating superior motion fidelity and appearance diversity.
Receives higher user preference scores in a user study, particularly for motion accuracy and visual quality. |
Limited ability to maintain coherence for complex actions spanning many frames.
Struggles to capture detailed dynamics in group actions due to inherent motion complexity and limitations in current text-to-video models. |
text-to-video generation, motion customization, video editing, diffusion models, motion disentanglement |
2312.05284
Report |
SlimSAM: 0.1% Data Makes Segment Anything Slim |
Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang |
Current approaches for compressing the Segment Anything Model (SAM) yield
commendable results, yet necessitate extensive data to train a new network from
scratch. Employing conventional pruning techniques can remarkably reduce data
requirements but would suffer from a degradation in performance. To address
this challenging trade-off, we introduce SlimSAM, a novel data-efficient SAM
compression method that achieves superior performance with extremely less
training data. The essence of SlimSAM is encapsulated in the alternate slimming
framework which effectively enhances knowledge inheritance under severely
limited training data availability and exceptional pruning ratio. Diverging
from prior techniques, our framework progressively compresses the model by
alternately pruning and distilling distinct, decoupled sub-structures.
Disturbed Taylor pruning is also proposed to address the misalignment between
the pruning objective and training target, thereby boosting the
post-distillation after pruning. SlimSAM yields significant performance
improvements while demanding over 10 times less training data than any other
existing compression methods. Even when compared to the original SAM, SlimSAM
achieves approaching performance while reducing parameter counts to merely 1.4%
(9.1M), MACs to 0.8% (23G), and requiring only 0.1% (10k) of the SAM training
data. The code is available at http://github.com/czg1225/SlimSAM. |
SlimSAM, a data-efficient compression method for the Segment Anything Model (SAM), which achieves high performance with minimal training data by reusing pre-trained weights and employing a novel modernized pruning-distillation procedure. |
SAM's large size and computational demands make it unsuitable for resource-constrained devices, hindering its wider application. Existing compression methods require extensive data to train from scratch or suffer performance degradation with conventional pruning. |
SlimSAM leverages an alternate slimming framework, alternately pruning and distilling decoupled embedding and bottleneck sub-structures. It also introduces disturbed Taylor pruning, a label-free importance estimation method aligning pruning objectives with distillation targets. |
SlimSAM achieves approaching performance to the original SAM-H with 1.4% parameters and 0.8% MACs using only 0.1% of the training data.
It outperforms other SAM compression techniques in terms of performance, efficiency, and training data requirements.
The method consistently surpasses other structural pruning methods, especially at high pruning ratios. |
While mitigating the need for large training datasets, more data can further enhance performance, particularly at higher pruning rates.
The effectiveness of global pruning in bottleneck compression is highly dependent on the chosen importance normalization method. |
segment anything, model compression, data-efficient, model pruning, knowledge distillation |
2312.05283
Report |
Nuvo: Neural UV Mapping for Unruly 3D Representations |
Pratul P. Srinivasan, Stephan J. Garbin, Dor Verbin, Jonathan T. Barron, Ben Mildenhall |
Existing UV mapping algorithms are designed to operate on well-behaved
meshes, instead of the geometry representations produced by state-of-the-art 3D
reconstruction and generation techniques. As such, applying these methods to
the volume densities recovered by neural radiance fields and related techniques
(or meshes triangulated from such fields) results in texture atlases that are
too fragmented to be useful for tasks such as view synthesis or appearance
editing. We present a UV mapping method designed to operate on geometry
produced by 3D reconstruction and generation techniques. Instead of computing a
mapping defined on a mesh's vertices, our method Nuvo uses a neural field to
represent a continuous UV mapping, and optimizes it to be a valid and
well-behaved mapping for just the set of visible points, i.e. only points that
affect the scene's appearance. We show that our model is robust to the
challenges posed by ill-behaved geometry, and that it produces editable UV
mappings that can represent detailed appearance. |
This paper presents a novel UV mapping method called \model, specifically designed to handle the complex geometry produced by modern 3D reconstruction and generation techniques like NeRF. |
Existing UV mapping algorithms often generate highly fragmented texture atlases when applied to the non-smooth and intricate geometry generated by these techniques, making them unsuitable for tasks such as view synthesis and appearance editing. |
\model utilizes neural fields to represent a continuous UV mapping. It optimizes this mapping by minimizing a set of losses, encouraging bijectivity, low distortion, and meaningful chart assignment solely for the visible points in the scene. |
\model effectively represents detailed surface appearance, achieving comparable or superior view synthesis results compared to directly optimizing appearance on mesh vertices.
\model generates UV mappings that are competitive with state-of-the-art methods on standard meshes and significantly outperforms all baselines on challenging geometry extracted from NeRF reconstructions.
The generated UV maps are suitable for integration into standard graphics pipelines, as demonstrated by baking the optimized coordinates onto meshes with minimal performance loss. |
While \model's point sampling approach provides flexibility, it poses challenges in guaranteeing global bijectivity and distortion minimization.
The current \model lacks interactive features, limiting user control over aspects like cut placement and region-specific distortion minimization. |
uv mapping, neural radiance fields (nerf), 3d reconstruction, appearance editing, texture atlas |
2312.05251
Report |
Reconstructing Hands in 3D with Transformers |
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik |
We present an approach that can reconstruct hands in 3D from monocular input.
Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based
architecture and can analyze hands with significantly increased accuracy and
robustness compared to previous work. The key to HaMeR's success lies in
scaling up both the data used for training and the capacity of the deep network
for hand reconstruction. For training data, we combine multiple datasets that
contain 2D or 3D hand annotations. For the deep model, we use a large scale
Vision Transformer architecture. Our final model consistently outperforms the
previous baselines on popular 3D hand pose benchmarks. To further evaluate the
effect of our design in non-controlled settings, we annotate existing
in-the-wild datasets with 2D hand keypoint annotations. On this newly collected
dataset of annotations, HInt, we demonstrate significant improvements over
existing baselines. We make our code, data and models available on the project
website: https://geopavlakos.github.io/hamer/. |
The paper proposes HAMER, a fully transformer-based approach for 3D hand mesh recovery from monocular images or videos, achieving improved accuracy and robustness by leveraging a large vision transformer model and extensive training data. |
Accurately reconstructing 3D hand meshes from monocular input is crucial for various applications like robotics, action recognition, and sign language understanding. This paper addresses the need for more robust and accurate hand mesh recovery models, particularly in challenging in-the-wild scenarios. |
HAMER utilizes a vision transformer (ViT) architecture pre-trained on large-scale image data and fine-tuned on a combination of existing datasets with 2D or 3D hand annotations, resulting in 2.7M training examples. The model regresses MANO hand model parameters and camera parameters, supervised by 2D and 3D losses, along with adversarial losses to promote natural hand poses. |
HAMER achieves state-of-the-art results on standard 3D hand pose benchmarks (FreiHAND and HO3Dv2) outperforming previous methods in most metrics.
Evaluation on the newly introduced 'Hand Interactions in the wild' (HINT) dataset, comprising challenging in-the-wild images annotated with 2D keypoints and occlusion labels, shows significant improvements over baselines (2-3x better PCK@0.05).
Ablation studies confirm the importance of both large-scale training data and the high-capacity ViT architecture for HAMER's performance. |
Limited evaluation on temporal aspects of hand motion, as HAMER is a single-frame approach.
Future work includes extending the approach to handle hand-object interaction more explicitly and exploring the use of temporal information for video-based reconstruction. |
3d hand mesh reconstruction, monocular vision, vision transformer, hand pose estimation, in-the-wild datasets |
2312.05210
Report |
IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing |
Shaofei Wang, Božidar Antić, Andreas Geiger, Siyu Tang |
We present IntrinsicAvatar, a novel approach to recovering the intrinsic
properties of clothed human avatars including geometry, albedo, material, and
environment lighting from only monocular videos. Recent advancements in
human-based neural rendering have enabled high-quality geometry and appearance
reconstruction of clothed humans from just monocular videos. However, these
methods bake intrinsic properties such as albedo, material, and environment
lighting into a single entangled neural representation. On the other hand, only
a handful of works tackle the problem of estimating geometry and disentangled
appearance properties of clothed humans from monocular videos. They usually
achieve limited quality and disentanglement due to approximations of secondary
shading effects via learned MLPs. In this work, we propose to model secondary
shading effects explicitly via Monte-Carlo ray tracing. We model the rendering
process of clothed humans as a volumetric scattering process, and combine ray
tracing with body articulation. Our approach can recover high-quality geometry,
albedo, material, and lighting properties of clothed humans from a single
monocular video, without requiring supervised pre-training using ground truth
materials. Furthermore, since we explicitly model the volumetric scattering
process and ray tracing, our model naturally generalizes to novel poses,
enabling animation of the reconstructed avatar in novel lighting conditions. |
IntrinsicAvatar, a novel approach to recover intrinsic properties of clothed human avatars (geometry, albedo, material, environment lighting) from monocular videos using volumetric scattering and Monte-Carlo ray tracing. |
Existing methods for reconstructing clothed humans from monocular videos entangle intrinsic properties in a single neural representation, limiting editing capabilities and relighting under novel conditions. This work aims to disentangle these properties. |
The method models clothed humans as articulated neural radiance fields, using iNGP with SDF for geometry and separate MLPs for radiance, albedo, and material. It employs volumetric scattering with Monte-Carlo ray tracing in canonical space for physically based inverse rendering, enabling relighting for unseen poses. |
Achieves high-quality reconstruction of clothed human avatars with disentangled intrinsic properties from monocular videos.
Significantly outperforms the state-of-the-art method (Relighting 4D) both qualitatively and quantitatively.
Enables realistic rendering of the learned avatars under novel lighting conditions and poses. |
Does not consider pose-dependent non-rigid motion, limiting applicability to more dynamic scenarios.
Relatively slow inference time (around 20 seconds per image) |
inverse rendering, neural radiance fields, human reconstruction, volumetric scattering, monte-carlo ray tracing |
2312.05208
Report |
ControlRoom3D: Room Generation using Semantic Proxy Rooms |
Jonas Schult, Sam Tsai, Lukas Höllein, Bichen Wu, Jialiang Wang, Chih-Yao Ma, Kunpeng Li, Xiaofang Wang, Felix Wimbauer, Zijian He, Peizhao Zhang, Bastian Leibe, Peter Vajda, Ji Hou |
Manually creating 3D environments for AR/VR applications is a complex process
requiring expert knowledge in 3D modeling software. Pioneering works facilitate
this process by generating room meshes conditioned on textual style
descriptions. Yet, many of these automatically generated 3D meshes do not
adhere to typical room layouts, compromising their plausibility, e.g., by
placing several beds in one bedroom. To address these challenges, we present
ControlRoom3D, a novel method to generate high-quality room meshes. Central to
our approach is a user-defined 3D semantic proxy room that outlines a rough
room layout based on semantic bounding boxes and a textual description of the
overall room style. Our key insight is that when rendered to 2D, this 3D
representation provides valuable geometric and semantic information to control
powerful 2D models to generate 3D consistent textures and geometry that aligns
well with the proxy room. Backed up by an extensive study including
quantitative metrics and qualitative user evaluations, our method generates
diverse and globally plausible 3D room meshes, thus empowering users to design
3D rooms effortlessly without specialized knowledge. |
\name{} is a novel method that generates diverse and globally plausible 3D room meshes from user-defined 3D semantic layouts and text prompts describing the desired room style. |
Manually creating 3D environments for AR/VR is difficult and requires expert knowledge. Existing methods often result in implausible layouts. \name{} addresses these limitations by leveraging user-defined layouts to guide the generation process, leading to higher quality and more plausible 3D rooms. |
\name{} utilizes a 3D semantic proxy room defined by bounding boxes and text prompts. It employs: (1) Guided Panorama Generation for style consistency, (2) Geometry Alignment to match predicted depth with the proxy room, (3) Mesh Cleaning to remove low-quality regions, and (4) Mesh Completion to fill in missing areas with new content. |
Significantly outperforms baseline methods in generating plausible and user-preferred 3D room layouts.
Fine-tuning adapters on a dataset with rendered 3D bounding boxes significantly improves object generation within defined layouts.
The Geometry Alignment module is crucial for ensuring generated objects align correctly with the proxy room. |
Limited to generating indoor room-scale environments.
Relies on a pre-defined set of semantic classes for the proxy room. |
3d scene generation, text-to-3d, semantic scene understanding, layout-aware generation, generative ai |
2312.05133
Report |
GIR: 3D Gaussian Inverse Rendering for Relightable Scene Factorization |
Yahao Shi, Yanmin Wu, Chenming Wu, Xing Liu, Chen Zhao, Haocheng Feng, Jingtuo Liu, Liangjun Zhang, Jian Zhang, Bin Zhou, Errui Ding, Jingdong Wang |
This paper presents GIR, a 3D Gaussian Inverse Rendering method for
relightable scene factorization. Compared to existing methods leveraging
discrete meshes or neural implicit fields for inverse rendering, our method
utilizes 3D Gaussians to estimate the material properties, illumination, and
geometry of an object from multi-view images. Our study is motivated by the
evidence showing that 3D Gaussian is a more promising backbone than neural
fields in terms of performance, versatility, and efficiency. In this paper, we
aim to answer the question: ``How can 3D Gaussian be applied to improve the
performance of inverse rendering?'' To address the complexity of estimating
normals based on discrete and often in-homogeneous distributed 3D Gaussian
representations, we proposed an efficient self-regularization method that
facilitates the modeling of surface normals without the need for additional
supervision. To reconstruct indirect illumination, we propose an approach that
simulates ray tracing. Extensive experiments demonstrate our proposed GIR's
superior performance over existing methods across multiple tasks on a variety
of widely used datasets in inverse rendering. This substantiates its efficacy
and broad applicability, highlighting its potential as an influential tool in
relighting and reconstruction. Project page: https://3dgir.github.io |
This paper introduces GIR, a novel inverse rendering framework based on 3D Gaussian Splatting (3DGS) that estimates material properties, geometry, and illumination from multi-view images in high fidelity. |
Inverse rendering is a fundamental computer vision problem with applications in various fields such as scene understanding, image manipulation, AR/VR, etc. The proposed method, GIR, leverages the strengths of 3DGS, a promising alternative to NeRFs, for high-performance inverse rendering. |
The paper proposes a novel inverse rendering framework leveraging 3D Gaussian Splatting. The method introduces a self-regularization method for accurate surface normal estimation and an approximate ray tracing approach for efficient indirect illumination reconstruction. The framework jointly optimizes for geometry, materials, and illumination using a combination of MAE, DSSIM, and smoothing losses. |
GIR achieves high-fidelity reconstruction of normal maps, specular and diffuse components, roughness and metallic properties, indirect illumination, and environmental maps.
The proposed self-regularization method for normal estimation in 3DGS proves effective without needing additional supervision.
Extensive experiments demonstrate GIR's superior performance over existing state-of-the-art methods on various benchmark datasets for relighting and novel view synthesis tasks. |
The indirect illumination reconstruction uses an approximate method, which can be further improved.
Future work can explore extending GIR for dynamic scene modeling and content generation leveraging the versatility of 3DGS. |
inverse rendering, 3d gaussian splatting, relightable scene factorization, normal estimation, indirect illumination |
2312.05107
Report |
DreaMoving: A Human Video Generation Framework based on Diffusion Models |
Mengyang Feng, Jinlin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei, Miaomiao Cui, Peiran Ren, Xuansong Xie |
In this paper, we present DreaMoving, a diffusion-based controllable video
generation framework to produce high-quality customized human videos.
Specifically, given target identity and posture sequences, DreaMoving can
generate a video of the target identity moving or dancing anywhere driven by
the posture sequences. To this end, we propose a Video ControlNet for
motion-controlling and a Content Guider for identity preserving. The proposed
model is easy to use and can be adapted to most stylized diffusion models to
generate diverse results. The project page is available at
https://dreamoving.github.io/dreamoving |
Presents DreaMoving, a diffusion-based controllable video generation framework that produces high-quality customized human videos based on target identity and posture sequences. |
Addresses challenges in human-centric video generation, particularly in character dance, where existing text-to-video models struggle with intraframe consistency, length, diversity, personalization, and controllability. |
Utilizes a Video ControlNet for motion control, a Content Guider for identity preservation, and incorporates motion blocks for temporal consistency. Employs a multi-stage training process including long-frame pretraining, Video ControlNet training, and expression fine-tuning. |
Generates high-quality, consistent videos with controlled motion based on input pose or depth sequences.
Enables content control through text prompts for background and image prompts for precise human appearance guidance.
Demonstrates generalization ability by generating videos in the style of unseen stylized images. |
Relies on accurate pose/depth estimation for optimal control.
Further exploration of diverse motion control mechanisms beyond pose and depth. |
video generation, diffusion models, controllable generation, human-centric content, motion control |
2312.05039
Report |
SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control |
Jaskirat Singh, Jianming Zhang, Qing Liu, Cameron Smith, Zhe Lin, Liang Zheng |
The field of generative image inpainting and object insertion has made
significant progress with the recent advent of latent diffusion models.
Utilizing a precise object mask can greatly enhance these applications.
However, due to the challenges users encounter in creating high-fidelity masks,
there is a tendency for these methods to rely on more coarse masks (e.g.,
bounding box) for these applications. This results in limited control and
compromised background content preservation. To overcome these limitations, we
introduce SmartMask, which allows any novice user to create detailed masks for
precise object insertion. Combined with a ControlNet-Inpaint model, our
experiments demonstrate that SmartMask achieves superior object insertion
quality, preserving the background content more effectively than previous
methods. Notably, unlike prior works the proposed approach can also be used
even without user-mask guidance, which allows it to perform mask-free object
insertion at diverse positions and scales. Furthermore, we find that when used
iteratively with a novel instruction-tuning based planning model, SmartMask can
be used to design detailed layouts from scratch. As compared with user-scribble
based layout design, we observe that SmartMask allows for better quality
outputs with layout-to-image generation methods. Project page is available at
https://smartmask-gen.github.io |
Introduces SmartMask, a context-aware diffusion model for generating fine-grained object masks for precise insertion and layout control. |
Addresses limitations of coarse-mask inpainting methods which often modify background content and offer limited control over object placement and scale. |
Leverages semantic amodal segmentation data to train a diffusion model for predicting object masks, enabling mask-free or user-guided (bounding box, scribbles) object insertion and employs an instruction-tuning based planning model for iterative layout design. |
Achieves better background preservation compared to state-of-the-art inpainting methods.
Allows for mask-free object insertion at diverse positions and scales.
Facilitates fine-grained layout design from scratch, enabling higher quality layout-to-image generation. |
Reliance on semantic layouts for mask prediction may limit depth context.
Training data size for SmartMask is smaller compared to typical inpainting models, potentially limiting generalizability to out-of-distribution objects. |
image inpainting, object insertion, layout generation, diffusion models, semantic segmentation |
2312.05038
Report |
Prompt-In-Prompt Learning for Universal Image Restoration |
Zilong Li, Yiming Lei, Chenglong Ma, Junping Zhang, Hongming Shan |
Image restoration, which aims to retrieve and enhance degraded images, is
fundamental across a wide range of applications. While conventional deep
learning approaches have notably improved the image quality across various
tasks, they still suffer from (i) the high storage cost needed for various
task-specific models and (ii) the lack of interactivity and flexibility,
hindering their wider application. Drawing inspiration from the pronounced
success of prompts in both linguistic and visual domains, we propose novel
Prompt-In-Prompt learning for universal image restoration, named PIP. First, we
present two novel prompts, a degradation-aware prompt to encode high-level
degradation knowledge and a basic restoration prompt to provide essential
low-level information. Second, we devise a novel prompt-to-prompt interaction
module to fuse these two prompts into a universal restoration prompt. Third, we
introduce a selective prompt-to-feature interaction module to modulate the
degradation-related feature. By doing so, the resultant PIP works as a
plug-and-play module to enhance existing restoration models for universal image
restoration. Extensive experimental results demonstrate the superior
performance of PIP on multiple restoration tasks, including image denoising,
deraining, dehazing, deblurring, and low-light enhancement. Remarkably, PIP is
interpretable, flexible, efficient, and easy-to-use, showing promising
potential for real-world applications. The code is available at
https://github.com/longzilicart/pip_universal. |
Proposes Prompt-in-Prompt (PIP) learning, a novel plug-and-play module that enhances existing image restoration backbones for universal image restoration by integrating high-level and low-level degradation knowledge via prompts. |
Addresses the limitations of conventional deep learning approaches for image restoration, such as high storage cost for task-specific models and lack of interactivity, by enabling a single model to handle multiple degradation types effectively. |
Learns two types of prompts: degradation-aware prompts (high-level) and basic restoration prompts (low-level). These prompts are fused through a prompt-to-prompt interaction module. A selective prompt-to-feature interaction module then modulates degradation-related features based on the fused prompts. |
Outperforms state-of-the-art universal image restoration methods on benchmark datasets across denoising, deraining, dehazing, deblurring, and low-light enhancement.
Demonstrates the effectiveness of decoupled degradation-aware prompts in improving restoration performance.
Shows efficiency by achieving significant performance gains with only a slight increase in parameters and FLOPS compared to the baseline models. |
Slight computational overhead in training and inference compared to baseline models.
Limited improvement in model generalization to unknown degradation types. |
image restoration, prompt learning, universal models, deep learning, computer vision |
2312.04966
Report |
Customizing Motion in Text-to-Video Diffusion Models |
Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell |
We introduce an approach for augmenting text-to-video generation models with
customized motions, extending their capabilities beyond the motions depicted in
the original training data. By leveraging a few video samples demonstrating
specific movements as input, our method learns and generalizes the input motion
patterns for diverse, text-specified scenarios. Our contributions are
threefold. First, to achieve our results, we finetune an existing text-to-video
model to learn a novel mapping between the depicted motion in the input
examples to a new unique token. To avoid overfitting to the new custom motion,
we introduce an approach for regularization over videos. Second, by leveraging
the motion priors in a pretrained model, our method can produce novel videos
featuring multiple people doing the custom motion, and can invoke the motion in
combination with other motions. Furthermore, our approach extends to the
multimodal customization of motion and appearance of individualized subjects,
enabling the generation of videos featuring unique characters and distinct
motions. Third, to validate our method, we introduce an approach for
quantitatively evaluating the learned custom motion and perform a systematic
ablation study. We show that our method significantly outperforms prior
appearance-based customization approaches when extended to the motion
customization task. |
This paper presents a method for customizing text-to-video diffusion models by incorporating new motions from a small set of exemplar videos. |
Current text-to-video models are limited to motions present in their training data. This work enables these models to generate videos with user-defined motions, broadening their applicability. |
The method involves fine-tuning a pre-trained text-to-video model's spatial and temporal layers, using a novel video regularization technique and a sampling strategy that emphasizes motion patterns. This allows associating a unique text token with the new motion. |
The approach successfully customizes models with diverse motions like dancing, gestures, and camera movements.
Quantitative evaluation shows significant improvement in motion accuracy compared to adapting image customization methods.
The method generalizes well, enabling the generation of customized motions with new subjects, multiple people, varying timings, and in combination with other motions. |
The model occasionally overfits to the appearance of training videos, leading to memorization.
The reliance on a pre-trained action recognition model for evaluation limits the scope to existing gesture datasets. Future work could explore alternative evaluation metrics. |
text-to-video generation, motion customization, diffusion models, video regularization, motion accuracy |
2312.04965
Report |
Inversion-Free Image Editing with Natural Language |
Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, Joyce Chai |
Despite recent advances in inversion-based editing, text-guided image
manipulation remains challenging for diffusion models. The primary bottlenecks
include 1) the time-consuming nature of the inversion process; 2) the struggle
to balance consistency with accuracy; 3) the lack of compatibility with
efficient consistency sampling methods used in consistency models. To address
the above issues, we start by asking ourselves if the inversion process can be
eliminated for editing. We show that when the initial sample is known, a
special variance schedule reduces the denoising step to the same form as the
multi-step consistency sampling. We name this Denoising Diffusion Consistent
Model (DDCM), and note that it implies a virtual inversion strategy without
explicit inversion in sampling. We further unify the attention control
mechanisms in a tuning-free framework for text-guided editing. Combining them,
we present inversion-free editing (InfEdit), which allows for consistent and
faithful editing for both rigid and non-rigid semantic changes, catering to
intricate modifications without compromising on the image's integrity and
explicit inversion. Through extensive experiments, InfEdit shows strong
performance in various editing tasks and also maintains a seamless workflow
(less than 3 seconds on one single A40), demonstrating the potential for
real-time applications. Project Page: https://sled-group.github.io/InfEdit/ |
This paper proposes InfEdit, an inversion-free editing framework for consistent and faithful text-guided image manipulation in diffusion models. |
Text-guided image editing in diffusion models is challenging due to the limitations of inversion-based methods, including lengthy processes, trade-offs between consistency and accuracy, and incompatibility with efficient consistency sampling. |
The authors introduce the Denoising Diffusion Consistent Model (DDCM) that eliminates explicit inversion. They further propose Unified Attention Control (UAC), combining cross-attention and mutual self-attention for both rigid and non-rigid editing. |
InfEdit achieves competitive or superior performance to inversion-based methods while being significantly more efficient.
Unified Attention Control (UAC) further improves InfEdit's performance in editing quality, consistency, and efficiency.
InfEdit demonstrates compatibility with Latent Consistency Models (LCMs) for even faster and higher-quality image editing. |
The paper acknowledges potential ethical concerns regarding copyright infringement and deceptive misuse.
Future work could explore mitigating inherent biases in pre-trained models used by InfEdit. |
image editing, diffusion models, attention mechanisms, consistency models, text-guided image manipulation |
2312.04963
Report |
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors |
Lihe Ding, Shaocong Dong, Zhanpeng Huang, Zibin Wang, Yiyuan Zhang, Kaixiong Gong, Dan Xu, Tianfan Xue |
Most 3D generation research focuses on up-projecting 2D foundation models
into the 3D space, either by minimizing 2D Score Distillation Sampling (SDS)
loss or fine-tuning on multi-view datasets. Without explicit 3D priors, these
methods often lead to geometric anomalies and multi-view inconsistency.
Recently, researchers have attempted to improve the genuineness of 3D objects
by directly training on 3D datasets, albeit at the cost of low-quality texture
generation due to the limited texture diversity in 3D datasets. To harness the
advantages of both approaches, we propose Bidirectional Diffusion(BiDiff), a
unified framework that incorporates both a 3D and a 2D diffusion process, to
preserve both 3D fidelity and 2D texture richness, respectively. Moreover, as a
simple combination may yield inconsistent generation results, we further bridge
them with novel bidirectional guidance. In addition, our method can be used as
an initialization of optimization-based models to further improve the quality
of 3D model and efficiency of optimization, reducing the generation process
from 3.4 hours to 20 minutes. Experimental results have shown that our model
achieves high-quality, diverse, and scalable 3D generation. Project website:
https://bidiff.github.io/. |
This paper proposes BiDiff, a novel bidirectional diffusion model for high-quality text-to-3D generation. It integrates pretrained 2D and 3D diffusion models within a unified framework with bidirectional guidance for joint 2D-3D feature learning. |
Existing text-to-3D methods struggle to achieve both high-quality texture (often present in 2D-based methods) and 3D consistency (often present in 3D-based methods). This paper aims to bridge this gap and enable efficient and controllable 3D generation. |
BiDiff utilizes a hybrid representation (SDF for 3D and multi-view images for 2D) with mutually transformable capabilities. It employs a 3D diffusion model (guided by denoised 2D images) and a 2D multi-view diffusion model (guided by rendered 3D images). Additionally, it uses outputs as initialization for optimization-based methods to further enhance quality. |
Achieves high-quality, diverse, and scalable 3D generation with separate control over geometry and texture.
Generates more diverse and text-aligned 3D objects compared to pure optimization methods, while being significantly faster (40 seconds vs. hours).
Acts as a strong initialization for optimization-based methods, improving both speed and quality, and reducing geometric errors. |
The resolution of the generated 3D model during the diffusion process is limited.
The controllability of fine-grained geometry is limited and requires further exploration. |
text-to-3d generation, diffusion models, bidirectional guidance, 3d consistency, texture control |
2312.04884
Report |
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models |
Yiming Zhao, Zhouhui Lian |
Text-to-Image (T2I) generation methods based on diffusion model have garnered
significant attention in the last few years. Although these image synthesis
methods produce visually appealing results, they frequently exhibit spelling
errors when rendering text within the generated images. Such errors manifest as
missing, incorrect or extraneous characters, thereby severely constraining the
performance of text image generation based on diffusion models. To address the
aforementioned issue, this paper proposes a novel approach for text image
generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion
[27]). Our approach involves the design and training of a light-weight
character-level text encoder, which replaces the original CLIP encoder and
provides more robust text embeddings as conditional guidance. Then, we
fine-tune the diffusion model using a large-scale dataset, incorporating local
attention control under the supervision of character-level segmentation maps.
Finally, by employing an inference stage refinement process, we achieve a
notably high sequence accuracy when synthesizing text in arbitrarily given
images. Both qualitative and quantitative results demonstrate the superiority
of our method to the state of the art. Furthermore, we showcase several
potential applications of the proposed UDiffText, including text-centric image
synthesis, scene text editing, etc. Code and model will be available at
https://github.com/ZYM-PKU/UDiffText . |
This paper proposes UDiffText, a novel diffusion model-based method for synthesizing accurate and harmonious text within both synthetic and real-world images, addressing the text rendering challenges (e.g., spelling errors) faced by existing T2I models. |
Existing T2I generation models, while producing visually appealing results, often struggle with rendering accurate text within images, hindering their application in text-centric image synthesis and editing tasks. |
The proposed UDiffText leverages a light-weight character-level text encoder to derive robust text embeddings and fine-tunes a pre-trained diffusion model using a combination of denoising score matching, local attention loss based on character-level segmentation maps, and a scene text recognition loss. It further incorporates a refinement process during inference to enhance text accuracy. |
UDiffText achieves superior text rendering accuracy and visual context coherency compared to previous state-of-the-art methods, as demonstrated by both qualitative and quantitative evaluations.
The use of a character-level text encoder and local attention control significantly improves the model's ability to attend to and accurately render individual characters.
The proposed method exhibits promising potential in various applications, including text-centric image synthesis, scene text editing, and enhancing the text accuracy of existing T2I models. |
The model's reliance on visual context may limit its performance in backgrounds with minimal texture or patterns.
Current implementation effectively handles text sequences up to 12 characters, requiring further development for longer texts (e.g., paragraph generation).
Future work will explore improving controllability and diversity and extending the method to other text-related image synthesis tasks. |
text-to-image generation, diffusion models, scene text editing, character-level text encoder, local attention |
2312.04875
Report |
MVDD: Multi-View Depth Diffusion Models |
Zhen Wang, Qiangeng Xu, Feitong Tan, Menglei Chai, Shichen Liu, Rohit Pandey, Sean Fanello, Achuta Kadambi, Yinda Zhang |
Denoising diffusion models have demonstrated outstanding results in 2D image
generation, yet it remains a challenge to replicate its success in 3D shape
generation. In this paper, we propose leveraging multi-view depth, which
represents complex 3D shapes in a 2D data format that is easy to denoise. We
pair this representation with a diffusion model, MVDD, that is capable of
generating high-quality dense point clouds with 20K+ points with fine-grained
details. To enforce 3D consistency in multi-view depth, we introduce an
epipolar line segment attention that conditions the denoising step for a view
on its neighboring views. Additionally, a depth fusion module is incorporated
into diffusion steps to further ensure the alignment of depth maps. When
augmented with surface reconstruction, MVDD can also produce high-quality 3D
meshes. Furthermore, MVDD stands out in other tasks such as depth completion,
and can serve as a 3D prior, significantly boosting many downstream tasks, such
as GAN inversion. State-of-the-art results from extensive experiments
demonstrate MVDD's excellent ability in 3D shape generation, depth completion,
and its potential as a 3D prior for downstream tasks. |
This paper presents MVDD, a novel diffusion model for 3D shape generation that utilizes a multi-view depth representation. |
This approach addresses the limitations of existing 3D shape generation methods that struggle with scalability, fine-grained detail, and versatility by leveraging the strengths of diffusion models and the multi-view depth representation. |
MVDD enforces cross-view consistency using a novel epipolar "line segment" attention mechanism and a depth fusion module during the denoising process. This allows for the generation of high-resolution, consistent depth maps that can be fused into dense point clouds and further reconstructed into high-quality meshes. |
MVDD achieves state-of-the-art results on standard 3D shape generation benchmarks, outperforming existing methods in both quality and diversity of generated shapes.
MVDD effectively performs depth completion, demonstrating its ability to leverage learned 3D information to complete missing data.
MVDD serves as an effective 3D prior for downstream tasks like 3D GAN inversion, improving reconstruction quality and preventing geometric collapse. |
The current implementation of MVDD assumes a fixed number of views for the multi-view depth representation.
Exploring alternative depth fusion techniques and their integration into the diffusion process could further enhance the model's performance. |
3d shape generation, denoising diffusion models, multi-view depth, epipolar geometry, depth fusion |
2312.04820
Report |
Learn to Optimize Denoising Scores for 3D Generation: A Unified and Improved Diffusion Prior on NeRF and 3D Gaussian Splatting |
Xiaofeng Yang, Yiwen Chen, Cheng Chen, Chi Zhang, Yi Xu, Xulei Yang, Fayao Liu, Guosheng Lin |
We propose a unified framework aimed at enhancing the diffusion priors for 3D
generation tasks. Despite the critical importance of these tasks, existing
methodologies often struggle to generate high-caliber results. We begin by
examining the inherent limitations in previous diffusion priors. We identify a
divergence between the diffusion priors and the training procedures of
diffusion models that substantially impairs the quality of 3D generation. To
address this issue, we propose a novel, unified framework that iteratively
optimizes both the 3D model and the diffusion prior. Leveraging the different
learnable parameters of the diffusion prior, our approach offers multiple
configurations, affording various trade-offs between performance and
implementation complexity. Notably, our experimental results demonstrate that
our method markedly surpasses existing techniques, establishing new
state-of-the-art in the realm of text-to-3D generation. Furthermore, our
approach exhibits impressive performance on both NeRF and the newly introduced
3D Gaussian Splatting backbones. Additionally, our framework yields insightful
contributions to the understanding of recent score distillation methods, such
as the VSD and DDS loss. |
This paper introduces LODS, a unified framework to improve diffusion priors for 3D generation by iteratively optimizing the 3D model and diffusion prior, aligning them closer to the original diffusion model's score. |
Existing diffusion priors for 3D generation struggle to produce high-quality results due to a divergence between the priors and diffusion model training procedures. |
LODS extends the SDS loss with learnable parameters (null embedding or low-rank model parameters) and iteratively optimizes the 3D model and these parameters, bridging the gap between training and inference of diffusion models. |
LODS significantly improves 3D generation quality over previous methods, achieving state-of-the-art performance on the T3Bench benchmark.
LODS successfully mitigates issues like over-saturation and 'floating objects' observed in previous methods.
The method demonstrates strong performance on both NeRF and 3D Gaussian Splatting backbones, offering flexibility and efficiency. |
The current focus is primarily on improving texture details, with limitations in enhancing the geometry of generated 3D models.
Future work can explore integrating LODS with geometry-aware diffusion models to further improve geometric quality. |
3d generation, diffusion models, diffusion priors, text-to-3d, nerf |
2312.04806
Report |
RL Dreams: Policy Gradient Optimization for Score Distillation based 3D Generation |
Aradhya N. Mathur, Phu Pham, Aniket Bera, Ojaswa Sharma |
3D generation has rapidly accelerated in the past decade owing to the
progress in the field of generative modeling. Score Distillation Sampling (SDS)
based rendering has improved 3D asset generation to a great extent. Further,
the recent work of Denoising Diffusion Policy Optimization (DDPO) demonstrates
that the diffusion process is compatible with policy gradient methods and has
been demonstrated to improve the 2D diffusion models using an aesthetic scoring
function. We first show that this aesthetic scorer acts as a strong guide for a
variety of SDS-based methods and demonstrates its effectiveness in text-to-3D
synthesis. Further, we leverage the DDPO approach to improve the quality of the
3D rendering obtained from 2D diffusion models. Our approach, DDPO3D, employs
the policy gradient method in tandem with aesthetic scoring. To the best of our
knowledge, this is the first method that extends policy gradient methods to 3D
score-based rendering and shows improvement across SDS-based methods such as
DreamGaussian, which are currently driving research in text-to-3D synthesis.
Our approach is compatible with score distillation-based methods, which would
facilitate the integration of diverse reward functions into the generative
process. Our project page can be accessed via https://ddpo3d.github.io. |
The paper introduces DDPO3D, a novel framework that integrates Denoising Diffusion Policy Optimization (DDPO) with score distillation sampling (SDS) methods for improved 3D generation, enhancing visual quality and allowing for non-differentiable reward functions. |
Existing text-to-3D generation methods often struggle with visual fidelity and lack the flexibility to incorporate diverse reward signals beyond traditional losses. This work addresses these limitations, pushing the boundaries of high-quality, reward-driven 3D generation. |
DDPO3D treats the 3D generation process as a Markov Decision Process (MDP) within the SDS framework. By leveraging a pre-trained 2D diffusion model and an aesthetic scoring function, it guides the optimization of 3D representations (NeRFs or Gaussian Splats) through policy gradients, maximizing both image quality and adherence to reward functions. |
DDPO3D demonstrably improves the quality of 3D objects generated from text prompts, evidenced by higher CLIP scores and visual fidelity improvements.
Integrating an aesthetic scoring function with SDS methods significantly enhances the visual quality and details of the generated 3D assets.
The framework proves compatible with various SDS-based 3D generation techniques, including DreamGaussian, DreamFusion, and GSGen, highlighting its adaptability. |
The inclusion of policy gradients introduces a trade-off between generation quality and runtime, demanding further exploration for optimization.
The paper primarily relies on CLIP and aesthetic scores for evaluation due to the lack of standardized metrics for text-to-3D generation, suggesting a need for better evaluation strategies in the field. |
3d generation, text-to-3d, diffusion models, score distillation sampling, reinforcement learning |
2312.04655
Report |
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations |
Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, Yezhou Yang |
Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g.,
DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional
T2I benchmarks, at the cost of significant computational resources. The unCLIP
stack comprises T2I prior and diffusion image decoder. The T2I prior model
alone adds a billion parameters compared to the Latent Diffusion Models, which
increases the computational and high-quality data requirements. We introduce
ECLIPSE, a novel contrastive learning method that is both parameter and
data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g.,
CLIP) to distill the knowledge into the prior model. We demonstrate that the
ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere
2.8% of the data, surpasses the baseline T2I priors with an average of 71.6%
preference score under resource-limited setting. It also attains performance on
par with SOTA big models, achieving an average of 63.36% preference score in
terms of the ability to follow the text compositions. Extensive experiments on
two unCLIP diffusion image decoders, Karlo and Kandinsky, affirm that ECLIPSE
priors consistently deliver high performance while significantly reducing
resource dependency. |
This work introduces \eclipse, a novel contrastive learning method for training text-to-image priors in unCLIP models that is both parameter and data-efficient. |
Existing text-to-image (T2I) diffusion models, particularly unCLIP models, while achieving state-of-the-art performance, demand significant computational resources due to their large prior models and extensive training data requirements. |
\eclipse leverages pre-trained vision-language models like CLIP to distill knowledge into compact non-diffusion prior models using a contrastive learning objective function. |
\eclipse priors, with only 3.3% of parameters, outperform baseline priors and achieve comparable performance to SOTA models using only 2.8% of the training data.
Empirical analysis shows that traditional diffusion priors, while benefiting from larger datasets, are resource-intensive and negatively impacted by increased prior steps and noise injection.
The choice of pre-trained vision-language model significantly influences performance, with better models leading to superior results. |
The aesthetic quality of generated images can be further improved, potentially through refined training data selection.
Future work could explore integrating \eclipse with existing knowledge distillation and model compression techniques for enhanced efficiency. |
text-to-image generation, diffusion models, contrastive learning, parameter efficiency, data efficiency |
2312.04567
Report |
Scaling Laws of Synthetic Images for Model Training ... for Now |
Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, Yonglong Tian |
Recent significant advances in text-to-image models unlock the possibility of
training vision systems using synthetic images, potentially overcoming the
difficulty of collecting curated data at scale. It is unclear, however, how
these models behave at scale, as more synthetic data is added to the training
set. In this paper we study the scaling laws of synthetic images generated by
state of the art text-to-image models, for the training of supervised models:
image classifiers with label supervision, and CLIP with language supervision.
We identify several factors, including text prompts, classifier-free guidance
scale, and types of text-to-image models, that significantly affect scaling
behavior. After tuning these factors, we observe that synthetic images
demonstrate a scaling trend similar to, but slightly less effective than, real
images in CLIP training, while they significantly underperform in scaling when
training supervised image classifiers. Our analysis indicates that the main
reason for this underperformance is the inability of off-the-shelf
text-to-image models to generate certain concepts, a limitation that
significantly impairs the training of image classifiers. Our findings also
suggest that scaling synthetic data can be particularly effective in scenarios
such as: (1) when there is a limited supply of real images for a supervised
problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the
evaluation dataset diverges significantly from the training data, indicating
the out-of-distribution scenario, or (3) when synthetic data is used in
conjunction with real images, as demonstrated in the training of CLIP models. |
This paper studies the scaling laws of synthetic data for training supervised vision models, particularly examining the impact of different factors like text prompts, classifier-free guidance scale, and text-to-image models. |
Understanding the scaling behavior of synthetic data is crucial for exploring the potential of using synthetic images to train vision models, potentially overcoming limitations of real data collection. |
The paper conducts empirical studies on supervised image classifiers and CLIP models, analyzing the scaling trends of synthetic images generated by Stable Diffusion, Imagen, and Muse in comparison to real images. They investigate the impact of different generation configurations on scaling ability, focusing on performance at 1.3M scale (ImageNet size) and scaling behavior. |
Synthetic data exhibits power-law scaling in both supervised and CLIP training, but with lower efficiency compared to real data.
Tuning prompt design, classifier-free guidance, and text-to-image model choice significantly improves the scaling ability of synthetic data.
While generally underperforming real data for supervised classifiers, synthetic data can be more effective in cases of limited real data, out-of-distribution scenarios, and in combination with real data for CLIP training. |
Scaling behavior beyond 4M images is limited by model capacity, requiring larger models.
Current text-to-image models struggle to generate certain concepts accurately, limiting the scaling potential of synthetic data. |
synthetic data, scaling laws, text-to-image models, supervised learning, clip |
2312.04566
Report |
Gen2Det: Generate to Detect |
Saksham Suri, Fanyi Xiao, Animesh Sinha, Sean Chang Culatana, Raghuraman Krishnamoorthi, Chenchen Zhu, Abhinav Shrivastava |
Recently diffusion models have shown improvement in synthetic image quality
as well as better control in generation. We motivate and present Gen2Det, a
simple modular pipeline to create synthetic training data for object detection
for free by leveraging state-of-the-art grounded image generation methods.
Unlike existing works which generate individual object instances, require
identifying foreground followed by pasting on other images, we simplify to
directly generating scene-centric images. In addition to the synthetic data,
Gen2Det also proposes a suite of techniques to best utilize the generated data,
including image-level filtering, instance-level filtering, and better training
recipe to account for imperfections in the generation. Using Gen2Det, we show
healthy improvements on object detection and segmentation tasks under various
settings and agnostic to detection methods. In the long-tailed detection
setting on LVIS, Gen2Det improves the performance on rare categories by a large
margin while also significantly improving the performance on other categories,
e.g. we see an improvement of 2.13 Box AP and 1.84 Mask AP over just training
on real data on LVIS with Mask R-CNN. In the low-data regime setting on COCO,
Gen2Det consistently improves both Box and Mask AP by 2.27 and 1.85 points. In
the most general detection setting, Gen2Det still demonstrates robust
performance gains, e.g. it improves the Box and Mask AP on COCO by 0.45 and
0.32 points. |
Introduces Gen2Det, a modular pipeline for creating synthetic training data for object detection by leveraging grounded image generation methods. |
Aims to improve object detection and segmentation performance, particularly for rare categories and in low-data regimes, by generating more realistic and contextually relevant training data. |
Utilizes grounded inpainting diffusion models to generate scene-centric images with new object instances. Employs image-level and instance-level filtering to remove low-quality generations. Introduces a sampling strategy for mixing real and synthetic data and modifies the loss function to accommodate filtered instances. |
Achieves substantial improvements in object detection and segmentation on LVIS and COCO datasets, particularly for rare categories.
Demonstrates consistent gains in low-data settings, highlighting the benefits of synthetic data augmentation in data-constrained scenarios.
Improves both box and mask AP despite not using segmentation masks for synthetic data, indicating the generation of semantically rich synthetic images. |
Current generation models may still lack diversity, limiting the potential benefits of scaling up synthetic data.
Exploration of improved filtering techniques or generation models could further enhance the quality of synthetic data and potentially lead to even larger performance gains. |
object detection, synthetic data, diffusion models, grounded inpainting, long-tailed learning |
2312.04565
Report |
MuRF: Multi-Baseline Radiance Fields |
Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, Fisher Yu |
We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward
approach to solving sparse view synthesis under multiple different baseline
settings (small and large baselines, and different number of input views). To
render a target novel view, we discretize the 3D space into planes parallel to
the target image plane, and accordingly construct a target view frustum volume.
Such a target volume representation is spatially aligned with the target view,
which effectively aggregates relevant information from the input views for
high-quality rendering. It also facilitates subsequent radiance field
regression with a convolutional network thanks to its axis-aligned nature. The
3D context modeled by the convolutional network enables our method to synthesis
sharper scene structures than prior works. Our MuRF achieves state-of-the-art
performance across multiple different baseline settings and diverse scenarios
ranging from simple objects (DTU) to complex indoor and outdoor scenes
(RealEstate10K and LLFF). We also show promising zero-shot generalization
abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability
of MuRF. |
This paper introduces Multi-Baseline Radiance Fields (MuRF), a feed-forward neural radiance field model for novel view synthesis that effectively handles both small and large baseline camera settings. |
Existing methods for sparse view synthesis often specialize in either small or large baselines, limiting their general applicability. MuRF addresses this by providing a unified solution that excels in both scenarios. |
MuRF constructs a target view frustum volume, spatially aligned with the target view, to effectively aggregate multi-view information. This volume, along with multi-view features and their cosine similarities, is processed by a 3D context-aware convolutional decoder to reconstruct the radiance field. |
MuRF achieves state-of-the-art performance on DTU and RealEstate10K datasets, outperforming specialized small and large baseline methods, respectively.
It exhibits promising zero-shot generalization abilities on the Mip-NeRF 360 dataset, indicating its robustness to unseen data.
Ablation studies validate the importance of the target view frustum volume and the 3D context-aware decoder. |
The model currently assumes known camera parameters and static scenes.
Performance could be further enhanced with larger and more diverse scene-level datasets. |
novel view synthesis, neural radiance fields, multi-baseline, sparse view synthesis, computer vision |
2312.04564
Report |
EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS |
Sharath Girish, Kamal Gupta, Abhinav Shrivastava |
Recently, 3D Gaussian splatting (3D-GS) has gained popularity in novel-view
scene synthesis. It addresses the challenges of lengthy training times and slow
rendering speeds associated with Neural Radiance Fields (NeRFs). Through rapid,
differentiable rasterization of 3D Gaussians, 3D-GS achieves real-time
rendering and accelerated training. They, however, demand substantial memory
resources for both training and storage, as they require millions of Gaussians
in their point cloud representation for each scene. We present a technique
utilizing quantized embeddings to significantly reduce per-point memory storage
requirements and a coarse-to-fine training strategy for a faster and more
stable optimization of the Gaussian point clouds. Our approach develops a
pruning stage which results in scene representations with fewer Gaussians,
leading to faster training times and rendering speeds for real-time rendering
of high resolution scenes. We reduce storage memory by more than an order of
magnitude all while preserving the reconstruction quality. We validate the
effectiveness of our approach on a variety of datasets and scenes preserving
the visual quality while consuming 10-20x lesser memory and faster
training/inference speed. Project page and code is available
https://efficientgaussian.github.io |
This paper introduces EAGLES, a novel approach to compress 3D Gaussian point cloud representations for novel view synthesis, leading to significant reductions in storage and runtime memory while maintaining high reconstruction quality. |
3D Gaussian Splatting (3D-GS), while advantageous over NeRFs for novel view synthesis, suffers from high memory usage due to the millions of Gaussians needed to represent scenes. This work aims to address this limitation. |
The authors employ quantized embeddings to compress color and rotation attributes, quantize opacity for optimization improvement, use a coarse-to-fine training strategy, and implement influence pruning to eliminate redundant Gaussians. |
EAGLES achieves comparable or better reconstruction quality than 3D-GS while reducing storage size by 10-20 times.
The method accelerates both training and rendering, achieving higher FPS and lower training times across datasets.
EAGLES significantly reduces GPU memory consumption during both training and rendering compared to 3D-GS. |
Further exploration of more complex decoders and quantization techniques could yield additional compression.
Investigating the integration of meta-learning approaches for compressing scenes from auxiliary datasets is a potential future direction. |
3d gaussian splatting, novel view synthesis, point cloud compression, quantization, progressive training |
2312.04561
Report |
GenDeF: Learning Generative Deformation Field for Video Generation |
Wen Wang, Kecheng Zheng, Qiuyu Wang, Hao Chen, Zifan Shi, Ceyuan Yang, Yujun Shen, Chunhua Shen |
We offer a new perspective on approaching the task of video generation.
Instead of directly synthesizing a sequence of frames, we propose to render a
video by warping one static image with a generative deformation field (GenDeF).
Such a pipeline enjoys three appealing advantages. First, we can sufficiently
reuse a well-trained image generator to synthesize the static image (also
called canonical image), alleviating the difficulty in producing a video and
thereby resulting in better visual quality. Second, we can easily convert a
deformation field to optical flows, making it possible to apply explicit
structural regularizations for motion modeling, leading to temporally
consistent results. Third, the disentanglement between content and motion
allows users to process a synthesized video through processing its
corresponding static image without any tuning, facilitating many applications
like video editing, keypoint tracking, and video segmentation. Both qualitative
and quantitative results on three common video generation benchmarks
demonstrate the superiority of our GenDeF method. |
This paper presents GenDeF, a novel video generation approach that decomposes videos into a content-rich canonical image and a motion-encoding deformation field, enabling high-quality video synthesis by warping the canonical image. |
This method addresses challenges in video generation related to high dimensionality, motion complexity, and temporal consistency, while also facilitating downstream video processing applications. |
GenDeF utilizes a GAN framework with a canonical image branch and a deformation field branch, both conditioned on input latent codes. The canonical image captures shared content, while the deformation field, conditioned on the canonical image features, encodes temporal motion. The model is trained with adversarial losses and a structural temporal smoothness constraint. |
GenDeF achieves state-of-the-art results on standard video generation benchmarks, demonstrating superior performance in temporal consistency and individual frame quality.
The explicit content-motion decomposition allows for generating multiple plausible videos with varied motions from a single canonical image.
The method facilitates downstream applications such as consistent video editing, point tracking, and video segmentation by leveraging the interpretable canonical image and deformation field. |
The fixed-resolution representation of the canonical image and deformation field limits the model's ability to handle arbitrary resolutions and extremely long videos.
The current approach relies on generated data, limiting its direct applicability to real-world videos where canonical images and deformation fields are not readily available. |
video generation, generative adversarial networks, deformation fields, canonical image, video editing |
2312.04560
Report |
NeRFiller: Completing Scenes via Generative 3D Inpainting |
Ethan Weber, Aleksander Hołyński, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, Angjoo Kanazawa |
We propose NeRFiller, an approach that completes missing portions of a 3D
capture via generative 3D inpainting using off-the-shelf 2D visual generative
models. Often parts of a captured 3D scene or object are missing due to mesh
reconstruction failures or a lack of observations (e.g., contact regions, such
as the bottom of objects, or hard-to-reach areas). We approach this challenging
3D inpainting problem by leveraging a 2D inpainting diffusion model. We
identify a surprising behavior of these models, where they generate more 3D
consistent inpaints when images form a 2$\times$2 grid, and show how to
generalize this behavior to more than four images. We then present an iterative
framework to distill these inpainted regions into a single consistent 3D scene.
In contrast to related works, we focus on completing scenes rather than
deleting foreground objects, and our approach does not require tight 2D object
masks or text. We compare our approach to relevant baselines adapted to our
setting on a variety of scenes, where NeRFiller creates the most 3D consistent
and plausible scene completions. Our project page is at
https://ethanweber.me/nerfiller. |
Presents NeRFiller, a method for completing missing parts of 3D scenes using off-the-shelf 2D inpainting diffusion models. |
Addresses the challenge of incomplete 3D captures by enabling 3D-aware and multi-view consistent scene completion. |
Identifies and leverages a 'Grid Prior' in diffusion models, generalizing it to multiple views with 'Joint Multi-View Inpainting' and iteratively distilling inpainted regions into a 3D scene representation. |
Joint Multi-View Inpainting improves 3D consistency compared to individual image inpainting.
NeRFiller generates more plausible and consistent 3D completions than adapted object-removal baselines.
The method allows for reference-guided inpainting for controlled scene completion. |
Limited resolution and potential blur in generated content.
Challenges in applying the method to casually captured scenes with large masked regions and specific mask patterns. |
3d inpainting, scene completion, diffusion models, neural radiance fields, multi-view consistency |
2312.04558
Report |
MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar |
Yufan Chen, Lizhen Wang, Qijing Li, Hongjiang Xiao, Shengping Zhang, Hongxun Yao, Yebin Liu |
The ability to animate photo-realistic head avatars reconstructed from
monocular portrait video sequences represents a crucial step in bridging the
gap between the virtual and real worlds. Recent advancements in head avatar
techniques, including explicit 3D morphable meshes (3DMM), point clouds, and
neural implicit representation have been exploited for this ongoing research.
However, 3DMM-based methods are constrained by their fixed topologies,
point-based approaches suffer from a heavy training burden due to the extensive
quantity of points involved, and the last ones suffer from limitations in
deformation flexibility and rendering efficiency. In response to these
challenges, we propose MonoGaussianAvatar (Monocular Gaussian Point-based Head
Avatar), a novel approach that harnesses 3D Gaussian point representation
coupled with a Gaussian deformation field to learn explicit head avatars from
monocular portrait videos. We define our head avatars with Gaussian points
characterized by adaptable shapes, enabling flexible topology. These points
exhibit movement with a Gaussian deformation field in alignment with the target
pose and expression of a person, facilitating efficient deformation.
Additionally, the Gaussian points have controllable shape, size, color, and
opacity combined with Gaussian splatting, allowing for efficient training and
rendering. Experiments demonstrate the superior performance of our method,
which achieves state-of-the-art results among previous methods. |
This paper introduces MonoGaussianAvatar, a novel approach for creating dynamic 3D head avatars from monocular portrait videos using 3D Gaussian points and a Gaussian deformation field. |
Existing methods for creating 3D head avatars suffer from limitations in topology, training burden, deformation flexibility, and rendering efficiency. This new method aims to address these challenges. |
The method uses 3D Gaussian points to represent facial features and learns a deformation field to animate these points according to target poses and expressions. It employs a two-stage initialization strategy for Gaussians and a novel point insertion/deletion approach for efficient training. |
MonoGaussianAvatar outperforms state-of-the-art methods in terms of structure similarity, image similarity, and Peak Signal-to-Noise Ratio.
The method accurately captures fine details like teeth and hair, surpasses mesh-based approaches in modeling thin hair strands, and avoids holes during significant head movements.
The introduced Gaussian deformation field proves crucial for preserving the structure of accessories and preventing blurring in novel poses. |
The current method lacks the ability to model reflections on eyeglass lenses, which presents an area for future research.
The method's reliance on 3DMM priors limits its ability to handle extreme expressions that deviate from these priors. |
3d head avatar, gaussian splatting, monocular reconstruction, facial animation, point-based representation |
2312.04557
Report |
GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation |
Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua |
In this study, we explore Transformer-based diffusion models for image and
video generation. Despite the dominance of Transformer architectures in various
fields due to their flexibility and scalability, the visual generative domain
primarily utilizes CNN-based U-Net architectures, particularly in
diffusion-based models. We introduce GenTron, a family of Generative models
employing Transformer-based diffusion, to address this gap. Our initial step
was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a
process involving thorough empirical exploration of the conditioning mechanism.
We then scale GenTron from approximately 900M to over 3B parameters, observing
significant improvements in visual quality. Furthermore, we extend GenTron to
text-to-video generation, incorporating novel motion-free guidance to enhance
video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win
rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text
alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench,
underscoring its strengths in compositional generation. We believe this work
will provide meaningful insights and serve as a valuable reference for future
research. |
The paper introduces GenTron, a family of generative Transformer-based diffusion models for high-quality text-to-image and video generation. |
This work bridges the gap between the dominant architectures used in visual generation (CNNs) and those used in NLP and visual perception (Transformers). |
The authors adapt Diffusion Transformers (DiTs) from class-conditional to text-conditional image generation, explore different conditioning mechanisms, scale the model up to 3B parameters, and extend it to video generation using a novel motion-free guidance technique. |
Cross-attention outperforms adaptive layer norm (adaLN) for text conditioning in GenTron.
Scaling GenTron to 3B parameters significantly improves visual quality.
Motion-free guidance improves the quality of generated videos and allows for the integration of image-text data during training. |
The performance of video generation, while promising, still lags behind state-of-the-art image generation.
Future work could focus on developing more efficient training methods for large-scale Transformer-based diffusion models. |
text-to-image generation, text-to-video generation, diffusion models, transformers, motion-free guidance |
2312.04551
Report |
Free3D: Consistent Novel View Synthesis without 3D Representation |
Chuanxia Zheng, Andrea Vedaldi |
We introduce Free3D, a simple accurate method for monocular open-set novel
view synthesis (NVS). Similar to Zero-1-to-3, we start from a pre-trained 2D
image generator for generalization, and fine-tune it for NVS. Compared to other
works that took a similar approach, we obtain significant improvements without
resorting to an explicit 3D representation, which is slow and memory-consuming,
and without training an additional network for 3D reconstruction. Our key
contribution is to improve the way the target camera pose is encoded in the
network, which we do by introducing a new ray conditioning normalization (RCN)
layer. The latter injects pose information in the underlying 2D image generator
by telling each pixel its viewing direction. We further improve multi-view
consistency by using light-weight multi-view attention layers and by sharing
generation noise between the different views. We train Free3D on the Objaverse
dataset and demonstrate excellent generalization to new categories in new
datasets, including OmniObject3D and GSO. The project page is available at
https://chuanxiaz.com/free3d/. |
This paper presents \method, a novel approach for single-view, open-set novel view synthesis that improves pose accuracy and multi-view consistency without relying on explicit 3D representations. |
Existing open-set novel view synthesis methods often struggle with accurate camera pose control and consistent multi-view generation, especially without relying on computationally expensive 3D representations. |
The methodology leverages a pre-trained 2D image generator enhanced by a novel Ray Conditioning Normalization (RCN) layer for accurate pose encoding. Multi-view consistency is achieved through a pseudo-3D cross-view attention module and multi-view noise sharing during image generation. |
\method surpasses state-of-the-art models in pose accuracy and view consistency on the Objaverse dataset, even those trained on larger datasets or employing explicit 3D representations.
The method demonstrates strong generalization ability by achieving superior results on unseen datasets like OmniObject3D and GSO, outperforming competing methods without fine-tuning.
Ablation studies confirm the effectiveness of RCN, multi-view attention, and noise sharing in enhancing pose accuracy and multi-view consistency. |
The method relies on a pre-trained 2D generator, which, while enabling generalization, might limit its capacity to model complex 3D structures not well-represented in the training data.
Future work could explore the integration of more advanced attention mechanisms or training strategies to further enhance multi-view consistency and detail preservation. |
novel view synthesis, open-set learning, generative models, ray conditioning, multi-view consistency |
2312.04539
Report |
Auto-Vocabulary Semantic Segmentation |
Osman Ülger, Maksymilian Kulicki, Yuki Asano, Martin R. Oswald |
Open-ended image understanding tasks gained significant attention from the
research community, particularly with the emergence of Vision-Language Models.
Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic
segmentation without relying on a fixed vocabulary, and in some cases, they
operate without the need for training or fine-tuning. However, OVS methods
typically require users to specify the vocabulary based on the task or dataset
at hand. In this paper, we introduce \textit{Auto-Vocabulary Semantic
Segmentation (AVS)}, advancing open-ended image understanding by eliminating
the necessity to predefine object categories for segmentation. Our approach,
\ours, presents a framework that autonomously identifies relevant class names
using enhanced BLIP embeddings, which are utilized for segmentation afterwards.
Given that open-ended object category predictions cannot be directly compared
with a fixed ground truth, we develop a Large Language Model-based
Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically
generated class names and their corresponding segments. Our method sets new
benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes
for AVS and showcases competitive performance to OVS methods that require
specified class names. |
This paper introduces Auto-Vocabulary Semantic Segmentation (AVS), a novel task aiming to segment images and assign classes without predefined categories, user input, or additional training data, unlike traditional or open-vocabulary methods. |
This work pushes the boundaries of open-ended image understanding by enabling a system to autonomously determine relevant object categories for segmentation, similar to human perception. |
The proposed AVS framework utilizes BLIP-Cluster-Caption (BCC) to generate class names by clustering BLIP embeddings, enhancing them for semantic accuracy, and captioning each cluster. These generated nouns then guide a pre-trained open-vocabulary segmentation model (X-Decoder) for pixel-level prediction. A novel evaluation metric, LAVE, leverages an LLM to map generated categories to dataset annotations for performance assessment. |
AVS framework shows competitive performance against open-vocabulary methods on PASCAL VOC, ADE20K, and Cityscapes, despite not having access to predefined categories.
BCC effectively identifies and segments out-of-vocabulary classes, demonstrating comprehension beyond fixed datasets.
The LLM-based evaluator, LAVE, successfully bridges the gap between open-ended predictions and fixed ground truth annotations. |
Performance on datasets with a high number of instances, like Cityscapes, is notably lower than some open-vocabulary methods, suggesting room for improvement in handling complex scenes.
Occasional misclassifications occur due to the model struggling to differentiate between semantically similar classes (e.g., hyponyms and hypernyms), highlighting the need for further refinement in semantic reasoning. |
auto-vocabulary semantic segmentation, open-vocabulary semantic segmentation, semantic segmentation, vision-language models, image captioning |
2312.04534
Report |
PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns |
Shuliang Ning, Duomin Wang, Yipeng Qin, Zirong Jin, Baoyuan Wang, Xiaoguang Han |
In this paper, we propose a novel virtual try-on from unconstrained designs
(ucVTON) task to enable photorealistic synthesis of personalized composite
clothing on input human images. Unlike prior arts constrained by specific input
types, our method allows flexible specification of style (text or image) and
texture (full garment, cropped sections, or texture patches) conditions. To
address the entanglement challenge when using full garment images as
conditions, we develop a two-stage pipeline with explicit disentanglement of
style and texture. In the first stage, we generate a human parsing map
reflecting the desired style conditioned on the input. In the second stage, we
composite textures onto the parsing map areas based on the texture input. To
represent complex and non-stationary textures that have never been achieved in
previous fashion editing works, we first propose extracting hierarchical and
balanced CLIP features and applying position encoding in VTON. Experiments
demonstrate superior synthesis quality and personalization enabled by our
method. The flexible control over style and texture mixing brings virtual
try-on to a new level of user experience for online shopping and fashion
design. |
This paper presents ucVTON, a novel virtual try-on method that allows users to synthesize personalized clothing with flexible style (text/image) and texture (full garment, cropped sections, or texture patches) conditions. |
Existing virtual try-on methods are limited in the types of inputs they allow, hindering users from mixing and matching style and texture elements from different garments. |
A two-stage pipeline is proposed, disentangling style and texture. Stage 1 generates a parsing map reflecting the desired style. Stage 2 composites textures onto the parsing map based on the texture input. The method also introduces hierarchical and balanced CLIP features with position encoding to handle complex, non-stationary textures. |
The method achieves significantly higher style prediction accuracy compared to prior arts.
It outperforms state-of-the-art methods in texture quality, particularly when using full garment images as texture input.
User studies confirm that ucVTON is preferred for its fidelity in style and texture, and overall image quality. |
The model currently lacks control over garment shape and fit.
Future work will explore incorporating user control over these aspects to further enhance personalization. |
virtual try-on, fashion editing, diffusion models, style and texture disentanglement, clip features |
2312.04524
Report |
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models |
Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag |
Recent advancements in diffusion-based models have demonstrated significant
success in generating images from text. However, video editing models have not
yet reached the same level of visual quality and user control. To address this,
we introduce RAVE, a zero-shot video editing method that leverages pre-trained
text-to-image diffusion models without additional training. RAVE takes an input
video and a text prompt to produce high-quality videos while preserving the
original motion and semantic structure. It employs a novel noise shuffling
strategy, leveraging spatio-temporal interactions between frames, to produce
temporally consistent videos faster than existing methods. It is also efficient
in terms of memory requirements, allowing it to handle longer videos. RAVE is
capable of a wide range of edits, from local attribute modifications to shape
transformations. In order to demonstrate the versatility of RAVE, we create a
comprehensive video evaluation dataset ranging from object-focused scenes to
complex human activities like dancing and typing, and dynamic scenes featuring
swimming fish and boats. Our qualitative and quantitative experiments highlight
the effectiveness of RAVE in diverse video editing scenarios compared to
existing methods. Our code, dataset and videos can be found in
https://rave-video.github.io. |
RAVE is a zero-shot video editing method that leverages pre-trained text-to-image diffusion models for style, attribute, and shape editing in videos, while preserving motion and structure. |
Existing video editing methods lack the visual quality, user control, and efficiency of their image editing counterparts. RAVE addresses this gap by enabling diverse video edits using pre-trained models without the need for extensive training. |
RAVE introduces a novel noise shuffling strategy within a grid-based video editing framework. This strategy leverages spatio-temporal interactions during the diffusion process to enhance temporal consistency across video frames, even for longer videos. |
RAVE demonstrates superior temporal consistency and textual alignment compared to baseline methods, as evidenced by both quantitative and qualitative evaluations.
The noise shuffling strategy in RAVE proves effective in maintaining consistency across multiple grids, overcoming limitations of conventional attention mechanisms for longer videos.
RAVE exhibits efficiency in terms of runtime, achieving edits approximately 25% faster than the closest competitor, making it suitable for real-time video editing applications. |
RAVE faces limitations in maintaining consistent shape transformations for extreme shape edits in longer videos.
Fine details can exhibit flickering in the edited videos, especially in cases requiring high-frequency edits, due to the absence of explicit pixel-level deflickering techniques. |
video editing, diffusion models, text-guided editing, temporal consistency, zero-shot learning |
2312.04483
Report |
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation |
Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, Nong Sang |
Despite diffusion models having shown powerful abilities to generate
photorealistic images, generating videos that are realistic and diverse still
remains in its infancy. One of the key reasons is that current methods
intertwine spatial content and temporal dynamics together, leading to a notably
increased complexity of text-to-video generation (T2V). In this work, we
propose HiGen, a diffusion model-based method that improves performance by
decoupling the spatial and temporal factors of videos from two perspectives,
i.e., structure level and content level. At the structure level, we decompose
the T2V task into two steps, including spatial reasoning and temporal
reasoning, using a unified denoiser. Specifically, we generate spatially
coherent priors using text during spatial reasoning and then generate
temporally coherent motions from these priors during temporal reasoning. At the
content level, we extract two subtle cues from the content of the input video
that can express motion and appearance changes, respectively. These two cues
then guide the model's training for generating videos, enabling flexible
content variations and enhancing temporal stability. Through the decoupled
paradigm, HiGen can effectively reduce the complexity of this task and generate
realistic videos with semantics accuracy and motion stability. Extensive
experiments demonstrate the superior performance of HiGen over the
state-of-the-art T2V methods. |
Proposes HiGen-T2V, a diffusion model-based text-to-video generation method that decouples spatial and temporal factors to improve realism and diversity |
Current T2V methods struggle to jointly generate realistic spatial content and diverse temporal dynamics due to the complexity of video data |
Decouples video generation at two levels: 1) Structure level: separates text-to-video generation into spatial reasoning (generates spatially coherent priors from text) and temporal reasoning (generates temporally coherent motions from priors). 2) Content level: extracts motion and appearance cues from input videos to guide model training and enhance stability and diversity |
Achieves superior performance compared to state-of-the-art T2V methods on MSR-VTT dataset.
Demonstrates improved spatial quality and temporal stability through ablation studies.
Allows flexible control over generated videos by manipulating motion and appearance factors. |
Object detail generation lags behind image synthesis models due to computational and data limitations.
Modeling human and animal actions realistically, especially with substantial motion, remains challenging. |
text-to-video generation, diffusion models, spatio-temporal decoupling, motion and appearance analysis, deep learning |
2312.04461
Report |
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding |
Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, Ying Shan |
Recent advances in text-to-image generation have made remarkable progress in
synthesizing realistic human photos conditioned on given text prompts. However,
existing personalized generation methods cannot simultaneously satisfy the
requirements of high efficiency, promising identity (ID) fidelity, and flexible
text controllability. In this work, we introduce PhotoMaker, an efficient
personalized text-to-image generation method, which mainly encodes an arbitrary
number of input ID images into a stack ID embedding for preserving ID
information. Such an embedding, serving as a unified ID representation, can not
only encapsulate the characteristics of the same input ID comprehensively, but
also accommodate the characteristics of different IDs for subsequent
integration. This paves the way for more intriguing and practically valuable
applications. Besides, to drive the training of our PhotoMaker, we propose an
ID-oriented data construction pipeline to assemble the training data. Under the
nourishment of the dataset constructed through the proposed pipeline, our
PhotoMaker demonstrates better ID preservation ability than test-time
fine-tuning based methods, yet provides significant speed improvements,
high-quality generation results, strong generalization capabilities, and a wide
range of applications. Our project page is available at
https://photo-maker.github.io/ |
This paper proposes PhotoMaker, an efficient personalized text-to-image generation method that encodes multiple input ID images into a stacked ID embedding to generate high-quality, customizable human photos with high ID fidelity. |
Existing personalized generation methods struggle to simultaneously achieve high efficiency, promising ID fidelity, and flexible text controllability. PhotoMaker aims to address these limitations. |
PhotoMaker uses a stacked ID embedding created by concatenating embeddings of multiple input ID images. This embedding is integrated with text embedding to guide a diffusion model (SDXL) for image generation. An ID-oriented data construction pipeline is also proposed to train PhotoMaker. |
PhotoMaker demonstrates high ID fidelity and generation quality comparable to DreamBooth while being significantly faster.
The method offers flexibility in controlling ID attributes like age and gender by simply modifying class words in text prompts.
PhotoMaker enables novel applications like identity mixing, bringing persons from artworks/old photos to reality, and stylization while preserving ID characteristics. |
PhotoMaker currently focuses on generating a single person and does not support multi-person ID control.
The method is biased towards the training dataset (SDXL) and may inherit its limitations. |
text-to-image generation, personalized image synthesis, diffusion models, identity preservation, image editing |
2312.04433
Report |
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion |
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan |
Customized generation using diffusion models has made impressive progress in
image generation, but remains unsatisfactory in the challenging video
generation task, as it requires the controllability of both subjects and
motions. To that end, we present DreamVideo, a novel approach to generating
personalized videos from a few static images of the desired subject and a few
videos of target motion. DreamVideo decouples this task into two stages,
subject learning and motion learning, by leveraging a pre-trained video
diffusion model. The subject learning aims to accurately capture the fine
appearance of the subject from provided images, which is achieved by combining
textual inversion and fine-tuning of our carefully designed identity adapter.
In motion learning, we architect a motion adapter and fine-tune it on the given
videos to effectively model the target motion pattern. Combining these two
lightweight and efficient adapters allows for flexible customization of any
subject with any motion. Extensive experimental results demonstrate the
superior performance of our DreamVideo over the state-of-the-art methods for
customized video generation. Our project page is at
https://dreamvideo-t2v.github.io. |
DreamVideo, a novel approach to generate personalized videos by customizing both subject identity and motion patterns using a pre-trained video diffusion model and two lightweight adapters. |
Customized video generation is challenging as it requires controllability of both subjects and motions, which is not well addressed by existing methods. |
DreamVideo decouples the task into subject learning and motion learning. Subject learning captures appearance details from static images via textual inversion and a fine-tuned identity adapter. Motion learning models motion patterns from videos using a motion adapter with appearance guidance. |
DreamVideo outperforms state-of-the-art methods in qualitative and quantitative comparisons, including AnimateDiff, ModelScopeT2V, and LoRA fine-tuning.
The method effectively combines customized subjects and motions under various contexts, preserving both identity and motion fidelity.
DreamVideo exhibits strong performance in individual subject and motion customization, surpassing alternatives like Textual Inversion, Dreamix, and Tune-A-Video. |
DreamVideo currently doesn't support customizing multiple subjects with multiple motions.
The approach may struggle with fine-grained single video motion, achieving similar patterns instead of frame-by-frame correspondence. |
video generation, diffusion models, customization, subject customization, motion customization |
2312.04429
Report |
Approximate Caching for Efficiently Serving Diffusion Models |
Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, Shiv Saini |
Text-to-image generation using diffusion models has seen explosive popularity
owing to their ability in producing high quality images adhering to text
prompts. However, production-grade diffusion model serving is a resource
intensive task that not only require high-end GPUs which are expensive but also
incurs considerable latency. In this paper, we introduce a technique called
approximate-caching that can reduce such iterative denoising steps for an image
generation based on a prompt by reusing intermediate noise states created
during a prior image generation for similar prompts. Based on this idea, we
present an end to end text-to-image system, Nirvana, that uses the
approximate-caching with a novel cache management-policy Least Computationally
Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8%
end-to-end latency reduction and 19% dollar savings, on average, on two real
production workloads. We further present an extensive characterization of real
production text-to-image prompts from the perspective of caching, popularity
and reuse of intermediate states in a large production environment. |
Introduces \sys, a system using approximate caching to reduce compute cost and latency in text-to-image generation with diffusion models by reusing intermediate noise states from prior prompts. |
Diffusion models for text-to-image generation, while producing high-quality images, are computationally expensive and have high latency, hindering interactive user experiences and increasing costs. |
Leverages approximate caching by storing intermediate noise states from previous image generations. Employs a novel cache management policy (LCBFU) to prioritize states offering the most compute savings. Uses a match predictor to avoid unnecessary cache searches, further reducing latency. |
\sys achieves up to 50% reduction in GPU usage and end-to-end latency while maintaining image quality comparable to vanilla diffusion models.
Reduces overall cost, latency, and compute requirements by ~20% on average, improving system throughput by 27%.
User study (N=60) shows 79% preference for \sys generated images, significantly higher than retrieval-based methods and close to vanilla diffusion model quality. |
Image diversity might decrease over time if the system encounters a high volume of very similar prompts.
Reliance on embedding similarity for prompt matching can be challenging for long, complex prompts. |
text-to-image generation, diffusion models, approximate caching, latency reduction, cost optimization |
2312.04424
Report |
Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views |
Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong, Qi Tian |
Synthesizing multi-view 3D from one single image is a significant and
challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent
diffusion model to the 3D scope. These approaches generate the target-view
image with a single-view source image and the camera pose as condition
information. However, the one-to-one manner adopted in Zero-1-to-3 incurs
challenges for building geometric and visual consistency across views,
especially for complex objects. We propose a cascade generation framework
constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this
issue, which progressively extracts 3D information from the source image.
Specifically, a self-prompting mechanism is designed to generate several nearby
views at first. These views are then fed into the second-stage model along with
the source image as generation conditions. With self-prompted multiple views as
the supplementary information, our Cascade-Zero123 generates more highly
consistent novel-view images than Zero-1-to-3. The promotion is significant for
various complex and challenging scenes, involving insects, humans, transparent
objects, and stacked multiple objects etc. The project page is at
https://cascadezero123.github.io/. |
This paper introduces \name, a novel cascade framework based on Zero-1-to-3 models, designed to improve the geometric and visual consistency of novel view synthesis from a single image. |
Existing single image to 3D methods, particularly Zero-1-to-3 approaches, struggle to maintain consistency across views, especially for complex objects and scenes with large pose variations. \name addresses this limitation by progressively extracting 3D information. |
\name utilizes two cascaded Zero-1-to-3 models. The first generates several nearby views from the input image. These, along with the input, are fed to the second model, which leverages cross-attention to synthesize the final target view with enhanced consistency. |
Significantly improves geometric and visual consistency in novel view synthesis, especially for complex scenes (e.g., stacked objects, insects) compared to Zero-1-to-3.
Achieves better visual quality compared to methods like SyncDreamer while maintaining consistency.
Demonstrates superior performance on Objaverse and RealFusion15 datasets based on metrics like PSNR, SSIM, LPIPS, and CLIP-score. |
Limited ability to handle cases with heavy occlusion due to reliance on 2D information.
Performance can degrade with high elevation angles in input images due to Zero-1-to-3's sensitivity. |
novel view synthesis, single image to 3d, latent diffusion model, zero-1-to-3, view consistency |
2312.04410
Report |
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models |
Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi |
Recently, diffusion models have made remarkable progress in text-to-image
(T2I) generation, synthesizing images with high fidelity and diverse contents.
Despite this advancement, latent space smoothness within diffusion models
remains largely unexplored. Smooth latent spaces ensure that a perturbation on
an input latent corresponds to a steady change in the output image. This
property proves beneficial in downstream tasks, including image interpolation,
inversion, and editing. In this work, we expose the non-smoothness of diffusion
latent spaces by observing noticeable visual fluctuations resulting from minor
latent variations. To tackle this issue, we propose Smooth Diffusion, a new
category of diffusion models that can be simultaneously high-performing and
smooth. Specifically, we introduce Step-wise Variation Regularization to
enforce the proportion between the variations of an arbitrary input latent and
that of the output image is a constant at any diffusion training step. In
addition, we devise an interpolation standard deviation (ISTD) metric to
effectively assess the latent space smoothness of a diffusion model. Extensive
quantitative and qualitative experiments demonstrate that Smooth Diffusion
stands out as a more desirable solution not only in T2I generation but also
across various downstream tasks. Smooth Diffusion is implemented as a
plug-and-play Smooth-LoRA to work with various community models. Code is
available at https://github.com/SHI-Labs/Smooth-Diffusion. |
The paper proposes "Smooth Diffusion," a novel category of diffusion models enhancing latent space smoothness without sacrificing performance. |
Many downstream tasks like image interpolation, inversion, and editing benefit from a smooth latent space where minor input changes correspond to steady output changes. |
The authors introduce "Step-wise Variation Regularization" to enforce a consistent ratio between input latent variations and output image variations during training. |
Smooth Diffusion significantly improves the continuity of transitions in image interpolation.
It reduces reconstruction errors in image inversion compared to baselines like Stable Diffusion.
Smooth Diffusion better preserves unedited image content during text-based and drag-based editing. |
Fully fine-tuned models are prone to collapse under the proposed regularization, suggesting a need for careful design.
The impact of the regularization strength requires fine-tuning based on specific tasks and datasets. |
diffusion models, latent space, image interpolation, image inversion, image editing |
2312.04302
Report |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
Yuechen Zhang, Shengju Qian, Bohao Peng, Shu Liu, Jiaya Jia |
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs)
inference: explicit controllable text generation. Multi-modal LLMs empower
multi-modality understanding with the capability of semantic generation yet
bring less explainability and heavier reliance on prompt contents due to their
autoregressive generative nature. While manipulating prompt formats could
improve outputs, designing specific and precise prompts per task can be
challenging and ineffective. To tackle this issue, we introduce a novel
inference method, Prompt Highlighter, which enables users to highlight specific
prompt spans to interactively control the focus during generation. Motivated by
the classifier-free diffusion guidance, we form regular and unconditional
context pairs based on highlighted tokens, demonstrating that the
autoregressive generation in models can be guided in a classifier-free way.
Notably, we find that, during inference, guiding the models with highlighted
tokens through the attention weights leads to more desired outputs. Our
approach is compatible with current LLMs and VLMs, achieving impressive
customized generation results without training. Experiments confirm its
effectiveness in focusing on input contexts and generating reliable content.
Without tuning on LLaVA-v1.5, our method secured 70.7 in the MMBench test and
1552.5 in MME-perception. The code is available at:
https://github.com/dvlab-research/Prompt-Highlighter/ |
This paper introduces Prompt Highlighter, a novel inference method for multi-modal LLMs that enables users to highlight specific prompt spans to interactively control the focus during text generation. |
Multi-modal LLMs are powerful but lack explainability and heavily rely on prompt engineering, which can be challenging and ineffective. Prompt Highlighter addresses this by offering more intuitive and fine-grained control over generation. |
Inspired by classifier-free diffusion guidance, the method constructs regular and unconditional context pairs based on highlighted tokens. It leverages attention mechanisms to guide the model's focus towards highlighted parts, enabling customized generation. |
Prompt Highlighter enables fine-grained control over generation, allowing users to highlight specific parts of text and images to influence output.
The method is effective in mitigating hallucinations and improving the reliability of generated content, as evidenced by quantitative evaluations on benchmarks like MMBench and MME.
User studies confirm that a significant majority of users find Prompt Highlighter beneficial and prefer its outputs over traditional inference methods. |
The approach introduces additional computational overhead due to the extra decoding branch, although the impact is marginal.
The quality of generated content is contingent on the capabilities of the base model. Poorly trained models may exhibit limitations in accurately emphasizing or de-emphasizing highlighted sections. |
multi-modal llms, controllable text generation, prompt highlighting, user interaction, classifier-free guidance |
2312.04086
Report |
MTVG : Multi-text Video Generation with Text-to-Video Models |
Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, Hyeokmin Kwon, Sangpil Kim |
Recently, video generation has attracted massive attention and yielded
noticeable outcomes. Concerning the characteristics of video, multi-text
conditioning incorporating sequential events is necessary for next-step video
generation. In this work, we propose a novel multi-text video generation~(MTVG)
by directly utilizing a pre-trained diffusion-based text-to-video~(T2V)
generation model without additional fine-tuning. To generate consecutive video
segments, visual consistency generated by distinct prompts is necessary with
diverse variations, such as motion and content-related transitions. Our
proposed MTVG includes Dynamic Noise and Last Frame Aware Inversion which
reinitialize the noise latent to preserve visual coherence between videos of
different prompts and prevent repetitive motion or contents. Furthermore, we
present Structure Guiding Sampling to maintain the global appearance across the
frames in a single video clip, where we leverage iterative latent updates
across the preceding frame. Additionally, our Prompt Generator allows for
arbitrary format of text conditions consisting of diverse events. As a result,
our extensive experiments, including diverse transitions of descriptions,
demonstrate that our proposed methods show superior generated outputs in terms
of semantically coherent and temporally seamless video.Video examples are
available in our project page: https://kuai-lab.github.io/mtvg-page. |
This paper presents MTVG, a novel pipeline for generating videos from multiple text prompts, leveraging pre-trained text-to-video models without requiring further training. |
Existing text-to-video generation methods often struggle to create coherent and dynamic videos from a sequence of prompts, limiting their ability to portray complex narratives. |
MTVG employs two key techniques: (1) Last Frame-Aware Latent Initialization, which preserves visual consistency across transitions by incorporating elements of the preceding video clip, and (2) Structure-Guided Sampling, which enhances temporal coherence within each video segment. |
MTVG generates more semantically coherent and temporally seamless videos compared to existing zero-shot video generation methods.
Quantitative results using CLIP-Text and CLIP-Image metrics demonstrate superior performance over baseline models.
Human evaluation confirms that MTVG produces more natural and visually appealing videos, reflecting a strong alignment with given prompts. |
The quality of generated videos can be influenced by the inherent limitations of the pre-trained text-to-video model.
Further exploration of prompt engineering and fine-tuning strategies could potentially enhance the overall performance. |
video generation, multi-text conditioning, diffusion models, zero-shot learning, temporal coherence |
2312.04005
Report |
KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis |
Youngwan Lee, Kwanyong Park, Yoorhim Cho, Yong-Ju Lee, Sung Ju Hwang |
Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the
community due to its generation performance and open-source nature. Recently,
Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a
lot of attention due to its significant performance improvements with a higher
resolution of 1024x1024 and a larger model. However, its increased computation
cost and model size require higher-end hardware(e.g., bigger VRAM GPU) for
end-users, incurring higher costs of operation. To address this problem, in
this work, we propose an efficient latent diffusion model for text-to-image
synthesis obtained by distilling the knowledge of SDXL. To this end, we first
perform an in-depth analysis of the denoising U-Net in SDXL, which is the main
bottleneck of the model, and then design a more efficient U-Net based on the
analysis. Secondly, we explore how to effectively distill the generation
capability of SDXL into an efficient U-Net and eventually identify four
essential factors, the core of which is that self-attention is the most
important part. With our efficient U-Net and self-attention-based knowledge
distillation strategy, we build our efficient T2I models, called KOALA-1B &
-700M, while reducing the model size up to 54% and 69% of the original SDXL
model. In particular, the KOALA-700M is more than twice as fast as SDXL while
still retaining a decent generation quality. We hope that due to its balanced
speed-performance tradeoff, our KOALA models can serve as a cost-effective
alternative to SDXL in resource-constrained environments. |
KOALA, an efficient text-to-image synthesis model distilled from SDXL, achieves a better speed-performance trade-off for resource-constrained environments. |
SDXL, while achieving state-of-the-art image generation quality, requires high-end hardware due to its large model size and computational cost, limiting accessibility. |
The authors design efficient U-Net architectures by analyzing SDXL's U-Net and propose a knowledge distillation strategy focusing on self-attention features. |
KOALA reduces the model size up to 69% and inference time by 60% compared to SDXL.
KOALA consistently outperforms BK-SDM in both visual aesthetics (HPSv2) and image-text alignment (T2I-CompBench).
KOALA-700M achieves better performance than SDM-v2.0 while having a similar model size and inference speed, and can operate on an 8GB GPU. |
KOALA shows limitations in rendering legible text and handling complex prompts with multiple attributes, potentially due to the training dataset.
Future work includes exploring the integration of machine-generated detailed captions for improved text-alignment. |
text-to-image synthesis, stable diffusion, knowledge distillation, model compression, self-attention |
2312.03913
Report |
Controllable Human-Object Interaction Synthesis |
Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, C. Karen Liu |
Synthesizing semantic-aware, long-horizon, human-object interaction is
critical to simulate realistic human behaviors. In this work, we address the
challenging problem of generating synchronized object motion and human motion
guided by language descriptions in 3D scenes. We propose Controllable
Human-Object Interaction Synthesis (CHOIS), an approach that generates object
motion and human motion simultaneously using a conditional diffusion model
given a language description, initial object and human states, and sparse
object waypoints. While language descriptions inform style and intent,
waypoints ground the motion in the scene and can be effectively extracted using
high-level planning methods. Naively applying a diffusion model fails to
predict object motion aligned with the input waypoints and cannot ensure the
realism of interactions that require precise hand-object contact and
appropriate contact grounded by the floor. To overcome these problems, we
introduce an object geometry loss as additional supervision to improve the
matching between generated object motion and input object waypoints. In
addition, we design guidance terms to enforce contact constraints during the
sampling process of the trained diffusion model. |
Presents CHOIS, a novel approach for synthesizing synchronized object and human motion in 3D scenes guided by language descriptions and sparse object waypoints using a conditional diffusion model. |
Synthesizing realistic and semantically aware human-object interactions is crucial for various applications like computer graphics and robotics. Previous methods struggled with larger, diverse objects and synthesizing both human and object motion from initial states. |
A conditional diffusion model generates synchronized object and human motion, conditioned on language, object geometry, initial states, and waypoints. An object geometry loss improves object motion accuracy. Guidance terms enforce contact constraints during sampling, enhancing realism. |
CHOIS successfully generates synchronized object and human motion aligning with language descriptions and object waypoints.
The method generalizes to novel objects, demonstrating robustness beyond seen datasets.
Human perceptual studies confirm that CHOIS outperforms baselines in terms of text consistency and interaction quality. |
The model does not explicitly handle articulated objects.
Waypoint extraction currently relies on heuristics and could be improved with learned approaches. |
human-object interaction, motion synthesis, diffusion models, 3d scenes, language guidance |
2312.03884
Report |
WonderJourney: Going from Anywhere to Everywhere |
Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T. Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, Charles Herrmann |
We introduce WonderJourney, a modularized framework for perpetual 3D scene
generation. Unlike prior work on view generation that focuses on a single type
of scenes, we start at any user-provided location (by a text description or an
image) and generate a journey through a long sequence of diverse yet coherently
connected 3D scenes. We leverage an LLM to generate textual descriptions of the
scenes in this journey, a text-driven point cloud generation pipeline to make a
compelling and coherent sequence of 3D scenes, and a large VLM to verify the
generated scenes. We show compelling, diverse visual results across various
scene types and styles, forming imaginary "wonderjourneys". Project website:
https://kovenyu.com/WonderJourney/ |
Introduces WonderJourney, a modular framework for generating a sequence of diverse and coherent 3D scenes from a text description or an image, simulating a journey through an imaginary world. |
Addresses the limitations of prior perpetual view generation methods that focus on single scene types or domains, enabling more creative and varied visual storytelling. |
Leverages an LLM for scene description generation, a text-driven visual module for creating coherent 3D scenes from those descriptions, and a VLM for validating the generated scenes. |
Generates compelling and diverse visual results across various scene types and styles.
Shows significant user preference over baseline methods in terms of diversity, visual quality, scene complexity, and overall interest.
Demonstrates the ability to generate long and controlled journeys using user-provided descriptions like poems or story abstracts. |
Reliance on pretrained models may inherit their biases and limitations.
The generation process can sometimes produce undesirable artifacts like photo borders or out-of-focus objects, requiring additional validation and regeneration. |
3d scene generation, text-to-3d, perpetual view generation, large language models, vision-language models |
2312.03869
Report |
Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion |
Kira Prabhu, Jane Wu, Lynn Tsai, Peter Hedman, Dan B Goldman, Ben Poole, Michael Broxton |
This paper presents a novel approach to inpainting 3D regions of a scene,
given masked multi-view images, by distilling a 2D diffusion model into a
learned 3D scene representation (e.g. a NeRF). Unlike 3D generative methods
that explicitly condition the diffusion model on camera pose or multi-view
information, our diffusion model is conditioned only on a single masked 2D
image. Nevertheless, we show that this 2D diffusion model can still serve as a
generative prior in a 3D multi-view reconstruction problem where we optimize a
NeRF using a combination of score distillation sampling and NeRF reconstruction
losses. Predicted depth is used as additional supervision to encourage accurate
geometry. We compare our approach to 3D inpainting methods that focus on object
removal. Because our method can generate content to fill any 3D masked region,
we additionally demonstrate 3D object completion, 3D object replacement, and 3D
scene completion. |
This paper introduces a novel method for 3D inpainting of scenes from multi-view images by leveraging a pre-trained 2D inpainting diffusion model as a generative prior for a learned 3D scene representation (NeRF). |
This approach addresses the limitations of existing 3D inpainting methods that either struggle with 3D consistency or require computationally expensive training of 3D-aware diffusion models. |
The method employs a joint optimization framework that combines score distillation sampling (SDS) with traditional NeRF reconstruction losses. This allows the model to leverage the 2D inpainting diffusion model for generating content in masked regions while maintaining consistency with the unmasked regions of the input images. |
The method generates realistic and 3D-consistent inpainted content for various mask types, including sphere masks, object masks, scribble masks, and outpainting masks.
Quantitative evaluation on the SPIn-NeRF dataset shows that the proposed method outperforms SPIn-NeRF in terms of SSIM and LPIPS, demonstrating improved 3D consistency.
The use of a patch-based depth regularizer significantly improves the overall depth map quality and 3D consistency of the inpainted results. |
The randomness inherent in SDS can lead to high variance in generated content, sometimes lacking high-frequency detail.
Future work will explore alternative diffusion prior distillation methods and incorporate additional 3D priors to enhance the level of detail in inpainted results. |
3d inpainting, diffusion models, nerf, score distillation sampling, multi-view reconstruction |
2312.03816
Report |
AVID: Any-Length Video Inpainting with Diffusion Model |
Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, Licheng Yu |
Recent advances in diffusion models have successfully enabled text-guided
image inpainting. While it seems straightforward to extend such editing
capability into the video domain, there have been fewer works regarding
text-guided video inpainting. Given a video, a masked region at its initial
frame, and an editing prompt, it requires a model to do infilling at each frame
following the editing guidance while keeping the out-of-mask region intact.
There are three main challenges in text-guided video inpainting: ($i$) temporal
consistency of the edited video, ($ii$) supporting different inpainting types
at different structural fidelity levels, and ($iii$) dealing with variable
video length. To address these challenges, we introduce Any-Length Video
Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is
equipped with effective motion modules and adjustable structure guidance, for
fixed-length video inpainting. Building on top of that, we propose a novel
Temporal MultiDiffusion sampling pipeline with a middle-frame attention
guidance mechanism, facilitating the generation of videos with any desired
duration. Our comprehensive experiments show our model can robustly deal with
various inpainting types at different video duration ranges, with high quality.
More visualization results are made publicly available at
https://zhang-zx.github.io/AVID/ . |
This paper introduces AVID, a novel framework for text-guided video inpainting that handles variable video lengths and diverse editing types while maintaining temporal consistency. |
Text-guided video inpainting is a challenging task due to the need for temporal consistency, support for various editing types and structural fidelity levels, and handling variable video lengths. This work addresses these challenges to enable flexible video editing with text. |
AVID integrates motion modules into a text-to-image inpainting diffusion model, incorporates a structure guidance module adaptable to different inpainting tasks, and employs a Temporal MultiDiffusion sampling pipeline with middle-frame attention guidance for variable video length handling. |
AVID effectively performs diverse inpainting tasks, including object swapping, re-texturing, and uncropping, while maintaining high visual quality and temporal consistency.
The proposed Temporal MultiDiffusion pipeline enables seamless inpainting in videos longer than the model's training duration.
Quantitative and qualitative comparisons demonstrate AVID's superiority over existing video editing methods in terms of background preservation, text-video alignment, and temporal consistency. |
The performance of AVID is limited by the capabilities of the underlying text-to-video model, particularly in handling complex actions.
Future work includes exploring learnable structure guidance scales controlled by editing prompts and addressing discontinuity issues in videos with reappearing objects. |
video inpainting, diffusion models, text-guided editing, temporal consistency, motion modules |
2312.03806
Report |
XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies |
Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, Francis Williams |
We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for
high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can
generate millions of voxels with a finest effective resolution of up to
$1024^3$ in a feed-forward fashion without time-consuming test-time
optimization. To achieve this, we employ a hierarchical voxel latent diffusion
model which generates progressively higher resolution grids in a coarse-to-fine
manner using a custom framework built on the highly efficient VDB data
structure. Apart from generating high-resolution objects, we demonstrate the
effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m
with a voxel size as small as 10cm. We observe clear qualitative and
quantitative improvements over past approaches. In addition to unconditional
generation, we show that our model can be used to solve a variety of tasks such
as user-guided editing, scene completion from a single scan, and text-to-3D.
More results and details can be found at
https://research.nvidia.com/labs/toronto-ai/xcube/. |
Introduces XCubes, a novel generative model for producing high-resolution sparse 3D voxel grids with attributes like signed distances, normals, and semantics. |
Addresses limitations of current 3D generative models in scaling to large outdoor scenes and high resolutions, aiming to unlock new possibilities for 3D content generation. |
Employs a hierarchical voxel latent diffusion model that generates increasingly detailed grids using a coarse-to-fine approach, facilitated by a custom VDB data structure based framework for efficiency. |
Achieves state-of-the-art results on object generation benchmarks like ShapeNet and Objaverse, outperforming methods using point clouds, triplanes, and dense voxels.
Demonstrates scalability by generating high-quality outdoor scenes from Waymo and Karton City datasets at resolutions up to 1024^3 with fine details.
Enables user-guided editing, scene completion from single scans, and text-to-3D generation, highlighting the model's versatility. |
Text-to-3D capability limited by the scale of existing 3D datasets compared to massive image datasets.
Future work includes exploring image-conditioning and leveraging the learned 3D prior for downstream tasks like reconstruction and perception. |
3d generation, voxel diffusion models, sparse representations, hierarchical modeling, large-scale scenes |
2312.03795
Report |
AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation |
Xinzhou Wang, Yikai Wang, Junliang Ye, Zhengyi Wang, Fuchun Sun, Pengkun Liu, Ling Wang, Kai Sun, Xintong Wang, Bin He |
Advances in 3D generation have facilitated sequential 3D model generation
(a.k.a 4D generation), yet its application for animatable objects with large
motion remains scarce. Our work proposes AnimatableDreamer, a text-to-4D
generation framework capable of generating diverse categories of non-rigid
objects on skeletons extracted from a monocular video. At its core,
AnimatableDreamer is equipped with our novel optimization design dubbed
Canonical Score Distillation (CSD), which lifts 2D diffusion for temporal
consistent 4D generation. CSD, designed from a score gradient perspective,
generates a canonical model with warp-robustness across different
articulations. Notably, it also enhances the authenticity of bones and skinning
by integrating inductive priors from a diffusion model. Furthermore, with
multi-view distillation, CSD infers invisible regions, thereby improving the
fidelity of monocular non-rigid reconstruction. Extensive experiments
demonstrate the capability of our method in generating high-flexibility
text-guided 3D models from the monocular video, while also showing improved
reconstruction performance over existing non-rigid reconstruction methods. |
AnimatableDreamer, a novel framework, is presented that leverages text prompts and monocular videos to generate and reconstruct animatable 3D models of generic categories with non-rigid deformations. |
Existing methods for generating deformable 3D objects struggle with large motions and often lack diversity or rely heavily on multi-view data. AnimatableDreamer addresses these limitations by using a novel optimization design called Canonical Score Distillation (CSD). |
AnimatableDreamer operates in two stages: 1) Skeleton Extraction: Extracts skeletons, skinning, and motions from monocular videos using CSD to refine unseen regions. 2) Skeleton-Based Generation: Generates a new canonical model guided by the extracted skeleton, bones, and text prompt, ensuring time consistency and warping robustness through CSD. |
Generates high-quality, animatable 3D models with text prompts from a template video, demonstrating time consistency and morphological plausibility.
CSD enhances the generation and reconstruction of non-rigid 3D models, ensuring morphological plausibility after warping and improving reconstruction quality in unseen regions.
Outperforms existing methods in monocular non-rigid object reconstruction, especially with limited viewpoints and large motion, as shown by quantitative and qualitative comparisons. |
Requires large VRAM due to high-resolution rendering for CSD training.
Simultaneous feeding of four images to MVDream poses a computational burden. |
4d generation, diffusion model, non-rigid reconstruction, canonical score distillation, skeleton-based generation |
2312.03793
Report |
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators |
Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, Jian Zhang |
Large-scale text-to-video (T2V) diffusion models have great progress in
recent years in terms of visual quality, motion and temporal consistency.
However, the generation process is still a black box, where all attributes
(e.g., appearance, motion) are learned and generated jointly without precise
control ability other than rough text descriptions. Inspired by image animation
which decouples the video as one specific appearance with the corresponding
motion, we propose AnimateZero to unveil the pre-trained text-to-video
diffusion model, i.e., AnimateDiff, and provide more precise appearance and
motion control abilities for it. For appearance control, we borrow intermediate
latents and their features from the text-to-image (T2I) generation for ensuring
the generated first frame is equal to the given generated image. For temporal
control, we replace the global temporal attention of the original T2V model
with our proposed positional-corrected window attention to ensure other frames
align with the first frame well. Empowered by the proposed methods, AnimateZero
can successfully control the generating progress without further training. As a
zero-shot image animator for given images, AnimateZero also enables multiple
new applications, including interactive video generation and real image
animation. The detailed experiments demonstrate the effectiveness of the
proposed method in both T2V and related applications. |
This paper presents AnimateZero, a zero-shot method for controllable video generation and image animation, by modifying the architecture of pre-trained text-to-video diffusion models. |
Existing text-to-video diffusion models lack precise control over appearance and motion, limiting their ability for step-by-step video generation from specific images. |
AnimateZero decouples appearance and motion control. It inserts intermediate latents from text-to-image generation to control the first frame appearance and utilizes a positional-corrected window attention mechanism to ensure temporal consistency across frames. |
AnimateZero generates videos that better match the text prompt and the original text-to-image domain compared to baselines.
It achieves comparable or superior quality to state-of-the-art image-to-video tools.
AnimateZero demonstrates potential for various applications, including controllable video generation, image animation, frame interpolation and looped video generation. |
AnimateZero's motion generation is limited by the motion prior of the base video diffusion model.
Domain gap issues can arise when animating real images due to style, resolution, and potential degradations. |
text-to-video generation, image animation, diffusion models, controllable generation, zero-shot learning |
2312.03771
Report |
DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models |
Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C. K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou |
This study introduces Text-Guided Subject-Driven Image Inpainting, a novel
task that combines text and exemplar images for image inpainting. While both
text and exemplar images have been used independently in previous efforts,
their combined utilization remains unexplored. Simultaneously accommodating
both conditions poses a significant challenge due to the inherent balance
required between editability and subject fidelity. To tackle this challenge, we
propose a two-step approach DreamInpainter. First, we compute dense subject
features to ensure accurate subject replication. Then, we employ a
discriminative token selection module to eliminate redundant subject details,
preserving the subject's identity while allowing changes according to other
conditions such as mask shape and text prompts. Additionally, we introduce a
decoupling regularization technique to enhance text control in the presence of
exemplar images. Our extensive experiments demonstrate the superior performance
of our method in terms of visual quality, identity preservation, and text
control, showcasing its effectiveness in the context of text-guided
subject-driven image inpainting. |
This paper introduces the task of Text-Guided Subject-Driven Image Inpainting, aiming to combine the advantages of text-conditioned and exemplar-based inpainting for enhanced control and creativity. |
This task addresses the limitations of current inpainting techniques that struggle to balance identity preservation with editability guided by both text prompts and exemplar images. |
The authors propose DreamInpainter, a two-step approach. First, dense subject features are extracted from an exemplar image using the UNet encoder of a pre-trained diffusion model. Then, a discriminative token selection module filters these features, preserving key identity information while allowing for edits based on text prompts and mask shapes. A decoupling regularization technique is also introduced to enhance text control in the presence of exemplar images. |
DreamInpainter effectively preserves subject identity while allowing flexible text-guided edits like attribute changes, shape modifications, and style transfers.
The method outperforms strong baselines in terms of identity preservation and text alignment, as shown by quantitative metrics like R-FID, F-CLIP, and F-DINO.
The importance of both the token selection module and the decoupling regularization is demonstrated through ablation studies, highlighting their role in preventing copy-paste artifacts and enhancing text control. |
DreamInpainter may struggle to preserve intricate details when dealing with complex reference objects due to the fixed number of selected tokens.
Future work could explore adaptive token selection based on object complexity to further enhance detail preservation without sacrificing editability. |
image inpainting, text-to-image generation, diffusion models, subject-driven generation, token selection |
2312.03763
Report |
Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing |
Yushi Lan, Feitong Tan, Di Qiu, Qiangeng Xu, Kyle Genova, Zeng Huang, Sean Fanello, Rohit Pandey, Thomas Funkhouser, Chen Change Loy, Yinda Zhang |
We present a novel framework for generating photorealistic 3D human head and
subsequently manipulating and reposing them with remarkable flexibility. The
proposed approach leverages an implicit function representation of 3D human
heads, employing 3D Gaussians anchored on a parametric face model. To enhance
representational capabilities and encode spatial information, we embed a
lightweight tri-plane payload within each Gaussian rather than directly storing
color and opacity. Additionally, we parameterize the Gaussians in a 2D UV space
via a 3DMM, enabling effective utilization of the diffusion model for 3D head
avatar generation. Our method facilitates the creation of diverse and realistic
3D human heads with fine-grained editing over facial features and expressions.
Extensive experiments demonstrate the effectiveness of our method. |
This paper proposes \nickname{}, a novel framework for generating and manipulating photorealistic 3D human heads, enabling high-level and fine-grained control over facial shape, texture, and expression. |
Existing methods for 3D-aware portrait generation and editing often lack flexibility, particularly in local feature editing, and struggle with disentangling shape and texture. \nickname{} addresses these limitations by introducing a novel representation and leveraging diffusion models. |
The method represents 3D heads using 3D Gaussians anchored to a 3D Morphable Model (3DMM), with each Gaussian containing a tri-plane payload to encode local appearance. It employs an analysis-by-synthesis approach, reconstructing a large dataset of 3D heads while learning a shared latent space via an auto-decoder. A 2D diffusion model is then trained on this latent space for generating and editing. |
The method achieves high-quality 3D reconstruction with intrinsic support for 3DMM-driven animation, outperforming existing methods on expression editing benchmarks.
It demonstrates superior editing capabilities, including inter-subject attribute transfer, local region-based editing, and 3D in-painting, while maintaining high fidelity and view consistency.
The use of a shared latent space and UV space parameterization enables disentanglement of shape and texture, facilitating smooth interpolation and manipulation of facial features. |
The model currently exhibits bias inherited from the training dataset, such as a tendency for generated females to smile more often.
While the method excels at multi-view reconstruction, single-image inversion remains challenging and an area for future work. |
3d head generation, diffusion models, 3d gaussian representation, facial editing, 3dmm |
2312.03701
Report |
Return of Unconditional Generation: A Self-supervised Representation Generation Method |
Tianhong Li, Dina Katabi, Kaiming He |
Unconditional generation -- the problem of modeling data distribution without
relying on human-annotated labels -- is a long-standing and fundamental
challenge in generative models, creating a potential of learning from
large-scale unlabeled data. In the literature, the generation quality of an
unconditional method has been much worse than that of its conditional
counterpart. This gap can be attributed to the lack of semantic information
provided by labels. In this work, we show that one can close this gap by
generating semantic representations in the representation space produced by a
self-supervised encoder. These representations can be used to condition the
image generator. This framework, called Representation-Conditioned Generation
(RCG), provides an effective solution to the unconditional generation problem
without using labels. Through comprehensive experiments, we observe that RCG
significantly improves unconditional generation quality: e.g., it achieves a
new state-of-the-art FID of 2.15 on ImageNet 256x256, largely reducing the
previous best of 5.91 by a relative 64%. Our unconditional results are situated
in the same tier as the leading class-conditional ones. We hope these
encouraging observations will attract the community's attention to the
fundamental problem of unconditional generation. Code is available at
https://github.com/LTH14/rcg. |
This paper introduces Representation-Conditioned Generation (RCG), a novel framework for unconditional image generation that leverages self-supervised representations to improve generation quality. |
Unconditional generation, which aims to learn data distributions without human-annotated labels, often lags behind conditional methods. RCG bridges this gap by utilizing the rich semantic information embedded within self-supervised representations. |
RCG uses a pre-trained self-supervised encoder to map images to a representation space. A lightweight representation diffusion model is trained to generate representations. Finally, an image generator (e.g., ADM, DiT, MAGE) generates images conditioned on these representations. |
RCG significantly improves unconditional generation quality across different image generators (LDM, ADM, DiT, MAGE) and datasets (ImageNet, CIFAR-10, iNaturalist).
On ImageNet 256x256, RCG achieves a state-of-the-art FID of 2.15 for unconditional generation, rivaling leading class-conditional methods.
The representation space learned by RCG exhibits semantic smoothness, enabling controlled image manipulation via representation interpolation. |
RCG's performance depends on the quality of the pre-trained self-supervised encoder.
Exploring the potential of pre-training RCG components on larger unlabeled datasets for improved generalization and downstream task adaptation. |
unconditional image generation, self-supervised representation learning, diffusion models, generative models, representation learning |
2312.03700
Report |
OneLLM: One Framework to Align All Modalities with Language |
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue |
Multimodal large language models (MLLMs) have gained significant attention
due to their strong multimodal understanding capability. However, existing
works rely heavily on modality-specific encoders, which usually differ in
architecture and are limited to common modalities. In this paper, we present
OneLLM, an MLLM that aligns eight modalities to language using a unified
framework. We achieve this through a unified multimodal encoder and a
progressive multimodal alignment pipeline. In detail, we first train an image
projection module to connect a vision encoder with LLM. Then, we build a
universal projection module (UPM) by mixing multiple image projection modules
and dynamic routing. Finally, we progressively align more modalities to LLM
with the UPM. To fully leverage the potential of OneLLM in following
instructions, we also curated a comprehensive multimodal instruction dataset,
including 2M items from image, audio, video, point cloud, depth/normal map, IMU
and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,
encompassing tasks such as multimodal captioning, question answering and
reasoning, where it delivers excellent performance. Code, data, model and
online demo are available at https://github.com/csuhan/OneLLM |
Proposes OneLLM, an MLLM aligning 8 modalities to language using a unified framework with a universal encoder and progressive multimodal alignment. |
Existing MLLMs rely on modality-specific encoders, limiting scalability and expansion to diverse modalities. |
Trains a vision LLM for initialization, progressively aligns other modalities using a universal encoder (pretrained CLIP-ViT) and a universal projection module (mixture of experts). Fine-tunes on a curated multimodal instruction dataset. |
Outperforms existing MMLLMs and specialized models on 25 multimodal benchmarks, including captioning, question answering, and reasoning tasks.
Demonstrates strong zero-shot capabilities on tasks like audio question answering and depth/normal map scene classification.
Shows effectiveness of joint training for data-scarce modalities and benefits of image-text pretraining for multimodal alignment. |
Limited by the availability of large-scale, high-quality datasets for modalities beyond images.
Future work includes collecting high-quality datasets and designing new encoders for fine-grained multimodal understanding. |
multimodal learning, large language models, vision-language models, multimodal alignment, unified framework |
2312.03641
Report |
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation |
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan |
Motions in a video primarily consist of camera motion, induced by camera
movement, and object motion, resulting from object movement. Accurate control
of both camera and object motion is essential for video generation. However,
existing works either mainly focus on one type of motion or do not clearly
distinguish between the two, limiting their control capabilities and diversity.
Therefore, this paper presents MotionCtrl, a unified and flexible motion
controller for video generation designed to effectively and independently
control camera and object motion. The architecture and training strategy of
MotionCtrl are carefully devised, taking into account the inherent properties
of camera motion, object motion, and imperfect training data. Compared to
previous methods, MotionCtrl offers three main advantages: 1) It effectively
and independently controls camera motion and object motion, enabling more
fine-grained motion control and facilitating flexible and diverse combinations
of both types of motion. 2) Its motion conditions are determined by camera
poses and trajectories, which are appearance-free and minimally impact the
appearance or shape of objects in generated videos. 3) It is a relatively
generalizable model that can adapt to a wide array of camera poses and
trajectories once trained. Extensive qualitative and quantitative experiments
have been conducted to demonstrate the superiority of MotionCtrl over existing
methods. |
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation |
Accurate control of both camera and object motion is crucial for video generation, but existing methods often lack independent control or clear distinction between the two. |
MotionCtrl introduces two modules: CMCM (Camera Motion Control Module) fusing camera poses with LVDM's temporal transformers for global motion, and OMCM (Object Motion Control Module) spatially incorporating object trajectories into LVDM's convolutional layers. It is trained using augmented datasets: Realestate10k with captions for CMCM and WebVid with synthesized object trajectories for OMCM. |
Independently controls camera and object motion, enabling fine-grained adjustments and diverse combinations.
Uses camera poses and trajectories as motion conditions, avoiding unnatural appearance artifacts in generated videos.
Generalizes to a wide range of camera movements and object trajectories without fine-tuning for each specific motion. |
Reliance on separate datasets for camera and object motion training due to the lack of a comprehensive dataset.
Further improvements in object trajectory synthesis for more realistic and complex object motion control. |
video generation, motion control, camera motion, object motion, text-to-video |
2312.03628
Report |
Boosting Segment Anything Model Towards Open-Vocabulary Learning |
Xumeng Han, Longhui Wei, Xuehui Yu, Zhiyang Dou, Xin He, Kuiran Wang, Zhenjun Han, Qi Tian |
The recent Segment Anything Model (SAM) has emerged as a new paradigmatic
vision foundation model, showcasing potent zero-shot generalization and
flexible prompting. Despite SAM finding applications and adaptations in various
domains, its primary limitation lies in the inability to grasp object
semantics. In this paper, we present Sambor to seamlessly integrate SAM with
the open-vocabulary object detector in an end-to-end framework. While retaining
all the remarkable capabilities inherent to SAM, we enhance it with the
capacity to detect arbitrary objects based on human inputs like category names
or reference expressions. To accomplish this, we introduce a novel SideFormer
module that extracts SAM features to facilitate zero-shot object localization
and inject comprehensive semantic information for open-vocabulary recognition.
In addition, we devise an open-set region proposal network (Open-set RPN),
enabling the detector to acquire the open-set proposals generated by SAM.
Sambor demonstrates superior zero-shot performance across benchmarks, including
COCO and LVIS, proving highly competitive against previous SoTA methods. We
aspire for this work to serve as a meaningful endeavor in endowing SAM to
recognize diverse object categories and advancing open-vocabulary learning with
the support of vision foundation models. |
This paper proposes Sambor, an end-to-end open-vocabulary object detection framework that integrates the Segment Anything Model (SAM) and enables it to detect arbitrary objects based on human inputs such as category names or phrases. |
While SAM excels in zero-shot segmentation, it lacks the semantic understanding for object recognition. Sambor addresses this limitation, enhancing SAM's capabilities and advancing open-vocabulary learning. |
Sambor utilizes a novel SideFormer module to extract SAM features for zero-shot object localization and inject semantic information from CLIP for recognition. Additionally, it employs an Open-set RPN to generate region proposals from SAM's output. |
Sambor achieves state-of-the-art zero-shot performance on COCO and LVIS benchmarks.
The proposed SideFormer module effectively combines SAM and CLIP features, enhancing both object localization and recognition.
The Open-set RPN significantly improves proposal quality, further boosting detection performance. |
Sambor's performance can be further enhanced by scaling up training with larger image-text datasets and by incorporating few-shot learning capabilities.
Exploring the integration of more interactive operations for gradual improvement is left for future work. |
open-vocabulary object detection, segment anything model (sam), vision foundation models, zero-shot learning, open-set recognition |
2312.03626
Report |
TokenCompose: Grounding Diffusion with Token-level Supervision |
Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu |
We present TokenCompose, a Latent Diffusion Model for text-to-image
generation that achieves enhanced consistency between user-specified text
prompts and model-generated images. Despite its tremendous success, the
standard denoising process in the Latent Diffusion Model takes text prompts as
conditions only, absent explicit constraint for the consistency between the
text prompts and the image contents, leading to unsatisfactory results for
composing multiple object categories. TokenCompose aims to improve
multi-category instance composition by introducing the token-wise consistency
terms between the image content and object segmentation maps in the finetuning
stage. TokenCompose can be applied directly to the existing training pipeline
of text-conditioned diffusion models without extra human labeling information.
By finetuning Stable Diffusion, the model exhibits significant improvements in
multi-category instance composition and enhanced photorealism for its generated
images. |
TokenCompose, a Latent Diffusion Model that enhances consistency between text prompts and generated images, particularly in composing multiple object categories, by incorporating token-wise consistency terms during fine-tuning. |
Standard Latent Diffusion Models lack explicit constraints for text-image consistency, leading to unsatisfactory compositions, especially for multiple object categories. |
Leverages pretrained vision models (Grounded SAM, Grounding DINO) to generate segmentation maps for noun tokens in training captions, then jointly optimizes the diffusion model with denoising and token-image grounding objectives. |
Achieves state-of-the-art performance on multi-category instance composition benchmarks (VISOR, MultiGen).
Exhibits enhanced photorealism as measured by FID scores on COCO and Flickr30K Entities.
Maintains efficient inference speed comparable to standard text-conditioned diffusion models. |
Currently focuses on noun tokens, leaving room for incorporating other parts of speech (adjectives, verbs) as training objectives.
Exploration of different grounding objectives and architectures for further improvement. |
text-to-image generation, latent diffusion models, compositionality, image understanding, multi-category instance composition |
2312.03611
Report |
DreamComposer: Controllable 3D Object Generation via Multi-View Conditions |
Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, Xihui Liu |
Utilizing pre-trained 2D large-scale generative models, recent works are
capable of generating high-quality novel views from a single in-the-wild image.
However, due to the lack of information from multiple views, these works
encounter difficulties in generating controllable novel views. In this paper,
we present DreamComposer, a flexible and scalable framework that can enhance
existing view-aware diffusion models by injecting multi-view conditions.
Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain
3D representations of an object from multiple views. Then, it renders the
latent features of the target view from 3D representations with the multi-view
feature fusion module. Finally the target view features extracted from
multi-view inputs are injected into a pre-trained diffusion model. Experiments
show that DreamComposer is compatible with state-of-the-art diffusion models
for zero-shot novel view synthesis, further enhancing them to generate
high-fidelity novel view images with multi-view conditions, ready for
controllable 3D object reconstruction and various other applications. |
DreamComposer is a flexible and scalable framework that enhances existing view-aware diffusion models for controllable novel view synthesis by injecting multi-view conditions. |
Existing methods for novel view synthesis struggle to generate controllable novel views due to the lack of information from multiple views. |
DreamComposer uses a three-stage approach: 1) target-aware 3D lifting to obtain 3D representations from multi-view inputs, 2) multi-view feature fusion to render and fuse 3D features into target-view 2D features, 3) target-view feature injection to incorporate the fused features into a pre-trained diffusion model. |
DreamComposer enables controllable novel view synthesis by conditioning on multiple input views.
It improves the accuracy of unseen viewpoints compared to single-view methods.
DreamComposer is compatible with existing state-of-the-art models like Zero-1-to-3 and SyncDreamer, enhancing their controllability and fidelity. |
Preserving fine-grained textures from non-main view input images remains challenging.
Angular deviations between multi-view input images can affect generation quality. |
novel view synthesis, diffusion models, multi-view conditioning, 3d object generation, controllable image synthesis |
2312.03594
Report |
A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting |
Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, Kai Chen |
Achieving high-quality versatile image inpainting, where user-specified
regions are filled with plausible content according to user intent, presents a
significant challenge. Existing methods face difficulties in simultaneously
addressing context-aware image inpainting and text-guided object inpainting due
to the distinct optimal training strategies required. To overcome this
challenge, we introduce PowerPaint, the first high-quality and versatile
inpainting model that excels in both tasks. First, we introduce learnable task
prompts along with tailored fine-tuning strategies to guide the model's focus
on different inpainting targets explicitly. This enables PowerPaint to
accomplish various inpainting tasks by utilizing different task prompts,
resulting in state-of-the-art performance. Second, we demonstrate the
versatility of the task prompt in PowerPaint by showcasing its effectiveness as
a negative prompt for object removal. Additionally, we leverage prompt
interpolation techniques to enable controllable shape-guided object inpainting.
Finally, we extensively evaluate PowerPaint on various inpainting benchmarks to
demonstrate its superior performance for versatile image inpainting. We release
our codes and models on our project page: https://powerpaint.github.io/. |
This paper presents PowerPaint, a versatile image inpainting model that excels in both text-guided object inpainting and context-aware image inpainting through the use of learnable task prompts. |
Existing image inpainting methods struggle to effectively handle both context-aware and text-guided inpainting due to their conflicting optimal training strategies. PowerPaint addresses this challenge, offering a unified solution for versatile high-quality inpainting. |
PowerPaint introduces three learnable task prompts (P_obj, P_ctxt, P_shape) and fine-tunes a text-to-image model (Stable Diffusion) with different strategies for each task. It leverages classifier-free guidance sampling with task prompts and enables controllable shape-guided inpainting through prompt interpolation. |
PowerPaint achieves state-of-the-art performance on various inpainting benchmarks for both text-guided object inpainting and context-aware image inpainting.
The learned task prompts effectively function as negative prompts, enhancing object removal capabilities in crowded scenes.
Prompt interpolation facilitates controllable shape-guided object inpainting, balancing object shape and textual description adherence. |
The synthesis quality is limited by the underlying text-to-image model.
Achieving precise shape control for small objects remains challenging due to sparse representation during training. |
image inpainting, text-guided synthesis, context-aware inpainting, task prompts, diffusion models |
2312.03587
Report |
Language-Informed Visual Concept Learning |
Sharon Lee, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu |
Our understanding of the visual world is centered around various concept
axes, characterizing different aspects of visual entities. While different
concept axes can be easily specified by language, e.g. color, the exact visual
nuances along each axis often exceed the limitations of linguistic
articulations, e.g. a particular style of painting. In this work, our goal is
to learn a language-informed visual concept representation, by simply
distilling large pre-trained vision-language models. Specifically, we train a
set of concept encoders to encode the information pertinent to a set of
language-informed concept axes, with an objective of reproducing the input
image through a pre-trained Text-to-Image (T2I) model. To encourage better
disentanglement of different concept encoders, we anchor the concept embeddings
to a set of text embeddings obtained from a pre-trained Visual Question
Answering (VQA) model. At inference time, the model extracts concept embeddings
along various axes from new test images, which can be remixed to generate
images with novel compositions of visual concepts. With a lightweight test-time
finetuning procedure, it can also generalize to novel concepts unseen at
training. |
This paper proposes a framework for learning disentangled and compositional visual concepts grounded in language by distilling knowledge from pre-trained text-to-image and visual question answering models. |
Learning such representations is crucial for enabling flexible manipulation and generation of images with desired combinations of visual concepts. |
The framework trains a set of concept encoders to extract concept embeddings from images, guided by two objectives: 1) reconstructing the input image through a pre-trained T2I model given axis-informed text prompts and 2) aligning the concept embeddings with corresponding text embeddings from a pre-trained VQA model. |
The learned concept encoders can extract disentangled concept embeddings from images, enabling the generation of images with novel compositions of concepts via remixing.
The framework allows for generalization to unseen concepts through a lightweight test-time finetuning procedure.
Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method in visual concept editing compared to text-based prompting baselines. |
The current model requires concept axes to be pre-defined, limiting the generality of the concept space it can capture.
Training separate encoders for each concept axis does not fully exploit the potential hierarchical structure among them. |
visual concept learning, text-to-image generation, visual question answering, concept disentanglement, image generation |
2312.03584
Report |
Context Diffusion: In-Context Aware Image Generation |
Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, Filip Radenovic |
We propose Context Diffusion, a diffusion-based framework that enables image
generation models to learn from visual examples presented in context. Recent
work tackles such in-context learning for image generation, where a query image
is provided alongside context examples and text prompts. However, the quality
and fidelity of the generated images deteriorate when the prompt is not
present, demonstrating that these models are unable to truly learn from the
visual context. To address this, we propose a novel framework that separates
the encoding of the visual context and preserving the structure of the query
images. This results in the ability to learn from the visual context and text
prompts, but also from either one of them. Furthermore, we enable our model to
handle few-shot settings, to effectively address diverse in-context learning
scenarios. Our experiments and user study demonstrate that Context Diffusion
excels in both in-domain and out-of-domain tasks, resulting in an overall
enhancement in image quality and fidelity compared to counterpart models. |
This paper introduces Context Diffusion, a diffusion-based image generation model that learns from visual context examples, alongside text prompts and query images, effectively separating structure preservation (query image) from style and detail infusion (context images). |
Existing in-context image generation models struggle to effectively utilize visual context without strong reliance on text prompts, limiting their flexibility and generalization to unseen tasks. |
The model encodes visual context separately from the query image, injecting it alongside text embeddings into the cross-attention layers of a diffusion model, enabling learning from either or both conditioning signals. Trained on diverse image-to-map and map-to-image tasks, it supports single and multiple context image inputs. |
Context Diffusion demonstrates superior fidelity to visual context even without text prompts, outperforming prior art in both in-domain and out-of-domain tasks.
The model effectively generalizes to unseen tasks like sketch-to-image and image editing, showcasing true in-context learning capability.
Using multiple context images further enhances image quality, particularly in the absence of text prompts, highlighting the benefit of few-shot learning. |
The current design assumes alignment between text prompts and context images; future work could explore complementary information between them.
Generating images with fine-grained details, especially for local edits, remains challenging and presents an area for improvement. |
image generation, in-context learning, diffusion models, few-shot learning, controllable generation |
2312.03517
Report |
FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models |
Junhyuk So, Jungwon Lee, Eunhyeok Park |
The substantial computational costs of diffusion models, especially due to
the repeated denoising steps necessary for high-quality image generation,
present a major obstacle to their widespread adoption. While several studies
have attempted to address this issue by reducing the number of score function
evaluations (NFE) using advanced ODE solvers without fine-tuning, the decreased
number of denoising iterations misses the opportunity to update fine details,
resulting in noticeable quality degradation. In our work, we introduce an
advanced acceleration technique that leverages the temporal redundancy inherent
in diffusion models. Reusing feature maps with high temporal similarity opens
up a new opportunity to save computation resources without compromising output
quality. To realize the practical benefits of this intuition, we conduct an
extensive analysis and propose a novel method, FRDiff. FRDiff is designed to
harness the advantages of both reduced NFE and feature reuse, achieving a
Pareto frontier that balances fidelity and latency trade-offs in various
generative tasks. |
FRDiff, a novel zero-shot diffusion model acceleration technique leveraging feature reuse (FR) based on temporal redundancy in iterative generation, achieving up to 1.76x speedup without quality loss. |
Diffusion models, while powerful, suffer from high computational cost due to numerous denoising steps, hindering wider adoption. FRDiff addresses this by reducing redundant computations. |
FRDiff reuses similar feature maps from adjacent timesteps, combines reduced NFE with FR via score mixing for optimal quality-latency trade-off, and employs Auto-FR for automatic tuning. |
FRDiff achieves up to 1.76x acceleration without noticeable quality degradation compared to baseline DDIM.
Quantitative analysis shows superior FID scores and speed compared to DDIM with reduced NFE, demonstrating better Pareto fronts.
FRDiff is successfully applied to various tasks like super-resolution, image inpainting, and text-to-video generation, showcasing versatility. |
Applicability may be limited when score function evaluation time steps are not continuous (e.g., DPM-Solver++).
Further investigation needed for methods with non-consecutive score function evaluations. |
diffusion models, model acceleration, feature reuse, zero-shot learning, image generation |
2312.03461
Report |
HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting |
Yuheng Jiang, Zhehao Shen, Penghao Wang, Zhuo Su, Yu Hong, Yingliang Zhang, Jingyi Yu, Lan Xu |
We have recently seen tremendous progress in photo-real human modeling and
rendering. Yet, efficiently rendering realistic human performance and
integrating it into the rasterization pipeline remains challenging. In this
paper, we present HiFi4G, an explicit and compact Gaussian-based approach for
high-fidelity human performance rendering from dense footage. Our core
intuition is to marry the 3D Gaussian representation with non-rigid tracking,
achieving a compact and compression-friendly representation. We first propose a
dual-graph mechanism to obtain motion priors, with a coarse deformation graph
for effective initialization and a fine-grained Gaussian graph to enforce
subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with
adaptive spatial-temporal regularizers to effectively balance the non-rigid
prior and Gaussian updating. We also present a companion compression scheme
with residual compensation for immersive experiences on various platforms. It
achieves a substantial compression rate of approximately 25 times, with less
than 2MB of storage per frame. Extensive experiments demonstrate the
effectiveness of our approach, which significantly outperforms existing
approaches in terms of optimization speed, rendering quality, and storage
overhead. |
HiFi4G, an explicit and compact Gaussian-based approach for high-fidelity 4D human performance rendering from dense footage. |
Efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. Existing methods suffer from limitations such as vulnerability to occlusions, lack of texture, blurriness, high storage costs, or the inability to handle large motions. |
HiFi4G leverages a dual-graph mechanism with a coarse deformation graph for motion priors and a fine-grained Gaussian graph for constraints. It employs a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers and introduces a compression scheme with residual compensation. |
HiFi4G outperforms existing methods in optimization speed, rendering quality, and storage overhead.
The dual-graph mechanism and regularization designs effectively recover spatial-temporally consistent 4D Gaussians.
The compression scheme achieves a 25x compression rate, requiring less than 2MB per frame. |
HiFi4G heavily relies on segmentation, which can be challenging in scenes with human-object interactions.
The Gaussian optimization process, although efficient, still requires several minutes and presents a bottleneck for future acceleration. |
4d human performance rendering, gaussian splatting, non-rigid tracking, compact representation, immersive experiences |
2312.03459
Report |
F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis |
Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song |
Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by
training transformers or diffusion models on large-scale datasets.
Nevertheless, inferring such large models incurs huge costs.Previous inference
acceleration works either require costly retraining or are model-specific.To
address this issue, instead of retraining we explore the inference process of
two mainstream T2V models using transformers and diffusion models.The
exploration reveals the redundancy in temporal attention modules of both
models, which are commonly utilized to establish temporal relations among
frames.Consequently, we propose a training-free and generalized pruning
strategy called F3-Pruning to prune redundant temporal attention
weights.Specifically, when aggregate temporal attention values are ranked below
a certain ratio, corresponding weights will be pruned.Extensive experiments on
three datasets using a classic transformer-based model CogVideo and a typical
diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in
inference acceleration, quality assurance and broad applicability. |
This paper introduces F$^3$-Pruning, a training-free, generalized pruning strategy for accelerating text-to-video inference by pruning redundant temporal attention weights. |
Existing inference acceleration methods for text-to-video models are either computationally expensive (require retraining) or model-specific. This paper proposes a method that is both efficient and generalizable. |
The authors analyze the inference process of transformer and diffusion-based text-to-video models and identify redundancy in temporal attention modules. Based on this, they propose F$^3$-Pruning which prunes temporal attention weights based on the aggregate attention score, effectively removing redundant connections. |
F$^3$-Pruning speeds up inference by up to 1.35x on the UCF-101 dataset using CogVideo.
The method also improves video quality, as shown by a 22% improvement in FVD metric on UCF-101 using CogVideo.
F$^3$-Pruning demonstrates generalization by effectively accelerating and improving quality on both transformer-based (CogVideo) and diffusion-based (Tune-A-Video) models. |
The paper primarily focuses on temporal attention pruning, exploring other modules for pruning could be a future direction.
Investigating the impact of different pruning ratios on various text-to-video models can further enhance the method's adaptability. |
text-to-video synthesis, inference acceleration, pruning, temporal attention, generative models |
2312.03431
Report |
Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle |
Youtian Lin, Zuozhuo Dai, Siyu Zhu, Yao Yao |
We introduce Gaussian-Flow, a novel point-based approach for fast dynamic
scene reconstruction and real-time rendering from both multi-view and monocular
videos. In contrast to the prevalent NeRF-based approaches hampered by slow
training and rendering speeds, our approach harnesses recent advancements in
point-based 3D Gaussian Splatting (3DGS). Specifically, a novel Dual-Domain
Deformation Model (DDDM) is proposed to explicitly model attribute deformations
of each Gaussian point, where the time-dependent residual of each attribute is
captured by a polynomial fitting in the time domain, and a Fourier series
fitting in the frequency domain. The proposed DDDM is capable of modeling
complex scene deformations across long video footage, eliminating the need for
training separate 3DGS for each frame or introducing an additional implicit
neural field to model 3D dynamics. Moreover, the explicit deformation modeling
for discretized Gaussian points ensures ultra-fast training and rendering of a
4D scene, which is comparable to the original 3DGS designed for static 3D
reconstruction. Our proposed approach showcases a substantial efficiency
improvement, achieving a $5\times$ faster training speed compared to the
per-frame 3DGS modeling. In addition, quantitative results demonstrate that the
proposed Gaussian-Flow significantly outperforms previous leading methods in
novel view rendering quality. Project page:
https://nju-3dv.github.io/projects/Gaussian-Flow |
Introduces Gaussian-Flow, a point-based differentiable rendering approach for dynamic 3D scene reconstruction using a novel Dual-Domain Deformation Model (DDDM) applied to 3D Gaussian Splatting. |
Achieves state-of-the-art training speed, rendering FPS, and novel view synthesis quality for 4D scene reconstruction by efficiently modeling deformations of each Gaussian point without relying on computationally expensive neural networks. |
Models a 4D scene as deformable 3D Gaussian points and uses DDDM to capture time-dependent attribute residuals (position, rotation, radiance) with polynomial fitting in the time domain and Fourier series fitting in the frequency domain. Employs adaptive timestamp scaling and regularizations for robust optimization. |
Achieves 5x faster training speed compared to per-frame 3DGS modeling.
Significantly outperforms prior methods in novel view rendering quality on both multi-view and monocular datasets (HyperNeRF, Plenoptic Video).
Demonstrates real-time rendering capabilities with high fidelity. |
Challenges remain in preserving high-fidelity thin structures.
Future work could explore more refined deformation models and regularization techniques to enhance detail preservation. |
dynamic scene reconstruction, 4d scene representation, differentiable rendering, 3d gaussian splatting, real-time rendering |
2312.03203
Report |
Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields |
Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, Achuta Kadambi |
3D scene representations have gained immense popularity in recent years.
Methods that use Neural Radiance fields are versatile for traditional tasks
such as novel view synthesis. In recent times, some work has emerged that aims
to extend the functionality of NeRF beyond view synthesis, for semantically
aware tasks such as editing and segmentation using 3D feature field
distillation from 2D foundation models. However, these methods have two major
limitations: (a) they are limited by the rendering speed of NeRF pipelines, and
(b) implicitly represented feature fields suffer from continuity artifacts
reducing feature quality. Recently, 3D Gaussian Splatting has shown
state-of-the-art performance on real-time radiance field rendering. In this
work, we go one step further: in addition to radiance field rendering, we
enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D
foundation model distillation. This translation is not straightforward: naively
incorporating feature fields in the 3DGS framework encounters significant
challenges, notably the disparities in spatial resolution and channel
consistency between RGB images and feature maps. We propose architectural and
training changes to efficiently avert this problem. Our proposed method is
general, and our experiments showcase novel view semantic segmentation,
language-guided editing and segment anything through learning feature fields
from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across
experiments, our distillation method is able to provide comparable or better
results, while being significantly faster to both train and render.
Additionally, to the best of our knowledge, we are the first method to enable
point and bounding-box prompting for radiance field manipulation, by leveraging
the SAM model. Project website at: https://feature-3dgs.github.io/ |
This paper introduces Feature 3DGS, a novel method for distilling high-dimensional semantic features from 2D foundation models (like SAM and CLIP-LSeg) into 3D Gaussian Splatting, enabling tasks like semantic segmentation, language-guided editing, and promptable instance segmentation. |
Existing NeRF-based methods for 3D feature distillation are limited by slow rendering speeds and potential interference between radiance and feature fields. Feature 3DGS overcomes these limitations by leveraging the speed and explicit representation of 3D Gaussian Splatting. |
The method uses a parallel N-dimensional Gaussian rasterizer to render both RGB images and semantic feature maps. A lightweight convolutional decoder (speed-up module) upsamples low-dimensional features, improving efficiency. Promptable scene manipulation is achieved by querying the distilled 3D feature field. |
Feature 3DGS achieves up to 2.7x faster feature field distillation and rendering compared to NeRF-based methods.
It shows up to 23% improvement in mIoU for semantic segmentation tasks on the Replica dataset.
The method enables novel view semantic segmentation, language-guided editing, and promptable segmentation from any viewpoint. |
The performance of Feature 3DGS is limited by the quality of the teacher network and the student feature's access to ground truth features.
The adaptation of the 3DGS pipeline can introduce noise and affect the optimal performance, particularly in complex scenes with tiny objects. |
3d gaussian splatting, feature distillation, semantic segmentation, language-guided editing, promptable segmentation |
2312.03160
Report |
HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces |
Haithem Turki, Vasu Agrawal, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Deva Ramanan, Michael Zollhöfer, Christian Richardt |
Neural radiance fields provide state-of-the-art view synthesis quality but
tend to be slow to render. One reason is that they make use of volume
rendering, thus requiring many samples (and model queries) per ray at render
time. Although this representation is flexible and easy to optimize, most
real-world objects can be modeled more efficiently with surfaces instead of
volumes, requiring far fewer samples per ray. This observation has spurred
considerable progress in surface representations such as signed distance
functions, but these may struggle to model semi-opaque and thin structures. We
propose a method, HybridNeRF, that leverages the strengths of both
representations by rendering most objects as surfaces while modeling the
(typically) small fraction of challenging regions volumetrically. We evaluate
HybridNeRF against the challenging Eyeful Tower dataset along with other
commonly used view synthesis datasets. When comparing to state-of-the-art
baselines, including recent rasterization-based approaches, we improve error
rates by 15-30% while achieving real-time framerates (at least 36 FPS) for
virtual-reality resolutions (2Kx2K). |
This paper proposes \method, a novel hybrid surface-volume neural rendering technique that integrates the strengths of surface and volumetric rendering to accelerate novel view synthesis for complex scenes. |
Achieving real-time rendering of high-fidelity scenes is crucial for immersive applications like AR and VR, but existing methods often struggle to balance speed and quality. |
\method leverages a spatially adaptive surfaceness field to represent most of the scene efficiently as a surface, while selectively employing volumetric rendering for challenging regions like thin structures or transparent objects. The method also introduces a distance-adjusted Eikonal regularization to accurately model complex backgrounds without a separate background model, and implements render-time optimizations such as hardware texture interpolation and sphere tracing to further boost performance. |
\method achieves state-of-the-art quality on the challenging Eyeful Tower dataset, surpassing baselines in fidelity while maintaining real-time frame rates (at least 36 FPS) at VR resolutions (2K×2K).
The approach demonstrates comparable performance to the best real-time and offline methods on the MipNeRF-360 dataset.
On ScanNet++, \method outperforms other real-time techniques and achieves near-identical quality to a high-fidelity but computationally expensive baseline while rendering over 400 times faster. |
The use of dense 3D grids and triplanes in \method leads to higher memory consumption compared to hash table-based approaches.
Training time, although faster than the original NeRF, is comparatively slower than some recent methods like iNGP and 3D Gaussian splatting. |
neural rendering, novel view synthesis, surface rendering, volume rendering, real-time rendering |
2312.03079
Report |
LooseControl: Lifting ControlNet for Generalized Depth Conditioning |
Shariq Farooq Bhat, Niloy J. Mitra, Peter Wonka |
We present LooseControl to allow generalized depth conditioning for
diffusion-based image generation. ControlNet, the SOTA for depth-conditioned
image generation, produces remarkable results but relies on having access to
detailed depth maps for guidance. Creating such exact depth maps, in many
scenarios, is challenging. This paper introduces a generalized version of depth
conditioning that enables many new content-creation workflows. Specifically, we
allow (C1) scene boundary control for loosely specifying scenes with only
boundary conditions, and (C2) 3D box control for specifying layout locations of
the target objects rather than the exact shape and appearance of the objects.
Using LooseControl, along with text guidance, users can create complex
environments (e.g., rooms, street views, etc.) by specifying only scene
boundaries and locations of primary objects. Further, we provide two editing
mechanisms to refine the results: (E1) 3D box editing enables the user to
refine images by changing, adding, or removing boxes while freezing the style
of the image. This yields minimal changes apart from changes induced by the
edited boxes. (E2) Attribute editing proposes possible editing directions to
change one particular aspect of the scene, such as the overall object density
or a particular object. Extensive tests and comparisons with baselines
demonstrate the generality of our method. We believe that LooseControl can
become an important design tool for easily creating complex environments and be
extended to other forms of guidance channels. Code and more information are
available at https://shariqfarooq123.github.io/loose-control/ . |
This paper introduces LooseControl, a novel framework that enables generalized depth conditioning for diffusion-based image generation, allowing for more flexible and creative control over the image generation process. |
Existing methods like ControlNet, while powerful, rely on precise depth maps for guidance, which can be challenging to create. LooseControl addresses this limitation by allowing for more abstract and user-friendly depth specifications, broadening the creative possibilities for users. |
The authors introduce two forms of generalized depth control: Scene boundary control, which uses scene boundaries as an upper depth limit, and 3D box control, which uses approximate 3D bounding boxes to guide object placement. They achieve this by training a modified ControlNet model on synthetically generated data that represents these generalized depth conditions. |
LooseControl generates more realistic and creative images compared to baseline methods, especially when using abstract depth guidance.
The framework introduces two novel editing mechanisms: 3D box editing for manipulating object placement while preserving scene style, and attribute editing for exploring variations in object attributes.
A user study showed a strong preference (over 95%) for LooseControl-generated images compared to those generated using traditional depth conditioning methods. |
While LooseControl effectively controls primary objects, achieving fine-grained control over secondary objects remains a challenge.
Similar to ControlNet, providing too many constraints as input can limit the diversity of generated results. |
image generation, diffusion models, depth conditioning, controllable image synthesis, generative ai |
2312.03048
Report |
DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control |
Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler, Anton Obukhov |
Large, pretrained latent diffusion models (LDMs) have demonstrated an
extraordinary ability to generate creative content, specialize to user data
through few-shot fine-tuning, and condition their output on other modalities,
such as semantic maps. However, are they usable as large-scale data generators,
e.g., to improve tasks in the perception stack, like semantic segmentation? We
investigate this question in the context of autonomous driving, and answer it
with a resounding "yes". We propose an efficient data generation pipeline
termed DGInStyle. First, we examine the problem of specializing a pretrained
LDM to semantically-controlled generation within a narrow domain. Second, we
propose a Style Swap technique to endow the rich generative prior with the
learned semantic control. Third, we design a Multi-resolution Latent Fusion
technique to overcome the bias of LDMs towards dominant objects. Using
DGInStyle, we generate a diverse dataset of street scenes, train a
domain-agnostic semantic segmentation model on it, and evaluate the model on
multiple popular autonomous driving datasets. Our approach consistently
increases the performance of several domain generalization methods compared to
the previous state-of-the-art methods. Source code and dataset are available at
https://dginstyle.github.io. |
This paper introduces DGInStyle, a data generation pipeline for improving domain generalization in semantic segmentation using pretrained latent diffusion models (LDMs). |
Domain generalization is crucial for deploying deep learning models in real-world scenarios with domain shifts. This paper addresses this by leveraging the rich priors encoded in pretrained LDMs. |
DGInStyle combines three key techniques: (1) Style Swap for preserving style diversity by decoupling semantic control from the source domain style, (2) Style Prompting for enriching style variations with text prompts, and (3) Multi-resolution Latent Fusion (MRLF) for generating high-fidelity images with precise semantic layouts, especially for small objects. |
DGInStyle significantly improves the performance of various domain generalization methods across different network architectures (CNNs and Transformers).
It leads to substantial improvements in class-wise IoU, particularly for small and challenging classes like poles, traffic lights, and traffic signs.
The effectiveness of each component (Style Swap, Style Prompting, MRLF) is validated through ablation studies. |
The reliance on existing segmentation masks from the source domain limits the diversity of generated scenes.
The computational cost of generating high-resolution images with MRLF can be a bottleneck for large-scale dataset generation. |
domain generalization, semantic segmentation, latent diffusion models, data augmentation, generative models |
2312.03047
Report |
MagicStick: Controllable Video Editing via Control Handle Transformations |
Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen |
Text-based video editing has recently attracted considerable interest in
changing the style or replacing the objects with a similar structure. Beyond
this, we demonstrate that properties such as shape, size, location, motion,
etc., can also be edited in videos. Our key insight is that the keyframe
transformations of the specific internal feature (e.g., edge maps of objects or
human pose), can easily propagate to other frames to provide generation
guidance. We thus propose MagicStick, a controllable video editing method that
edits the video properties by utilizing the transformation on the extracted
internal control signals. In detail, to keep the appearance, we inflate both
the pretrained image diffusion model and ControlNet to the temporal dimension
and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in
editing, we perform an inversion and editing framework. Differently, finetuned
ControlNet is introduced in both inversion and generation for attention
guidance with the proposed attention remix between the spatial attention maps
of inversion and editing. Yet succinct, our method is the first method to show
the ability of video property editing from the pre-trained text-to-image model.
We present experiments on numerous examples within our unified framework. We
also compare with shape-aware text-based editing and handcrafted motion video
generation, demonstrating our superior temporal consistency and editing
capability than previous works. The code and models will be made publicly
available. |
This paper proposes MagicStick, a novel framework for controllable video editing that modifies video properties (e.g., shape, size, location, motion) by leveraging keyframe transformations on extracted internal control signals (like object edges or human pose). |
Many straightforward video edits, like resizing objects or changing their position over time, remain challenging for existing methods. This work addresses this gap by enabling controllable video editing of various properties while maintaining temporal consistency and appearance fidelity. |
The method uses a pre-trained image diffusion model and ControlNet, adapting them to the temporal dimension. It employs a controllable video customization step to maintain appearance consistency. During editing, it uses an inversion and editing framework with a novel attention remix module guided by transformed control signals. |
MagicStick successfully edits object size, position, and human motion in videos while preserving appearance and temporal consistency.
The proposed method outperforms baselines like Shape-aware Video Editing and VideoComposer in terms of temporal consistency and editing quality, as shown qualitatively and quantitatively.
Ablation studies confirm the importance of individual components like LoRA tuning, token embedding, temporal modules, and the Attention ReMix module. |
The method struggles to edit object motion along trajectories significantly different from the source video.
Future work could explore the application of this framework to more powerful pre-trained video diffusion models. |
video editing, controllable generation, diffusion models, controlnet, attention mechanisms |
2312.03045
Report |
Customization Assistant for Text-to-image Generation |
Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Tong Sun |
Customizing pre-trained text-to-image generation model has attracted massive
research interest recently, due to its huge potential in real-world
applications. Although existing methods are able to generate creative content
for a novel concept contained in single user-input image, their capability are
still far from perfection. Specifically, most existing methods require
fine-tuning the generative model on testing images. Some existing methods do
not require fine-tuning, while their performance are unsatisfactory.
Furthermore, the interaction between users and models are still limited to
directive and descriptive prompts such as instructions and captions. In this
work, we build a customization assistant based on pre-trained large language
model and diffusion model, which can not only perform customized generation in
a tuning-free manner, but also enable more user-friendly interactions: users
can chat with the assistant and input either ambiguous text or clear
instruction. Specifically, we propose a new framework consists of a new model
design and a novel training strategy. The resulting assistant can perform
customized generation in 2-5 seconds without any test time fine-tuning.
Extensive experiments are conducted, competitive results have been obtained
across different domains, illustrating the effectiveness of the proposed
method. |
This paper introduces CAFE, a Customization Assistant For text-to-imagE generation that utilizes large language models (LLMs) to enable tuning-free, user-friendly image customization. |
Existing methods for customizing pre-trained text-to-image models are either inefficient, require fine-tuning, or lack user-friendliness. CAFE addresses these limitations by offering fast, tuning-free customization and handling ambiguous user input. |
CAFE leverages a multi-modal large language model (MLLM) to infer user intent from text and image input. It generates tailored image embeddings and textual explanations. A novel self-improvement via distillation (SID) strategy trains the model on automatically generated high-quality data, eliminating costly human filtering. |
CAFE generates customized images in 2-5 seconds without test-time fine-tuning.
It handles both declarative and interrogative sentences, enabling more natural user interactions.
Quantitative evaluations demonstrate competitive performance against state-of-the-art methods in both object and human image domains. |
The model's performance relies heavily on the quality and diversity of training data.
Future work could explore incorporating user feedback to further improve the model's customization ability. |
text-to-image generation, customization, large language models, tuning-free, image editing |
2312.03029
Report |
Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians |
Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, Yebin Liu |
Creating high-fidelity 3D head avatars has always been a research hotspot,
but there remains a great challenge under lightweight sparse view setups. In
this paper, we propose Gaussian Head Avatar represented by controllable 3D
Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D
Gaussians and a fully learned MLP-based deformation field to capture complex
expressions. The two parts benefit each other, thereby our method can model
fine-grained dynamic details while ensuring expression accuracy. Furthermore,
we devise a well-designed geometry-guided initialization strategy based on
implicit SDF and Deep Marching Tetrahedra for the stability and convergence of
the training procedure. Experiments show our approach outperforms other
state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering
quality at 2K resolution even under exaggerated expressions. |
This paper proposes Gaussian Head Avatar, a novel representation for reconstructing high-fidelity 3D head avatars from sparse views using controllable 3D Gaussians. |
Existing methods struggle to synthesize high-fidelity images with pixel-level details, especially at 2K resolution and under exaggerated expressions. This method aims to overcome these limitations. |
The method employs a fully learned deformation field on 3D Gaussians to model complex expressions and introduces a geometry-guided initialization strategy using SDF and DMTet for robust convergence. |
Achieves superior image quality with fine-grained dynamic details at 2K resolution, outperforming state-of-the-art methods on self-reenactment tasks.
Demonstrates accurate expression transfer and can effectively model exaggerated expressions not well-captured by traditional methods.
Shows strong 3D consistency, enabling high-quality novel view synthesis from limited input views. |
Limitations: Experiences blurring for areas lacking robust tracking, like the inside of the mouth or long hair.
Future work: Address the limitations by integrating advanced tracking techniques for challenging regions. |
3d head avatar, gaussian splatting, deformation field, sparse view reconstruction, high-fidelity rendering |
2312.03026
Report |
Uni3DL: Unified Model for 3D and Language Understanding |
Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny |
In this work, we present Uni3DL, a unified model for 3D and Language
understanding. Distinct from existing unified vision-language models in 3D
which are limited in task variety and predominantly dependent on projected
multi-view images, Uni3DL operates directly on point clouds. This approach
significantly expands the range of supported tasks in 3D, encompassing both
vision and vision-language tasks in 3D. At the core of Uni3DL, a query
transformer is designed to learn task-agnostic semantic and mask outputs by
attending to 3D visual features, and a task router is employed to selectively
generate task-specific outputs required for diverse tasks. With a unified
architecture, our Uni3DL model enjoys seamless task decomposition and
substantial parameter sharing across tasks. Uni3DL has been rigorously
evaluated across diverse 3D vision-language understanding tasks, including
semantic segmentation, object detection, instance segmentation, visual
grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates
performance on par with or surpassing state-of-the-art (SOTA) task-specific
models. We hope our benchmark and Uni3DL model will serve as a solid step to
ease future research in unified models in the realm of 3D and language
understanding. Project page: https://uni3dl.github.io. |
This paper introduces Uni3DL, a unified model for 3D and language understanding that operates directly on raw point clouds, departing from traditional multi-view image projection methods. |
Existing 3D vision-language models rely heavily on projected 2D images, limiting their ability to process 3D geometric information effectively. Uni3DL aims to address this by directly learning from raw point cloud data. |
Uni3DL employs a transformer-based architecture with a novel cross-modal attention mechanism to learn joint representations of 3D point clouds and text. It's pre-trained on large-scale 3D-language datasets (ScanNet, ScanRefer, Cap3D Objaverse) and fine-tuned for various downstream tasks like segmentation, captioning, and retrieval. |
Uni3DL achieves state-of-the-art results on ScanNet for 3D instance segmentation, even surpassing methods using additional segment labels.
It demonstrates competitive performance in zero-shot 3D classification on ModelNet40 and ModelNet10, particularly excelling in top-5 accuracy.
The model exhibits strong capabilities in text-guided 3D segmentation and cross-modal retrieval tasks. |
Uni3DL currently doesn't leverage the strengths of pre-trained 2D foundation models like CLIP, which limits its ability to benefit from rich 2D image representations.
Future work will explore a hybrid approach, combining point-based learning with insights and features from 2D foundation models to further enhance 3D language understanding. |
3d vision, language understanding, point cloud processing, cross-modal learning, transformers |
2312.03015
Report |
PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation |
Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yunhao Fang, Hao Su |
Open-world 3D part segmentation is pivotal in diverse applications such as
robotics and AR/VR. Traditional supervised methods often grapple with limited
3D data availability and struggle to generalize to unseen object categories.
PartSLIP, a recent advancement, has made significant strides in zero- and
few-shot 3D part segmentation. This is achieved by harnessing the capabilities
of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic
method for converting and lifting multi-view 2D bounding box predictions into
3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced
version designed to overcome the limitations of its predecessor. Our approach
incorporates two major improvements. First, we utilize a pre-trained 2D
segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more
precise and accurate annotations than the 2D bounding boxes used in PartSLIP.
Second, PartSLIP++ replaces the heuristic 3D conversion process with an
innovative modified Expectation-Maximization algorithm. This algorithm
conceptualizes 3D instance segmentation as unobserved latent variables, and
then iteratively refines them through an alternating process of 2D-3D matching
and optimization with gradient descent. Through extensive evaluations, we show
that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot
3D semantic and instance-based object part segmentation tasks. Code released at
https://github.com/zyc00/PartSLIP2. |
PartSLIP++ improves upon PartSLIP for few-shot 3D part segmentation by using SAM for pixel-wise 2D segmentation and a modified EM algorithm for lifting 2D to 3D. |
Open-world 3D part segmentation is crucial for applications like robotics and AR/VR, but supervised methods suffer from limited 3D data and struggle to generalize. |
PartSLIP++ uses SAM to refine 2D bounding boxes from GLIP into segmentation masks. It then employs a modified EM algorithm to iteratively match and optimize 3D instance labels with projected 2D masks. |
PartSLIP++ outperforms PartSLIP in low-shot 3D semantic and instance segmentation on the PartNetE dataset.
Ablation studies confirm the effectiveness of using SAM, the EM algorithm, and post-processing.
PartSLIP++ enables applications like semi-automatic part annotation and 3D instance proposal generation. |
The reliance on pre-trained 2D models might limit performance on highly specialized or unseen objects.
Further exploration of different 2D-3D matching and optimization techniques within the EM algorithm is possible. |
3d part segmentation, few-shot learning, open-world segmentation, segment anything model (sam), expectation-maximization (em) |
2312.03011
Report |
InstructBooth: Instruction-following Personalized Text-to-Image Generation |
Daewon Chae, Nokyung Park, Jinkyu Kim, Kimin Lee |
Personalizing text-to-image models using a limited set of images for a
specific object has been explored in subject-specific image generation.
However, existing methods often face challenges in aligning with text prompts
due to overfitting to the limited training images. In this work, we introduce
InstructBooth, a novel method designed to enhance image-text alignment in
personalized text-to-image models without sacrificing the personalization
ability. Our approach first personalizes text-to-image models with a small
number of subject-specific images using a unique identifier. After
personalization, we fine-tune personalized text-to-image models using
reinforcement learning to maximize a reward that quantifies image-text
alignment. Additionally, we propose complementary techniques to increase the
synergy between these two processes. Our method demonstrates superior
image-text alignment compared to existing baselines, while maintaining high
personalization ability. In human evaluations, InstructBooth outperforms them
when considering all comprehensive factors. Our project page is at
https://sites.google.com/view/instructbooth. |
This paper introduces InstructBooth, a novel method for personalized text-to-image generation that enhances image-text alignment without sacrificing personalization ability. |
Existing personalized text-to-image generation methods often struggle to balance subject fidelity with the ability to accurately reflect new contexts and actions from text prompts. |
InstructBooth first personalizes a text-to-image model using a unique identifier and reference images. Then, it leverages reinforcement learning to fine-tune the model, maximizing a reward based on image-text alignment. |
InstructBooth generates personalized images with high text fidelity, outperforming existing methods in aligning generated images with given prompts.
The method maintains high subject fidelity, ensuring generated images resemble the user-provided subject.
Human evaluations demonstrate a strong preference for InstructBooth outputs over existing methods, highlighting its ability to generate personalized images that are both accurate and visually appealing. |
The current subject fidelity metric used in evaluation primarily focuses on appearance and might not be ideal for evaluating personalized images with diverse poses and actions.
The research highlights the need for improved metrics and techniques to evaluate subject fidelity more comprehensively in personalized image generation. |
text-to-image generation, personalization, reinforcement learning, image-text alignment, subject fidelity |
2312.02981
Report |
ReconFusion: 3D Reconstruction with Diffusion Priors |
Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, Aleksander Holynski |
3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at
rendering photorealistic novel views of complex scenes. However, recovering a
high-quality NeRF typically requires tens to hundreds of input images,
resulting in a time-consuming capture process. We present ReconFusion to
reconstruct real-world scenes using only a few photos. Our approach leverages a
diffusion prior for novel view synthesis, trained on synthetic and multiview
datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel
camera poses beyond those captured by the set of input images. Our method
synthesizes realistic geometry and texture in underconstrained regions while
preserving the appearance of observed regions. We perform an extensive
evaluation across various real-world datasets, including forward-facing and
360-degree scenes, demonstrating significant performance improvements over
previous few-view NeRF reconstruction approaches. |
This paper proposes a novel method to enhance 3D scene reconstruction from a limited number of posed images, leveraging a diffusion model trained for novel view synthesis as a prior to regularize a NeRF-based 3D reconstruction pipeline. |
Reconstructing high-quality 3D scenes typically demands dense image captures (tens to hundreds), which is time-consuming and limits accessibility. This method addresses this challenge by significantly reducing the number of input images required. |
The approach involves training a diffusion model on a mixture of real and synthetic multiview datasets to generate plausible novel views. This model, conditioned on input images and poses, is integrated into a NeRF reconstruction pipeline, guiding it to produce realistic renderings even from sparsely sampled viewpoints. |
The method outperforms existing few-view NeRF reconstruction approaches, demonstrating significant quality improvements in both geometry and appearance, particularly in under-observed regions.
It effectively reduces artifacts common in few-view reconstructions, such as "floaters" and inaccurate geometry.
The diffusion prior proves to be a robust regularizer, enhancing reconstruction quality across a range of capture settings, including both forward-facing and 360-degree scenes. |
The reliance on the heavyweight diffusion model introduces computational costs, slowing down the reconstruction process.
The current method shows limited 3D outpainting capabilities compared to the 2D hallucinations possible with the image model. |
3d reconstruction, neural radiance fields (nerf), few-shot learning, diffusion models, novel view synthesis |
2312.02980
Report |
GPT4Point: A Unified Framework for Point-Language Understanding and Generation |
Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, Hengshuang Zhao |
Multimodal Large Language Models (MLLMs) have excelled in 2D image-text
comprehension and image generation, but their understanding of the 3D world is
notably deficient, limiting progress in 3D language understanding and
generation. To solve this problem, we introduce GPT4Point, an innovative
groundbreaking point-language multimodal model designed specifically for
unified 3D object understanding and generation within the MLLM framework.
GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text
reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point
is equipped with advanced capabilities for controllable 3D generation, it can
get high-quality results through a low-quality point-text feature maintaining
the geometric shapes and colors. To support the expansive needs of 3D
object-text pairs, we develop Pyramid-XL, a point-language dataset annotation
engine. It constructs a large-scale database over 1M objects of varied text
granularity levels from the Objaverse-XL dataset, essential for training
GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D
point-language understanding capabilities. In extensive evaluations, GPT4Point
has demonstrated superior performance in understanding and generation. |
GPT4Point, a unified framework for 3D object understanding and generation using point clouds and language. |
Addresses limitations of existing MLLMs in understanding and generating 3D objects, aiming for comprehensive 3D world interpretation. |
Two-stage approach: (1) Point-text feature alignment using Bert-based Point-QFormer. (2) LLM branch for text inference and Diffusion branch for controlled 3D generation conditioned on point-text features. |
Outperforms VLMs and PointLLM in 3D object recognition tasks like zero-shot classification and point-text retrieval.
Achieves superior performance in 3D object text inference tasks, including captioning and question answering.
Enables controllable text-to-3D generation by leveraging low-quality point cloud features and text descriptions, enhancing generation quality and controllability. |
Limited exploration of multi-object scene understanding and interaction.
Reliance on Point-E for generation, potentially limiting generation quality and diversity. |
3d vision, multimodal learning, large language models, point cloud processing, text-to-3d generation |
2312.02974
Report |
Describing Differences in Image Sets with Natural Language |
Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy |
How do two sets of images differ? Discerning set-level differences is crucial
for understanding model behaviors and analyzing datasets, yet manually sifting
through thousands of images is impractical. To aid in this discovery process,
we explore the task of automatically describing the differences between two
$\textbf{sets}$ of images, which we term Set Difference Captioning. This task
takes in image sets $D_A$ and $D_B$, and outputs a description that is more
often true on $D_A$ than $D_B$. We outline a two-stage approach that first
proposes candidate difference descriptions from image sets and then re-ranks
the candidates by checking how well they can differentiate the two sets. We
introduce VisDiff, which first captions the images and prompts a language model
to propose candidate descriptions, then re-ranks these descriptions using CLIP.
To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image
sets with ground truth difference descriptions. We apply VisDiff to various
domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing
classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing
model failure modes (supervised ResNet), characterizing differences between
generative models (e.g., StableDiffusionV1 and V2), and discovering what makes
images memorable. Using VisDiff, we are able to find interesting and previously
unknown differences in datasets and models, demonstrating its utility in
revealing nuanced insights. |
This paper explores **Set Difference Captioning (SDC)**, a task where the goal is to generate natural language descriptions that capture the salient differences between two sets of images. |
SDC is important for understanding model behaviors, analyzing datasets (especially for distribution shifts), and gaining insights into human cognition, all in a scalable and interpretable way. |
The paper proposes a two-stage **proposer-ranker** framework. The proposer generates candidate difference descriptions based on small subsets of images. The ranker then evaluates and ranks these descriptions by checking their validity across the full image sets. |
A novel SDC benchmark, **VisDiffBench**, is created with 187 paired image sets and ground truth difference descriptions.
The best approach, **VisDiff**, leverages a caption-based proposer with GPT-4 and a feature-based ranker with CLIP, achieving high accuracy on VisDiffBench.
VisDiff reveals interesting and sometimes previously unknown insights when applied to comparing datasets (ImageNet vs. ImageNetV2), model behaviors (CLIP vs ResNet), and analyzing human memory (LaMem dataset). |
The current method relies heavily on large pre-trained models, inheriting their potential biases and limitations.
VisDiffBench, while extensive, could be expanded to include more diverse and subtle differences beyond objects and styles. |
set difference captioning, image understanding, dataset analysis, model interpretation, vision and language |
2312.02970
Report |
Alchemist: Parametric Control of Material Properties with Diffusion Models |
Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, William T. Freeman, Mark Matthews |
We propose a method to control material attributes of objects like roughness,
metallic, albedo, and transparency in real images. Our method capitalizes on
the generative prior of text-to-image models known for photorealism, employing
a scalar value and instructions to alter low-level material properties.
Addressing the lack of datasets with controlled material attributes, we
generated an object-centric synthetic dataset with physically-based materials.
Fine-tuning a modified pre-trained text-to-image model on this synthetic
dataset enables us to edit material properties in real-world images while
preserving all other attributes. We show the potential application of our model
to material edited NeRFs. |
This paper introduces a method leveraging pre-trained text-to-image diffusion models for parametric control of material properties (roughness, metallic, albedo, transparency) in real images. |
Achieving fine-grained control over object material properties in images has broad applications in image editing, advertising, and forensics. |
The authors generate a synthetic dataset with controlled material attributes and fine-tune a pre-trained text-to-image diffusion model using relative attribute strength as an input. |
The model generalizes to real images despite training on synthetic data.
It allows for smooth edits of material properties controlled by a single scalar value.
The method can be extended to material editing in neural radiance fields. |
The model may produce minimal perceptual changes for certain attributes (roughness, metallic).
Occasionally, physically unrealistic transparency edits may occur. |
material editing, diffusion models, image editing, generative models, synthetic data |
2312.02963
Report |
MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures |
Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, Shuguang Cui, Xiaoguang Han |
In this era, the success of large language models and text-to-image models
can be attributed to the driving force of large-scale datasets. However, in the
realm of 3D vision, while remarkable progress has been made with models trained
on large-scale synthetic and real-captured object data like Objaverse and
MVImgNet, a similar level of progress has not been observed in the domain of
human-centric tasks partially due to the lack of a large-scale human dataset.
Existing datasets of high-fidelity 3D human capture continue to be mid-sized
due to the significant challenges in acquiring large-scale high-quality 3D
human data. To bridge this gap, we present MVHumanNet, a dataset that comprises
multi-view human action sequences of 4,500 human identities. The primary focus
of our work is on collecting human data that features a large number of diverse
identities and everyday clothing using a multi-view human capture system, which
facilitates easily scalable data collection. Our dataset contains 9,000 daily
outfits, 60,000 motion sequences and 645 million frames with extensive
annotations, including human masks, camera parameters, 2D and 3D keypoints,
SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the
potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot
studies on view-consistent action recognition, human NeRF reconstruction,
text-driven view-unconstrained human image generation, as well as 2D
view-unconstrained human image and 3D avatar generation. Extensive experiments
demonstrate the performance improvements and effective applications enabled by
the scale provided by MVHumanNet. As the current largest-scale 3D human
dataset, we hope that the release of MVHumanNet data with annotations will
foster further innovations in the domain of 3D human-centric tasks at scale. |
This paper introduces MVHumanNet, the largest multi-view human capture dataset to date, containing 4,500 identities, 9,000 outfits, and 645 million frames with annotations. |
A large-scale, diverse human dataset is crucial for advancing 3D human-centric tasks in computer vision, similar to the impact of large datasets on language and 2D image models. |
The authors built a multi-view capture system and collected data from 4,500 individuals performing various actions in everyday clothing. They annotated the data with action labels, camera parameters, masks, skeletons, and SMPL parameters. |
View-consistent action recognition accuracy improves significantly with more viewpoints.
NeRF reconstruction for humans shows enhanced generalization ability when trained on larger scales of MVHumanNet data.
MVHumanNet enables the development of high-quality text-driven human image generation and 3D human avatar generative models. |
Current experiments used only a subset (62%) of the full dataset due to hardware limitations.
Existing generalizable NeRF methods, designed for limited data, could be redesigned to better leverage the full potential of MVHumanNet. |
3d human capture, multi-view dataset, nerf reconstruction, text-driven generation, human generative model |
2312.02949
Report |
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models |
Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang |
With the recent significant advancements in large multi-modal models (LMMs),
the importance of their grounding capability in visual chat is increasingly
recognized. Despite recent efforts to enable LMMs to support grounding, their
capabilities for grounding and chat are usually separate, and their chat
performance drops dramatically when asked to ground. The problem is the lack of
a dataset for grounded visual chat (GVC). Existing grounding datasets only
contain short captions. To address this issue, we have created GVC data that
allows for the combination of grounding and chat capabilities. To better
evaluate the GVC capabilities, we have introduced a benchmark called
Grounding-Bench. Additionally, we have proposed a model design that can support
GVC and various types of visual prompts by connecting segmentation models with
language models. Experimental results demonstrate that our model outperforms
other LMMs on Grounding-Bench. Furthermore, our model achieves competitive
performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K
Entities. Our code will be released at
https://github.com/UX-Decoder/LLaVA-Grounding . |
This paper introduces ullname{}, an AI assistant capable of both visual chat and grounding, by creating a new grounded visual chat dataset, proposing a new model architecture, and establishing enchname{} as a benchmark for evaluating grounded visual chat performance. |
Existing large multimodal models (LMMs) struggle to effectively perform grounded visual chat due to the scarcity of grounded visual chat data and suboptimal model designs. This work aims to address these challenges and advance the development of grounded visual chat for LMMs. |
The authors create a high-quality Grounded Visual Chat (GVC) dataset using human-labeled object detection data and GPT-4 for matching noun phrases to instances. They propose ullname{}, an end-to-end model that connects an LMM with a grounding model to handle grounding tasks. They also introduce enchname{}, a benchmark for evaluating grounded visual chat performance, including chat and grounding aspects. |
ullname{} outperforms other open-source LMMs in both chat and grounding tasks on enchname{}.
ullname{} achieves competitive results on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities.
ullname{} effectively supports various types of visual prompts, including marks, clicks, and boxes. |
ullname{} has limitations in terms of semantic scope, as the training data is limited.
Future work could focus on extending the dataset and data labeling methods to open-vocabulary settings. |
visual grounding, visual chat, large multimodal models, benchmarking, visual prompts |
2312.02936
Report |
Drag-A-Video: Non-rigid Video Editing with Point-based Interaction |
Yao Teng, Enze Xie, Yue Wu, Haoyu Han, Zhenguo Li, Xihui Liu |
Video editing is a challenging task that requires manipulating videos on both
the spatial and temporal dimensions. Existing methods for video editing mainly
focus on changing the appearance or style of the objects in the video, while
keeping their structures unchanged. However, there is no existing method that
allows users to interactively ``drag'' any points of instances on the first
frame to precisely reach the target points with other frames consistently
deformed. In this paper, we propose a new diffusion-based method for
interactive point-based video manipulation, called Drag-A-Video. Our method
allows users to click pairs of handle points and target points as well as masks
on the first frame of an input video. Then, our method transforms the inputs
into point sets and propagates these sets across frames. To precisely modify
the contents of the video, we employ a new video-level motion supervision to
update the features of the video and introduce the latent offsets to achieve
this update at multiple denoising timesteps. We propose a temporal-consistent
point tracking module to coordinate the movement of the points in the handle
point sets. We demonstrate the effectiveness and flexibility of our method on
various videos. The website of our work is available here:
https://drag-a-video.github.io/. |
Introduces Drag-A-Video, the first point-based interactive non-rigid video editing system allowing users to drag points on the first frame to deform subsequent frames consistently. |
Existing video editing methods struggle with precise and fine-grained control over object structure and motion, particularly for non-rigid deformations. |
Employs a three-step process: 1) point set propagation of handle points, target points, and masks across frames, 2) latent optimization with video-level motion supervision to update diffusion latents across multiple timesteps, and 3) temporal-consistent point tracking to update handle point locations. |
Drag-A-Video enables dragging video content by manipulating handle points toward target points, effectively deforming object structures.
User study confirms Drag-A-Video surpasses the baseline in frame quality, temporal consistency, and handle point movement accuracy.
Ablation studies validate the importance of point sets, multi-timestep manipulation, and temporal consistency modules for robust and coherent video editing. |
2D point propagation can be impacted by occlusion and lacks depth information, limiting its effectiveness in complex scenes.
The framework's sensitivity to user input, particularly mask coordination with handle points, requires further investigation to enhance usability. |
video editing, diffusion models, point-based manipulation, non-rigid deformation, temporal consistency |
2312.02928
Report |
LivePhoto: Real Image Animation with Text-guided Motion Control |
Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao |
Despite the recent progress in text-to-video generation, existing studies
usually overlook the issue that only spatial contents but not temporal motions
in synthesized videos are under the control of text. Towards such a challenge,
this work presents a practical system, named LivePhoto, which allows users to
animate an image of their interest with text descriptions. We first establish a
strong baseline that helps a well-learned text-to-image generator (i.e., Stable
Diffusion) take an image as a further input. We then equip the improved
generator with a motion module for temporal modeling and propose a carefully
designed training pipeline to better link texts and motions. In particular,
considering the facts that (1) text can only describe motions roughly (e.g.,
regardless of the moving speed) and (2) text may include both content and
motion descriptions, we introduce a motion intensity estimation module as well
as a text re-weighting module to reduce the ambiguity of text-to-motion
mapping. Empirical evidence suggests that our approach is capable of well
decoding motion-related textual instructions into videos, such as actions,
camera movements, or even conjuring new contents from thin air (e.g., pouring
water into an empty glass). Interestingly, thanks to the proposed intensity
learning mechanism, our system offers users an additional control signal (i.e.,
the motion intensity) besides text for video customization. |
This paper introduces \method, a novel text-driven image animation framework enabling users to animate real images using text descriptions, controlling actions, camera movements, and even generating new content. |
This work addresses the limitation of existing text-to-video generation methods that lack control over temporal motions, aiming to allow for flexible and user-friendly video customization through textual instructions. |
The authors build upon Stable Diffusion, enhancing it with (1) image content guidance for identity preservation, (2) motion intensity estimation for controlling motion speed and range, and (3) text re-weighting for prioritizing motion descriptions over potentially conflicting content descriptions. |
\method effectively animates real images from diverse domains, demonstrating strong adherence to textual instructions for motion control.
The introduction of motion intensity as a parameter allows users to fine-tune the speed and range of generated motions.
Text re-weighting successfully mitigates the influence of content descriptions within text prompts, preventing conflicts with the reference image and enhancing motion control. |
The current implementation is limited by the resolution of SD 1.5 (256x256).
Future work can explore higher resolutions and more powerful models like SD-XL to further improve performance. |
image animation, text-to-video generation, motion control, stable diffusion, content guidance |
2312.02919
Report |
Fine-grained Controllable Video Generation via Object Appearance and Context |
Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang |
Text-to-video generation has shown promising results. However, by taking only
natural languages as input, users often face difficulties in providing detailed
information to precisely control the model's output. In this work, we propose
fine-grained controllable video generation (FACTOR) to achieve detailed
control. Specifically, FACTOR aims to control objects' appearances and context,
including their location and category, in conjunction with the text prompt. To
achieve detailed control, we propose a unified framework to jointly inject
control signals into the existing text-to-video model. Our model consists of a
joint encoder and adaptive cross-attention layers. By optimizing the encoder
and the inserted layer, we adapt the model to generate videos that are aligned
with both text prompts and fine-grained control. Compared to existing methods
relying on dense control signals such as edge maps, we provide a more intuitive
and user-friendly interface to allow object-level fine-grained control. Our
method achieves controllability of object appearances without finetuning, which
reduces the per-subject optimization efforts for the users. Extensive
experiments on standard benchmark datasets and user-provided inputs validate
that our model obtains a 70% improvement in controllability metrics over
competitive baselines. |
This paper presents FACTOR, a framework for fine-grained controllable video generation that allows users to control object appearance and context (location, category) using intuitive inputs like hand-drawn trajectories and reference images. |
Current text-to-video generation models lack detailed controllability, often requiring dense control signals or per-subject finetuning. This work provides a more user-friendly and efficient approach for customized video generation. |
FACTOR adapts a pretrained text-to-video model by incorporating a joint encoder for text prompts and control signals, and adaptive cross-attention layers to inject fine-grained control into the generation process. The model is trained by freezing the pretrained weights and updating only the newly added layers. |
FACTOR achieves a 70% improvement in controllability metrics over baselines, demonstrating effective control over object trajectories and appearances.
The model exhibits the ability to generate complex videos with object-object and subject-object interactions, despite not being explicitly trained for this purpose.
User studies confirm the model's superior performance in visual quality, text alignment, and adherence to user-specified trajectories and appearances. |
The current implementation uses a single reference image for appearance control, potentially limiting the range of motion for live subjects. Exploring data augmentation techniques could alleviate this limitation.
The model may underperform when text prompts and control signals are misaligned. Future work could investigate strategies for better handling such inconsistencies. |
video generation, controllable generation, text-to-video, object appearance, trajectory control |
2312.02918
Report |
Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration |
Yuang Ai, Huaibo Huang, Xiaoqiang Zhou, Jiexiang Wang, Ran He |
Despite substantial progress, all-in-one image restoration (IR) grapples with
persistent challenges in handling intricate real-world degradations. This paper
introduces MPerceiver: a novel multimodal prompt learning approach that
harnesses Stable Diffusion (SD) priors to enhance adaptiveness,
generalizability and fidelity for all-in-one image restoration. Specifically,
we develop a dual-branch module to master two types of SD prompts: textual for
holistic representation and visual for multiscale detail representation. Both
prompts are dynamically adjusted by degradation predictions from the CLIP image
encoder, enabling adaptive responses to diverse unknown degradations. Moreover,
a plug-in detail refinement module improves restoration fidelity via direct
encoder-to-decoder information transformation. To assess our method, MPerceiver
is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art
task-specific methods across most tasks. Post multitask pre-training,
MPerceiver attains a generalized representation in low-level vision, exhibiting
remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive
experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of
adaptiveness, generalizability and fidelity. |
This paper introduces MPerceiver, a multimodal prompt learning approach leveraging Stable Diffusion priors for enhanced adaptiveness, generalizability, and fidelity in all-in-one image restoration. |
Despite substantial progress in all-in-one image restoration, handling intricate real-world degradations remains a challenge, highlighting the need for more adaptive and generalizable solutions. |
MPerceiver uses a dual-branch module to learn textual and visual prompts dynamically adjusted by degradation predictions. It also utilizes a detail refinement module for enhanced fidelity. |
MPerceiver outperforms state-of-the-art task-specific methods on most tasks.
MPerceiver effectively handles challenging mixed degradations, common in real-world scenarios.
Pre-trained MPerceiver exhibits remarkable zero-shot and few-shot capabilities in unseen tasks, demonstrating strong generalization. |
MPerceiver currently focuses on single-image restoration tasks.
Future work will explore the potential of the proposed approach in video restoration |
image restoration, stable diffusion, prompt learning, multimodal learning, low-level vision |
2312.02902
Report |
HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting |
Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, Eduardo Pérez-Pellitero |
3D head animation has seen major quality and runtime improvements over the
last few years, particularly empowered by the advances in differentiable
rendering and neural radiance fields. Real-time rendering is a highly desirable
goal for real-world applications. We propose HeadGaS, the first model to use 3D
Gaussian Splats (3DGS) for 3D head reconstruction and animation. In this paper
we introduce a hybrid model that extends the explicit representation from 3DGS
with a base of learnable latent features, which can be linearly blended with
low-dimensional parameters from parametric head models to obtain
expression-dependent final color and opacity values. We demonstrate that
HeadGaS delivers state-of-the-art results in real-time inference frame rates,
which surpasses baselines by up to ~2dB, while accelerating rendering speed by
over x10. |
This paper proposes \methodName, the first model to use 3D Gaussian Splats (3DGS) for real-time 3D head reconstruction and animation. |
Real-time rendering of animatable 3D heads is essential for various applications like AR/VR and teleconferencing. Existing methods struggle to achieve both high realism and real-time performance. |
The method enhances 3DGS with a base of learnable latent features within each Gaussian. These features are blended using expression parameters from parametric head models to obtain expression-dependent color and opacity values. The model is trained on monocular videos with tracked head poses and expression weights. |
\methodNameSpace achieves state-of-the-art results on public datasets, outperforming baselines in visual quality by up to 2dB (PSNR).
It significantly surpasses baselines in rendering speed, achieving real-time performance of over 100fps.
The method enables realistic novel view synthesis and cross-subject expression transfer. |
Performance depends on the accuracy of pre-computed head poses and expression weights.
Limited generalization to unseen expressions or viewpoints that are significantly different from the training data. |
3d head animation, 3d gaussian splatting, real-time rendering, differentiable rendering, neural radiance fields |
2312.02896
Report |
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models |
Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot |
Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable
capabilities in visual reasoning with common image styles. However, their
robustness against diverse style shifts, crucial for practical applications,
remains largely unexplored. In this paper, we propose a new benchmark,
BenchLMM, to assess the robustness of LMMs against three different styles:
artistic image style, imaging sensor style, and application style, where each
style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate
state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance
degradation when working with other styles; 2) An LMM performs better than
another model in common style does not guarantee its superior performance in
other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs
to predict the style first, based on which we propose a versatile and
training-free method for improving LMMs; 4) An intelligent LMM is expected to
interpret the causes of its errors when facing stylistic variations. We hope
that our benchmark and analysis can shed new light on developing more
intelligent and versatile LMMs. |
This paper introduces BenchLMM, a benchmark designed to evaluate the robustness of Large Multimodal Models (LMMs) against various visual style shifts. |
Existing LMM benchmarks primarily use common image styles, limiting the understanding of LMM performance across diverse artistic, sensor, and application-specific styles, crucial for real-world applications. |
BenchLMM leverages existing datasets with re-labeling for VQA across three style categories: artistic (Cartoon, Sketch, etc.), sensor (Infrared, X-ray, etc.), and application-specific (remote sensing, autonomous driving, etc.). The authors evaluate several state-of-the-art LMMs, including GPT-4V, on BenchLMM and propose a Style Prompt Enhancement (SPE) method. |
LMMs exhibit significant performance degradation when presented with images outside common styles.
Superior performance on common-style images doesn't guarantee similar performance on other styles, highlighting the need for comprehensive evaluation.
The proposed SPE method, prompting LMMs to predict image style before answering questions, shows consistent improvement across styles without fine-tuning. |
The study primarily focuses on accuracy, neglecting other aspects like computational efficiency and bias detection in LMMs.
Future work could explore fine-tuning LMMs on diverse styles and incorporate human feedback for error analysis and improvement. |
large multimodal models, visual reasoning, benchmarking, style transfer, domain adaptation |
2312.02772
Report |
FG-MDM: Towards Zero-Shot Human Motion Generation via Fine-Grained Descriptions |
Xu Shi, Wei Yao, Chuanchen Luo, Junran Peng, Hongwen Zhang, Yunlian Sun |
Recently, significant progress has been made in text-based motion generation,
enabling the generation of diverse and high-quality human motions that conform
to textual descriptions. However, generating motions beyond the distribution of
original datasets remains challenging, i.e., zero-shot generation. By adopting
a divide-and-conquer strategy, we propose a new framework named Fine-Grained
Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation.
Specifically, we first parse previous vague textual annotations into
fine-grained descriptions of different body parts by leveraging a large
language model. We then use these fine-grained descriptions to guide a
transformer-based diffusion model, which further adopts a design of part
tokens. FG-MDM can generate human motions beyond the scope of original datasets
owing to descriptions that are closer to motion essence. Our experimental
results demonstrate the superiority of FG-MDM over previous methods in
zero-shot settings. We will release our fine-grained textual annotations for
HumanML3D and KIT. |
The paper introduces FG-MDM, a novel framework for zero-shot human motion generation that leverages fine-grained descriptions of body parts to guide a diffusion model. |
Generating human motions beyond the distribution of existing datasets is challenging due to limited dataset size and diversity. Existing methods struggle to generalize to unseen motions. |
The authors use ChatGPT to paraphrase vague textual descriptions into detailed descriptions of individual body parts. These fine-grained descriptions, along with global text embeddings, guide a transformer-based diffusion model that uses part tokens for each body part. |
FG-MDM outperforms state-of-the-art methods in zero-shot motion generation on HuMMan and Kungfu datasets.
Qualitative results demonstrate FG-MDM's ability to generate motions consistent with fine-grained textual descriptions, including unseen and stylized motions.
A user study confirms the superior quality and text-matching capabilities of motions generated by FG-MDM. |
The quality of fine-grained text annotations can be further improved.
Exploring better methods for incorporating fine-grained information into the diffusion model. |
human motion generation, zero-shot learning, diffusion models, large language models, fine-grained descriptions |
2312.02703
Report |
MyPortrait: Morphable Prior-Guided Personalized Portrait Generation |
Bo Ding, Zhenfeng Fan, Shuang Yang, Shihong Xia |
Generating realistic talking faces is an interesting and long-standing topic
in the field of computer vision. Although significant progress has been made,
it is still challenging to generate high-quality dynamic faces with
personalized details. This is mainly due to the inability of the general model
to represent personalized details and the generalization problem to unseen
controllable parameters. In this work, we propose Myportrait, a simple,
general, and flexible framework for neural portrait generation. We incorporate
personalized prior in a monocular video and morphable prior in 3D face
morphable space for generating personalized details under novel controllable
parameters. Our proposed framework supports both video-driven and audio-driven
face animation given a monocular video of a single person. Distinguished by
whether the test data is sent to training or not, our method provides a
real-time online version and a high-quality offline version. Comprehensive
experiments in various metrics demonstrate the superior performance of our
method over the state-of-the-art methods. The code will be publicly available. |
Presents Myportrait, a novel prior-guided framework for neural portrait generation that leverages personalized prior from a monocular video and morphable prior from 3D face morphable space to generate high-quality dynamic faces with personalized details. |
Addresses the challenge of generating high-quality dynamic faces with personalized details due to limitations in representing these details and generalizing to unseen controllable parameters. |
Employs a two-stage training strategy: (1) Reconstruction Training on a monocular video to learn personalized prior. (2) Scalable Training incorporating morphable prior from auxiliary data to extend the face parameter space and improve generalization to novel parameters. |
Achieves superior performance in self-reenactment experiments, evidenced by lower L1 distance, LPIPS, and FID compared to state-of-the-art methods.
Outperforms existing methods in cross-reenactment experiments, demonstrating improved CSIM and FID, particularly in the offline version where driven data is included in training.
Shows promising results in audio-driven reenactment, indicating the validity of the morphable prior in enhancing results for this application. |
Limited to monocular videos with fixed backgrounds due to the reduction of 3D to 2D scenes, potentially addressed by incorporating face segmentation methods.
Performance reliant on the accuracy of face parameters extracted by face trackers, with potential for improvement as face tracking technology advances. |
neural portrait generation, talking face generation, personalized prior, morphable prior, 3d face morphable model |
2312.02663
Report |
FaceStudio: Put Your Face Everywhere in Seconds |
Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu |
This study investigates identity-preserving image synthesis, an intriguing
task in image generation that seeks to maintain a subject's identity while
adding a personalized, stylistic touch. Traditional methods, such as Textual
Inversion and DreamBooth, have made strides in custom image creation, but they
come with significant drawbacks. These include the need for extensive resources
and time for fine-tuning, as well as the requirement for multiple reference
images. To overcome these challenges, our research introduces a novel approach
to identity-preserving synthesis, with a particular focus on human images. Our
model leverages a direct feed-forward mechanism, circumventing the need for
intensive fine-tuning, thereby facilitating quick and efficient image
generation. Central to our innovation is a hybrid guidance framework, which
combines stylized images, facial images, and textual prompts to guide the image
generation process. This unique combination enables our model to produce a
variety of applications, such as artistic portraits and identity-blended
images. Our experimental results, including both qualitative and quantitative
evaluations, demonstrate the superiority of our method over existing baseline
models and previous works, particularly in its remarkable efficiency and
ability to preserve the subject's identity with high fidelity. |
This paper introduces a novel, tuning-free framework for identity-preserving image synthesis, focusing on human images. This framework leverages a hybrid guidance module that integrates stylized images, facial images, and textual prompts to guide image generation while preserving identity. |
Existing text-to-image diffusion models face challenges in capturing nuanced details, like human facial features, relying solely on textual descriptions. Existing methods for identity-preserving synthesis often require resource-intensive fine-tuning and multiple reference images. |
The method uses a direct feed-forward mechanism, eliminating the need for fine-tuning. It employs a hybrid guidance module combining textual prompts, style images, and identity images to guide the image generation process of a latent diffusion model. For images with multiple identities, a multi-identity cross-attention mechanism maps guidance details to specific human segments. |
The proposed model outperforms existing methods in both qualitative and quantitative evaluations, particularly in efficiency and identity preservation.
The model effectively synthesizes images with large pose changes while maintaining identity, showcasing its robustness.
The multi-identity cross-attention mechanism enables the generation of multi-human images with distinct identities, surpassing baselines using vanilla cross-attention. |
The model is specifically tailored for human images, limiting its application to other subjects like animals or objects.
The ability to generate realistic human images raises concerns regarding intellectual property and potential misuse in creating offensive content. |
image synthesis, identity preservation, diffusion models, hybrid guidance, multi-identity generation |
2312.02625
Report |
Diffusion Noise Feature: Accurate and Fast Generated Image Detection |
Yichi Zhang, Xiaogang Xu |
Generative models have reached an advanced stage where they can produce
remarkably realistic images. However, this remarkable generative capability
also introduces the risk of disseminating false or misleading information.
Notably, existing image detectors for generated images encounter challenges
such as low accuracy and limited generalization. This paper seeks to address
this issue by seeking a representation with strong generalization capabilities
to enhance the detection of generated images. Our investigation has revealed
that real and generated images display distinct latent Gaussian representations
when subjected to an inverse diffusion process within a pre-trained diffusion
model. Exploiting this disparity, we can amplify subtle artifacts in generated
images. Building upon this insight, we introduce a novel image representation
known as Diffusion Noise Feature (DNF). DNF is extracted from the estimated
noise generated during the inverse diffusion process. A simple classifier,
e.g., ResNet50, trained on DNF achieves high accuracy, robustness, and
generalization capabilities for detecting generated images (even the
corresponding generator is built with datasets/structures that are not seen
during the classifier's training). We conducted experiments using four training
datasets and five testsets, achieving state-of-the-art detection performance. |
This paper proposes Diffusion Noise Feature (DNF), a novel image representation for detecting generated images by leveraging the distinct latent Gaussian representations of real and generated images during the inverse diffusion process in a pre-trained diffusion model. |
Existing generated image detectors face limitations in accuracy and generalization, particularly with the increasing realism of images from state-of-the-art generative models. This necessitates a novel representation with enhanced generalization capabilities to distinguish real and generated images effectively. |
DNF is extracted by inputting an image into a pre-trained diffusion model, executing the inverse diffusion process, and collecting the estimated noise generated at each step. A fusion strategy, determined experimentally, combines these noise estimations to obtain the final DNF representation. |
The DNF classifier achieved state-of-the-art detection performance, significantly outperforming existing methods with a perfect 100% accuracy and precision on DiffusionForensics.
DNF exhibited exceptional robustness against common image perturbations like Gaussian blur and JPEG compression, maintaining over 99.2% accuracy.
The DNF classifier demonstrated strong cross-dataset and cross-generator generalization capabilities, accurately detecting images from unseen datasets and generators, including those based on different generative principles (e.g., Diffusion Models vs. GANs). |
The effectiveness of different fusion strategies for combining the estimated noise sequence needs further investigation to optimize DNF computation for diverse scenarios and challenging detection tasks.
Future research will focus on developing novel detection models specifically tailored for DNF to address the evolving capabilities of emerging generative models like Stable Diffusion v3 and Sora. |
generated image detection, diffusion models, feature engineering, generalization capability, robustness |
2312.02617
Report |
DreaMo: Articulated 3D Reconstruction From A Single Casual Video |
Tao Tu, Ming-Feng Li, Chieh Hubert Lin, Yen-Chi Cheng, Min Sun, Ming-Hsuan Yang |
Articulated 3D reconstruction has valuable applications in various domains,
yet it remains costly and demands intensive work from domain experts. Recent
advancements in template-free learning methods show promising results with
monocular videos. Nevertheless, these approaches necessitate a comprehensive
coverage of all viewpoints of the subject in the input video, thus limiting
their applicability to casually captured videos from online sources. In this
work, we study articulated 3D shape reconstruction from a single and casually
captured internet video, where the subject's view coverage is incomplete. We
propose DreaMo that jointly performs shape reconstruction while solving the
challenging low-coverage regions with view-conditioned diffusion prior and
several tailored regularizations. In addition, we introduce a skeleton
generation strategy to create human-interpretable skeletons from the learned
neural bones and skinning weights. We conduct our study on a self-collected
internet video collection characterized by incomplete view coverage. DreaMo
shows promising quality in novel-view rendering, detailed articulated shape
reconstruction, and skeleton generation. Extensive qualitative and quantitative
studies validate the efficacy of each proposed component, and show existing
methods are unable to solve correct geometry due to the incomplete view
coverage. |
This paper introduces DreaMo, a novel template-free framework designed to reconstruct articulated 3D models from single, casually captured videos with incomplete view coverage. |
Reconstructing 3D models from casual videos, which often lack comprehensive viewpoint coverage, is crucial for various applications but challenging for existing methods. |
DreaMo utilizes a neural implicit function to learn a rest-pose 3D model and employs a view-conditioned diffusion model to hallucinate plausible geometry in unseen or low-coverage regions. It further introduces regularization techniques to refine neural bone placement and enhance reconstruction quality. |
DreaMo outperforms state-of-the-art methods in reconstructing detailed 3D shapes with plausible textures from videos with limited viewpoints.
The proposed regularization schemes are shown to effectively improve the placement of neural bones, leading to more intuitive skeletons and fewer geometric artifacts.
DreaMo supports user control, enabling the manipulation of reconstructed models into novel poses by adjusting the generated skeletons. |
DreaMo, as a structure-from-motion method, requires a certain level of camera baseline and struggles with videos lacking sufficient viewpoint diversity.
The hallucination of bones and articulations in entirely unseen regions remains a challenge, as the model relies on observing real-world motions to learn these features. |
3d reconstruction, articulated shape reconstruction, diffusion models, view synthesis, skeleton generation |
2312.02548
Report |
GeNIe: Generative Hard Negative Images Through Diffusion |
Soroush Abbasi Koohpayegani, Anuj Singh, K L Navaneet, Hadi Jamali-Rad, Hamed Pirsiavash |
Data augmentation is crucial in training deep models, preventing them from
overfitting to limited data. Recent advances in generative AI, e.g., diffusion
models, have enabled more sophisticated augmentation techniques that produce
data resembling natural images. We introduce GeNIe a novel augmentation method
which leverages a latent diffusion model conditioned on a text prompt to merge
contrasting data points (an image from the source category and a text prompt
from the target category) to generate challenging samples. To achieve this,
inspired by recent diffusion based image editing techniques, we limit the
number of diffusion iterations to ensure the generated image retains low-level
and background features from the source image while representing the target
category, resulting in a hard negative sample for the source category. We
further enhance the proposed approach by finding the appropriate noise level
adaptively for each image (coined as GeNIe-Ada) leading to further performance
improvement. Our extensive experiments, in both few-shot and long-tail
distribution settings, demonstrate the effectiveness of our novel augmentation
method and its superior performance over the prior art. Our code is available
here: https://github.com/UCDvision/GeNIe |
\genie{} is a novel data augmentation method that leverages a text-prompted latent diffusion model to generate challenging (hard negative) samples. It achieves this by merging contrasting data points: an image from the source category and a text prompt from the target category. |
\genie{} addresses challenges in training deep models with limited data, particularly in few-shot and long-tailed learning scenarios where model generalization and robustness are crucial. It is also helpful in mitigating the effect of spurious correlations in datasets. |
\genie{} employs a two-step process: (1) It partially adds noise to the latent representation of a source image. (2) It leverages a text-conditioned diffusion model, prompted with the target category, to generate a new image that semantically aligns with the target category while preserving low-level features from the source image. An adaptive noise level selection strategy (\texttt{GeNIe-Ada}) is further proposed to automatically determine the optimal noise level for each image. |
\genie{} consistently improves the performance of few-shot image classification on mini-Imagenet and tiered-Imagenet, surpassing other state-of-the-art methods and data augmentation techniques.
In long-tailed classification on ImageNet-LT, \genie{} leads to a significant performance boost, particularly for categories with limited samples, demonstrating its effectiveness in addressing data imbalance.
For fine-grained few-shot classification, \genie{} consistently outperforms other text-based augmentation methods across various datasets like CUB200, Cars196, Food101, and FGVC-Aircraft. |
The augmentation process in \genie{} is slower than traditional methods due to the time required for the diffusion process, making it less suitable for online augmentation settings.
\genie{} might face challenges with datasets where images significantly deviate from the generative model's training distribution or with unfamiliar category names, requiring potential fine-tuning of the model. |
data augmentation, diffusion models, few-shot learning, long-tailed classification, hard negative mining |
2312.02503
Report |
SAVE: Protagonist Diversification with Structure Agnostic Video Editing |
Yeji Song, Wonsik Shin, Junsoo Lee, Jeesoo Kim, Nojun Kwak |
Driven by the upsurge progress in text-to-image (T2I) generation models,
text-to-video (T2V) generation has experienced a significant advance as well.
Accordingly, tasks such as modifying the object or changing the style in a
video have been possible. However, previous works usually work well on trivial
and consistent shapes, and easily collapse on a difficult target that has a
largely different body shape from the original one. In this paper, we spot the
bias problem in the existing video editing method that restricts the range of
choices for the new protagonist and attempt to address this issue using the
conventional image-level personalization method. We adopt motion
personalization that isolates the motion from a single source video and then
modifies the protagonist accordingly. To deal with the natural discrepancy
between image and video, we propose a motion word with an inflated textual
embedding to properly represent the motion in a source video. We also regulate
the motion word to attend to proper motion-related areas by introducing a novel
pseudo optical flow, efficiently computed from the pre-calculated attention
maps. Finally, we decouple the motion from the appearance of the source video
with an additional pseudo word. Extensive experiments demonstrate the editing
capability of our method, taking a step toward more diverse and extensive video
editing. |
This paper introduces SAVE, a novel single-shot video editing method that enables protagonist diversification while preserving the motion of the original subject, even with substantial changes in body structure. |
Existing video editing methods struggle to maintain motion fidelity when replacing protagonists with objects of significantly different shapes, limiting their flexibility and diversity. |
SAVE employs a motion personalization approach with a new motion word (S_mot) that captures the specific motion in a source video. This word utilizes expanded text embeddings with temporal information and is trained with a motion-aware cross-attention loss based on a novel pseudo optical flow. Additionally, pre-registration of the protagonist's appearance (S_pro) disentangles motion from appearance during training. |
SAVE successfully edits protagonists with diverse structures while maintaining the original motion, outperforming existing methods in qualitative comparisons.
Quantitative evaluation shows SAVE achieves superior performance in text alignment, frame consistency, and user preference.
Ablation studies confirm the contribution of each proposed component, including expanded text embeddings, cross-attention regularization, and pre-registration of the protagonist. |
The current method is limited to single protagonist motions and struggles with multiple protagonists.
Future work will focus on expanding to broader motion types, including background and camera movements. |
video editing, motion personalization, text-to-video generation, diffusion models, protagonist diversification |
2312.02432
Report |
Orthogonal Adaptation for Modular Customization of Diffusion Models |
Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein |
Customization techniques for text-to-image models have paved the way for a
wide range of previously unattainable applications, enabling the generation of
specific concepts across diverse contexts and styles. While existing methods
facilitate high-fidelity customization for individual concepts or a limited,
pre-defined set of them, they fall short of achieving scalability, where a
single model can seamlessly render countless concepts. In this paper, we
address a new problem called Modular Customization, with the goal of
efficiently merging customized models that were fine-tuned independently for
individual concepts. This allows the merged model to jointly synthesize
concepts in one image without compromising fidelity or incurring any additional
computational costs.
To address this problem, we introduce Orthogonal Adaptation, a method
designed to encourage the customized models, which do not have access to each
other during fine-tuning, to have orthogonal residual weights. This ensures
that during inference time, the customized models can be summed with minimal
interference.
Our proposed method is both simple and versatile, applicable to nearly all
optimizable weights in the model architecture. Through an extensive set of
quantitative and qualitative evaluations, our method consistently outperforms
relevant baselines in terms of efficiency and identity preservation,
demonstrating a significant leap toward scalable customization of diffusion
models. |
This paper introduces Orthogonal Adaptation, a novel method for modular customization of text-to-image diffusion models, enabling efficient merging of independently fine-tuned models for multi-concept image synthesis. |
Existing methods struggle with scalability in multi-concept customization, exhibiting degradation in concept quality when merged or requiring computationally expensive joint training. |
Orthogonal Adaptation encourages orthogonal residual weights during independent concept fine-tuning, minimizing interference and preserving identity during merging through simple summation. |
Orthogonal Adaptation maintains high fidelity in single-concept generations from merged models, outperforming baselines in identity preservation.
It enables efficient merging of multiple concepts, exhibiting superior identity alignment compared to baselines, even with a large number of merged concepts.
Quantitative evaluations demonstrate superior image and identity alignment scores while maintaining comparable text alignment to state-of-the-art methods. |
Generating images with complex compositions and interactions between multiple custom concepts remains challenging.
The method currently requires modification of the fine-tuning process and cannot be applied post-hoc to existing fine-tuned models. |
diffusion models, text-to-image synthesis, model customization, orthogonal adaptation, multi-concept generation |
2312.02420
Report |
Towards Granularity-adjusted Pixel-level Semantic Annotation |
Rohit Kundu, Sudipta Paul, Rohit Lal, Amit K. Roy-Chowdhury |
Recent advancements in computer vision predominantly rely on learning-based
systems, leveraging annotations as the driving force to develop specialized
models. However, annotating pixel-level information, particularly in semantic
segmentation, presents a challenging and labor-intensive task, prompting the
need for autonomous processes. In this work, we propose GranSAM which
distinguishes itself by providing semantic segmentation at the user-defined
granularity level on unlabeled data without the need for any manual
supervision, offering a unique contribution in the realm of semantic mask
annotation method. Specifically, we propose an approach to enable the Segment
Anything Model (SAM) with semantic recognition capability to generate
pixel-level annotations for images without any manual supervision. For this, we
accumulate semantic information from synthetic images generated by the Stable
Diffusion model or web crawled images and employ this data to learn a mapping
function between SAM mask embeddings and object class labels. As a result, SAM,
enabled with granularity-adjusted mask recognition, can be used for pixel-level
semantic annotation purposes. We conducted experiments on the PASCAL VOC 2012
and COCO-80 datasets and observed a +17.95% and +5.17% increase in mIoU,
respectively, compared to existing state-of-the-art methods when evaluated
under our problem setting. |
This paper introduces GranSAM, a novel semantic segmentation based annotation framework that generates pixel-level annotations and semantic masks without requiring any manually labeled images or human interaction. |
Annotating pixel-level information for semantic segmentation is labor-intensive and expensive. GranSAM offers an autonomous solution, enhancing efficiency and reducing costs in developing specialized models for computer vision. |
The framework utilizes the Segment Anything Model (SAM) for region distinction and leverages synthetic images (generated via Stable Diffusion) or web crawled images to guide SAM's semantic understanding. A classifier head is trained on SAM's mask embeddings using a weakly-supervised multiple instance learning setup with uncertainty distillation to map masks to user-defined object classes. |
GranSAM achieves competitive performance compared to existing unsupervised semantic segmentation methods on PASCAL VOC and COCO-80 datasets despite being trained on a small set of synthetic or web crawled single-object images.
The framework demonstrates superior performance over state-of-the-art unsupervised methods when tested on unseen data distributions, highlighting its generalization capabilities.
Uncertainty distillation during training significantly improves the model's discriminative ability, especially on the challenging COCO-80 dataset. |
The performance of GranSAM on COCO-80, while exceeding baselines, highlights the challenges posed by complex datasets with diverse object classes and scenes.
Further research can explore the incorporation of techniques like few-shot learning to further enhance the model's ability to generalize from limited training data and refine segmentation accuracy. |
semantic segmentation, automatic annotation, segment anything model, unsupervised learning, weakly supervised learning |
2312.02362
Report |
PointNeRF++: A multi-scale, point-based Neural Radiance Field |
Weiwei Sun, Eduard Trulls, Yang-Che Tseng, Sneha Sambandam, Gopal Sharma, Andrea Tagliasacchi, Kwang Moo Yi |
Point clouds offer an attractive source of information to complement images
in neural scene representations, especially when few images are available.
Neural rendering methods based on point clouds do exist, but they do not
perform well when the point cloud quality is low -- e.g., sparse or incomplete,
which is often the case with real-world data. We overcome these problems with a
simple representation that aggregates point clouds at multiple scale levels
with sparse voxel grids at different resolutions. To deal with point cloud
sparsity, we average across multiple scale levels -- but only among those that
are valid, i.e., that have enough neighboring points in proximity to the ray of
a pixel. To help model areas without points, we add a global voxel at the
coarsest scale, thus unifying ``classical'' and point-based NeRF formulations.
We validate our method on the NeRF Synthetic, ScanNet, and KITTI-360 datasets,
outperforming the state of the art, with a significant gap compared to other
NeRF-based methods, especially on more challenging scenes. |
This paper introduces PointNeRF++, a novel multi-scale, point-based neural radiance field representation for improved novel view synthesis, especially in challenging scenarios with sparse or incomplete point clouds. |
Existing point cloud-based neural rendering methods struggle with low-quality, sparse, or incomplete point clouds often encountered in real-world data. |
The method aggregates point clouds at multiple scale levels with sparse voxel grids, averaging features only across valid scales with sufficient neighboring points. It also incorporates a global voxel at the coarsest scale to model areas without points, unifying classic and point-based NeRF formulations. A tri-plane representation is utilized for coarser scales to effectively cover larger support regions. |
Significantly outperforms state-of-the-art methods on the NeRF Synthetic, ScanNet, and KITTI-360 datasets.
Demonstrates superior performance, especially in handling sparse or incomplete point clouds, compared to PointNeRF and other baselines.
Shows the effectiveness of multi-scale representation and the global voxel in capturing scene details and filling in gaps in point clouds. |
The computational cost is limited by the classic NeRF backbone.
Future work includes exploring the combination of the multi-scale strategy with computationally efficient methods like 3D Gaussian Splatting. |
neural radiance fields, point clouds, multi-scale representation, novel view synthesis, 3d scene reconstruction |
2312.02319
Report |
Kernel Diffusion: An Alternate Approach to Blind Deconvolution |
Yash Sanghvi, Yiheng Chi, Stanley H. Chan |
Blind deconvolution problems are severely ill-posed because neither the
underlying signal nor the forward operator are not known exactly.
Conventionally, these problems are solved by alternating between estimation of
the image and kernel while keeping the other fixed. In this paper, we show that
this framework is flawed because of its tendency to get trapped in local minima
and, instead, suggest the use of a kernel estimation strategy with a non-blind
solver. This framework is employed by a diffusion method which is trained to
sample the blur kernel from the conditional distribution with guidance from a
pre-trained non-blind solver. The proposed diffusion method leads to
state-of-the-art results on both synthetic and real blur datasets. |
This paper introduces Kernel-Diff, a novel diffusion-based blind deconvolution method that prioritizes kernel estimation over the conventional alternating minimization approach. |
Alternating minimization for blind deconvolution is prone to local minima. Directly estimating the kernel using a marginalization approach is more robust but computationally challenging. Kernel-Diff addresses this challenge using a diffusion model guided by a non-blind solver. |
Kernel-Diff employs a diffusion model trained to sample blur kernels from the conditional distribution p(k|y), effectively approximating the marginalization of the image space. A differentiable non-blind solver guides the diffusion process, minimizing the reblurring loss and ensuring a plausible kernel estimate. |
Kernel-Diff achieves state-of-the-art performance on both synthetic (BSD100) and real (RealBlur-50) blur datasets, outperforming existing methods in PSNR, SSIM, LPIPS and FID.
Ablation study demonstrates the crucial role of the non-blind solver guidance in achieving superior performance.
Analysis of the reblurring loss during diffusion confirms that the proposed kernel estimation strategy converges to a better local minimum compared to alternating minimization methods. |
The current implementation assumes spatially invariant blur, limiting its applicability to more general scenarios.
Future work can explore better approximations of the image space marginalization or incorporate robustness to kernel inaccuracies in the non-blind solver. |
blind deconvolution, diffusion models, kernel estimation, non-blind solver, image restoration |
2312.02284
Report |
PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation |
Zhenyu Li, Shariq Farooq Bhat, Peter Wonka |
Single image depth estimation is a foundational task in computer vision and
generative modeling. However, prevailing depth estimation models grapple with
accommodating the increasing resolutions commonplace in today's consumer
cameras and devices. Existing high-resolution strategies show promise, but they
often face limitations, ranging from error propagation to the loss of
high-frequency details. We present PatchFusion, a novel tile-based framework
with three key components to improve the current state of the art: (1) A
patch-wise fusion network that fuses a globally-consistent coarse prediction
with finer, inconsistent tiled predictions via high-level feature guidance, (2)
A Global-to-Local (G2L) module that adds vital context to the fusion network,
discarding the need for patch selection heuristics, and (3) A Consistency-Aware
Training (CAT) and Inference (CAI) approach, emphasizing patch overlap
consistency and thereby eradicating the necessity for post-processing.
Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that
our framework can generate high-resolution depth maps with intricate details.
PatchFusion is independent of the base model for depth estimation. Notably, our
framework built on top of SOTA ZoeDepth brings improvements for a total of
17.3% and 29.4% in terms of the root mean squared error (RMSE) on
UnrealStereo4K and MVS-Synth, respectively. |
This paper introduces PatchFusion, a novel tile-based framework for high-resolution monocular metric depth estimation that surpasses input resolution limitations of existing depth estimation models. |
Existing depth estimation models struggle with high-resolution images common in modern devices. PatchFusion addresses this by enabling the use of pre-trained models on high-resolution inputs without sacrificing accuracy or efficiency. |
PatchFusion uses three steps: (1) global scale-aware coarse depth estimation, (2) local fine-depth estimation on image patches, (3) fusion of coarse and fine predictions using a guided fusion network with a global-to-local module. It also employs consistency-aware training and inference for patch coherence. |
PatchFusion outperforms previous state-of-the-art methods on UnrealStereo4K and MVS-Synth datasets, showing significant improvements in RMSE, REL, and boundary delineation.
The framework generalizes well to real-world images, as demonstrated on the Middlebury 2014 dataset in a zero-shot transfer setting.
Ablation studies confirm the contribution of each component, particularly the effectiveness of the guided fusion network with the G2L module and consistency-aware training and inference. |
The computational efficiency of the framework can be further improved, especially when using a large number of randomly selected patches.
The performance in real-world settings could benefit from the availability of large, high-resolution, real-world depth datasets for training. |
depth estimation, high-resolution, tile-based, monocular, deep learning |
2312.02256
Report |
EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation |
Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, Lingjie Liu |
We introduce Efficient Motion Diffusion Model (EMDM) for fast and
high-quality human motion generation. Current state-of-the-art generative
diffusion models have produced impressive results but struggle to achieve fast
generation without sacrificing quality. On the one hand, previous works, like
motion latent diffusion, conduct diffusion within a latent space for
efficiency, but learning such a latent space can be a non-trivial effort. On
the other hand, accelerating generation by naively increasing the sampling step
size, e.g., DDIM, often leads to quality degradation as it fails to approximate
the complex denoising distribution. To address these issues, we propose EMDM,
which captures the complex distribution during multiple sampling steps in the
diffusion model, allowing for much fewer sampling steps and significant
acceleration in generation. This is achieved by a conditional denoising
diffusion GAN to capture multimodal data distributions among arbitrary (and
potentially larger) step sizes conditioned on control signals, enabling
fewer-step motion sampling with high fidelity and diversity. To minimize
undesired motion artifacts, geometric losses are imposed during network
learning. As a result, EMDM achieves real-time motion generation and
significantly improves the efficiency of motion diffusion models compared to
existing methods while achieving high-quality motion generation. Our code will
be publicly available upon publication. |
This paper introduces EMDM, an Efficient Motion Diffusion Model for real-time, high-quality human motion generation, addressing the speed-quality trade-off in existing diffusion-based methods. |
Current motion diffusion models struggle to achieve fast generation without compromising quality, limiting their real-world applicability. |
EMDM utilizes a conditional denoising diffusion GAN to model complex motion distributions over larger sampling step sizes. This allows for fewer denoising steps during generation, significantly improving speed. Additionally, geometric losses are incorporated during training to enhance motion quality. |
EMDM achieves real-time motion generation with competitive or superior quality compared to state-of-the-art methods.
The model demonstrates significant speed improvements, particularly in text-to-motion tasks where it outperforms existing approaches.
Ablation studies validate the contribution of key design choices like sampling step size and geometric loss weighting. |
The lack of physics-based considerations in the motion generation process may lead to artifacts like floating or ground penetration.
Future work could explore incorporating physical constraints and expanding input modalities beyond text, such as visual or audio inputs. |
text-to-motion, motion generation, diffusion model, gan, efficient motion synthesis |
2312.02253
Report |
Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images |
Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee |
Recent advances in generative deep learning have enabled the creation of
high-quality synthetic images in text-to-image generation. Prior work shows
that fine-tuning a pretrained diffusion model on ImageNet and generating
synthetic training images from the finetuned model can enhance an ImageNet
classifier's performance. However, performance degrades as synthetic images
outnumber real ones. In this paper, we explore whether generative fine-tuning
is essential for this improvement and whether it is possible to further scale
up training using more synthetic data. We present a new framework leveraging
off-the-shelf generative models to generate synthetic training images,
addressing multiple challenges: class name ambiguity, lack of diversity in
naive prompts, and domain shifts. Specifically, we leverage large language
models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we
propose contextualized diversification (CD) and stylized diversification (SD)
methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage
domain adaptation techniques with auxiliary batch normalization for synthetic
images. Our framework consistently enhances recognition model performance with
more synthetic data, up to 6x of original ImageNet size showcasing the
potential of synthetic data for improved recognition models and strong
out-of-domain generalization. |
This paper proposes a new framework that leverages off-the-shelf generative models to generate synthetic training images, improving recognition model performance on large-scale datasets without the need for generative fine-tuning. |
Fine-tuning generative models for each dataset is resource-intensive and performance degrades when synthetic images outnumber real ones. This work explores the potential of using readily available generative models to overcome these limitations. |
The framework addresses challenges like class name ambiguity, lack of diversity in images, and domain shifts. It uses LLMs and CLIP to resolve ambiguity, introduces contextual and style diversification in prompts, and employs domain adaptation techniques with auxiliary batch normalization. |
The framework consistently improves recognition accuracy, outperforming methods using fine-tuned generative models, especially as synthetic data scales up.
Models trained with synthetic data show strong out-of-domain generalization, achieving significant accuracy improvements on ImageNet variations.
The method is effective in low-data and long-tail settings, demonstrating its potential to reduce annotation efforts. |
Training large vision transformer models with more synthetic data is computationally expensive.
Further research is needed to optimize synthetic data generation for specific downstream tasks. |
synthetic data, image classification, diffusion models, domain adaptation, large language models |
2312.02238
Report |
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model |
Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou |
We introduce X-Adapter, a universal upgrader to enable the pretrained
plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the
upgraded text-to-image diffusion model (e.g., SDXL) without further retraining.
We achieve this goal by training an additional network to control the frozen
upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a
frozen copy of the old model to preserve the connectors of different plugins.
Additionally, X-Adapter adds trainable mapping layers that bridge the decoders
from models of different versions for feature remapping. The remapped features
will be used as guidance for the upgraded model. To enhance the guidance
ability of X-Adapter, we employ a null-text training strategy for the upgraded
model. After training, we also introduce a two-stage denoising strategy to
align the initial latents of X-Adapter and the upgraded model. Thanks to our
strategies, X-Adapter demonstrates universal compatibility with various plugins
and also enables plugins of different versions to work together, thereby
expanding the functionalities of diffusion community. To verify the
effectiveness of the proposed method, we conduct extensive experiments and the
results show that X-Adapter may facilitate wider application in the upgraded
foundational diffusion model. |
X-Adapter is a universal adapter that upgrades pretrained plug-and-play modules for text-to-image diffusion models, enabling their use with newer models without retraining. |
The rapid development of plugins for diffusion models is often hindered by the emergence of newer models. X-Adapter solves this incompatibility, saving time and resources while enhancing plugin capabilities. |
X-Adapter freezes a copy of the old model and adds trainable mapping layers between its decoder and the upgraded model's decoder for feature remapping. This allows direct use of old plugins on the newer model, guided by the remapped features. A two-stage denoising strategy aligns the latent spaces of the models during inference. |
X-Adapter demonstrates universal compatibility with various plugins, including ControlNet and LoRA.
It improves the performance of old plugins by leveraging the enhanced capabilities of upgraded models, as shown in quantitative and qualitative comparisons.
X-Adapter enables plugin remixing, allowing plugins from different model versions to work together. |
X-Adapter may not fully preserve identity consistency for plugins like IP-Adapter that generate personalized concepts.
Future work includes extending X-Adapter to improve concept customization capabilities. |
diffusion models, plug-and-play modules, model upgrading, parameter-efficient transfer learning, text-to-image generation |
2312.02228
Report |
PixelLM: Pixel Reasoning with Large Multimodal Model |
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin |
While large multimodal models (LMMs) have achieved remarkable progress,
generating pixel-level masks for image reasoning tasks involving multiple
open-world targets remains a challenge. To bridge this gap, we introduce
PixelLM, an effective and efficient LMM for pixel-level reasoning and
understanding. Central to PixelLM is a novel, lightweight pixel decoder and a
comprehensive segmentation codebook. The decoder efficiently produces masks
from the hidden embeddings of the codebook tokens, which encode detailed
target-relevant information. With this design, PixelLM harmonizes with the
structure of popular LMMs and avoids the need for additional costly
segmentation models. Furthermore, we propose a target refinement loss to
enhance the model's ability to differentiate between multiple targets, leading
to substantially improved mask quality. To advance research in this area, we
construct MUSE, a high-quality multi-target reasoning segmentation benchmark.
PixelLM excels across various pixel-level image reasoning and understanding
tasks, outperforming well-established methods in multiple benchmarks, including
MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm
the efficacy of each proposed component. All code, models, and datasets will be
publicly available. |
PixelLM, an efficient and effective large multimodal model (LMM) for pixel-level reasoning and understanding, capable of handling tasks with multiple open-world targets and diverse reasoning complexities. |
Existing LMMs primarily generate textual descriptions and struggle with pixel-level responses like object masks, limiting their applications in tasks like image editing and robotics. |
PixelLM introduces a novel pixel decoder and a segmentation codebook. The codebook encodes target-relevant information at different visual scales, and the decoder generates masks based on these embeddings and image features. A target refinement loss further enhances the differentiation between multiple targets. |
PixelLM achieves state-of-the-art performance on multi-target reasoning segmentation, outperforming baselines including adapted LISA and SEEM.
It demonstrates superior results on multi-referring segmentation benchmarks, surpassing LISA and its augmented variant.
PixelLM also shows competitive performance on single-target referring segmentation (refCOCO series) despite not being specifically designed for this task. |
The model's performance might be further improved by incorporating object relationships into the data generation process.
Investigating the integration of external knowledge bases to enhance reasoning capabilities for complex scenarios. |
large multimodal models, pixel-level reasoning, segmentation, image understanding, codebook |
2312.02221
Report |
Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction |
Yizhi Wang, Wallace Lira, Wenqi Wang, Ali Mahdavi-Amiri, Hao Zhang |
We introduce multi-slice reasoning, a new notion for single-view 3D
reconstruction which challenges the current and prevailing belief that
multi-view synthesis is the most natural conduit between single-view and 3D.
Our key observation is that object slicing is more advantageous than altering
views to reveal occluded structures. Specifically, slicing is more
occlusion-revealing since it can peel through any occluders without
obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed
to unveil all hidden object parts. We realize our idea by developing Slice3D, a
novel method for single-view 3D reconstruction which first predicts multi-slice
images from a single RGB image and then integrates the slices into a 3D model
using a coordinate-based transformer network for signed distance prediction.
The slice images can be regressed or generated, both through a U-Net based
network. For the former, we inject a learnable slice indicator code to
designate each decoded image into a spatial slice location, while the slice
generator is a denoising diffusion model operating on the entirety of slice
images stacked on the input channels. We conduct extensive evaluation against
state-of-the-art alternatives to demonstrate superiority of our method,
especially in recovering complex and severely occluded shape structures, amid
ambiguities. All Slice3D results were produced by networks trained on a single
Nvidia A40 GPU, with an inference time less than 20 seconds. |
Slice3D, a novel single-view 3D reconstruction method that predicts multi-slice images to reveal occluded parts before reconstructing a 3D model. |
Addresses the fundamental challenge of single-view 3D reconstruction: faithfully reconstructing occluded parts from a single view, which prevailing multi-view synthesis methods struggle with. |
Predicts multi-slice images from a single RGB image using either a regressive U-Net with slice indicator codes or a generative denoising diffusion model, then integrates the slices into a 3D model using a coordinate-based transformer for signed distance prediction. |
Outperforms SOTA methods, including those using diffusion + NeRF, in recovering complex and severely occluded shapes.
Demonstrates superior generalization ability across diverse object categories on Objaverse.
Offers faster inference times compared to NeRF-based methods while achieving better reconstruction quality. |
Current implementation uses a fixed number of slices per direction, potentially limiting detail recovery.
Slicing is mainly applicable to digital 3D models, limiting its applicability to real-world scenarios compared to multi-view methods. |
single-view 3d reconstruction, occlusion-revealing, multi-slice representation, denoising diffusion model, transformer network |
2312.02218
Report |
WavePlanes: A compact Wavelet representation for Dynamic Neural Radiance Fields |
Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull |
Dynamic Neural Radiance Fields (Dynamic NeRF) enhance NeRF technology to
model moving scenes. However, they are resource intensive and challenging to
compress. To address these issues, this paper presents WavePlanes, a fast and
more compact explicit model. We propose a multi-scale space and space-time
feature plane representation using N-level 2-D wavelet coefficients. The
inverse discrete wavelet transform reconstructs feature signals at varying
detail, which are linearly decoded to approximate the color and density of
volumes in a 4-D grid. Exploiting the sparsity of wavelet coefficients, we
compress the model using a Hash Map containing only non-zero coefficients and
their locations on each plane. Compared to the state-of-the-art (SotA)
plane-based models, WavePlanes is up to 15x smaller while being less resource
demanding and competitive in performance and training time. Compared to other
small SotA models WavePlanes preserves details better without requiring custom
CUDA code or high performance computing resources. Our code is available at:
https://github.com/azzarelli/waveplanes/ |
Presents WavePlanes, a novel dynamic Neural Radiance Field (NeRF) representation and compression method that utilizes wavelets to reduce computation and enable efficient compression. |
Dynamic NeRFs, while promising for modeling moving scenes, are resource intensive and difficult to compress. Existing solutions are either slow, large, or struggle with temporal detail. |
Decomposes the 4D scene into six 2D grids. Employs a multi-scale space-time feature plane representation using N-level 2D wavelet coefficients. Introduces a Zero-Agreement Masked (ZAM) fusion scheme for improved signal localization. Leverages wavelet sparsity for compression using a Hash Map. |
Achieves up to 15x compression compared to state-of-the-art plane-based models.
Maintains competitive performance and training time despite being more compact.
Preserves details better than other small dynamic NeRF models, particularly in regions of high frequency and occlusion. |
Limited in modeling objects outside the predefined bounding box.
Modeling fast motion with a fixed temporal resolution can introduce noise. |
neural radiance fields, nerf, wavelets, compression, dynamic scenes |
2312.02216
Report |
DragVideo: Interactive Drag-style Video Editing |
Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, Chi-Keung Tang |
Video generation models have shown their superior ability to generate
photo-realistic video. However, how to accurately control (or edit) the video
remains a formidable challenge. The main issues are: 1) how to perform direct
and accurate user control in editing; 2) how to execute editings like changing
shape, expression, and layout without unsightly distortion and artifacts to the
edited content; and 3) how to maintain spatio-temporal consistency of video
after editing. To address the above issues, we propose DragVideo, a general
drag-style video editing framework. Inspired by DragGAN, DragVideo addresses
issues 1) and 2) by proposing the drag-style video latent optimization method
which gives desired control by updating noisy video latent according to drag
instructions through video-level drag objective function. We amend issue 3) by
integrating the video diffusion model with sample-specific LoRA and Mutual
Self-Attention in DragVideo to ensure the edited result is spatio-temporally
consistent. We also present a series of testing examples for drag-style video
editing and conduct extensive experiments across a wide array of challenging
editing tasks, such as motion, skeleton editing, etc, underscoring DragVideo
can edit video in an intuitive, faithful to the user's intention manner, with
nearly unnoticeable distortion and artifacts, while maintaining spatio-temporal
consistency. While traditional prompt-based video editing fails to do the
former two and directly applying image drag editing fails in the last,
DragVideo's versatility and generality are emphasized. Github link:
https://github.com/RickySkywalker/DragVideo-Official. |
This paper introduces DragVideo, the first end-to-end framework for drag-style video editing that enables accurate and intuitive video editing while maintaining spatio-temporal consistency. |
Existing video editing methods either struggle with accurate and artifact-free editing (e.g., prompt-based methods) or fail to maintain temporal consistency when directly applying image-based drag editing across frames. |
DragVideo uses a video diffusion model, sample-specific LoRA, and Mutual Self-Attention. It optimizes a noisy video latent based on user-provided drag instructions (points and masks), which are propagated across frames. |
DragVideo achieves high-quality, drag-based editing on real-world videos with accurate and artifact-free results.
It effectively addresses the temporal inconsistency issues present in frame-by-frame drag editing approaches.
Quantitative evaluations and user studies confirm DragVideo's superior performance in temporal consistency and editing effectiveness compared to baselines. |
Some edited outputs still exhibit blurriness and spatial inconsistency, suggesting a need for further optimization in visual quality.
The framework currently has high computational costs, necessitating improvements in computational efficiency. |
video editing, drag-style editing, video diffusion models, temporal consistency, user-guided editing |
2312.02214
Report |
FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding |
Jun Xiang, Xuan Gao, Yudong Guo, Juyong Zhang |
We propose FlashAvatar, a novel and lightweight 3D animatable avatar
representation that could reconstruct a digital avatar from a short monocular
video sequence in minutes and render high-fidelity photo-realistic images at
300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D
Gaussian field embedded in the surface of a parametric face model and learn
extra spatial offset to model non-surface regions and subtle facial details.
While full use of geometric priors can capture high-frequency facial details
and preserve exaggerated expressions, proper initialization can help reduce the
number of Gaussians, thus enabling super-fast rendering speed. Extensive
experimental results demonstrate that FlashAvatar outperforms existing works
regarding visual quality and personalized details and is almost an order of
magnitude faster in rendering speed. Project page:
https://ustc3dv.github.io/FlashAvatar/ |
FlashAvatar is a novel, lightweight, and animatable 3D avatar representation that can reconstruct high-fidelity digital avatars from short monocular videos. |
It addresses the limitations of existing methods, such as 3DMMs' inability to model complex features and NeRF's slow rendering speeds, paving the way for real-time interactive digital human applications. |
It leverages a mesh-embedded Gaussian field initialized on a parametric face model (FLAME), learns spatial offsets for non-surface details, and employs efficient UV sampling for optimal Gaussian distribution. |
Achieves photo-realistic rendering quality with fine details and subtle expressions.
Outperforms previous methods in terms of visual quality and preserves personalized details.
Enables super-fast rendering at 300FPS on consumer-grade GPUs due to a low Gaussian count (10K level). |
Performance is contingent on accurate FLAME mesh tracking, with potential for detail loss or misalignment due to tracking errors.
Currently relies on tracked expression codes for animation, limiting its ability to model complex hair dynamics. |
digital avatar, 3d gaussian splatting, facial reenactment, real-time rendering, 3d reconstruction |
2312.02201
Report |
ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation |
Peng Wang, Yichun Shi |
We introduce "ImageDream," an innovative image-prompt, multi-view diffusion
model for 3D object generation. ImageDream stands out for its ability to
produce 3D models of higher quality compared to existing state-of-the-art,
image-conditioned methods. Our approach utilizes a canonical camera
coordination for the objects in images, improving visual geometry accuracy. The
model is designed with various levels of control at each block inside the
diffusion model based on the input image, where global control shapes the
overall object layout and local control fine-tunes the image details. The
effectiveness of ImageDream is demonstrated through extensive evaluations using
a standard prompt list. For more information, visit our project page at
https://Image-Dream.github.io. |
ImageDream: a novel Image-Prompt Multi-view diffusion model for high-quality 3D object generation from a single image, surpassing previous SoTA in geometry and texture quality. |
Images provide richer visual information for 3D generation than text, leading to more accurate and detailed 3D models. |
ImageDream uses a canonical camera coordination for consistent geometry and a multi-level image-prompt controller for granular control over object layout and appearance. It leverages a multi-view diffusion network and score distillation sampling for 3D model creation. |
Significantly outperforms other SoTA baselines in user studies based on geometry quality and similarity to the image prompt.
Successfully addresses geometric inaccuracies present in previous methods.
Maintains high image quality in both diffusion and post-3D fusion stages according to quantitative metrics like QIS and CLIP scores. |
Struggles with capturing fine details when image constraints are overly stringent, resulting in potential blurriness.
Requires a better estimation of image intrinsic and extrinsic properties for optimal performance. |
3d object generation, image-prompt, multi-view diffusion, canonical camera coordination, score distillation sampling |
2312.02197
Report |
Test-Time Degradation Adaption for Open-Set Image Restoration |
Yuanbiao Gou, Haiyu Zhao, Boyun Li, Xinyan Xiao, Xi Peng |
In contrast to close-set scenarios that restore images from a predefined set
of degradations, open-set image restoration aims to handle the unknown
degradations that were unforeseen during the pretraining phase, which is
less-touched as far as we know. In this work, we explicitly study this
challenging problem and reveal its essence, i.e., the unidentified distribution
shifts between test and training data. In recent, test-time adaptation emerges
as a fundamental method to address this inherent disparities. Inspired by this,
we propose a test-time degradation adaption framework for open-set image
restoration, which involves three components, i.e., i) a pre-trained and
degradation-agnostic diffusion model to generate clean images, ii) a test-time
degradation adapter adapts the unknown degradations based on the input image
during the testing phase, and iii) the adapter-guided image restoration guides
the model through the adapter to produce the corresponding clean image. Through
experiments on multiple degradations absent from the training data, we show
that our method achieves comparable even better performance than those
task-specific methods. |
This paper introduces the problem of open-set image restoration (OIR), where the task is to restore clean images from degradations not present in the training data, and proposes a Test-time degradation Adaption framework (TAO) to address it. |
Most existing image restoration methods operate under a close-set scenario, limiting their applicability to real-world situations where diverse and unforeseen degradations are common. OIR aims to tackle this limitation by enabling models to handle unknown degradations. |
TAO leverages a pre-trained diffusion model and incorporates two novel components: 1) a Test-time Degradation Adapter (TDA) that aligns the model to the unknown degradation during testing and 2) an Adapter-guided Image Restoration (AIR) module that dynamically adjusts supervision strategies throughout the denoising process. |
TAO achieves comparable or better performance than task-specific zero-shot methods on image dehazing, low-light enhancement, and denoising.
The TDA effectively aligns the generated image domain to that of the degraded input.
AIR significantly improves restoration quality by dynamically adjusting guidance strategies throughout the denoising process. |
The current method relies on heuristics for dividing the denoising process into stages for AIR.
Exploration of alternative adapter architectures and more principled stage division strategies could further enhance performance. |
image restoration, open-set learning, test-time adaptation, diffusion models, domain adaptation |
2312.02190
Report |
Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D |
Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy Mitra |
Diffusion Handles is a novel approach to enabling 3D object edits on
diffusion images. We accomplish these edits using existing pre-trained
diffusion models, and 2D image depth estimation, without any fine-tuning or 3D
object retrieval. The edited results remain plausible, photo-real, and preserve
object identity. Diffusion Handles address a critically missing facet of
generative image based creative design, and significantly advance the
state-of-the-art in generative image editing. Our key insight is to lift
diffusion activations for an object to 3D using a proxy depth, 3D-transform the
depth and associated activations, and project them back to image space. The
diffusion process applied to the manipulated activations with identity control,
produces plausible edited images showing complex 3D occlusion and lighting
effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic
data benchmark; and qualitatively by a user study, showing our output to be
more plausible, and better than prior art at both, 3D editing and identity
control. Project Webpage: https://diffusionhandles.github.io/ |
Introduces Diffusion Handles, a method for 3D-aware object editing in diffusion-generated images using estimated depth maps and diffusion activations. |
Addresses limitations of existing diffusion-based image editing techniques by enabling plausible 3D object transformations while preserving object identity. |
Lifts diffusion activations to 3D using depth maps, applies 3D transformations, projects back to image space, and guides the diffusion process with the edited activations. |
Outperforms baselines in terms of plausibility, identity preservation, and edit adherence as measured by user study.
Demonstrates robustness to depth map inaccuracies and artifacts.
Shows consistent performance on a synthetic benchmark of randomly generated scenes and edits. |
Large edits revealing depth estimation errors can lead to low-quality outputs.
Identity preservation, though improved, can be further enhanced. |
image editing, diffusion models, 3d-aware editing, generative models, depth estimation |
2312.02189
Report |
StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D |
Pengsheng Guo, Hans Hao, Adam Caccavale, Zhongzheng Ren, Edward Zhang, Qi Shan, Aditya Sankar, Alexander G. Schwing, Alex Colburn, Fangchang Ma |
In the realm of text-to-3D generation, utilizing 2D diffusion models through
score distillation sampling (SDS) frequently leads to issues such as blurred
appearances and multi-faced geometry, primarily due to the intrinsically noisy
nature of the SDS loss. Our analysis identifies the core of these challenges as
the interaction among noise levels in the 2D diffusion process, the
architecture of the diffusion network, and the 3D model representation. To
overcome these limitations, we present StableDreamer, a methodology
incorporating three advances. First, inspired by InstructNeRF2NeRF, we
formalize the equivalence of the SDS generative prior and a simple supervised
L2 reconstruction loss. This finding provides a novel tool to debug SDS, which
we use to show the impact of time-annealing noise levels on reducing
multi-faced geometries. Second, our analysis shows that while image-space
diffusion contributes to geometric precision, latent-space diffusion is crucial
for vivid color rendition. Based on this observation, StableDreamer introduces
a two-stage training strategy that effectively combines these aspects,
resulting in high-fidelity 3D models. Third, we adopt an anisotropic 3D
Gaussians representation, replacing Neural Radiance Fields (NeRFs), to enhance
the overall quality, reduce memory usage during training, and accelerate
rendering speeds, and better capture semi-transparent objects. StableDreamer
reduces multi-face geometries, generates fine details, and converges stably. |
\OURS is a text-to-3D generation framework that addresses blurry appearances and multi-face geometry problems common in existing score distillation sampling (SDS) methods. |
Existing text-to-3D approaches, particularly those using SDS, struggle with issues like blurry appearance, oversimplified geometry, multi-face artifacts, and slow optimization and rendering. |
The paper introduces three key advances: 1) reinterpretation of SDS loss as a supervised reconstruction task enabling noise annealing and training visualization, 2) a two-stage training approach utilizing both image-space and latent-space diffusion models for enhanced geometry and color quality, and 3) integration of 3D Gaussian Splatting (3DGS) representation with specialized initialization and density control for improved detail. |
Time-annealing of noise levels in SDS significantly reduces multi-face artifacts.
Two-stage training with image-space diffusion followed by latent-space diffusion results in both geometric accuracy and vivid, detailed appearances.
Integrating 3DGS with proposed regularization techniques leads to high-fidelity models with fast rendering speeds (over 30 FPS). |
The method still encounters failure cases with certain prompts where the 2D diffusion model struggles.
Future work could explore techniques to address the remaining failure cases and improve the overall robustness of the system. |
text-to-3d generation, score distillation sampling, 3d gaussian splatting, diffusion models, multi-view consistency |
2312.02157
Report |
Mesh-Guided Neural Implicit Field Editing |
Can Wang, Mingming He, Menglei Chai, Dongdong Chen, Jing Liao |
Neural implicit fields have emerged as a powerful 3D representation for
reconstructing and rendering photo-realistic views, yet they possess limited
editability. Conversely, explicit 3D representations, such as polygonal meshes,
offer ease of editing but may not be as suitable for rendering high-quality
novel views. To harness the strengths of both representations, we propose a new
approach that employs a mesh as a guiding mechanism in editing the neural
radiance field. We first introduce a differentiable method using marching
tetrahedra for polygonal mesh extraction from the neural implicit field and
then design a differentiable color extractor to assign colors obtained from the
volume renderings to this extracted mesh. This differentiable colored mesh
allows gradient back-propagation from the explicit mesh to the implicit fields,
empowering users to easily manipulate the geometry and color of neural implicit
fields. To enhance user control from coarse-grained to fine-grained levels, we
introduce an octree-based structure into its optimization. This structure
prioritizes the edited regions and the surface part, making our method achieve
fine-grained edits to the neural implicit field and accommodate various user
modifications, including object additions, component removals, specific area
deformations, and adjustments to local and global colors. Through extensive
experiments involving diverse scenes and editing operations, we have
demonstrated the capabilities and effectiveness of our method. Our project page
is: \url{https://cassiepython.github.io/MNeuEdit/} |
This paper introduces a novel mesh-guided editing method for neural implicit fields that enables users to edit the geometry and color of neural implicit fields with the ease of manipulating explicit 3D meshes. |
Editing neural implicit fields, while offering high-fidelity rendering, is challenging due to their implicit representation. This work aims to bridge the gap by using the intuitive editing capabilities of explicit 3D meshes to guide modifications in implicit fields, making the process user-friendly and compatible with existing 3D modeling workflows. |
The method leverages differentiable marching tetrahedra for mesh extraction from neural implicit fields and introduces a differentiable color extractor to assign colors to the mesh vertices. An octree-based structure is incorporated for optimization, allowing for fine-grained edits. The framework supports a two-step process: first, optimizing the density field for geometry editing, and then the color function for color modifications. |
The proposed method enables extensive editing capabilities, including object addition, component removal, deformation, and precise color editing, surpassing previous methods in flexibility.
The octree-based optimization significantly reduces computational demands while allowing for fine-grained editing of geometry and color, addressing the limitations of using dense grids.
The method outperforms existing approaches in achieving fine-grained and consistent editing results, especially in challenging scenarios like complex textures and uneven surfaces. |
The method currently lacks direct support for editing scene shading and lighting, requiring users to bake these features into vertex colors.
Editing highly intricate structures that do not produce high-quality meshes, such as human hair, remains challenging due to the reliance on a reliable underlying mesh structure. |
neural implicit fields, 3d mesh editing, differentiable rendering, octree-based optimization, geometry and color editing |
2312.02155
Report |
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis |
Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, Yebin Liu |
We present a new approach, termed GPS-Gaussian, for synthesizing novel views
of a character in a real-time manner. The proposed method enables 2K-resolution
rendering under a sparse-view camera setting. Unlike the original Gaussian
Splatting or neural implicit rendering methods that necessitate per-subject
optimizations, we introduce Gaussian parameter maps defined on the source views
and regress directly Gaussian Splatting properties for instant novel view
synthesis without any fine-tuning or optimization. To this end, we train our
Gaussian parameter regression module on a large amount of human scan data,
jointly with a depth estimation module to lift 2D parameter maps to 3D space.
The proposed framework is fully differentiable and experiments on several
datasets demonstrate that our method outperforms state-of-the-art methods while
achieving an exceeding rendering speed. |
Presents GPS-Gaussian, a novel method for real-time synthesis of high-fidelity novel views of human characters from sparse multi-view RGB inputs, utilizing a generalizable 3D Gaussian Splatting approach. |
Addresses the limitations of existing human NVS methods, which are either computationally expensive (e.g., NeRF-based) or lack generalizability (e.g., per-subject optimization in 3D Gaussian Splatting), hindering real-time applications. |
Introduces pixel-wise Gaussian parameter maps on 2D image planes, jointly learns an iterative depth estimation module and a Gaussian parameter regression module, and leverages differentiable rendering for end-to-end training. |
Achieves real-time performance exceeding 25 FPS for 2K resolution rendering on a single GPU.
Outperforms state-of-the-art methods in terms of rendering quality, particularly in handling occlusions and thin structures.
Demonstrates strong generalization ability, enabling instant rendering of unseen characters without requiring per-subject optimization. |
Requires accurate foreground matting as a preprocessing step, limiting its application to general scenes.
Relies on ground truth depth for supervision during training, posing challenges for data acquisition. |
novel view synthesis, human performance rendering, 3d gaussian splatting, depth estimation, real-time rendering |
2312.02153
Report |
Aligning and Prompting Everything All at Once for Universal Visual Perception |
Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, Rongrong Ji |
Vision foundation models have been explored recently to build general-purpose
vision systems. However, predominant paradigms, driven by casting
instance-level tasks as an object-word alignment, bring heavy cross-modality
interaction, which is not effective in prompting object detection and visual
grounding. Another line of work that focuses on pixel-level tasks often
encounters a large annotation gap of things and stuff, and suffers from mutual
interference between foreground-object and background-class segmentation. In
stark contrast to the prevailing methods, we present APE, a universal visual
perception model for aligning and prompting everything all at once in an image
to perform diverse tasks, i.e., detection, segmentation, and grounding, as an
instance-level sentence-object matching paradigm. Specifically, APE advances
the convergence of detection and grounding by reformulating language-guided
grounding as open-vocabulary detection, which efficiently scales up model
prompting to thousands of category vocabularies and region descriptions while
maintaining the effectiveness of cross-modality fusion. To bridge the
granularity gap of different pixel-level tasks, APE equalizes semantic and
panoptic segmentation to proxy instance learning by considering any isolated
regions as individual instances. APE aligns vision and language representation
on broad data with natural and challenging characteristics all at once without
task-specific fine-tuning. The extensive experiments on over 160 datasets
demonstrate that, with only one-suit of weights, APE outperforms (or is on par
with) the state-of-the-art models, proving that an effective yet universal
perception for anything aligning and prompting is indeed feasible. Codes and
trained models are released at https://github.com/shenyunhang/APE. |
This paper introduces APE, a novel vision foundation model trained on diverse datasets for multiple tasks, including open-vocabulary object detection, various image segmentation types (semantic, instance, panoptic), and visual grounding. |
Existing VFMs face limitations such as heavy cross-modality interaction leading to inefficiency in prompting, large annotation gaps between things and stuff, and mutual interference between foreground and background segmentation. APE addresses these limitations. |
APE leverages an instance-level region-sentence matching paradigm. It utilizes compact sentence representations for efficient vision-language interaction, equalizes semantic and panoptic segmentation to a proxy instance learning objective, and aligns vision and language representations on broad data without task-specific fine-tuning. |
APE achieves state-of-the-art or competitive performance on over 160 datasets across all tasks with a single set of weights, demonstrating strong generalization ability.
The model effectively handles large-scale text prompts for querying thousands of categories and sentences in a single forward pass.
APE effectively unifies the learning of thing and stuff categories, addressing the granularity discrepancy in previous methods. |
The current implementation of APE relies on instance-level annotations, leading to a disadvantage in panoptic segmentation evaluation due to potentially overlapping segments.
Future work could explore leveraging stronger language models and pre-training methods. |
vision foundation models, open-vocabulary object detection, image segmentation, visual grounding, vision-language models |
2312.02150
Report |
Readout Guidance: Learning Control from Diffusion Features |
Grace Luo, Trevor Darrell, Oliver Wang, Dan B Goldman, Aleksander Holynski |
We present Readout Guidance, a method for controlling text-to-image diffusion
models with learned signals. Readout Guidance uses readout heads, lightweight
networks trained to extract signals from the features of a pre-trained, frozen
diffusion model at every timestep. These readouts can encode single-image
properties, such as pose, depth, and edges; or higher-order properties that
relate multiple images, such as correspondence and appearance similarity.
Furthermore, by comparing the readout estimates to a user-defined target, and
back-propagating the gradient through the readout head, these estimates can be
used to guide the sampling process. Compared to prior methods for conditional
generation, Readout Guidance requires significantly fewer added parameters and
training samples, and offers a convenient and simple recipe for reproducing
different forms of conditional control under a single framework, with a single
architecture and sampling procedure. We showcase these benefits in the
applications of drag-based manipulation, identity-consistent generation, and
spatially aligned control. Project page: https://readout-guidance.github.io. |
The paper proposes Readout Guidance, a method for controlling text-to-image diffusion models by training lightweight readout heads on diffusion features to extract and guide image generation towards desired properties. |
This method provides flexible user control over diffusion models without expensive finetuning, requiring significantly fewer training samples and time compared to existing techniques. |
Readout heads are trained on top of frozen diffusion models to extract single-image properties like pose and depth, or relative properties between images like appearance similarity. These learned readouts then guide the sampling process towards user-defined targets or references. |
Readout Guidance achieves state-of-the-art performance on drag-based image manipulation, outperforming previous methods that require per-example finetuning.
The method enables identity-consistent generation, preserving the appearance of a subject from a reference image without subject-specific training.
Readout Guidance is effective for spatially aligned controls like pose, depth, and edge guidance, achieving competitive performance with significantly less training data and parameters compared to finetuned models. |
Readout Guidance requires additional memory and runtime during sampling due to gradient computations.
The method may sometimes produce unrealistic or cartoonish imagery while satisfying readout constraints, requiring careful tuning of guidance strength. |
diffusion models, image generation, conditional image synthesis, sampling-time guidance, image manipulation |
2312.02149
Report |
Generative Powers of Ten |
Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski |
We present a method that uses a text-to-image model to generate consistent
content across multiple image scales, enabling extreme semantic zooms into a
scene, e.g., ranging from a wide-angle landscape view of a forest to a macro
shot of an insect sitting on one of the tree branches. We achieve this through
a joint multi-scale diffusion sampling approach that encourages consistency
across different scales while preserving the integrity of each individual
sampling process. Since each generated scale is guided by a different text
prompt, our method enables deeper levels of zoom than traditional
super-resolution methods that may struggle to create new contextual structure
at vastly different scales. We compare our method qualitatively with
alternative techniques in image super-resolution and outpainting, and show that
our method is most effective at generating consistent multi-scale content. |
This paper introduces a method for generating consistent content across multiple image scales using a text-to-image diffusion model, enabling extreme semantic zooms into a scene described by a series of text prompts at varying zoom levels. |
This method addresses the limitations of traditional super-resolution methods, which struggle to create new contextual structure at vastly different scales, by leveraging semantic information from text prompts. |
The method employs a joint multi-scale diffusion sampling approach, utilizing a zoom stack representation and multi-resolution blending to ensure consistency across different scales while preserving the integrity of individual sampling processes. |
The method successfully generates consistent and high-quality zoom sequences for various zoom factors and scenes.
It outperforms baseline methods like diffusion-based outpainting and super-resolution models in terms of consistency and image quality.
The method allows for user control by incorporating real images or editing text prompts to guide the generation process. |
Identifying appropriate text prompts that align with specific scales and the text-to-image model's training data remains challenging.
Future work could explore optimizing geometric transformations or text embeddings for better alignment between zoom levels and prompts. |
text-to-image generation, semantic zoom, multi-scale representation, diffusion models, joint sampling |
2312.02147
Report |
Rejuvenating image-GPT as Strong Visual Representation Learners |
Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang Xie |
This paper enhances image-GPT (iGPT), one of the pioneering works that
introduce autoregressive pretraining to predict next pixels for visual
representation learning. Two simple yet essential changes are made. First, we
shift the prediction target from raw pixels to semantic tokens, enabling a
higher-level understanding of visual content. Second, we supplement the
autoregressive modeling by instructing the model to predict not only the next
tokens but also the visible tokens. This pipeline is particularly effective
when semantic tokens are encoded by discriminatively trained models, such as
CLIP. We introduce this novel approach as D-iGPT. Extensive experiments
showcase that D-iGPT excels as a strong learner of visual representations: A
notable achievement of D-iGPT is its compelling performance on the ImageNet-1K
dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\%
top-1 accuracy with a vanilla ViT-Large model. This model also shows strong
generalization on the downstream task and robustness on out-of-distribution
samples. Code is avaiable at
\href{https://github.com/OliverRensu/D-iGPT}{https://github.com/OliverRensu/D-iGPT}. |
This paper introduces D-iGPT, an enhanced version of the iGPT model that predicts semantic tokens instead of raw pixels for visual representation learning, achieving strong performance on ImageNet using publicly available data. |
The work addresses the limitations of existing self-supervised learning methods in computer vision, particularly the under-exploration of autoregressive pretraining for high-quality visual representation learning at scale. |
D-iGPT modifies iGPT by 1) predicting semantic tokens obtained from a discriminatively trained model like CLIP, and 2) adding supervision for visible tokens to improve training. |
D-iGPT achieves 89.5% top-1 accuracy on ImageNet-1K using publicly available datasets, exceeding previous state-of-the-art methods.
D-iGPT demonstrates strong generalization by surpassing MAE counterparts on semantic segmentation using ADE20K.
D-iGPT exhibits superior robustness compared to existing methods across various out-of-domain ImageNet datasets. |
The reliance on a separate model for generating semantic tokens introduces potential limitations depending on the chosen model's performance.
Further exploration is needed to fully understand the impact of scaling D-iGPT to even larger datasets and model sizes. |
self-supervised learning, autoregressive pretraining, vision transformer, image classification, semantic segmentation |
2312.02145
Report |
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation |
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler |
Monocular depth estimation is a fundamental computer vision task. Recovering
3D depth from a single image is geometrically ill-posed and requires scene
understanding, so it is not surprising that the rise of deep learning has led
to a breakthrough. The impressive progress of monocular depth estimators has
mirrored the growth in model capacity, from relatively modest CNNs to large
Transformer architectures. Still, monocular depth estimators tend to struggle
when presented with images with unfamiliar content and layout, since their
knowledge of the visual world is restricted by the data seen during training,
and challenged by zero-shot generalization to new domains. This motivates us to
explore whether the extensive priors captured in recent generative diffusion
models can enable better, more generalizable depth estimation. We introduce
Marigold, a method for affine-invariant monocular depth estimation that is
derived from Stable Diffusion and retains its rich prior knowledge. The
estimator can be fine-tuned in a couple of days on a single GPU using only
synthetic training data. It delivers state-of-the-art performance across a wide
range of datasets, including over 20% performance gains in specific cases.
Project page: https://marigoldmonodepth.github.io. |
This paper introduces Marigold, a resource-efficient fine-tuning protocol that transforms a pre-trained Latent Diffusion Model (Stable Diffusion v2) into an image-conditioned depth estimator, enabling state-of-the-art affine-invariant monocular depth estimation. |
Existing monocular depth estimators often struggle with unfamiliar content, highlighting the need for methods that leverage broader visual priors and generalize well. Diffusion models, trained on massive image datasets, offer such priors, making them promising candidates for this task. |
The authors freeze the pre-trained Stable Diffusion VAE and fine-tune its U-Net to estimate depth. The input image and depth map are encoded into a latent space, concatenated, and fed into the U-Net. Fine-tuning relies solely on synthetic data (Hypersim, Virtual KITTI) and an annealed multi-resolution noise schedule for faster convergence. A test-time ensembling scheme aggregates multiple predictions to boost performance. |
Marigold achieves state-of-the-art results on multiple zero-shot benchmarks (NYUv2, KITTI, ETH3D, ScanNet, DIODE), surpassing existing methods in most cases despite being trained solely on synthetic data.
Training with multi-resolution noise and annealing consistently improves performance and reduces prediction variance, as demonstrated by ablation studies.
The proposed test-time ensembling significantly enhances accuracy, with the most substantial gains observed when aggregating up to 10 predictions. |
The inference speed of Marigold is slower than feed-forward methods due to its iterative nature.
The generative nature of diffusion models can lead to inconsistent depth predictions for similar input images, even with test-time ensembling. |
monocular depth estimation, diffusion models, stable diffusion, fine-tuning, zero-shot generalization |
2312.02139
Report |
DiffiT: Diffusion Vision Transformers for Image Generation |
Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat |
Diffusion models with their powerful expressivity and high sample quality
have achieved State-Of-The-Art (SOTA) performance in the generative domain. The
pioneering Vision Transformer (ViT) has also demonstrated strong modeling
capabilities and scalability, especially for recognition tasks. In this paper,
we study the effectiveness of ViTs in diffusion-based generative learning and
propose a new model denoted as Diffusion Vision Transformers (DiffiT).
Specifically, we propose a methodology for finegrained control of the denoising
process and introduce the Time-dependant Multihead Self Attention (TMSA)
mechanism. DiffiT is surprisingly effective in generating high-fidelity images
with significantly better parameter efficiency. We also propose latent and
image space DiffiT models and show SOTA performance on a variety of
class-conditional and unconditional synthesis tasks at different resolutions.
The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet-256
dataset while having 19.85%, 16.88% less parameters than other
Transformer-based diffusion models such as MDT and DiT, respectively. Code:
https://github.com/NVlabs/DiffiT |
Introduced DiffiT, a novel ViT-based diffusion model for latent and image space generation, featuring Time-dependent Multihead Self-Attention (TMSA) for efficient spatial-temporal dependency learning. |
Combines the strengths of ViTs (long-range dependency modeling, scalability) and diffusion models (high sample quality) for enhanced image generation. |
Leveraged TMSA within a U-Net architecture, enabling dynamic adaptation of attention across denoising stages by integrating temporal information into queries, keys, and values. |
Achieved SOTA FID score of 1.73 on ImageNet-256, outperforming competing models with fewer parameters.
Demonstrated SOTA image generation performance on CIFAR-10 and FFHQ-64 datasets.
Showed TMSA's effectiveness through ablation studies, highlighting its importance for superior generation quality and parameter efficiency. |
Exploration of TMSA's potential beyond image generation tasks, such as image-to-image translation.
Investigation of alternative time embedding techniques for potential performance enhancement. |
image generation, diffusion models, vision transformers, time-dependent self-attention, generative modeling |
2312.02135
Report |
Fast View Synthesis of Casual Videos |
Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, Feng Liu |
Novel view synthesis from an in-the-wild video is difficult due to challenges
like scene dynamics and lack of parallax. While existing methods have shown
promising results with implicit neural radiance fields, they are slow to train
and render. This paper revisits explicit video representations to synthesize
high-quality novel views from a monocular video efficiently. We treat static
and dynamic video content separately. Specifically, we build a global static
scene model using an extended plane-based scene representation to synthesize
temporally coherent novel video. Our plane-based scene representation is
augmented with spherical harmonics and displacement maps to capture
view-dependent effects and model non-planar complex surface geometry. We opt to
represent the dynamic content as per-frame point clouds for efficiency. While
such representations are inconsistency-prone, minor temporal inconsistencies
are perceptually masked due to motion. We develop a method to quickly estimate
such a hybrid video representation and render novel views in real time. Our
experiments show that our method can render high-quality novel views from an
in-the-wild video with comparable quality to state-of-the-art methods while
being 100x faster in training and enabling real-time rendering. |
This paper presents an efficient novel view synthesis method for casual monocular videos using a hybrid explicit representation, achieving comparable quality to NeRF-based methods but 100x faster. |
Existing NeRF-based methods for dynamic novel view synthesis are computationally expensive and slow to train and render, making them impractical for real-time applications. |
The method separates static and dynamic content, using a global plane-based representation with spherical harmonics and displacement maps for the static background and per-frame point clouds for dynamic objects. It jointly optimizes the representation from monocular videos using a set of carefully designed loss functions. |
The method achieves comparable rendering quality to state-of-the-art NeRF-based methods on the NVIDIA and DAVIS datasets.
It significantly outperforms previous approaches in terms of training and rendering speed, achieving real-time rendering at 27 FPS.
Ablation studies demonstrate the effectiveness of view-dependent plane textures and temporal neighbor blending for improving synthesis quality. |
The method's performance depends on the accuracy of preprocessed video depth and pose estimation.
It may struggle to separate objects with subtle motion from the static background, highlighting an area for future improvement. |
novel view synthesis, dynamic scenes, explicit representation, real-time rendering, monocular video |
2312.02134
Report |
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians |
Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, Liqiang Nie |
We present GaussianAvatar, an efficient approach to creating realistic human
avatars with dynamic 3D appearances from a single video. We start by
introducing animatable 3D Gaussians to explicitly represent humans in various
poses and clothing styles. Such an explicit and animatable representation can
fuse 3D appearances more efficiently and consistently from 2D observations. Our
representation is further augmented with dynamic properties to support
pose-dependent appearance modeling, where a dynamic appearance network along
with an optimizable feature tensor is designed to learn the
motion-to-appearance mapping. Moreover, by leveraging the differentiable motion
condition, our method enables a joint optimization of motions and appearances
during avatar modeling, which helps to tackle the long-standing issue of
inaccurate motion estimation in monocular settings. The efficacy of
GaussianAvatar is validated on both the public dataset and our collected
dataset, demonstrating its superior performances in terms of appearance quality
and rendering efficiency. |
Presents GaussianAvatar, a novel method for reconstructing realistic human avatars with dynamic 3D appearances from a single video using animatable 3D Gaussians. |
Creating personalized avatars from monocular videos is challenging due to the inherent ambiguity and complexities in capturing dynamic human appearance, including wrinkles and cloth deformations. |
Introduces animatable 3D Gaussians with pose-dependent properties and a dynamic appearance network to model motion-to-appearance mapping. Jointly optimizes motion and appearance during training to refine inaccurate motion estimations from monocular input. |
Outperforms previous methods in reconstruction quality on People-Snapshot, NeuMan, and DynVideo datasets.
Demonstrates robustness to initial motion estimations and effectively corrects misalignments in motion capture results.
Enables realistic animation with challenging poses while maintaining real-time rendering speeds. |
May generate artifacts due to inaccurate foreground segmentation.
Faces challenges in accurately modeling loose outfits, such as dresses, due to limitations in skinning weights derived from the SMPL model. |
human avatar reconstruction, animatable 3d gaussians, dynamic appearance modeling, motion optimization, single-view reconstruction |
2312.02133
Report |
Style Aligned Image Generation via Shared Attention |
Amir Hertz, Andrey Voynov, Shlomi Fruchter, Daniel Cohen-Or |
Large-scale Text-to-Image (T2I) models have rapidly gained prominence across
creative fields, generating visually compelling outputs from textual prompts.
However, controlling these models to ensure consistent style remains
challenging, with existing methods necessitating fine-tuning and manual
intervention to disentangle content and style. In this paper, we introduce
StyleAligned, a novel technique designed to establish style alignment among a
series of generated images. By employing minimal `attention sharing' during the
diffusion process, our method maintains style consistency across images within
T2I models. This approach allows for the creation of style-consistent images
using a reference style through a straightforward inversion operation. Our
method's evaluation across diverse styles and text prompts demonstrates
high-quality synthesis and fidelity, underscoring its efficacy in achieving
consistent style across various inputs. |
Introduces "StyleAligned," a technique for consistent style interpretation across a set of images generated by text-to-image models, using minimal attention sharing during the diffusion process. |
Addresses the challenge of maintaining consistent style in AI-generated image sets, which is crucial for applications requiring a unified aesthetic. |
Employs minimal 'attention sharing' with AdaIN modulation during the diffusion process, where target images attend to a reference image's features, enabling style alignment without optimization or fine-tuning. |
Achieves significantly higher style consistency scores compared to standard text-to-image generation.
Exhibits less content leakage and generates more diverse sets compared to full attention sharing.
Outperforms personalization-based methods like DreamBooth and StyleDrop in terms of style consistency and adherence to text prompts. |
Achieving finer control over shape and appearance similarity among generated images.
Overcoming limitations of diffusion inversion for more robust style transfer from input images. |
text-to-image synthesis, style consistency, diffusion models, attention mechanisms, generative ai |
2312.02126
Report |
SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM |
Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, Jonathon Luiten |
Dense simultaneous localization and mapping (SLAM) is crucial for robotics
and augmented reality applications. However, current methods are often hampered
by the non-volumetric or implicit way they represent a scene. This work
introduces SplaTAM, an approach that, for the first time, leverages explicit
volumetric representations, i.e., 3D Gaussians, to enable high-fidelity
reconstruction from a single unposed RGB-D camera, surpassing the capabilities
of existing methods. SplaTAM employs a simple online tracking and mapping
system tailored to the underlying Gaussian representation. It utilizes a
silhouette mask to elegantly capture the presence of scene density. This
combination enables several benefits over prior representations, including fast
rendering and dense optimization, quickly determining if areas have been
previously mapped, and structured map expansion by adding more Gaussians.
Extensive experiments show that SplaTAM achieves up to 2x superior performance
in camera pose estimation, map construction, and novel-view synthesis over
existing methods, paving the way for more immersive high-fidelity SLAM
applications. |
\coolname{} is the first dense RGB-D SLAM solution to use 3D Gaussian Splatting for high-fidelity online camera tracking and scene reconstruction. |
Existing dense SLAM methods suffer from limitations. Explicit representations struggle with novel view synthesis while implicit ones are computationally expensive and hard to edit. \coolname{} leverages the advantages of explicit volumetric representations to address these issues. |
\coolname{} represents the scene as a collection of 3D Gaussians. It utilizes differentiable rendering via Gaussian splatting to estimate camera poses and update the Gaussian map. This process involves camera tracking, Gaussian densification, and map updating. |
Achieves state-of-the-art camera pose estimation accuracy on multiple datasets, especially excelling in scenarios with large camera motion.
Demonstrates high-fidelity novel view synthesis performance, comparable to methods using ground truth poses.
Exhibits fast runtime comparable to methods using significantly fewer pixels for optimization due to the efficiency of Gaussian splatting. |
Shows sensitivity to motion blur, large depth noise, and aggressive rotation.
Future work includes addressing these sensitivities, scaling to large-scale scenes, and removing dependencies on known camera intrinsics and dense depth. |
slam, 3d gaussian splatting, novel view synthesis, differentiable rendering, dense reconstruction |
2312.02116
Report |
GIVT: Generative Infinite-Vocabulary Transformers |
Michael Tschannen, Cian Eastwood, Fabian Mentzer |
We introduce generative infinite-vocabulary transformers (GIVT) which
generate vector sequences with real-valued entries, instead of discrete tokens
from a finite vocabulary. To this end, we propose two surprisingly simple
modifications to decoder-only transformers: 1) at the input, we replace the
finite-vocabulary lookup table with a linear projection of the input vectors;
and 2) at the output, we replace the logits prediction (usually mapped to a
categorical distribution) with the parameters of a multivariate Gaussian
mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT,
where transformers are used to model the discrete latent sequences of a VQ-VAE,
we use GIVT to model the unquantized real-valued latent sequences of a
$\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and
improved variants thereof) as well as MaskGIT, and achieves performance
competitive with recent latent diffusion models. Finally, we obtain strong
results outside of image generation when applying GIVT to panoptic segmentation
and depth estimation with a VAE variant of the UViM framework |
Introduces Generative Infinite-Vocabulary Transformers (GIVT), which generate sequences of real-valued vectors instead of discrete tokens, eliminating the need for quantization in visual data generation. |
Overcomes limitations of VQ-VAE based image generation, such as low codebook usage and large embedding matrices, while offering better quality and representation learning. |
Modifies decoder-only transformers by replacing embedding lookup with linear projection and outputting parameters of a Gaussian Mixture Model (GMM). Trained on latent sequences from a β-VAE. |
Outperforms VQGAN, MaskGIT, and some diffusion models in class-conditional image generation.
Achieves strong representation learning capabilities, on par with or exceeding VQ-based models.
Demonstrates competitive performance in panoptic segmentation and depth estimation with UViM framework. |
End-to-end training of VAE and GIVT poses challenges.
Exploration of GIVT applications beyond image generation, such as audio and time-series modeling. |
generative models, transformers, image generation, quantization-free, infinite vocabulary |
2312.02109
Report |
ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation |
Dar-Yen Chen, Hamish Tennent, Ching-Wen Hsu |
This work introduces ArtAdapter, a transformative text-to-image (T2I) style
transfer framework that transcends traditional limitations of color,
brushstrokes, and object shape, capturing high-level style elements such as
composition and distinctive artistic expression. The integration of a
multi-level style encoder with our proposed explicit adaptation mechanism
enables ArtAdapter to achieve unprecedented fidelity in style transfer,
ensuring close alignment with textual descriptions. Additionally, the
incorporation of an Auxiliary Content Adapter (ACA) effectively separates
content from style, alleviating the borrowing of content from style references.
Moreover, our novel fast finetuning approach could further enhance zero-shot
style representation while mitigating the risk of overfitting. Comprehensive
evaluations confirm that ArtAdapter surpasses current state-of-the-art methods. |
ArtAdapter, a novel text-to-image (T2I) style transfer framework that captures both low- and high-level artistic features, from textures to composition. |
Existing T2I style transfer methods struggle to capture high-level artistic style, often being limited to color and texture, or borrowing unwanted content from style references. |
Uses a multi-level style encoder, an explicit adaptation mechanism within the diffusion model's cross-attention layers, and an auxiliary content adapter (ACA) during training. |
Faithfully captures diverse artistic styles without compromising textual semantics.
Enables flexible style mixing from different references and across different style levels.
Outperforms state-of-the-art methods in both single- and multi-reference style transfer based on quantitative metrics and user study. |
High-level style embeddings can inadvertently incorporate lower-level elements during style mixing, requiring improved disentanglement.
Future work can explore broader applications of ArtAdapter beyond style transfer, such as incorporating structural controls. |
text-to-image, style transfer, diffusion models, style mixing, deep learning |
2312.02103
Report |
Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection |
Sunghun Kang, Junbum Cha, Jonghwan Mun, Byungseok Roh, Chang D. Yoo |
Open-vocabulary object detection (OVOD) has recently gained significant
attention as a crucial step toward achieving human-like visual intelligence.
Existing OVOD methods extend target vocabulary from pre-defined categories to
open-world by transferring knowledge of arbitrary concepts from vision-language
pre-training models to the detectors. While previous methods have shown
remarkable successes, they suffer from indirect supervision or limited
transferable concepts. In this paper, we propose a simple yet effective method
to directly learn region-text alignment for arbitrary concepts. Specifically,
the proposed method aims to learn arbitrary image-to-text mapping for
pseudo-labeling of arbitrary concepts, named Pseudo-Labeling for Arbitrary
Concepts (PLAC). The proposed method shows competitive performance on the
standard OVOD benchmark for noun concepts and a large improvement on referring
expression comprehension benchmark for arbitrary concepts. |
This paper introduces PLAC, a pseudo-labeling method for open-vocabulary object detection (OVOD) that learns arbitrary image-to-text mapping to generate pseudo-labels, enabling the detection of concepts beyond simple nouns. |
Existing OVOD methods struggle to effectively transfer knowledge from open-world classifiers to detectors, often relying on indirect supervision or limiting themselves to noun-based concepts. PLAC overcomes these limitations by directly learning region-text alignment for arbitrary concepts, broadening the scope of detectable objects. |
PLAC leverages a module trained on image-text pairs to map CLIP image embeddings to corresponding text embeddings. These embeddings serve as pseudo-labels for training an OVOD model (Deformable DETR) with a two-stage matching strategy to handle the uncertainty of pseudo-labels. |
PLAC achieves competitive performance on the LVIS benchmark for OVOD, demonstrating its ability to effectively learn noun concepts.
PLAC significantly outperforms previous state-of-the-art methods on the RefCOCOg referring expression comprehension benchmark, highlighting its capability to detect objects based on arbitrary concepts, including colors and specific object attributes.
Ablation studies confirm the effectiveness of the proposed pseudo-labeling method, loss functions, and the two-stage matching strategy. |
The performance of PLAC on LVIS, while competitive, reveals that current benchmarks might not be ideal for evaluating the full potential of OVOD methods designed for arbitrary concept detection.
Future work could explore incorporating additional modalities or knowledge sources to further enhance the richness and accuracy of the pseudo-labels. |
open-vocabulary object detection, pseudo-labeling, region-text alignment, vision-language pre-training, referring expression comprehension |
2312.02087
Report |
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence |
Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang |
Current diffusion-based video editing primarily focuses on
structure-preserved editing by utilizing various dense correspondences to
ensure temporal consistency and motion alignment. However, these approaches are
often ineffective when the target edit involves a shape change. To embark on
video editing with shape change, we explore customized video subject swapping
in this work, where we aim to replace the main subject in a source video with a
target subject having a distinct identity and potentially different shape. In
contrast to previous methods that rely on dense correspondences, we introduce
the VideoSwap framework that exploits semantic point correspondences, inspired
by our observation that only a small number of semantic points are necessary to
align the subject's motion trajectory and modify its shape. We also introduce
various user-point interactions (\eg, removing points and dragging points) to
address various semantic point correspondence. Extensive experiments
demonstrate state-of-the-art video subject swapping results across a variety of
real-world videos. |
This paper introduces VideoSwap, a novel framework for customized video subject swapping that leverages semantic point correspondences to align motion trajectories while enabling significant shape changes in the swapped subject. |
Existing diffusion-based video editing methods, often relying on dense correspondences, struggle with shape changes in the edited subject. VideoSwap addresses this limitation by utilizing sparse semantic point correspondences, allowing for flexible shape manipulation while preserving motion fidelity. |
VideoSwap extracts semantic point trajectories and embeddings from the source video. It then registers these points on the source video to guide the diffusion model during the editing process. The framework supports user interaction through point removal or dragging to handle various semantic correspondences between the source and target subjects. A layered neural atlas aids in propagating point displacements consistently across frames. |
VideoSwap demonstrates superior performance in customized video subject swapping compared to state-of-the-art video editing methods, as evidenced by both qualitative and quantitative evaluations.
The use of semantic point correspondence enables VideoSwap to achieve significant shape changes in the target subject while maintaining accurate alignment with the source subject's motion.
Ablation studies validate the contribution of key components, such as the use of DIFT embeddings for semantic point representation and the incorporation of a point patch loss and a semantic-enhanced schedule during training. |
The performance of VideoSwap relies on accurate point tracking, which can be challenged by self-occlusion and significant view changes in the video.
The current implementation of VideoSwap incurs noticeable computational cost, limiting its practicality for real-time interactive editing. Future work may explore neural field acceleration and diffusion model distillation to address this limitation. |
video editing, diffusion models, semantic point correspondence, shape change, motion trajectory alignment |
2312.02069
Report |
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians |
Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, Matthias Nießner |
We introduce GaussianAvatars, a new method to create photorealistic head
avatars that are fully controllable in terms of expression, pose, and
viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian
splats that are rigged to a parametric morphable face model. This combination
facilitates photorealistic rendering while allowing for precise animation
control via the underlying parametric model, e.g., through expression transfer
from a driving sequence or by manually changing the morphable model parameters.
We parameterize each splat by a local coordinate frame of a triangle and
optimize for explicit displacement offset to obtain a more accurate geometric
representation. During avatar reconstruction, we jointly optimize for the
morphable model parameters and Gaussian splat parameters in an end-to-end
fashion. We demonstrate the animation capabilities of our photorealistic avatar
in several challenging scenarios. For instance, we show reenactments from a
driving video, where our method outperforms existing works by a significant
margin. |
Introduces \OURS{}, a method for creating animatable and photorealistic head avatars by rigging 3D Gaussian splats to a parametric mesh (FLAME) |
Creating animatable avatars with photorealistic quality and controllability is crucial for various applications in gaming, VR/AR, etc. |
Binds 3D Gaussian splats to a FLAME mesh, enabling the Gaussians to move dynamically with the mesh. A binding inheritance strategy supports adding/removing Gaussians while maintaining controllability. Regularization ensures accurate animation without artifacts. |
Outperforms state-of-the-art methods in novel-view synthesis and self-reenactment achieving higher PSNR and SSIM values.
Exhibits superior visual quality, capturing fine details like wrinkles and hair strands, especially during animation.
Demonstrates better generalization to novel expressions and poses compared to other methods. |
Lacks explicit control over regions not modeled by FLAME, such as hair or accessories.
Relighting the avatar is currently not feasible. |
avatar creation, 3d gaussian splatting, parametric face model, novel view synthesis, facial reenactment |
2312.01987
Report |
Bootstrapping SparseFormers from Vision Foundation Models |
Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou |
The recently proposed SparseFormer architecture provides an alternative
approach to visual understanding by utilizing a significantly lower number of
visual tokens via adjusting RoIs, greatly reducing computational costs while
still achieving promising performance. However, training SparseFormers from
scratch is still expensive, and scaling up the number of parameters can be
challenging. In this paper, we propose to bootstrap SparseFormers from
ViT-based vision foundation models in a simple and efficient way. Since the
majority of SparseFormer blocks are the standard transformer ones, we can
inherit weights from large-scale pre-trained vision transformers and freeze
them as much as possible. Therefore, we only need to train the
SparseFormer-specific lightweight focusing transformer to adjust token RoIs and
fine-tune a few early pre-trained blocks to align the final token
representation. In such a way, we can bootstrap SparseFormer architectures from
various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or
CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and
without labels or captions within just a few hours. As a result, the
bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9%
accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from
CLIPs also demonstrates notable zero-shot performance with highly reduced
computational cost without seeing any caption during the bootstrapping
procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output
space with language without seeing a word, can serve as efficient vision
encoders in multimodal large language models. Code and models are available at
https://github.com/showlab/sparseformer |
This paper presents a method to bootstrap SparseFormers from pre-trained vision foundation models (like AugReg and CLIP) by inheriting weights and aligning final representations using a smaller dataset. |
Training SparseFormers from scratch is expensive and scaling them is challenging. This work addresses these limitations by leveraging pre-trained models for faster and more efficient training. |
The method inherits weights from the pre-trained model into the SparseFormer's cortex transformer blocks. Then, it aligns the final representation of the SparseFormer with that of the pre-trained model using a cosine loss on a smaller dataset like ImageNet-1K. |
Bootstrapped SparseFormers achieve comparable accuracy to pre-trained models with significantly fewer tokens (e.g., 84.9% top-1 accuracy on ImageNet-1K with only 49 tokens).
The method is applicable to both unimodal (classification) and multimodal (CLIP) models.
Bootstrapped SparseFormers can serve as efficient backbones for downstream tasks like semantic segmentation and as vision encoders in multimodal large language models. |
The bootstrapping method assumes a transformer architecture for the vision foundation model, limiting its applicability to non-transformer models.
The method requires access to pre-trained weights, which may not be available for some large-scale proprietary models. |
sparseformers, vision foundation models, model bootstrapping, efficient vision transformers, multimodal learning |
2312.01985
Report |
UniGS: Unified Representation for Image Generation and Segmentation |
Lu Qi, Lehan Yang, Weidong Guo, Yu Xu, Bo Du, Varun Jampani, Ming-Hsuan Yang |
This paper introduces a novel unified representation of diffusion models for
image generation and segmentation. Specifically, we use a colormap to represent
entity-level masks, addressing the challenge of varying entity numbers while
aligning the representation closely with the image RGB domain. Two novel
modules, including the location-aware color palette and progressive dichotomy
module, are proposed to support our mask representation. On the one hand, a
location-aware palette guarantees the colors' consistency to entities'
locations. On the other hand, the progressive dichotomy module can efficiently
decode the synthesized colormap to high-quality entity-level masks in a
depth-first binary search without knowing the cluster numbers. To tackle the
issue of lacking large-scale segmentation training data, we employ an
inpainting pipeline and then improve the flexibility of diffusion models across
various tasks, including inpainting, image synthesis, referring segmentation,
and entity segmentation. Comprehensive experiments validate the efficiency of
our approach, demonstrating comparable segmentation mask quality to
state-of-the-art and adaptability to multiple tasks. The code will be released
at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}. |
This paper presents UniGS, a novel unified diffusion model for simultaneous image generation and entity-level segmentation using a colormap representation for masks. |
A unified representation for both generation and segmentation can refine image generation, enhance coherence between synthesized entities and their masks, and enable a single model to perform various dense prediction tasks. |
UniGS employs a UNet architecture with dual branches for image and mask generation. It introduces a location-aware color palette for consistent entity representation and a progressive dichotomy module for efficient colormap decoding to masks. The model is trained using an inpainting pipeline to address limited segmentation data. |
UniGS achieves comparable segmentation quality to state-of-the-art methods without using standard segmentation losses.
It demonstrates strong performance in multi-class multi-region inpainting, image synthesis, referring segmentation, and entity segmentation.
The model exhibits the ability to generate realistic shadows, even without explicit shadow supervision. |
There is still a performance gap between UniGS and state-of-the-art entity segmentation models.
Future work includes exploring multi-task training for all tasks within a single model. |
diffusion models, image generation, semantic segmentation, unified representation, inpainting |
2312.01841
Report |
VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior |
Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, Xun Cao |
Audio-driven talking head generation has drawn much attention in recent
years, and many efforts have been made in lip-sync, expressive facial
expressions, natural head pose generation, and high video quality. However, no
model has yet led or tied on all these metrics due to the one-to-many mapping
between audio and motion. In this paper, we propose VividTalk, a two-stage
generic framework that supports generating high-visual quality talking head
videos with all the above properties. Specifically, in the first stage, we map
the audio to mesh by learning two motions, including non-rigid expression
motion and rigid head motion. For expression motion, both blendshape and vertex
are adopted as the intermediate representation to maximize the representation
ability of the model. For natural head motion, a novel learnable head pose
codebook with a two-phase training mechanism is proposed. In the second stage,
we proposed a dual branch motion-vae and a generator to transform the meshes
into dense motion and synthesize high-quality video frame-by-frame. Extensive
experiments show that the proposed VividTalk can generate high-visual quality
talking head videos with lip-sync and realistic enhanced by a large margin, and
outperforms previous state-of-the-art works in objective and subjective
comparisons. |
Proposed VividTalk, a two-stage framework for generating high-quality talking head videos with expressive facial expressions and natural head poses from audio and a single reference image. |
Existing methods struggle to simultaneously achieve lip-sync, expressive facial expressions, natural head poses, and high video quality due to the one-to-many mapping between audio and motion. |
The framework consists of: 1) **Audio-To-Mesh Generation**: maps audio to non-rigid facial expressions (using both blendshapes and vertex offsets) and rigid head poses (using a learnable head pose codebook) to generate driven meshes. 2) **Mesh-To-Video Generation**: transforms driven meshes into 2D dense motion using a dual-branch motion-VAE and synthesizes the final video frame-by-frame. |
Outperforms state-of-the-art methods in both objective and subjective evaluations for video quality, identity preservation, and head pose diversity.
Generates accurate lip-synchronized and expressive facial motions.
Demonstrates the effectiveness of using both blendshapes and vertex offsets, a learnable head pose codebook, and a dual-branch motion-VAE. |
Reliance on 3DMM for facial modeling, which may limit the representation of certain facial features.
Further research on improving the temporal consistency and smoothness of generated head poses. |
talking head generation, audio-driven animation, deep learning, computer vision, 3d morphable model |
2312.01790
Report |
Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization |
Konstantinos Triaridis, Vasileios Mezaris |
Recent image manipulation localization and detection techniques usually
leverage forensic artifacts and traces that are produced by a noise-sensitive
filter, such as SRM and Bayar convolution. In this paper, we showcase that
different filters commonly used in such approaches excel at unveiling different
types of manipulations and provide complementary forensic traces. Thus, we
explore ways of merging the outputs of such filters and aim to leverage the
complementary nature of the artifacts produced to perform image manipulation
localization and detection (IMLD). We propose two distinct methods: one that
produces independent features from each forensic filter and then fuses them
(this is referred to as late fusion) and one that performs early mixing of
different modal outputs and produces early combined features (this is referred
to as early fusion). We demonstrate that both approaches achieve competitive
performance for both image manipulation localization and detection,
outperforming state-of-the-art models across several datasets. |
This paper proposes two novel multi-modal fusion approaches for enhancing image manipulation detection and localization by leveraging complementary forensic artifacts from different filters (NoisePrint++, SRM, Bayar convolution). |
Image manipulation detection and localization are crucial for combating disinformation and fostering trust in digital media, especially with the advancement of sophisticated image editing tools. |
The study uses a dual-branch encoder-decoder architecture as a baseline and extends it with two fusion paradigms: 1) **Late Fusion:** Features from each forensic filter are extracted independently and then concatenated. Shared weights in the RGB branch mitigate overfitting. 2) **Early Fusion:** Features from different modalities are mixed early on using convolutional blocks before being fed into the encoder, promoting smoother feature integration. |
Both early and late fusion methods achieve state-of-the-art performance on five benchmark datasets for image manipulation localization (using pixel-level F1) and detection (using AUC and balanced accuracy).
The study reveals that different forensic filters exhibit complementary strengths in detecting specific manipulation types, such as NoisePrint++ excelling in post-processing manipulations and SRM/Bayar in copy-move or diffusion-based manipulations.
Both proposed fusion approaches demonstrate robustness against image degradations like Gaussian blurring and JPEG compression. |
The late fusion model's performance on detection tasks (bAcc) suggests potential for improvement through further regularization.
Future work will explore the models' limitations against adversarial attacks. |
image forensics, image manipulation detection, image manipulation localization, multimodal fusion, deep learning |
2312.01771
Report |
IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks |
Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang |
In-context learning allows adapting a model to new tasks given a task
description at test time. In this paper, we present IMProv - a generative model
that is able to in-context learn visual tasks from multimodal prompts. Given a
textual description of a visual task (e.g. "Left: input image, Right:
foreground segmentation"), a few input-output visual examples, or both, the
model in-context learns to solve it for a new test input. We train a masked
generative transformer on a new dataset of figures from computer vision papers
and their associated captions, together with a captioned large-scale image-text
dataset. During inference time, we prompt the model with text and/or image task
example(s) and have the model inpaint the corresponding output. We show that
training our model with text conditioning and scaling the dataset size improves
in-context learning for computer vision tasks by over +10\% AP for Foreground
Segmentation, over +5\% gains in AP for Single Object Detection, and almost
20\% lower LPIPS in Colorization. Our empirical results suggest that vision and
language prompts are complementary and it is advantageous to use both to
achieve better in-context learning performance. Project page is available at
https://jerryxu.net/IMProv . |
Presents IMProv, a generative model for in-context learning of visual tasks from multimodal prompts (text and image), enabling it to solve new tasks given a textual description, visual examples, or both. |
Addresses the limitations of vision-only in-context learning in computer vision by incorporating the complementary strengths of language and vision for clearer task instruction and ambiguity reduction. |
Trains a masked generative transformer on a new dataset of captioned computer vision paper figures (S2CV) combined with LAION-400M. At inference, the model receives visual and/or textual prompts and inpaints the output for the given task. |
Training with text conditioning and a larger dataset significantly improves in-context learning performance on various vision tasks.
IMProv achieves superior results compared to vision-only approaches, e.g., +10% AP on foreground segmentation.
Demonstrates a trade-off between visual and textual prompt quality - high-quality prompts of one type can compensate for lower-quality prompts of the other. |
Limited to generating pixel-based outputs, restricting its applicability to tasks representable in the pixel space.
Future work involves investigating the impact of incorporating more diverse unstructured data sources on in-context learning capabilities. |
in-context learning, multimodal learning, computer vision, image inpainting, generative models |
2312.01711
Report |
Regressor-Segmenter Mutual Prompt Learning for Crowd Counting |
Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, Yaowei Wang, Qixiang Ye |
Crowd counting has achieved significant progress by training regressors to
predict instance positions. In heavily crowded scenarios, however, regressors
are challenged by uncontrollable annotation variance, which causes density map
bias and context information inaccuracy. In this study, we propose mutual
prompt learning (mPrompt), which leverages a regressor and a segmenter as
guidance for each other, solving bias and inaccuracy caused by annotation
variance while distinguishing foreground from background. In specific, mPrompt
leverages point annotations to tune the segmenter and predict pseudo head masks
in a way of point prompt learning. It then uses the predicted segmentation
masks, which serve as spatial constraint, to rectify biased point annotations
as context prompt learning. mPrompt defines a way of mutual information
maximization from prompt learning, mitigating the impact of annotation variance
while improving model accuracy. Experiments show that mPrompt significantly
reduces the Mean Average Error (MAE), demonstrating the potential to be general
framework for down-stream vision tasks. |
This paper proposes a mutual prompt learning (mPrompt) framework for crowd counting that leverages a regressor and a segmenter to guide each other, mitigating the impact of annotation variance. |
Point annotations in crowded scenes suffer from variance, leading to density map bias and inaccurate context information in crowd counting models. |
mPrompt utilizes point annotations to train a segmenter, predicting pseudo head masks as point prompts. These masks then act as context prompts, refining the regressor's predictions. This mutual learning process optimizes both branches. |
mPrompt significantly reduces the Mean Average Error (MAE) on four benchmark datasets.
Visualization analysis demonstrates mPrompt's capability to generate more accurate density maps.
Ablation studies validate the effectiveness of each component in mPrompt. |
The current method relies on pre-trained segmenters; exploring end-to-end training without relying on box annotations is a potential future direction.
Extending mPrompt to other downstream vision tasks with scarce or noisy labels, such as object detection and visual tracking, is promising. |
crowd counting, prompt learning, annotation variance, segmentation, density map regression |
2312.01663
Report |
Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training |
Runze He, Shaofei Huang, Xuecheng Nie, Tianrui Hui, Luoqi Liu, Jiao Dai, Jizhong Han, Guanbin Li, Si Liu |
In this paper, we target the adaptive source driven 3D scene editing task by
proposing a CustomNeRF model that unifies a text description or a reference
image as the editing prompt. However, obtaining desired editing results
conformed with the editing prompt is nontrivial since there exist two
significant challenges, including accurate editing of only foreground regions
and multi-view consistency given a single-view reference image. To tackle the
first challenge, we propose a Local-Global Iterative Editing (LGIE) training
scheme that alternates between foreground region editing and full-image
editing, aimed at foreground-only manipulation while preserving the background.
For the second challenge, we also design a class-guided regularization that
exploits class priors within the generation model to alleviate the
inconsistency problem among different views in image-driven editing. Extensive
experiments show that our CustomNeRF produces precise editing results under
various real scenes for both text- and image-driven settings. |
CustomNeRF, a unified framework for adaptive source-driven 3D scene editing using text descriptions or reference images as prompts. |
Existing 3D scene editing methods lack the flexibility to perform specific edits based on user-provided reference images while preserving the background accurately. |
The authors propose a novel framework with (1) a foreground-aware NeRF for identifying editable regions, (2) a subject-aware T2I model for embedding reference image subjects into hybrid prompts, and (3) a Local-Global Iterative Editing (LGIE) training scheme for editing foregrounds while preserving backgrounds and a class-guided regularization for view consistency in image-driven editing. |
CustomNeRF produces precise and view-consistent editing results in both text- and image-driven settings, outperforming baseline methods.
The LGIE training scheme effectively edits foreground regions while preserving background content.
Class-guided regularization mitigates the Janus problem in image-driven editing, improving cross-view consistency. |
The method's reliance on Custom Diffusion for transferring subject appearance may result in inconsistencies if Custom Diffusion fails to replicate reference images perfectly.
Currently limited to text and image prompts, future work could explore incorporating other editing sources like audio or sketches. |
3d scene editing, neural radiance fields (nerf), text-to-image generation, image-driven editing, view consistency |
2312.01629
Report |
CLAMP: Contrastive LAnguage Model Prompt-tuning |
Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim |
Large language models (LLMs) have emerged as powerful general-purpose
interfaces for many machine learning problems. Recent work has adapted LLMs to
generative visual tasks like image captioning, visual question answering, and
visual chat, using a relatively small amount of instruction-tuning data. In
this paper, we explore whether modern LLMs can also be adapted to classifying
an image into a set of categories. First, we evaluate multimodal LLMs that are
tuned for generative tasks on zero-shot image classification and find that
their performance is far below that of specialized models like CLIP. We then
propose an approach for light fine-tuning of LLMs using the same contrastive
image-caption matching objective as CLIP. Our results show that LLMs can,
indeed, achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms
contrastive learning with a custom text model, while also retaining the LLM's
generative abilities. LLM initialization appears to particularly help
classification in domains under-represented in the visual pre-training data. |
The paper introduces CLAMP (Contrastive LAnguage Model Prompt-tuning), a technique to enhance the zero-shot image classification abilities of multimodal Large Language Models (mLLMs). |
Existing mLLMs excel at generative visual tasks like captioning but struggle with discriminative tasks such as zero-shot image classification. This is a significant limitation as classification is fundamental for a foundation model. |
CLAMP adapts an LLM by replacing the text encoder of a contrastive vision-language model with the LLM and fine-tunes it using a contrastive image-caption objective. It leverages techniques like Read-Only Prompts, Output Attention Pooling, and LoRA (Low Rank Adaptation) for efficient fine-tuning. |
CLAMP significantly outperforms state-of-the-art mLLMs on zero-shot image classification by 13%, approaching the performance of CLIP.
The LLM initialization in CLAMP proves particularly beneficial for classification in domains under-represented in the visual pre-training data.
CLAMP retains the LLM's generative capabilities, showing promise for universal models. |
One limitation is that CLAMP's current implementation requires separate adapters for discriminative and generative tasks.
Future work could explore combining these adapters into a single set for a more unified model. |
multimodal llms, zero-shot classification, contrastive learning, prompt-tuning, parameter-efficient fine-tuning |
2312.01623
Report |
Universal Segmentation at Arbitrary Granularity with Language Instruction |
Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang |
This paper aims to achieve universal segmentation of arbitrary semantic
level. Despite significant progress in recent years, specialist segmentation
approaches are limited to specific tasks and data distribution. Retraining a
new model for adaptation to new scenarios or settings takes expensive
computation and time cost, which raises the demand for versatile and universal
segmentation model that can cater to various granularity. Although some
attempts have been made for unifying different segmentation tasks or
generalization to various scenarios, limitations in the definition of paradigms
and input-output spaces make it difficult for them to achieve accurate
understanding of content at arbitrary granularity. To this end, we present
UniLSeg, a universal segmentation model that can perform segmentation at any
semantic level with the guidance of language instructions. For training
UniLSeg, we reorganize a group of tasks from original diverse distributions
into a unified data format, where images with texts describing segmentation
targets as input and corresponding masks are output. Combined with a automatic
annotation engine for utilizing numerous unlabeled data, UniLSeg achieves
excellent performance on various tasks and settings, surpassing both specialist
and unified segmentation models. |
Proposes UniLSeg, a universal segmentation model that uses language instructions to segment images at any semantic level. |
Existing segmentation models are often task-specific and struggle to adapt to diverse scenarios and granularities. UniLSeg addresses this by using flexible language prompts for universal segmentation. |
UniLSeg employs a two-stream decoding structure for visual-linguistic interaction, enabling segmentation at various levels. It's trained on a unified dataset of images, masks, and captions, incorporating data from various segmentation tasks. An automatic annotation engine generates pseudo-labels for unlabeled data, enhancing training. |
Outperforms state-of-the-art methods in referring image segmentation, achieving 79.27% vs 73.41% IoU on G-Ref.
Achieves state-of-the-art performance in salient object detection, surpassing previous methods on ECSSD, SOD, and PASCAL-S.
Demonstrates strong performance in semantic segmentation, surpassing previous unified models in the in-vocabulary setting and achieving competitive results in the open-vocabulary setting. |
Current implementation processes videos frame-by-frame, lacking temporal understanding for video segmentation.
Performance on semantic segmentation, while exceeding other unified models, remains slightly lower than specialized models. |
universal segmentation, language-guided vision, visual-linguistic interaction, multi-task learning, automatic annotation |
2312.01597
Report |
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference |
Feng Wang, Jieru Mei, Alan Yuille |
Recent advances in contrastive language-image pretraining (CLIP) have
demonstrated strong capabilities in zero-shot classification by aligning visual
representations with target text embeddings in an image level. However, in
dense prediction tasks, CLIP often struggles to localize visual features within
an image and fails to give accurate pixel-level predictions, which prevents it
from functioning as a generalized visual foundation model. In this work, we aim
to enhance CLIP's potential for semantic segmentation with minimal
modifications to its pretrained models. By rethinking self-attention, we
surprisingly find that CLIP can adapt to dense prediction tasks by simply
introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically,
we replace the traditional self-attention block of CLIP vision encoder's last
layer by our CSA module and reuse its pretrained projection matrices of query,
key, and value, leading to a training-free adaptation approach for CLIP's
zero-shot semantic segmentation. Extensive experiments show the advantage of
CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic
segmentation benchmarks highlighted in this paper, significantly outperforming
the existing SoTA's 33.9% and the vanilla CLIP's 14.1%. |
The paper proposes SCLIP, a segmentation-adapted CLIP model for zero-shot semantic segmentation. It leverages a novel Correlative Self-Attention (CSA) mechanism to improve CLIP's dense prediction capabilities without requiring fine-tuning. |
Vanilla CLIP struggles with localizing visual features in images, making it unsuitable for semantic segmentation. SCLIP addresses this issue and enhances CLIP's potential as a general-purpose visual foundation model. |
SCLIP introduces the CSA module, which replaces the original self-attention block in CLIP's vision encoder. CSA computes attention scores based on pairwise correlations between local visual tokens, promoting spatial covariance and enabling accurate localization. |
SCLIP achieves state-of-the-art zero-shot semantic segmentation results, significantly outperforming baselines like MaskCLIP and TCL on eight benchmarks.
The CSA module is robust and insensitive to projection matrix parameters, allowing for training-free adaptation of pretrained CLIP models.
SCLIP demonstrates the effectiveness of incorporating semantic correlations between local features for improved visual reasoning in dense prediction tasks. |
The paper primarily focuses on adapting CLIP's vision transformer encoder and does not explore modifications to the language encoder.
Future work could explore alternative architectural choices within the CSA module or investigate its effectiveness in other dense prediction tasks beyond semantic segmentation. |
semantic segmentation, zero-shot learning, vision-language models, clip, self-attention |
2312.01531
Report |
SANeRF-HQ: Segment Anything for NeRF in High Quality |
Yichen Liu, Benran Hu, Chi-Keung Tang, Yu-Wing Tai |
Recently, the Segment Anything Model (SAM) has showcased remarkable
capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has
gained popularity as a method for various 3D problems beyond novel view
synthesis. Though there exist initial attempts to incorporate these two methods
into 3D segmentation, they face the challenge of accurately and consistently
segmenting objects in complex scenarios. In this paper, we introduce the
Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high-quality
3D segmentation of any target object in a given scene. SANeRF-HQ utilizes SAM
for open-world object segmentation guided by user-supplied prompts, while
leveraging NeRF to aggregate information from different viewpoints. To overcome
the aforementioned challenges, we employ density field and RGB similarity to
enhance the accuracy of segmentation boundary during the aggregation.
Emphasizing on segmentation accuracy, we evaluate our method on multiple NeRF
datasets where high-quality ground-truths are available or manually annotated.
SANeRF-HQ shows a significant quality improvement over state-of-the-art methods
in NeRF object segmentation, provides higher flexibility for object
localization, and enables more consistent object segmentation across multiple
views. Results and code are available at the project site:
https://lyclyc52.github.io/SANeRF-HQ/. |
The paper introduces SANeRF-HQ, a novel framework that combines Segment Anything Model (SAM) and Neural Radiance Fields (NeRF) to achieve high-quality 3D segmentation in complex scenes. |
Existing methods for 3D segmentation in NeRF struggle with accuracy, consistency across views, and generalization to open-world scenarios. SANeRF-HQ addresses these limitations by leveraging the power of SAM and the multi-view aggregation capabilities of NeRF. |
SANeRF-HQ consists of a feature container (cache or distilled feature field), a mask decoder, and a mask aggregator. It encodes images into SAM features, propagates user prompts to generate 2D masks, and aggregates these masks in 3D using an object field. A Ray-Pair RGB loss further improves boundary accuracy. |
SANeRF-HQ quantitatively outperforms state-of-the-art methods like SA3D and ISRF on multiple NeRF datasets.
The method generates consistent 3D segmentations across different viewpoints.
The use of a Ray-Pair RGB loss leads to more accurate segmentation boundaries, especially in challenging cases. |
The performance of SANeRF-HQ relies on the quality of the pre-trained NeRF model and may be affected by scene complexity.
The Ray-Pair RGB loss might not be universally applicable, particularly when dealing with objects that share similar colors and textures. Future work could focus on enhancing its robustness. |
3d segmentation, neural radiance fields, segment anything model, multi-view consistency, zero-shot segmentation |
2312.01409
Report |
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models |
Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun-Hao Paul Huang, Tuanfeng Yang Wang, Gordon Wetzstein |
Traditional 3D content creation tools empower users to bring their
imagination to life by giving them direct control over a scene's geometry,
appearance, motion, and camera path. Creating computer-generated videos,
however, is a tedious manual process, which can be automated by emerging
text-to-video diffusion models. Despite great promise, video diffusion models
are difficult to control, hindering a user to apply their own creativity rather
than amplifying it. To address this challenge, we present a novel approach that
combines the controllability of dynamic 3D meshes with the expressivity and
editability of emerging diffusion models. For this purpose, our approach takes
an animated, low-fidelity rendered mesh as input and injects the ground truth
correspondence information obtained from the dynamic mesh into various stages
of a pre-trained text-to-image generation model to output high-quality and
temporally consistent frames. We demonstrate our approach on various examples
where motion can be obtained by animating rigged assets or changing the camera
path. |
This paper presents Generative Rendering, a novel framework that combines the controllability of 3D modeling with the expressiveness of text-to-image (T2I) diffusion models for generating stylized animations. |
Existing text-to-video models lack fine-grained control over scene layout and motion, while traditional 3D workflows are time-consuming and require expertise. This work bridges this gap by enabling controllable video generation using pre-trained T2I models. |
The method leverages 4D spatio-temporal correspondences from animated 3D meshes to guide the image generation process. Key innovations include UV-space noise initialization for temporal consistency and correspondence-aware blending of self-attention features for consistent appearance synthesis. |
Generative Rendering demonstrates superior frame consistency and prompt fidelity compared to adapted baselines.
The method supports camera and object rotations, physical simulations, and character animations, showcasing its versatility.
The proposed UV-space feature injection and noise initialization significantly improve temporal consistency in generated animations. |
The method's reliance on multi-step diffusion inference limits real-time animation capabilities.
Handling large environmental changes and dramatic perspective shifts remains challenging due to limitations in feature correspondence. |
video generation, diffusion models, 3d animation, text-to-image synthesis, generative ai |
2312.01381
Report |
Language-driven All-in-one Adverse Weather Removal |
Hao Yang, Liyuan Pan, Yan Yang, Wei Liang |
All-in-one (AiO) frameworks restore various adverse weather degradations with
a single set of networks jointly. To handle various weather conditions, an AiO
framework is expected to adaptively learn weather-specific knowledge for
different degradations and shared knowledge for common patterns. However,
existing methods: 1) rely on extra supervision signals, which are usually
unknown in real-world applications; 2) employ fixed network structures, which
restrict the diversity of weather-specific knowledge. In this paper, we propose
a Language-driven Restoration framework (LDR) to alleviate the aforementioned
issues. First, we leverage the power of pre-trained vision-language (PVL)
models to enrich the diversity of weather-specific knowledge by reasoning about
the occurrence, type, and severity of degradation, generating description-based
degradation priors. Then, with the guidance of degradation prior, we sparsely
select restoration experts from a candidate list dynamically based on a
Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the
weather-specific and shared knowledge to handle various weather conditions
(e.g., unknown or mixed weather). Experiments on extensive restoration
scenarios show our superior performance (see Fig. 1). The source code will be
made available. |
This paper proposes LDR, a Language-driven Restoration framework for removing various adverse weather conditions in an all-in-one solution. |
Existing methods struggle to handle diverse weather conditions, often relying on extra supervision signals or fixed network structures. LDR overcomes these limitations by leveraging pre-trained vision-language models. |
LDR uses a pre-trained vision-language model to generate degradation priors, which are then used to dynamically select restoration experts from a candidate list based on a Mixture-of-Experts structure. The selected experts are applied pixel-wisely to restore weather-specific features. |
LDR significantly outperforms existing general and all-in-one methods on benchmark datasets.
The method effectively handles images with varying degradation severity, outperforming baselines in heavily degraded cases.
LDR generalizes well to unseen weather conditions, successfully restoring images degraded by haze even though it was trained only on rain, snow, and raindrop degradations. |
The reliance on pre-trained vision-language models introduces a dependence on the quality and reasoning capabilities of those models.
Future work could explore extending LDR to handle a wider range of image degradations beyond adverse weather conditions. |
image restoration, adverse weather removal, vision-language models, mixture-of-experts, degradation prior |
2312.01305
Report |
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models |
Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, Kwang Moo Yi |
Generating novel views of an object from a single image is a challenging
task. It requires an understanding of the underlying 3D structure of the object
from an image and rendering high-quality, spatially consistent new views. While
recent methods for view synthesis based on diffusion have shown great progress,
achieving consistency among various view estimates and at the same time abiding
by the desired camera pose remains a critical problem yet to be solved. In this
work, we demonstrate a strikingly simple method, where we utilize a pre-trained
video diffusion model to solve this problem. Our key idea is that synthesizing
a novel view could be reformulated as synthesizing a video of a camera going
around the object of interest -- a scanning video -- which then allows us to
leverage the powerful priors that a video diffusion model would have learned.
Thus, to perform novel-view synthesis, we create a smooth camera trajectory to
the target view that we wish to render, and denoise using both a
view-conditioned diffusion model and a video diffusion model. By doing so, we
obtain a highly consistent novel view synthesis, outperforming the state of the
art. |
This paper proposes a novel method for single-image novel view synthesis that leverages pre-trained video diffusion models to improve the consistency of generated views without requiring any additional training or fine-tuning. |
Existing diffusion-based novel view synthesis methods often struggle with maintaining consistency in object pose and content across different generated views. This method addresses this limitation by using video diffusion models as a prior to ensure smoother transitions and greater fidelity to the input image. |
The method synthesizes a sequence of views along a smooth camera trajectory from the input image to the target view. It then leverages a pre-trained view-conditioned diffusion model and a video diffusion model jointly during the denoising process to generate the final novel view. |
The method significantly improves consistency in object pose and content across generated views compared to existing 2D novel view synthesis techniques.
It outperforms state-of-the-art methods on standard image quality metrics such as PSNR, SSIM, and LPIPS.
A novel optical flow-based metric demonstrates the superior performance of the method in generating spatially consistent novel views. |
The method currently lacks an explicit 3D model and can exhibit inconsistencies when generating views from very different angles.
Future work will explore incorporating explicit 3D pipelines and leveraging the method for high-resolution and editable novel view rendering. |
novel view synthesis, diffusion models, video diffusion models, single image view synthesis, 3d consistency |
2312.01280
Report |
Brain Decodes Deep Nets |
Huzheng Yang, James Gee, Jianbo Shi |
We developed a tool for visualizing and analyzing large pre-trained vision
models by mapping them onto the brain, thus exposing their hidden inside. Our
innovation arises from a surprising usage of brain encoding: predicting brain
fMRI measurements in response to images. We report two findings. First,
explicit mapping between the brain and deep-network features across dimensions
of space, layers, scales, and channels is crucial. This mapping method,
FactorTopy, is plug-and-play for any deep-network; with it, one can paint a
picture of the network onto the brain (literally!). Second, our visualization
shows how different training methods matter: they lead to remarkable
differences in hierarchical organization and scaling behavior, growing with
more data or network capacity. It also provides insight into fine-tuning: how
pre-trained models change when adapting to small datasets. We found brain-like
hierarchically organized network suffer less from catastrophic forgetting after
fine-tuned. |
A novel visualization tool, FactorTopy, is introduced, using brain encoding models to map deep network features onto the brain, exposing their internal workings. |
This tool allows for analyzing how different training objectives and model scales affect the hierarchical organization of deep networks, ultimately impacting their performance and generalization capabilities. |
FactorTopy employs a factorized feature selection approach across space, layer, scale, and channel dimensions, constrained by brain topology for robust network-to-brain mapping. This mapping is then visualized by coloring the brain based on the dominant layer selected for each voxel. |
Training objectives matter: CLIP aligns hierarchically with the brain, while supervised methods like ImageNet and SAM show bottom-up and top-down structures.
Scaling networks: CLIP's brain alignment improves with increasing size and data, while other models exhibit a decrease.
Fine-tuning: CLIP maintains its hierarchical structure and suffers less catastrophic forgetting compared to models like DiNOv2 and SAM. |
Reliance on high-quality brain-encoding data, which is currently limited.
Potential for limited applicability to network designs drastically different from the brain's structure. |
brain encoding, deep network visualization, hierarchical organization, fine-tuning, catastrophic forgetting |
2312.01255
Report |
Meta ControlNet: Enhancing Task Adaptation via Meta Learning |
Junjie Yang, Jinze Zhao, Peihao Wang, Zhangyang Wang, Yingbin Liang |
Diffusion-based image synthesis has attracted extensive attention recently.
In particular, ControlNet that uses image-based prompts exhibits powerful
capability in image tasks such as canny edge detection and generates images
well aligned with these prompts. However, vanilla ControlNet generally requires
extensive training of around 5000 steps to achieve a desirable control for a
single task. Recent context-learning approaches have improved its adaptability,
but mainly for edge-based tasks, and rely on paired examples. Thus, two
important open issues are yet to be addressed to reach the full potential of
ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation
for non-edge-based tasks. In this paper, we introduce a novel Meta ControlNet
method, which adopts the task-agnostic meta learning technique and features a
new layer freezing design. Meta ControlNet significantly reduces learning steps
to attain control ability from 5000 to 1000. Further, Meta ControlNet exhibits
direct zero-shot adaptability in edge-based tasks without any finetuning, and
achieves control within only 100 finetuning steps in more complex non-edge
tasks such as Human Pose, outperforming all existing methods. The codes is
available in https://github.com/JunjieYang97/Meta-ControlNet. |
This paper introduces Meta ControlNet, a novel approach for fast and adaptable image synthesis by learning a generalizable ControlNet initialization using meta-learning and a novel layer freezing design. |
Vanilla ControlNet requires extensive training for task-specific control, and existing adaptations struggle with zero-shot learning and fast adaptation for non-edge-based tasks. This work addresses these limitations. |
The paper proposes Meta ControlNet, which leverages a FO-MAML framework with various image conditions as meta-tasks and freezes specific encoder and middle blocks during training to enable rapid adaptation. |
Meta ControlNet achieves control ability in 1000 steps, significantly faster than vanilla ControlNet's 5000 steps.
The method demonstrates zero-shot adaptation for edge-based tasks like Canny edge detection.
Meta ControlNet adapts quickly to challenging non-edge tasks, controlling human pose in 100 steps and human pose mapping in 200 steps. |
The model exhibits minor errors in distinguishing between humans and animals in tasks like human pose mapping.
Future work can explore improving the adaptation speed disparity between tasks aligned with Stable Diffusion's strengths and those requiring learning new representations. |
image synthesis, controlnet, meta-learning, zero-shot learning, few-shot learning |
2312.01196
Report |
Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction |
Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, Jan Eric Lenssen |
Reconstructing dynamic objects from monocular videos is a severely
underconstrained and challenging problem, and recent work has approached it in
various directions. However, owing to the ill-posed nature of this problem,
there has been no solution that can provide consistent, high-quality novel
views from camera positions that are significantly different from the training
views. In this work, we introduce Neural Parametric Gaussians (NPGs) to take on
this challenge by imposing a two-stage approach: first, we fit a low-rank
neural deformation model, which then is used as regularization for non-rigid
reconstruction in the second stage. The first stage learns the object's
deformations such that it preserves consistency in novel views. The second
stage obtains high reconstruction quality by optimizing 3D Gaussians that are
driven by the coarse model. To this end, we introduce a local 3D Gaussian
representation, where temporally shared Gaussians are anchored in and deformed
by local oriented volumes. The resulting combined model can be rendered as
radiance fields, resulting in high-quality photo-realistic reconstructions of
the non-rigidly deforming objects. We demonstrate that NPGs achieve superior
results compared to previous works, especially in challenging scenarios with
few multi-view cues. |
Presents Neural Parametric Gaussians (NPGs), a two-stage approach for high-quality, non-rigid object reconstruction from monocular videos. |
Monocular non-rigid reconstruction is highly underconstrained, and existing methods struggle to produce temporally consistent results, especially for novel views. |
Stage 1 learns a coarse point model with low-rank deformation for temporal regularization. Stage 2 optimizes 3D Gaussians within local volumes defined by the point model, capturing fine details. |
Achieves state-of-the-art novel view synthesis on the D-NeRF dataset.
Significantly outperforms previous methods on the challenging Unbiased4D dataset with limited multi-view cues.
Provides temporally consistent reconstructions with high-frequency details even for complex object motions. |
Performance depends on the complexity of sequences (e.g., camera motion, speed, and extent of deformations).
May struggle with scenes where the template initialization fails (e.g., collapses to a flat surface). |
non-rigid reconstruction, novel view synthesis, monocular video, 3d gaussians, neural parametric models |
2312.01129
Report |
ControlDreamer: Stylized 3D Generation with Multi-View ControlNet |
Yeongtak Oh, Jooyoung Choi, Yongsung Kim, Minjun Park, Chaehun Shin, Sungroh Yoon |
Recent advancements in text-to-3D generation have significantly contributed
to the automation and democratization of 3D content creation. Building upon
these developments, we aim to address the limitations of current methods in
generating 3D models with creative geometry and styles. We introduce multi-view
ControlNet, a novel depth-aware multi-view diffusion model trained on generated
datasets from a carefully curated text corpus. Our multi-view ControlNet is
then integrated into our two-stage pipeline, ControlDreamer, enabling
text-guided generation of stylized 3D models. Additionally, we present a
comprehensive benchmark for 3D style editing, encompassing a broad range of
subjects, including objects, animals, and characters, to further facilitate
research on diverse 3D generation. Our comparative analysis reveals that this
new pipeline outperforms existing text-to-3D methods as evidenced by human
evaluations and CLIP score metrics. |
Introduces ControlDreamer, a two-stage text-to-3D generation pipeline that uses a novel depth-aware multi-view diffusion model (MV-ControlNet) to create stylized 3D models. |
Addresses limitations of current text-to-3D methods in generating creative geometry and styles by separating geometry and style generation stages and leveraging depth information. |
Combines NeRF generation with DMTet mesh refinement, trained with a novel MV-ControlNet that leverages depth maps for aligning style with generated geometry. Trained on a dataset generated from a curated 100K text corpus. |
Outperforms existing two-stage pipelines in generating styles on 3D models, as evidenced by CLIP score metrics and human assessments.
MV-ControlNet effectively aligns diverse geometries and styles, leading to high-quality 3D model generation.
Depth-aware MV-ControlNet surpasses normals and edges-aware variants in rendering detailed textures and geometries. |
Errors in pre-trained depth estimators can cause artifacts.
Limited to 256x256 resolution due to MVDream's training resolution. |
text-to-3d, two-stage pipeline, multi-view diffusion model, controlnet, 3d style editing |
2312.01068
Report |
DPHMs: Diffusion Parametric Head Models for Depth-based Tracking |
Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner |
We introduce Diffusion Parametric Head Models (DPHMs), a generative model
that enables robust volumetric head reconstruction and tracking from monocular
depth sequences. While recent volumetric head models, such as NPHMs, can now
excel in representing high-fidelity head geometries, tracking and
reconstructing heads from real-world single-view depth sequences remains very
challenging, as the fitting to partial and noisy observations is
underconstrained. To tackle these challenges, we propose a latent
diffusion-based prior to regularize volumetric head reconstruction and
tracking. This prior-based regularizer effectively constrains the identity and
expression codes to lie on the underlying latent manifold which represents
plausible head shapes. To evaluate the effectiveness of the diffusion-based
prior, we collect a dataset of monocular Kinect sequences consisting of various
complex facial expression motions and rapid transitions. We compare our method
to state-of-the-art tracking methods and demonstrate improved head identity
reconstruction as well as robust expression tracking. |
Introduces Diffusion Parametric Head Models (DPHMs), a novel generative model for robust volumetric head reconstruction and tracking from monocular depth sequences, by incorporating diffusion-based priors into neural parametric head models (NPHMs). |
Addresses the challenge of reconstructing and tracking heads from noisy and partial depth data, which often leads to overfitting and unrealistic results with traditional NPHMs. |
Leverages a latent diffusion model to learn the distribution of identity and expression parameters in NPHMs, enabling effective regularization of these parameters during fitting to real-world depth sequences. |
Achieves more accurate head identity reconstruction compared to state-of-the-art methods, especially in capturing fine-grained hair geometries.
Demonstrates robust and coherent facial expression tracking, even for complex and rapid transitions, by constraining latent optimization within plausible head shape manifolds.
Outperforms existing methods in quantitative evaluations on a new challenging benchmark dataset (DPHM-Kinect) and a multi-view video dataset (NerSemble). |
Current implementation has slower inference due to test-time optimization of neural parametric models.
Future work will focus on real-time head tracking solutions and incorporating RGB images for enhanced hair reconstruction. |
head reconstruction, facial tracking, diffusion models, depth sensors, 3d avatars |
2312.01027
Report |
LDM-ISP: Enhancing Neural ISP for Low Light with Latent Diffusion Models |
Qiang Wen, Yazhou Xing, Zhefan Rao, Qifeng Chen |
Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB
image is a significant challenge for modern digital cameras. Prior approaches
have difficulties in recovering fine-grained details and true colors of the
scene under extremely low-light environments due to near-to-zero SNR.
Meanwhile, diffusion models have shown significant progress towards general
domain image generation. In this paper, we propose to leverage the pre-trained
latent diffusion model to perform the neural ISP for enhancing extremely
low-light images. Specifically, to tailor the pre-trained latent diffusion
model to operate on the RAW domain, we train a set of lightweight taming
modules to inject the RAW information into the diffusion denoising process via
modulating the intermediate features of UNet. We further observe different
roles of UNet denoising and decoder reconstruction in the latent diffusion
model, which inspires us to decompose the low-light image enhancement task into
latent-space low-frequency content generation and decoding-phase high-frequency
detail maintenance. Through extensive experiments on representative datasets,
we demonstrate our simple design not only achieves state-of-the-art performance
in quantitative evaluations but also shows significant superiority in visual
comparisons over strong baselines, which highlight the effectiveness of
powerful generative priors for neural ISP under extremely low-light
environments. The project page is available at
https://csqiangwen.github.io/projects/ldm-isp/ |
This paper presents LDM-ISP, a novel method leveraging a pre-trained latent diffusion model and taming modules to enhance neural ISP for low-light image enhancement. |
Existing low-light enhancement methods struggle to recover fine-grained details and true colors, especially in extremely low-light conditions due to limited training data. This work explores the potential of powerful generative priors from pre-trained diffusion models to address these limitations. |
LDM-ISP inserts trainable taming modules into a frozen pre-trained latent diffusion model (Stable Diffusion). It employs 2D discrete wavelet transforms to decompose the input RAW image into low- and high-frequency sub-bands. The low-frequency sub-band guides the UNet for content generation, while the high-frequency sub-bands are used to maintain details during the decoding phase. |
LDM-ISP achieves state-of-the-art performance on three benchmark datasets (SID-Sony, ELD-Sony, LRD) quantitatively and qualitatively.
The method effectively recovers structural information and enhances details, even in extremely dark and noisy regions.
Taming the decoder with high-frequency information is crucial for accurate color correction and detail preservation. |
The inference speed of LDM-ISP is limited by the DDIM sampling process in the diffusion model.
Exploring the combination of text prompts with the proposed method for flexible low-light image editing is a promising research direction. |
low-light image enhancement, neural image signal processing, latent diffusion model, generative priors, taming modules |
2312.01026
Report |
Token Fusion: Bridging the Gap between Token Pruning and Token Merging |
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin |
Vision Transformers (ViTs) have emerged as powerful backbones in computer
vision, outperforming many traditional CNNs. However, their computational
overhead, largely attributed to the self-attention mechanism, makes deployment
on resource-constrained edge devices challenging. Multiple solutions rely on
token pruning or token merging. In this paper, we introduce "Token Fusion"
(ToFu), a method that amalgamates the benefits of both token pruning and token
merging. Token pruning proves advantageous when the model exhibits sensitivity
to input interpolations, while token merging is effective when the model
manifests close to linear responses to inputs. We combine this to propose a new
scheme called Token Fusion. Moreover, we tackle the limitations of average
merging, which doesn't preserve the intrinsic feature norm, resulting in
distributional shifts. To mitigate this, we introduce MLERP merging, a variant
of the SLERP technique, tailored to merge multiple tokens while maintaining the
norm distribution. ToFu is versatile, applicable to ViTs with or without
additional training. Our empirical evaluations indicate that ToFu establishes
new benchmarks in both classification and image generation tasks concerning
computational efficiency and model accuracy. |
Introduced "Token Fusion" (ToFu), a method that combines the advantages of token pruning and token merging to accelerate Vision Transformers (ViTs) while preserving accuracy. |
ViTs are powerful but computationally expensive, hindering deployment on resource-constrained devices. ToFu addresses this by reducing computational overhead without significant accuracy loss. |
ToFu dynamically switches between pruning and merging based on the model's sensitivity to input interpolations at different depths. It also introduces MLERP merging, a variant of SLERP, to preserve feature norm distribution during merging. |
ToFu achieves state-of-the-art speed and accuracy trade-offs on ImageNet classification compared to existing token reduction methods.
MLERP merging outperforms average merging in both accuracy and speed.
ToFu demonstrates effectiveness in image generation tasks, improving efficiency while maintaining image quality in Stable Diffusion. |
The selection of the merging strategy switching point (d) currently relies on a hyperparameter search.
Further investigation into the theoretical properties of MLERP merging and its impact on model optimization. |
vision transformer, token pruning, token merging, model compression, efficient inference |
2312.00971
Report |
Consistent Mesh Diffusion |
Julian Knodt, Xifeng Gao |
Given a 3D mesh with a UV parameterization, we introduce a novel approach to
generating textures from text prompts. While prior work uses optimization from
Text-to-Image Diffusion models to generate textures and geometry, this is slow
and requires significant compute resources. Alternatively, there are projection
based approaches that use the same Text-to-Image models that paint images onto
a mesh, but lack consistency at different viewing angles, we propose a method
that uses a single Depth-to-Image diffusion network, and generates a single
consistent texture when rendered on the 3D surface by first unifying multiple
2D image's diffusion paths, and hoisting that to 3D with
MultiDiffusion~\cite{multidiffusion}. We demonstrate our approach on a dataset
containing 30 meshes, taking approximately 5 minutes per mesh. To evaluate the
quality of our approach, we use CLIP-score~\cite{clipscore} and Frechet
Inception Distance (FID)~\cite{frechet} to evaluate the quality of the
rendering, and show our improvement over prior work. |
This paper presents a novel method for generating consistent textures on 3D meshes from text prompts using a single Depth-to-Image diffusion network. |
Generating high-quality 3D models with textures is crucial for various applications like games and shopping apps. Existing methods are either computationally expensive or produce inconsistent textures across different views. |
The proposed method leverages the concept of MultiDiffusion and extends it to 3D mesh texturing. It utilizes a spherical harmonic latent texture map to render the mesh in latent space, enabling joint denoising of multiple views from a single diffusion pass. To further enhance consistency, it incorporates GAN inversion in latent space and weights pixel importance based on surface normals. |
The method generates consistent textures with fewer seams and artifacts compared to previous approaches like TEXTure and Text2Tex.
Quantitative evaluation using CLIP-Score and FID shows comparable or superior performance to baseline methods.
The approach is computationally efficient, taking approximately 5 minutes per mesh on a single NVIDIA GeForce RTX 3090. |
The method may still suffer from the multi-Janus problem where multiple faces are generated from different views.
The reliance on text prompts for texture generation can be imprecise and ambiguous, leading to inconsistent results. Exploring image-based guidance could address this limitation. |
mesh texturing, text-to-3d, diffusion models, multi-view consistency, gan inversion |
2312.00944
Report |
Enhancing Diffusion Models with 3D Perspective Geometry Constraints |
Rishi Upadhyay, Howard Zhang, Yunhao Ba, Ethan Yang, Blake Gella, Sicheng Jiang, Alex Wong, Achuta Kadambi |
While perspective is a well-studied topic in art, it is generally taken for
granted in images. However, for the recent wave of high-quality image synthesis
methods such as latent diffusion models, perspective accuracy is not an
explicit requirement. Since these methods are capable of outputting a wide
gamut of possible images, it is difficult for these synthesized images to
adhere to the principles of linear perspective. We introduce a novel geometric
constraint in the training process of generative models to enforce perspective
accuracy. We show that outputs of models trained with this constraint both
appear more realistic and improve performance of downstream models trained on
generated images. Subjective human trials show that images generated with
latent diffusion models trained with our constraint are preferred over images
from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth
estimation models such as DPT and PixelFormer, fine-tuned on our images,
outperform the original models trained on real images by up to 7.03% in RMSE
and 19.3% in SqRel on the KITTI test set for zero-shot transfer. |
The paper introduces a novel geometric constraint during the training process of latent diffusion models to improve the perspective accuracy of generated images. |
Current diffusion models often generate images that violate the principles of linear perspective, limiting their realism and usefulness for downstream tasks like depth estimation. |
The authors add a new loss term to the diffusion model training process. This term encourages the gradient field of the generated image to align with its expected vanishing points, calculated from ground truth data. |
Images generated with the proposed constraint appear more realistic and better preserve straight lines compared to the baseline Stable Diffusion V2 model.
Human subjective tests show a strong preference (around 70%) for images generated by the enhanced model over the baseline model.
Fine-tuning SOTA monocular depth estimation models on images generated by the enhanced model improves their performance on real-world datasets (KITTI, DIODE) compared to models trained on baseline images or even real images. |
The method requires a dataset with ground truth vanishing points for training, limiting its applicability to scenes with strong vanishing lines.
Generating large synthetic datasets using diffusion models remains computationally expensive. |
diffusion models, perspective constraints, depth estimation, image generation, synthetic data |
2312.00878
Report |
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers |
Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne |
Vision-language foundation models have shown remarkable performance in
various zero-shot settings such as image retrieval, classification, or
captioning. But so far, those models seem to fall behind when it comes to
zero-shot localization of referential expressions and objects in images. As a
result, they need to be fine-tuned for this task. In this paper, we show that
pretrained vision-language (VL) models allow for zero-shot open-vocabulary
object localization without any fine-tuning. To leverage those capabilities, we
propose a Grounding Everything Module (GEM) that generalizes the idea of
value-value attention introduced by CLIPSurgery to a self-self attention path.
We show that the concept of self-self attention corresponds to clustering, thus
enforcing groups of tokens arising from the same object to be similar while
preserving the alignment with the language space. To further guide the group
formation, we propose a set of regularizations that allows the model to finally
generalize across datasets and backbones. We evaluate the proposed GEM
framework on various benchmark tasks and datasets for semantic segmentation. It
shows that GEM not only outperforms other training-free open-vocabulary
localization methods, but also achieves state-of-the-art results on the
recently proposed OpenImagesV7 large-scale segmentation benchmark. |
This paper proposes GEM (Grounding Everything Module), a training-free method for open-vocabulary object localization using pre-trained vision-language models. |
Existing vision-language models often struggle with zero-shot localization tasks, requiring fine-tuning. This method leverages the inherent localization capabilities of these models without additional training. |
GEM employs a self-self attention mechanism inspired by CLIPSurgery, combining it with L2 normalization, adaptive temperature, iterative refinement, and qkv-ensemble to improve visual feature grouping and alignment with text embeddings. |
GEM outperforms other training-free methods and achieves competitive results against fine-tuned models on zero-shot semantic segmentation tasks.
It achieves state-of-the-art results on the OpenImagesV7 dataset for zero-shot point prediction, demonstrating its effectiveness in large-scale open vocabulary settings.
Analysis reveals that GEM enhances both visual distinctiveness (grouping of similar features) and vision-language alignment. |
The number of iterations in the self-self attention mechanism can impact performance depending on the number of classes in the dataset.
Failure cases highlight potential limitations in the text encoder, suggesting future research directions. |
vision-language models, zero-shot learning, object localization, semantic segmentation, self-attention |
2312.00869
Report |
Segment and Caption Anything |
Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu |
We propose a method to efficiently equip the Segment Anything Model (SAM)
with the ability to generate regional captions. SAM presents strong
generalizability to segment anything while is short for semantic understanding.
By introducing a lightweight query-based feature mixer, we align the
region-specific features with the embedding space of language models for later
caption generation. As the number of trainable parameters is small (typically
in the order of tens of millions), it costs less computation, less memory
usage, and less communication bandwidth, resulting in both fast and scalable
training. To address the scarcity problem of regional caption data, we propose
to first pre-train our model on objection detection and segmentation tasks. We
call this step weak supervision pretraining since the pre-training data only
contains category names instead of full-sentence descriptions. The weak
supervision pretraining allows us to leverage many publicly available object
detection and segmentation datasets. We conduct extensive experiments to
demonstrate the superiority of our method and validate each design choice. This
work serves as a stepping stone towards scaling up regional captioning data and
sheds light on exploring efficient ways to augment SAM with regional semantics.
The project page, along with the associated code, can be accessed via
https://xk-huang.github.io/segment-caption-anything/. |
This paper proposes an efficient method to augment the Segment Anything Model (SAM) with regional captioning capabilities by introducing a lightweight query-based feature mixer. |
SAM exhibits strong generalizability for segmenting anything but lacks semantic understanding. This work aims to bridge this gap by enabling SAM to generate regional captions. |
The method employs a lightweight hybrid feature mixer that aligns region-specific features with the embedding space of a frozen language model. Weak supervision pre-training is used to leverage existing object detection and segmentation datasets. |
The method achieves state-of-the-art performance on the Visual Genome benchmark.
Weak supervision pre-training using large-scale datasets significantly improves performance.
A larger pre-trained language model generally leads to better captioning results. |
The model might face challenges in predicting correct attributes (e.g., color) and distinguishing visually similar concepts.
Future work includes exploring larger-scale weak supervision datasets and self-training for improved generalizability. |
regional captioning, segment anything model, weak supervision, vision-language models, interactive segmentation |
2312.00863
Report |
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything |
Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, Vikas Chandra |
Segment Anything Model (SAM) has emerged as a powerful tool for numerous
vision applications. A key component that drives the impressive performance for
zero-shot transfer and high versatility is a super large Transformer model
trained on the extensive high-quality SA-1B dataset. While beneficial, the huge
computation cost of SAM model has limited its applications to wider real-world
applications. To address this limitation, we propose EfficientSAMs,
light-weight SAM models that exhibits decent performance with largely reduced
complexity. Our idea is based on leveraging masked image pretraining, SAMI,
which learns to reconstruct features from SAM image encoder for effective
visual representation learning. Further, we take SAMI-pretrained light-weight
image encoders and mask decoder to build EfficientSAMs, and finetune the models
on SA-1B for segment anything task. We perform evaluations on multiple vision
tasks including image classification, object detection, instance segmentation,
and semantic object detection, and find that our proposed pretraining method,
SAMI, consistently outperforms other masked image pretraining methods. On
segment anything task such as zero-shot instance segmentation, our
EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably
with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models. |
This paper proposes EfficientSAMs, lightweight Segment Anything Model (SAM) variants that achieve comparable performance to SAM with significantly reduced complexity, enhancing real-world applicability. |
The high computational cost of SAM, particularly the image encoder, limits its practical deployment in real-time applications. EfficientSAMs address this limitation. |
The authors introduce SAM-leveraged masked image pretraining (SAMI), training lightweight ViT image encoders to reconstruct features from the SAM encoder. EfficientSAMs integrate these pretrained encoders with the SAM decoder and fine-tune them on the SA-1B dataset. |
SAMI consistently outperforms other masked image pretraining methods in transfer learning settings on image classification, object detection, instance segmentation, and semantic segmentation tasks.
EfficientSAMs achieve state-of-the-art quality-efficiency trade-offs, demonstrating superior performance (e.g., ~4 AP improvement on COCO/LVIS) compared to other fast SAM models like MobileSAM and FastSAM.
EfficientSAMs significantly reduce inference time (~20x) and parameter size (~20x) compared to SAM while maintaining competitive performance. |
The paper primarily focuses on efficiency improvements for the image encoder, leaving room for future exploration in optimizing the decoder for further computational gains.
While demonstrating promising results in salient instance segmentation, further research is needed to refine and evaluate its performance thoroughly. |
segment anything model, efficient deep learning, masked image pretraining, vision transformers, instance segmentation |
2312.00860
Report |
Segment Any 3D Gaussians |
Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian |
Interactive 3D segmentation in radiance fields is an appealing task since its
importance in 3D scene understanding and manipulation. However, existing
methods face challenges in either achieving fine-grained, multi-granularity
segmentation or contending with substantial computational overhead, inhibiting
real-time interaction. In this paper, we introduce Segment Any 3D GAussians
(SAGA), a novel 3D interactive segmentation approach that seamlessly blends a
2D segmentation foundation model with 3D Gaussian Splatting (3DGS), a recent
breakthrough of radiance fields. SAGA efficiently embeds multi-granularity 2D
segmentation results generated by the segmentation foundation model into 3D
Gaussian point features through well-designed contrastive training. Evaluation
on existing benchmarks demonstrates that SAGA can achieve competitive
performance with state-of-the-art methods. Moreover, SAGA achieves
multi-granularity segmentation and accommodates various prompts, including
points, scribbles, and 2D masks. Notably, SAGA can finish the 3D segmentation
within milliseconds, achieving nearly 1000x acceleration compared to previous
SOTA. The project page is at https://jumpat.github.io/SAGA. |
SAGA is a novel interactive 3D segmentation method that achieves millisecond-level segmentation by distilling knowledge from the Segment Anything Model (SAM) into 3D Gaussians. |
Existing methods for interactive 3D segmentation in radiance fields are either computationally expensive or lack fine-grained segmentation capabilities. |
SAGA trains low-dimensional features for 3D Gaussians using a combination of SAM-guidance loss and correspondence loss to enable efficient and accurate segmentation from various prompts like points, scribbles, and masks. |
SAGA achieves competitive segmentation performance with previous state-of-the-art methods while being significantly faster.
SAGA supports various prompt types, including points, scribbles, masks, bounding boxes, and text.
SAGA is particularly well-suited for scenes with multiple objects requiring segmentation. |
SAGA's performance depends on the quality of the 3D Gaussian reconstruction.
The semantic-agnostic nature of the post-processing step can lead to false positives. |
3d segmentation, radiance fields, 3d gaussian splatting, interactive segmentation, segment anything model |
2312.00853
Report |
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution |
Xi Yang, Chenhang He, Jianqi Ma, Lei Zhang |
Real-world low-resolution (LR) videos have diverse and complex degradations,
imposing great challenges on video super-resolution (VSR) algorithms to
reproduce their high-resolution (HR) counterparts with high quality. Recently,
the diffusion models have shown compelling performance in generating realistic
details for image restoration tasks. However, the diffusion process has
randomness, making it hard to control the contents of restored images. This
issue becomes more serious when applying diffusion models to VSR tasks because
temporal consistency is crucial to the perceptual quality of videos. In this
paper, we propose an effective real-world VSR algorithm by leveraging the
strength of pre-trained latent diffusion models. To ensure the content
consistency among adjacent frames, we exploit the temporal dynamics in LR
videos to guide the diffusion process by optimizing the latent sampling path
with a motion-guided loss, ensuring that the generated HR video maintains a
coherent and continuous visual flow. To further mitigate the discontinuity of
generated details, we insert temporal module to the decoder and fine-tune it
with an innovative sequence-oriented loss. The proposed motion-guided latent
diffusion (MGLD) based VSR algorithm achieves significantly better perceptual
quality than state-of-the-arts on real-world VSR benchmark datasets, validating
the effectiveness of the proposed model design and training strategies. |
This paper proposes Motion-Guided Latent Diffusion (MGLD), a novel video super-resolution (VSR) algorithm that leverages the generative power of pre-trained latent diffusion models to enhance the quality of real-world low-resolution videos. |
Real-world low-resolution videos present diverse and complex degradations, posing significant challenges for VSR algorithms. Existing methods often struggle to balance detail reproduction with artifact suppression. This work explores the use of latent diffusion models, which have shown impressive results in image restoration, to address these challenges in VSR. |
The proposed MGLD incorporates temporal dynamics into the VSR process through two key innovations: 1) a motion-guided diffusion sampling process that uses optical flow information from LR videos to ensure temporal consistency in the generated HR frames, and 2) a temporal-aware sequence decoder fine-tuned with a novel sequence-oriented loss to enhance the continuity and smoothness of generated details. |
MGLD outperforms state-of-the-art real-world VSR methods on benchmark datasets, exhibiting superior perceptual quality in terms of detail realism, texture richness, and artifact reduction.
Quantitative evaluation using full-reference metrics (LPIPS, DISTS) on synthetic datasets and no-reference metrics (NIQE, BRISQUE, MUSIQ) on real-world datasets demonstrates the superior performance of MGLD.
Ablation studies confirm the effectiveness of the proposed motion-guided sampling and temporal-aware decoding strategies, highlighting their synergistic contributions to the overall VSR performance. |
The computational complexity of MGLD is higher compared to non-diffusion based VSR methods, primarily due to the iterative nature of the diffusion process. Future work will investigate model distillation and efficient sampling techniques to enhance the inference speed.
While the average warping error (WE) is commonly used to evaluate temporal consistency in VSR, it might not fully capture human perception. Future research will explore more sophisticated metrics to assess the temporal smoothness of generated videos. |
video super-resolution, real-world vsr, latent diffusion models, motion-guided sampling, temporal consistency |
2312.00852
Report |
Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion |
Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu |
Sampling from the posterior distribution poses a major computational
challenge in solving inverse problems using latent diffusion models. Common
methods rely on Tweedie's first-order moments, which are known to induce a
quality-limiting bias. Existing second-order approximations are impractical due
to prohibitive computational costs, making standard reverse diffusion processes
intractable for posterior sampling. This paper introduces Second-order Tweedie
sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency
comparable to first-order Tweedie with a tractable reverse process using
second-order approximation. Our theoretical results reveal that the
second-order approximation is lower bounded by our surrogate loss that only
requires $O(1)$ compute using the trace of the Hessian, and by the lower bound
we derive a new drift term to make the reverse process tractable. Our method
surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural
function evaluations, respectively, while notably enhancing sampling quality on
FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to
text-guided image editing and addresses residual distortions present from
corrupted images in leading text-guided image editing methods. To our best
knowledge, this is the first work to offer an efficient second-order
approximation in solving inverse problems using latent diffusion and editing
real-world images with corruptions. |
This paper introduces STSL, an efficient second-order Tweedie sampler for posterior sampling in latent diffusion models, improving image inversion and text-guided editing, especially for corrupted images. |
Existing first-order Tweedie samplers in diffusion-based inverse problem solvers suffer from bias, while second-order approximations are computationally expensive. This hinders high-fidelity reconstruction and editing, especially for real-world corrupted images. |
STSL leverages a novel surrogate loss function based on a tractable second-order Tweedie approximation. It uses Hutchinson's estimator to efficiently compute the trace of the Hessian, requiring only the readily available first-order score from diffusion models. This enables an efficient alternative reverse diffusion process for superior posterior sampling. |
STSL achieves 4x and 8x reduction in neural function evaluations compared to state-of-the-art solvers PSLD and P2L, respectively, while enhancing sampling quality.
It outperforms existing methods in image inversion tasks, including denoising, inpainting, super-resolution, and deblurring, on FFHQ, ImageNet, and COCO benchmarks.
STSL effectively extends to text-guided image editing, outperforming NTI in handling real-world corrupted images by enabling faithful edits and content preservation. |
The current implementation could be further optimized by utilizing a more sophisticated measurement operator as in P2L.
Incorporating prompt-tuning into the pipeline could potentially improve text-guided editing. |
latent diffusion models, image inversion, image editing, second-order tweedie sampler, posterior sampling |
2312.00845
Report |
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models |
Hyeonho Jeong, Geon Yeong Park, Jong Chul Ye |
Text-to-video diffusion models have advanced video generation significantly.
However, customizing these models to generate videos with tailored motions
presents a substantial challenge. In specific, they encounter hurdles in (a)
accurately reproducing motion from a target video, and (b) creating diverse
visual variations. For example, straightforward extensions of static image
customization methods to video often lead to intricate entanglements of
appearance and motion data. To tackle this, here we present the Video Motion
Customization (VMC) framework, a novel one-shot tuning approach crafted to
adapt temporal attention layers within video diffusion models. Our approach
introduces a novel motion distillation objective using residual vectors between
consecutive frames as a motion reference. The diffusion process then preserves
low-frequency motion trajectories while mitigating high-frequency
motion-unrelated noise in image space. We validate our method against
state-of-the-art video generative models across diverse real-world motions and
contexts. Our codes, data and the project demo can be found at
https://video-motion-customization.github.io |
This document outlines a style guide for submitting papers to the IEEE Computer Society Press, emphasizing blind review practices, formatting guidelines, and referencing conventions. |
Standardized formatting ensures clear communication of research, fair peer review, and professional publication quality for IEEE Computer Society Press. |
The paper provides detailed instructions and examples for various aspects of manuscript preparation, including language, length, margins, type style, headings, figures, tables, references, and color use. |
The guide clarifies blind review procedures, emphasizing the importance of removing identifying information while maintaining scientific rigor.
Specific instructions are given for formatting equations, cross-references, and citations, ensuring consistency and ease of navigation for readers.
Authors are strongly encouraged to prioritize clear and concise writing, avoiding unnecessary jargon and ensuring the paper's accessibility to a broad audience. |
The guide primarily focuses on technical aspects of formatting, potentially leaving room for addressing ethical considerations in research and publication.
While the guide emphasizes blind review, it could benefit from further discussion on handling conflicts of interest and promoting diversity in citations. |
style guide, ieee, manuscript preparation, blind review, academic publishing |
2312.00833
Report |
Lasagna: Layered Score Distillation for Disentangled Object Relighting |
Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko |
Professional artists, photographers, and other visual content creators use
object relighting to establish their photo's desired effect. Unfortunately,
manual tools that allow relighting have a steep learning curve and are
difficult to master. Although generative editing methods now enable some forms
of image editing, relighting is still beyond today's capabilities; existing
methods struggle to keep other aspects of the image -- colors, shapes, and
textures -- consistent after the edit. We propose Lasagna, a method that
enables intuitive text-guided relighting control. Lasagna learns a lighting
prior by using score distillation sampling to distill the prior of a diffusion
model, which has been finetuned on synthetic relighting data. To train Lasagna,
we curate a new synthetic dataset ReLiT, which contains 3D object assets re-lit
from multiple light source locations. Despite training on synthetic images,
quantitative results show that Lasagna relights real-world images while
preserving other aspects of the input image, outperforming state-of-the-art
text-guided image editing methods. Lasagna enables realistic and controlled
results on natural images and digital art pieces and is preferred by humans
over other methods in over 91% of cases. Finally, we demonstrate the
versatility of our learning objective by extending it to allow colorization,
another form of image editing. |
This paper introduces Lasagna, a novel method for text-guided object relighting in images that leverages a diffusion model prior. |
Relighting objects in images is a crucial aspect of visual content creation, but existing tools are often difficult to use or lack generalizability. Lasagna aims to provide an intuitive and realistic solution for text-guided relighting. |
Lasagna employs a layered score distillation sampling approach to learn a lighting prior from a diffusion model fine-tuned on a synthetic relighting dataset called ReLiT. It predicts separate editing layers for shading and lighting, which are then composed with the input image to achieve the desired relighting effect while preserving other image aspects. |
Lasagna outperforms state-of-the-art text-guided image editing methods in terms of realistic and controlled relighting, as evidenced by human evaluation.
The method generalizes well to various image domains, including natural photos and digital art, despite being trained on synthetic data.
Lasagna's layered editing framework can be extended to other image editing tasks, as demonstrated with a proof-of-concept for sketch colorization. |
Lasagna may struggle with highly abstract input images.
The method can sometimes introduce over-exposure artifacts in the background, which could be addressed in future work with techniques like foreground masking. |
image editing, relighting, diffusion models, score distillation sampling, text-guided synthesis |
2312.00785
Report |
Sequential Modeling Enables Scalable Learning for Large Vision Models |
Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros |
We introduce a novel sequential modeling approach which enables learning a
Large Vision Model (LVM) without making use of any linguistic data. To do this,
we define a common format, "visual sentences", in which we can represent raw
images and videos as well as annotated data sources such as semantic
segmentations and depth reconstructions without needing any meta-knowledge
beyond the pixels. Once this wide variety of visual data (comprising 420
billion tokens) is represented as sequences, the model can be trained to
minimize a cross-entropy loss for next token prediction. By training across
various scales of model architecture and data diversity, we provide empirical
evidence that our models scale effectively. Many different vision tasks can be
solved by designing suitable visual prompts at test time. |
This paper introduces a novel sequential modeling approach to learning a Large Vision Model (LVM) without linguistic data. |
The goal is to create a foundation for LVMs that can scale with large datasets and address various vision tasks through prompting, similar to LLMs in NLP. |
The methodology involves: (1) Representing diverse visual data, including raw images/videos and annotations, as unified "visual sentences" - sequences of images. (2) Training a large transformer architecture to predict the next token in these visual sentences, using a learned tokenizer to convert images into discrete tokens. |
The model demonstrates scaling behavior with increasing model size and data size.
Various vision tasks can be solved by designing suitable prompts, showcasing the potential for in-context learning.
The model benefits significantly from the diversity and volume of unsupervised data used during training. |
Limitations in computational resources restricted the exploration of various aspects, like the impact of different datasets.
The model's size, despite being large, is still relatively small compared to LLMs, leaving room for further exploration in generalization capabilities. |
large vision model, visual prompting, sequential modeling, vision-only training, unsupervised learning |
2312.00784
Report |
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts |
Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee |
While existing large vision-language multimodal models focus on whole image
understanding, there is a prominent gap in achieving region-specific
comprehension. Current approaches that use textual coordinates or spatial
encodings often fail to provide a user-friendly interface for visual prompting.
To address this challenge, we introduce a novel multimodal model capable of
decoding arbitrary visual prompts. This allows users to intuitively mark images
and interact with the model using natural cues like a "red bounding box" or
"pointed arrow". Our simple design directly overlays visual markers onto the
RGB image, eliminating the need for complex region encodings, yet achieves
state-of-the-art performance on region-understanding tasks like Visual7W,
PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present
ViP-Bench, a comprehensive benchmark to assess the capability of models in
understanding visual prompts across multiple dimensions, enabling future
research in this domain. Code, data, and model are publicly available. |
The paper introduces ViP-LLaVA, a novel multimodal model designed for intuitive interaction with images using natural language and arbitrary visual prompts like arrows, boxes, or scribbles. |
Existing large vision-language models predominantly focus on whole-image understanding and lack the capacity to process region-specific information effectively, limiting their ability to understand user intent in complex scenes. |
ViP-LLaVA leverages CLIP's ability to encode both images and superimposed visual markers. By overlaying these prompts directly onto the image, the model learns to associate visual cues with specific regions, enhancing region-specific comprehension. |
ViP-LLaVA achieves state-of-the-art results on region understanding tasks, surpassing models specifically designed for region-based reasoning on benchmarks like Visual7W and PointQA.
The model demonstrates strong generalization abilities, accurately interpreting user-drawn visual prompts at test time, even with variations in thickness or marker type.
A new benchmark, ViP-Bench, is introduced to comprehensively evaluate multimodal models' region understanding capabilities under various visual prompts, covering aspects like recognition, OCR, knowledge, math, relationship reasoning, and language generation. |
Current LMMs, including ViP-LLaVA, still lag behind GPT-4V in tasks demanding strong language reasoning, particularly OCR, math, and language generation, indicating an area for future research.
While ViP-LLaVA effectively leverages visual prompts for region understanding, exploring other region representation methods, such as combining visual prompts with textual coordinates, could further enhance performance. |
multimodal learning, visual prompting, region understanding, vision-language models, benchmarking |
2312.00778
Report |
MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video |
Hengyi Wang, Jingwen Wang, Lourdes Agapito |
Neural rendering has demonstrated remarkable success in dynamic scene
reconstruction. Thanks to the expressiveness of neural representations, prior
works can accurately capture the motion and achieve high-fidelity
reconstruction of the target object. Despite this, real-world video scenarios
often feature large unobserved regions where neural representations struggle to
achieve realistic completion. To tackle this challenge, we introduce MorpheuS,
a framework for dynamic 360{\deg} surface reconstruction from a casually
captured RGB-D video. Our approach models the target scene as a canonical field
that encodes its geometry and appearance, in conjunction with a deformation
field that warps points from the current frame to the canonical space. We
leverage a view-dependent diffusion prior and distill knowledge from it to
achieve realistic completion of unobserved regions. Experimental results on
various real-world and synthetic datasets show that our method can achieve
high-fidelity 360{\deg} surface reconstruction of a deformable object from a
monocular RGB-D video. |
MorpheuS: a novel framework for dynamic 360° surface reconstruction from casual monocular RGB-D videos, achieving photo-realistic completion of unobserved regions via diffusion priors. |
Existing dynamic scene reconstruction methods struggle to achieve realistic completion of unobserved regions, limiting their applications. |
MorpheuS represents the scene with a hyper-dimensional canonical field and a deformation field. It leverages a view-dependent diffusion prior and distills knowledge from it through Score Distillation Sampling (SDS) to complete unobserved geometry and appearance. |
MorpheuS achieves high-fidelity 360° surface reconstruction with accurate motion and geometry.
The use of diffusion priors leads to photo-realistic completion of unobserved regions, outperforming previous methods.
Canonical space regularization and temporal view-dependent SDS contribute to robust and accurate reconstruction. |
MorpheuS may fail in challenging scenarios like incomplete views, motion blur, or complex articulation due to limitations of the diffusion prior.
The method currently lacks motion priors, hindering reconstruction in self-occluded regions. |
dynamic scene reconstruction, diffusion priors, neural implicit representations, 360° reconstruction, rgb-d video |
2312.00777
Report |
VideoBooth: Diffusion-based Video Generation with Image Prompts |
Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu |
Text-driven video generation witnesses rapid progress. However, merely using
text prompts is not enough to depict the desired subject appearance that
accurately aligns with users' intents, especially for customized content
creation. In this paper, we study the task of video generation with image
prompts, which provide more accurate and direct content control beyond the text
prompts. Specifically, we propose a feed-forward framework VideoBooth, with two
dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine
manner. Coarse visual embeddings from image encoder provide high-level
encodings of image prompts, while fine visual embeddings from the proposed
attention injection module provide multi-scale and detailed encoding of image
prompts. These two complementary embeddings can faithfully capture the desired
appearance. 2) In the attention injection module at fine level, multi-scale
image prompts are fed into different cross-frame attention layers as additional
keys and values. This extra spatial information refines the details in the
first frame and then it is propagated to the remaining frames, which maintains
temporal consistency. Extensive experiments demonstrate that VideoBooth
achieves state-of-the-art performance in generating customized high-quality
videos with subjects specified in image prompts. Notably, VideoBooth is a
generalizable framework where a single model works for a wide range of image
prompts with feed-forward pass. |
Introduces VideoBooth, a novel feed-forward framework for generating videos using both image and text prompts, enabling customized content creation with accurate subject appearance control. |
Addresses the limitations of text-only prompts in video generation, which struggle to accurately depict specific subject appearances, particularly for customized content. |
Employs a coarse-to-fine visual embedding strategy: 1) a pretrained CLIP image encoder extracts coarse visual embeddings from image prompts, inserted into text embeddings; 2) an attention injection module refines details by incorporating multi-scale image prompt representations into the cross-frame attention of a pretrained text-to-video diffusion model. |
VideoBooth achieves state-of-the-art performance in generating customized videos with subject appearance faithful to the image prompts.
Quantitative evaluation using CLIP-Image and DINO metrics demonstrate superior image alignment compared to baseline methods.
User study confirms VideoBooth's superiority in image alignment, text alignment, and overall quality. |
The model was trained on a dataset with watermarks, requiring an additional module to remove them.
Future work includes expanding the dataset and enhancing the model's ability to handle complex motions and diverse object appearances. |
video generation, image prompts, text-to-video, diffusion models, customized content creation |
2312.00739
Report |
Adversarial Score Distillation: When score distillation meets GAN |
Min Wei, Jingkai Zhou, Junyao Sun, Xuesong Zhang |
Existing score distillation methods are sensitive to classifier-free guidance
(CFG) scale: manifested as over-smoothness or instability at small CFG scales,
while over-saturation at large ones. To explain and analyze these issues, we
revisit the derivation of Score Distillation Sampling (SDS) and decipher
existing score distillation with the Wasserstein Generative Adversarial Network
(WGAN) paradigm. With the WGAN paradigm, we find that existing score
distillation either employs a fixed sub-optimal discriminator or conducts
incomplete discriminator optimization, resulting in the scale-sensitive issue.
We propose the Adversarial Score Distillation (ASD), which maintains an
optimizable discriminator and updates it using the complete optimization
objective. Experiments show that the proposed ASD performs favorably in 2D
distillation and text-to-3D tasks against existing methods. Furthermore, to
explore the generalization ability of our WGAN paradigm, we extend ASD to the
image editing task, which achieves competitive results. The project page and
code are at https://github.com/2y7c3/ASD. |
This paper unveils the connection between score distillation and GANs, proposing Adversarial Score Distillation (ASD) to address the limitations of existing score distillation methods. |
Existing score distillation methods, like SDS, are sensitive to classifier-free guidance (CFG) scale, resulting in over-smoothing or instability at small scales and over-saturation at large scales. This paper aims to analyze and rectify these issues. |
The authors revisit the derivation of SDS and establish its connection to Wasserstein GAN (WGAN). They identify that existing methods either employ a fixed sub-optimal discriminator or conduct incomplete optimization. ASD, leveraging the WGAN paradigm, maintains an optimizable discriminator and updates it using the complete WGAN discriminator loss. |
ASD demonstrates superior performance in quality, stability, and diversity compared to SDS and VSD in both 2D distillation and text-to-3D tasks.
The paper provides a theoretical analysis of VSD and CSD under the WGAN paradigm.
ASD's application is extended to image editing, showcasing competitive results and highlighting the generalization ability of the proposed paradigm. |
ASD, while exhibiting strong performance, still suffers from speed limitations similar to VSD.
Further exploration of dynamic gamma values in the discriminator loss function is suggested for potential improvement. |
score distillation, generative adversarial networks, text-to-3d synthesis, image editing, wasserstein gan |
2312.00732
Report |
Gaussian Grouping: Segment and Edit Anything in 3D Scenes |
Mingqiao Ye, Martin Danelljan, Fisher Yu, Lei Ke |
The recent Gaussian Splatting achieves high-quality and real-time novel-view
synthesis of the 3D scenes. However, it is solely concentrated on the
appearance and geometry modeling, while lacking in fine-grained object-level
scene understanding. To address this issue, we propose Gaussian Grouping, which
extends Gaussian Splatting to jointly reconstruct and segment anything in
open-world 3D scenes. We augment each Gaussian with a compact Identity
Encoding, allowing the Gaussians to be grouped according to their object
instance or stuff membership in the 3D scene. Instead of resorting to expensive
3D labels, we supervise the Identity Encodings during the differentiable
rendering by leveraging the 2D mask predictions by SAM, along with introduced
3D spatial consistency regularization. Comparing to the implicit NeRF
representation, we show that the discrete and grouped 3D Gaussians can
reconstruct, segment and edit anything in 3D with high visual quality, fine
granularity and efficiency. Based on Gaussian Grouping, we further propose a
local Gaussian Editing scheme, which shows efficacy in versatile scene editing
applications, including 3D object removal, inpainting, colorization and scene
recomposition. Our code and models will be at
https://github.com/lkeab/gaussian-grouping. |
Presents Gaussian Grouping, an extension of 3D Gaussian Splatting for joint reconstruction and segmentation of anything in open-world 3D scenes. |
Addresses the limitations of existing 3D scene understanding methods that rely on expensive 3D labels or struggle with fine-grained segmentation in open-world settings. |
Augments each Gaussian with a learnable Identity Encoding, supervised by 2D mask predictions from SAM and a 3D spatial consistency regularization, enabling grouping of Gaussians into object instances or stuff. |
Achieves high-quality reconstruction comparable to original Gaussian Splatting.
Significantly outperforms existing open-vocabulary 3D segmentation methods on LERF-Mask dataset.
Enables efficient and versatile scene editing applications, including object removal, inpainting, colorization, and scene recomposition. |
Currently limited to static 3D scenes due to the lack of dynamic modeling.
Exploiting fully unsupervised 3D Gaussian grouping without 2D mask supervision. |
3d scene understanding, gaussian splatting, open-world segmentation, scene editing, segment anything model (sam) |
2312.00674
Report |
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models |
Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe Wang |
Vision-language pre-training like CLIP has shown promising performance on
various downstream tasks such as zero-shot image classification and image-text
retrieval. Most of the existing CLIP-alike works usually adopt relatively large
image encoders like ResNet50 and ViT, while the lightweight counterparts are
rarely discussed. In this paper, we propose a multi-level interaction paradigm
for training lightweight CLIP models. Firstly, to mitigate the problem that
some image-text pairs are not strictly one-to-one correspondence, we improve
the conventional global instance-level alignment objective by softening the
label of negative samples progressively. Secondly, a relaxed bipartite matching
based token-level alignment objective is introduced for finer-grained alignment
between image patches and textual words. Moreover, based on the observation
that the accuracy of CLIP model does not increase correspondingly as the
parameters of text encoder increase, an extra objective of masked language
modeling (MLM) is leveraged for maximizing the potential of the shortened text
encoder. In practice, an auxiliary fusion module injecting unmasked image
embedding into masked text embedding at different network stages is proposed
for enhancing the MLM. Extensive experiments show that without introducing
additional computational cost during inference, the proposed method achieves a
higher performance on multiple downstream tasks. |
This paper proposes LightCLIP, a multi-level interaction paradigm for training lightweight CLIP models that achieves higher performance on downstream tasks without introducing additional computational cost during inference. |
Existing CLIP models are difficult to deploy on edge devices due to their large parameter size, and directly adapting existing training methods to lightweight models leads to sub-optimal results. |
The authors propose: (1) a progressive softening of labels for the global instance-level alignment objective to account for noisy image-text pairs; (2) a relaxed bipartite matching based token-level alignment objective for finer-grained alignment between image patches and textual words; and (3) a masked language modeling (MLM) objective enhanced by fusing unmasked image embedding into masked text embedding at different network stages to maximize the potential of a shortened text encoder. |
LightCLIP outperforms CLIP, SLIP, and DeCLIP on zero-shot ImageNet classification with various lightweight image encoders.
LightCLIP consistently achieves higher average zero-shot accuracy on 10 small datasets compared to other methods.
LightCLIP shows significant improvements in zero-shot image-text retrieval on Flickr30K and MS-COCO, especially in image-to-text top-1 hit accuracy. |
The paper primarily focuses on YFCC15M-V2 dataset for pre-training, limiting the exploration of performance with larger datasets.
Future work could explore alternative lightweight architectures and fusion strategies for both image and text encoders. |
vision-language pre-training, lightweight model, clip, zero-shot learning, image-text retrieval |
2312.00596
Report |
BCN: Batch Channel Normalization for Image Classification |
Afifa Khaled, Chao Li, Jia Ning, Kun He |
Normalization techniques have been widely used in the field of deep learning
due to their capability of enabling higher learning rates and are less careful
in initialization. However, the effectiveness of popular normalization
technologies is typically limited to specific areas. Unlike the standard Batch
Normalization (BN) and Layer Normalization (LN), where BN computes the mean and
variance along the (N,H,W) dimensions and LN computes the mean and variance
along the (C,H,W) dimensions (N, C, H and W are the batch, channel, spatial
height and width dimension, respectively), this paper presents a novel
normalization technique called Batch Channel Normalization (BCN). To exploit
both the channel and batch dependence and adaptively and combine the advantages
of BN and LN based on specific datasets or tasks, BCN separately normalizes
inputs along the (N, H, W) and (C, H, W) axes, then combines the normalized
outputs based on adaptive parameters. As a basic block, BCN can be easily
integrated into existing models for various applications in the field of
computer vision. Empirical results show that the proposed technique can be
seamlessly applied to various versions of CNN or Vision Transformer
architecture. The code is publicly available at
https://github.com/AfifaKhaled/BatchChannel-Normalization |
This paper introduces Batch Channel Normalization (BCN), a novel normalization technique for deep learning that combines the strengths of Batch Normalization (BN) and Layer Normalization (LN). |
Existing normalization techniques like BN and LN have limitations, with BN requiring large batch sizes and LN not performing well on convolutional layers. BCN aims to overcome these limitations by exploiting both channel and batch dependencies. |
BCN normalizes inputs separately along the (N, H, W) and (C, H, W) axes, then combines these normalized outputs using adaptive parameters. This allows BCN to leverage the advantages of both BN and LN. |
BCN consistently outperforms BN, LN, and other normalization techniques in image classification tasks on CIFAR-10/100, SVHN, and ImageNet datasets.
BCN improves the performance of self-supervised learning methods like BYOL.
BCN shows consistent improvements when applied to Vision Transformer (ViT) models. |
Future work includes an ablation study on directly computing average and variance along (N, C, H, W) axes.
Further investigation of BCN's effectiveness across a wider range of CNN architectures and applications is planned. |
batch normalization, layer normalization, deep learning, normalization techniques, computer vision |
2312.00588
Report |
LucidDreaming: Controllable Object-Centric 3D Generation |
Zhaoning Wang, Ming Li, Chen Chen |
With the recent development of generative models, Text-to-3D generations have
also seen significant growth. Nonetheless, achieving precise control over 3D
generation continues to be an arduous task, as using text to control often
leads to missing objects and imprecise locations. Contemporary strategies for
enhancing controllability in 3D generation often entail the introduction of
additional parameters, such as customized diffusion models. This often induces
hardness in adapting to different diffusion models or creating distinct
objects.
In this paper, we present LucidDreaming as an effective pipeline capable of
fine-grained control over 3D generation. It requires only minimal input of 3D
bounding boxes, which can be deduced from a simple text prompt using a Large
Language Model. Specifically, we propose clipped ray sampling to separately
render and optimize objects with user specifications. We also introduce
object-centric density blob bias, fostering the separation of generated
objects. With individual rendering and optimizing of objects, our method excels
not only in controlled content generation from scratch but also within the
pre-trained NeRF scenes. In such scenarios, existing generative approaches
often disrupt the integrity of the original scene, and current editing methods
struggle to synthesize new content in empty spaces. We show that our method
exhibits remarkable adaptability across a spectrum of mainstream Score
Distillation Sampling-based 3D generation frameworks, and achieves superior
alignment of 3D content when compared to baseline approaches. We also provide a
dataset of prompts with 3D bounding boxes, benchmarking 3D spatial
controllability. |
This paper introduces LucidDreaming, a plug-and-play pipeline enhancing controllability in 3D generation using bounding boxes or text prompts. |
Existing text-to-3D generation methods struggle with fine-grained control, often resulting in missing objects or inaccurate placements. While controllable methods exist, they rely on customized diffusion models and lack adaptability. |
The paper proposes: (1) Clipped ray sampling for individual object rendering and optimization within bounding boxes. (2) Object-centric density bias initialization to accurately position initial density within bounding boxes. (3) Integration of a Large Language Model to generate bounding boxes and object descriptions from text prompts. |
LucidDreaming demonstrates superior control over object placement and number compared to baseline methods.
The method adapts to various SDS-based 3D generation frameworks (DreamFusion, Magic3D, ProlificDreamer).
It allows controlled object generation within pre-trained NeRF scenes, unlike methods focused on modifying existing objects. |
Current implementation struggles with object interactions, relying on separate rendering.
Training time increases linearly with the number of objects, posing challenges for complex scenes. |
3d generation, controllability, text-to-3d, nerf, score distillation sampling |
2312.00583
Report |
MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes |
Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Mike Zheng Shou, Shuran Song, Jeffrey Ichnowski |
Accurate 3D tracking in highly deformable scenes with occlusions and shadows
can facilitate new applications in robotics, augmented reality, and generative
AI. However, tracking under these conditions is extremely challenging due to
the ambiguity that arises with large deformations, shadows, and occlusions. We
introduce MD-Splatting, an approach for simultaneous 3D tracking and novel view
synthesis, using video captures of a dynamic scene from various camera poses.
MD-Splatting builds on recent advances in Gaussian splatting, a method that
learns the properties of a large number of Gaussians for state-of-the-art and
fast novel view synthesis. MD-Splatting learns a deformation function to
project a set of Gaussians with non-metric, thus canonical, properties into
metric space. The deformation function uses a neural-voxel encoding and a
multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow
scalar. We enforce physics-inspired regularization terms based on local
rigidity, conservation of momentum, and isometry, which leads to trajectories
with smaller trajectory errors. MD-Splatting achieves high-quality 3D tracking
on highly deformable scenes with shadows and occlusions. Compared to
state-of-the-art, we improve 3D tracking by an average of 23.9 %, while
simultaneously achieving high-quality novel view synthesis. With sufficient
texture such as in scene 6, MD-Splatting achieves a median tracking error of
3.39 mm on a cloth of 1 x 1 meters in size. Project website:
https://md-splatting.github.io/. |
\modelname{} is a novel method for simultaneous 3D tracking and novel view synthesis in highly deformable scenes, using video captures from various camera poses. It leverages Gaussian splatting and learns a deformation function to map canonical Gaussians into metric space for tracking and rendering. |
Accurate 3D tracking in deformable scenes is crucial for applications in robotics, AR, and AI, but it is challenging due to ambiguities caused by deformations, shadows, and occlusions. |
\modelname{} learns a deformation function using a neural-voxel encoding and an MLP to infer Gaussian position, rotation, and a shadow scalar. It also enforces physics-inspired regularization terms for plausible deformations. |
\modelname{} achieves state-of-the-art 3D tracking on deformable scenes, improving accuracy by 16.7% compared to previous methods.
It achieves high-quality novel view reconstruction with an average PSNR of 39.1.
The method exhibits robustness in textured environments and shows promising results even with lower time resolution. |
The method currently relies on a multi-camera setup, limiting its applicability in some real-world scenarios.
The work primarily focuses on scenes with a single cloth object; expanding to more complex environments with diverse soft objects is an area for future exploration. |
3d tracking, novel view synthesis, deformable objects, gaussian splatting, neural rendering |
2312.00451
Report |
FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting |
Zehao Zhu, Zhiwen Fan, Yifan Jiang, Zhangyang Wang |
Novel view synthesis from limited observations remains an important and
persistent task. However, high efficiency in existing NeRF-based few-shot view
synthesis is often compromised to obtain an accurate 3D representation. To
address this challenge, we propose a few-shot view synthesis framework based on
3D Gaussian Splatting that enables real-time and photo-realistic view synthesis
with as few as three training views. The proposed method, dubbed FSGS, handles
the extremely sparse initialized SfM points with a thoughtfully designed
Gaussian Unpooling process. Our method iteratively distributes new Gaussians
around the most representative locations, subsequently infilling local details
in vacant areas. We also integrate a large-scale pre-trained monocular depth
estimator within the Gaussians optimization process, leveraging online
augmented views to guide the geometric optimization towards an optimal
solution. Starting from sparse points observed from limited input viewpoints,
our FSGS can accurately grow into unseen regions, comprehensively covering the
scene and boosting the rendering quality of novel views. Overall, FSGS achieves
state-of-the-art performance in both accuracy and rendering efficiency across
diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website:
https://zehaozhu.github.io/FSGS/. |
FSGS, a novel point-based framework for few-shot view synthesis, leveraging Proximity-guided Gaussian Unpooling and monocular depth priors. |
Addresses the challenge of high inefficiency and inaccurate 3D representation in existing NeRF-based few-shot view synthesis methods. |
Employs Proximity-guided Gaussian Unpooling to densify 3D Gaussians for scene coverage and integrates monocular depth priors, enhanced by pseudo view generation, for optimal Gaussian optimization. |
Achieves state-of-the-art rendering quality on LLFF, Mip-NeRF360, and Blender datasets.
Enables real-time rendering speed (200+ FPS) suitable for practical applications.
Significantly outperforms NeRF-based methods in rendering accuracy and speed, particularly in few-shot scenarios with limited training views. |
Reliance on accurate SfM for initialization, potentially limiting performance in challenging scenarios.
Exploration of alternative depth priors beyond monocular depth estimators to further improve accuracy. |
novel view synthesis, few-shot learning, 3d gaussian splatting, monocular depth estimation, real-time rendering |
2312.00330
Report |
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter |
Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang, Ying Shan |
Text-to-video (T2V) models have shown remarkable capabilities in generating
diverse videos. However, they struggle to produce user-desired stylized videos
due to (i) text's inherent clumsiness in expressing specific styles and (ii)
the generally degraded style fidelity. To address these challenges, we
introduce StyleCrafter, a generic method that enhances pre-trained T2V models
with a style control adapter, enabling video generation in any style by
providing a reference image. Considering the scarcity of stylized video
datasets, we propose to first train a style control adapter using style-rich
image datasets, then transfer the learned stylization ability to video
generation through a tailor-made finetuning paradigm. To promote content-style
disentanglement, we remove style descriptions from the text prompt and extract
style information solely from the reference image using a decoupling learning
strategy. Additionally, we design a scale-adaptive fusion module to balance the
influences of text-based content features and image-based style features, which
helps generalization across various text and style combinations. StyleCrafter
efficiently generates high-quality stylized videos that align with the content
of the texts and resemble the style of the reference images. Experiments
demonstrate that our approach is more flexible and efficient than existing
competitors. |
The paper introduces StyleCrafter, a novel method that enables pre-trained text-to-video (T2V) models to generate stylized videos using a single reference image. |
Existing T2V models struggle to produce stylized videos due to the difficulty of expressing specific styles through text prompts and the lack of large-scale stylized video datasets. |
The authors propose a two-stage training pipeline: 1) train a style adapter on a stylized image dataset to extract style features, 2) adapt a pre-trained T2V model by fine-tuning its temporal blocks with the style adapter incorporated. |
StyleCrafter generates high-quality stylized videos that are both text-aligned and style-conformant.
The method outperforms existing single-reference and even some multi-reference based stylized video generation methods.
Ablation studies validate the effectiveness of the proposed style adapter architecture, training scheme, and adaptive style-content fusion module. |
The model may not generate satisfactory results when the reference image inadequately represents the target style or the style is extremely uncommon.
The reliance on pre-trained T2V models limits the quality of generated results in certain aspects, e.g., generating high-fidelity faces. |
text-to-video generation, stylized video generation, style adapter, content-style disentanglement, diffusion models |
2312.00210
Report |
DREAM: Diffusion Rectification and Estimation-Adaptive Models |
Jinxin Zhou, Tianyu Ding, Tianyi Chen, Jiachen Jiang, Ilya Zharkov, Zhihui Zhu, Luming Liang |
We present DREAM, a novel training framework representing Diffusion
Rectification and Estimation Adaptive Models, requiring minimal code changes
(just three lines) yet significantly enhancing the alignment of training with
sampling in diffusion models. DREAM features two components: diffusion
rectification, which adjusts training to reflect the sampling process, and
estimation adaptation, which balances perception against distortion. When
applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff
between minimizing distortion and preserving high image quality. Experiments
demonstrate DREAM's superiority over standard diffusion-based SR methods,
showing a $2$ to $3\times $ faster training convergence and a $10$ to
$20\times$ reduction in sampling steps to achieve comparable results. We hope
DREAM will inspire a rethinking of diffusion model training paradigms. |
This paper presents DREAM, a novel training framework for diffusion models that effectively reduces the training-sampling discrepancy in conditional image generation tasks, such as super-resolution. |
Training diffusion models, especially for conditional generation tasks, suffers from a discrepancy between training and sampling processes, hindering their performance. DREAM addresses this issue with minimal code changes, leading to enhanced image quality, faster training, and improved sampling efficiency. |
DREAM consists of two main components: 1) Diffusion Rectification: adjusts training to reflect the sampling process by utilizing the model's own predictions for error estimation and rectification. 2) Estimation Adaptation: balances the benefits of standard diffusion and diffusion rectification by adaptively incorporating ground-truth information during training. |
DREAM significantly enhances both distortion and perception metrics across various diffusion-based super-resolution models and datasets.
It achieves a 2-3 times faster training convergence and a 10-20 times reduction in sampling steps compared to standard diffusion training, yielding superior or comparable results.
DREAM demonstrates superior robustness and generalization ability, achieving state-of-the-art out-of-distribution (OOD) super-resolution performance across diverse datasets and scales. |
While DREAM shows promising results, it primarily focuses on super-resolution tasks in this work.
Further exploration of advanced network architectures and loss functions, such as incorporating GAN loss, could potentially lead to further enhancements in image quality. |
diffusion models, super-resolution, training-sampling discrepancy, generative models, image generation |
2312.00206
Report |
SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting |
Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, Achuta Kadambi |
The problem of novel view synthesis has grown significantly in popularity
recently with the introduction of Neural Radiance Fields (NeRFs) and other
implicit scene representation methods. A recent advance, 3D Gaussian Splatting
(3DGS), leverages an explicit representation to achieve real-time rendering
with high-quality results. However, 3DGS still requires an abundance of
training views to generate a coherent scene representation. In few shot
settings, similar to NeRF, 3DGS tends to overfit to training views, causing
background collapse and excessive floaters, especially as the number of
training views are reduced. We propose a method to enable training coherent
3DGS-based radiance fields of 360-degree scenes from sparse training views. We
integrate depth priors with generative and explicit constraints to reduce
background collapse, remove floaters, and enhance consistency from unseen
viewpoints. Experiments show that our method outperforms base 3DGS by 6.4% in
LPIPS and by 12.2% in PSNR, and NeRF-based methods by at least 17.6% in LPIPS
on the MipNeRF-360 dataset with substantially less training and inference cost. |
This paper presents SparseGS, a novel method for real-time 360° sparse view synthesis that leverages 3D Gaussian Splatting (3DGS) and incorporates depth priors, diffusion constraints, and a novel floater pruning technique. |
Existing view synthesis techniques like NeRFs and 3DGS often struggle in few-shot scenarios, leading to artifacts like floaters and background collapse, particularly in challenging 360° unbounded scenes. |
SparseGS integrates depth priors using a patch-based depth correlation loss based on a novel softmax depth rendering technique. It utilizes a score distillation sampling loss from a pre-trained diffusion model for refinement and employs image re-projection for data augmentation. A key innovation is an explicit, adaptive operator that directly prunes unwanted "floater" artifacts from the 3D Gaussian representation. |
SparseGS outperforms base 3DGS by 6.4% in LPIPS and 12.2% in PSNR on the MipNeRF-360 dataset.
It surpasses NeRF-based methods by at least 17.6% in LPIPS.
SparseGS enables real-time inference (100+ FPS) while maintaining high visual quality. |
SparseGS is highly reliant on the accuracy and detail of the initial point cloud provided by COLMAP, which can be problematic in sparse view settings where initial point clouds are small.
Future work could explore point cloud densification techniques as data augmentation to address this limitation. |
novel view synthesis, 3d gaussian splatting, few-shot learning, depth priors, floater pruning |
2312.00195
Report |
Raising the Bar of AI-generated Image Detection with CLIP |
Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, Luisa Verdoliva |
The aim of this work is to explore the potential of pre-trained
vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study
its performance in a wide variety of challenging scenarios. We find that,
contrary to previous beliefs, it is neither necessary nor convenient to use a
large domain-specific dataset for training. On the contrary, by using only a
handful of example images from a single generative model, a CLIP-based detector
exhibits surprising generalization ability and high robustness across different
architectures, including recent commercial tools such as Dalle-3, Midjourney
v5, and Firefly. We match the state-of-the-art (SoTA) on in-distribution data
and significantly improve upon it in terms of generalization to
out-of-distribution data (+6% AUC) and robustness to impaired/laundered data
(+13%). Our project is available at
https://grip-unina.github.io/ClipBased-SyntheticImageDetection/ |
This paper presents a lightweight AI-generated image detection method using CLIP features, demonstrating superior generalization ability and robustness across diverse generators, outperforming state-of-the-art methods. |
Detecting AI-generated images is crucial for combating disinformation and ensuring media authenticity, especially with the proliferation of advanced image synthesis tools. |
The method extracts CLIP features from real/fake image pairs with shared textual descriptions, training a linear SVM classifier. It analyzes the impact of reference set size, content, and CLIP pre-training. |
CLIP features achieve excellent generalization, requiring only a handful of examples for effective detection.
Performance is influenced by the diversity of the reference set and benefits from large-scale pre-training.
The method demonstrates strong robustness to image perturbations, surpassing existing methods, particularly on challenging commercial AI-generated images. |
The method's reliance on semantic features might be vulnerable to adversarial attacks targeting these aspects.
Future work includes exploring few-shot adaptation for real-world scenarios and improving interpretability. |
ai-generated image detection, clip, generalization, robustness, image forensics |
2312.00116
Report |
S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion |
Or Greenberg, Eran Kishon, Dani Lischinski |
Image-to-image translation (I2IT) refers to the process of transforming
images from a source domain to a target domain while maintaining a fundamental
connection in terms of image content. In the past few years, remarkable
advancements in I2IT were achieved by Generative Adversarial Networks (GANs),
which nevertheless struggle with translations requiring high precision.
Recently, Diffusion Models have established themselves as the engine of choice
for image generation. In this paper we introduce S2ST, a novel framework
designed to accomplish global I2IT in complex photorealistic images, such as
day-to-night or clear-to-rain translations of automotive scenes. S2ST operates
within the seed space of a Latent Diffusion Model, thereby leveraging the
powerful image priors learned by the latter. We show that S2ST surpasses
state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches,
for complex automotive scenes, improving fidelity while respecting the target
domain's appearance across a variety of domains. Notably, S2ST obviates the
necessity for training domain-specific translation networks. |
Introduces S2ST, a novel diffusion-based unpaired image-to-image translation (I2IT) method for complex photorealistic images (e.g., automotive scenes), operating within the seed space of a Latent Diffusion Model (LDM). |
Addresses limitations of GAN-based I2IT methods in handling complex scene translations with high content fidelity, leveraging the power of pre-trained diffusion models for realistic and detailed image generation. |
Employs a two-step process: 1) Seed Translation optimizes the initial seed obtained by inverting the source image to match the target domain while preserving structure. 2) Trajectory Optimization refines the DDIM sampling trajectory to further enhance structural similarity between source and generated images. |
Outperforms state-of-the-art GAN-based methods in terms of target domain appearance and realism (measured by KID and SSIM) for day-to-night translations on BDD100k.
Demonstrates superior performance in human evaluation for achieving target domain appearance while preserving source image content.
Enables multi-domain translation using the same model, unlike GAN-based methods requiring separate training for each domain pair. |
High computational cost due to backpropagation through the entire sampling process.
Lack of explicit cycle-consistency mechanism found in GANs, potentially limiting content preservation despite efforts through seed optimization and trajectory refinement. |
image-to-image translation, diffusion models, seed space, trajectory optimization, automotive scenes |
2312.00109
Report |
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering |
Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, Bo Dai |
Neural rendering methods have significantly advanced photo-realistic 3D scene
rendering in various academic and industrial applications. The recent 3D
Gaussian Splatting method has achieved the state-of-the-art rendering quality
and speed combining the benefits of both primitive-based representations and
volumetric representations. However, it often leads to heavily redundant
Gaussians that try to fit every training view, neglecting the underlying scene
geometry. Consequently, the resulting model becomes less robust to significant
view changes, texture-less area and lighting effects. We introduce Scaffold-GS,
which uses anchor points to distribute local 3D Gaussians, and predicts their
attributes on-the-fly based on viewing direction and distance within the view
frustum. Anchor growing and pruning strategies are developed based on the
importance of neural Gaussians to reliably improve the scene coverage. We show
that our method effectively reduces redundant Gaussians while delivering
high-quality rendering. We also demonstrates an enhanced capability to
accommodate scenes with varying levels-of-detail and view-dependent
observations, without sacrificing the rendering speed. |
This paper introduces Scaffold-GS, a novel 3D scene representation method for view-adaptive rendering using anchor points to guide neural 3D Gaussian distribution. |
Existing 3D Gaussian Splatting methods suffer from redundant Gaussians and lack of robustness to view changes. Scaffold-GS improves rendering quality and efficiency by leveraging scene structure and view-dependent neural Gaussians. |
The method initializes anchor points from SfM point clouds and dynamically predicts neural Gaussian attributes from anchor features and viewing information. It refines anchor points via growing and pruning based on neural Gaussian feedback. |
Scaffold-GS achieves comparable or better rendering quality than state-of-the-art methods like 3D-GS.
It requires significantly less storage space while maintaining real-time rendering speed.
The learned anchor features exhibit semantic clustering, indicating potential for scene understanding tasks. |
Performance heavily relies on the quality of initial SfM point clouds.
The current filtering strategy by opacity may mask important neural Gaussians. |
neural rendering, 3d gaussian splatting, view-adaptive rendering, scene representation, anchor points |
2312.00094
Report |
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps |
Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen |
Sampling from diffusion models can be treated as solving the corresponding
ordinary differential equations (ODEs), with the aim of obtaining an accurate
solution with as few number of function evaluations (NFE) as possible.
Recently, various fast samplers utilizing higher-order ODE solvers have emerged
and achieved better performance than the initial first-order one. However,
these numerical methods inherently result in certain approximation errors,
which significantly degrades sample quality with extremely small NFE (e.g.,
around 5). In contrast, based on the geometric observation that each sampling
trajectory almost lies in a two-dimensional subspace embedded in the ambient
space, we propose Approximate MEan-Direction Solver (AMED-Solver) that
eliminates truncation errors by directly learning the mean direction for fast
diffusion sampling. Besides, our method can be easily used as a plugin to
further improve existing ODE-based samplers. Extensive experiments on image
synthesis with the resolution ranging from 32 to 512 demonstrate the
effectiveness of our method. With only 5 NFE, we achieve 6.61 FID on CIFAR-10,
10.74 FID on ImageNet 64$\times$64, and 13.20 FID on LSUN Bedroom. Our code is
available at https://github.com/zju-pi/diff-sampler. |
This paper introduces AMED-Solver, a novel single-step ODE solver for diffusion models that minimizes discretization errors by predicting mean directions in each sampling step. |
Existing fast diffusion samplers suffer from significant sample quality degradation when using very few function evaluations (NFE), especially single-step solvers. AMED-Solver addresses this limitation, enabling high-quality generation in around 5 NFE. |
The method leverages the observation that sampling trajectories lie approximately in a 2D subspace. It then trains a shallow neural network (AMED predictor) to predict intermediate time steps and scaling factors that minimize the distance between student and teacher sampling trajectories. |
AMED-Solver outperforms other single-step ODE solvers and achieves comparable or superior results to multi-step solvers in many cases.
The AMED-Plugin, a generalization of AMED-Solver, consistently improves the performance of existing fast ODE solvers across various datasets.
The method achieves state-of-the-art results among solver-based methods in around 5 NFE, demonstrating significant FID improvements on CIFAR-10, ImageNet 64x64, and LSUN Bedroom. |
Fast ODE solvers, including AMED, show high sensitivity to time schedules, especially with limited NFE. Future work could explore adaptive time schedules based on sampling trajectory geometry.
The paper primarily focuses on image generation. Exploring AMED's applicability to other diffusion model applications like image editing and restoration could be interesting. |
diffusion models, ode solvers, fast sampling, image generation, knowledge distillation |
2312.00093
Report |
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs |
Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, Bernhard Schölkopf |
As pretrained text-to-image diffusion models become increasingly powerful,
recent efforts have been made to distill knowledge from these text-to-image
pretrained models for optimizing a text-guided 3D model. Most of the existing
methods generate a holistic 3D model from a plain text input. This can be
problematic when the text describes a complex scene with multiple objects,
because the vectorized text embeddings are inherently unable to capture a
complex description with multiple entities and relationships. Holistic 3D
modeling of the entire scene further prevents accurate grounding of text
entities and concepts. To address this limitation, we propose GraphDreamer, a
novel framework to generate compositional 3D scenes from scene graphs, where
objects are represented as nodes and their interactions as edges. By exploiting
node and edge information in scene graphs, our method makes better use of the
pretrained text-to-image diffusion model and is able to fully disentangle
different objects without image-level supervision. To facilitate modeling of
object-wise relationships, we use signed distance fields as representation and
impose a constraint to avoid inter-penetration of objects. To avoid manual
scene graph creation, we design a text prompt for ChatGPT to generate scene
graphs based on text inputs. We conduct both qualitative and quantitative
experiments to validate the effectiveness of GraphDreamer in generating
high-fidelity compositional 3D scenes with disentangled object entities. |
GraphDreamer is a novel framework that leverages scene graphs to generate compositional 3D scenes, effectively disentangling objects and their relationships from text descriptions. |
Existing text-to-3D methods struggle with complex scenes involving multiple objects and their interactions, suffering from attribute confusion and guidance collapse. GraphDreamer overcomes these limitations by utilizing the structured representation of scene graphs. |
GraphDreamer decomposes scene graphs into object and relationship descriptions. It employs identity-aware positional encoders to represent individual object fields and a shared SDF network for geometry. By rendering objects and their combinations individually and globally, GraphDreamer utilizes SDS loss for optimization. |
GraphDreamer effectively disentangles objects in 3D scenes, as evidenced by the CLIP score analysis of individual object renderings.
It outperforms state-of-the-art text-to-3D methods like Magic3D and MVDream in generating multi-object scenes, achieving higher CLIP scores and better visual fidelity.
Ablation studies confirm that the use of scene graphs significantly improves performance, highlighting their importance for accurate guidance. |
Individual object generation quality remains constrained by the limitations of SDS optimization.
Object decomposition may fail in cases of significant semantic dominance of one object over another. |
3d scene generation, text-to-3d, scene graphs, score distillation sampling, compositional 3d modeling |
2312.00085
Report |
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation |
Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji |
In recent times, automatic text-to-3D content creation has made significant
progress, driven by the development of pretrained 2D diffusion models. Existing
text-to-3D methods typically optimize the 3D representation to ensure that the
rendered image aligns well with the given text, as evaluated by the pretrained
2D diffusion model. Nevertheless, a substantial domain gap exists between 2D
images and 3D assets, primarily attributed to variations in camera-related
attributes and the exclusive presence of foreground objects. Consequently,
employing 2D diffusion models directly for optimizing 3D representations may
lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a
novel approach for high-quality text-to-3D content creation that effectively
bridges the gap between text-to-2D and text-to-3D synthesis. The key components
of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation
(CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically
incorporates camera information into the pretrained diffusion models by
employing camera-dependent generation for trainable parameters. This
integration enhances the alignment between the generated 3D assets and the
camera's perspective. AMA loss guides the attention map of the pretrained
diffusion model using the binary mask of the 3D object, prioritizing the
creation of the foreground object. This module ensures that the model focuses
on generating accurate and detailed foreground objects. Extensive evaluations
demonstrate the effectiveness of our proposed method compared to existing
text-to-3D approaches. Our project webpage:
https://xmu-xiaoma666.github.io/Projects/X-Dreamer/ . |
This paper introduces X-Dreamer, a novel framework for high-quality text-to-3D content creation that bridges the gap between text-to-2D and text-to-3D generation by incorporating camera information and prioritizing foreground object generation. |
Existing text-to-3D methods face challenges due to the domain gap between 2D images and 3D assets, especially in handling camera parameters and focusing on foreground objects. |
X-Dreamer utilizes two innovative designs: 1) CG-LoRA dynamically integrates camera information into pretrained diffusion models for better alignment. 2) AMA loss guides the attention map to prioritize foreground object generation by aligning it with the rendered 3D object mask. |
X-Dreamer generates high-quality, photorealistic 3D assets from text prompts, starting from either an ellipsoid or a coarse-grained mesh.
X-Dreamer outperforms SOTA methods like DreamFusion, Magic3D, and Fantasia3D in realism and achieves comparable results to ProlificDreamer with significantly less optimization time.
Ablation studies demonstrate the significant contributions of CG-LoRA and AMA loss in enhancing geometry, appearance, and overall quality of generated 3D objects. |
X-Dreamer currently cannot generate multiple separate objects from a single text prompt, sometimes merging their properties.
Future work could explore multi-object generation and address other limitations. |
text-to-3d synthesis, diffusion models, camera-aware generation, foreground object prioritization, 3d content creation |
2312.00081
Report |
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding |
Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu |
Vision language models (VLM) have demonstrated remarkable performance across
various downstream tasks. However, understanding fine-grained visual-linguistic
concepts, such as attributes and inter-object relationships, remains a
significant challenge. While several benchmarks aim to evaluate VLMs in finer
granularity, their primary focus remains on the linguistic aspect, neglecting
the visual dimension. Here, we highlight the importance of evaluating VLMs from
both a textual and visual perspective. We introduce a progressive pipeline to
synthesize images that vary in a specific attribute while ensuring consistency
in all other aspects. Utilizing this data engine, we carefully design a
benchmark, SPEC, to diagnose the comprehension of object size, position,
existence, and count. Subsequently, we conduct a thorough evaluation of four
leading VLMs on SPEC. Surprisingly, their performance is close to random guess,
revealing significant limitations. With this in mind, we propose a simple yet
effective approach to optimize VLMs in fine-grained understanding, achieving
significant improvements on SPEC without compromising the zero-shot
performance. Results on two additional fine-grained benchmarks also show
consistent improvements, further validating the transferability of our
approach. Code and data are available at https://github.com/wjpoom/SPEC. |
This paper introduces SPEC, a new benchmark for evaluating the fine-grained visual-linguistic comprehension of Vision Language Models (VLMs) concerning object size, position, existence, and count, and proposes a simple yet effective training method to enhance VLMs' understanding in these aspects. |
Existing VLMs show limitations in understanding fine-grained visual-linguistic concepts, highlighting a need for benchmarks like SPEC that go beyond evaluating object recognition and focus on compositional reasoning. |
The authors develop a progressive pipeline to synthesize images with controlled variations in specific attributes while maintaining consistency in other aspects. They use this pipeline to construct SPEC and evaluate four leading VLMs, revealing their shortcomings. A novel training method incorporating hard negative examples is then proposed and applied to CLIP to boost its fine-grained understanding. |
Even state-of-the-art VLMs perform close to random chance on SPEC, indicating significant limitations in fine-grained comprehension.
The proposed training method significantly improves CLIP's performance on SPEC, boosting both image-to-text and text-to-image matching accuracy.
The improvements obtained through the proposed method generalize to other fine-grained benchmarks like ARO and Eqben, showcasing its ability to enhance transferable fine-grained understanding. |
The current study focuses on evaluating four specific attributes, future work could explore more diverse visual-linguistic concepts.
The benchmark is constructed using synthetic images, which may not fully encompass the complexity of real-world images. Future work should consider incorporating real-world images for evaluation. |
vision language models, fine-grained understanding, benchmarking, image synthesis, compositional reasoning |
2312.00079
Report |
HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models |
Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou |
This paper explores advancements in high-fidelity personalized image
generation through the utilization of pre-trained text-to-image diffusion
models. While previous approaches have made significant strides in generating
versatile scenes based on text descriptions and a few input images, challenges
persist in maintaining the subject fidelity within the generated images. In
this work, we introduce an innovative algorithm named HiFi Tuner to enhance the
appearance preservation of objects during personalized image generation. Our
proposed method employs a parameter-efficient fine-tuning framework, comprising
a denoising process and a pivotal inversion process. Key enhancements include
the utilization of mask guidance, a novel parameter regularization technique,
and the incorporation of step-wise subject representations to elevate the
sample fidelity. Additionally, we propose a reference-guided generation
approach that leverages the pivotal inversion of a reference image to mitigate
unwanted subject variations and artifacts. We further extend our method to a
novel image editing task: substituting the subject in an image through textual
manipulations. Experimental evaluations conducted on the DreamBooth dataset
using the Stable Diffusion model showcase promising results. Fine-tuning solely
on textual embeddings improves CLIP-T score by 3.6 points and improves DINO
score by 9.6 points over Textual Inversion. When fine-tuning all parameters,
HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2
points over DreamBooth, establishing a new state of the art. |
This paper introduces HiFi Tuner, a novel parameter-efficient fine-tuning framework for personalized image generation using pre-trained text-to-image diffusion models, enhancing subject fidelity while preserving scene coverage. |
Existing methods struggle to balance sample quality with parameter efficiency, scene flexibility, and accurate preservation of subject appearance in personalized image generation. This work addresses these limitations to enhance the fidelity of personalized images. |
The proposed HiFi Tuner utilizes a denoising process with mask guidance, parameter regularization, and step-wise subject representations. It also employs a reference-guided generation approach leveraging pivotal inversion of a reference image to maintain subject details. |
Fine-tuning solely textual embeddings with HiFi Tuner improves CLIP-T score by 3.6 points and DINO score by 9.6 points compared to Textual Inversion.
Fine-tuning all parameters with HiFi Tuner surpasses DreamBooth by 1.2 points in both CLIP-T and DINO scores.
The method is extended to a novel image editing task, successfully substituting subjects in images through textual manipulations. |
The reference-guided generation is only applied to rigid objects due to the limited appearance variations in the dataset.
Future work could explore applying HiFi Tuner to more complex scenes with multiple interacting objects. |
image generation, diffusion models, personalized image synthesis, text-to-image generation, fine-tuning |
2312.00065
Report |
Unsupervised Keypoints from Pretrained Diffusion Models |
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Xingzhe He, Hossam Isack, Abhishek Kar Helge Rhodin, Andrea Tagliasacchi, Kwang Moo Yi |
Unsupervised learning of keypoints and landmarks has seen significant
progress with the help of modern neural network architectures, but performance
is yet to match the supervised counterpart, making their practicability
questionable. We leverage the emergent knowledge within text-to-image diffusion
models, towards more robust unsupervised keypoints. Our core idea is to find
text embeddings that would cause the generative model to consistently attend to
compact regions in images (i.e. keypoints). To do so, we simply optimize the
text embedding such that the cross-attention maps within the denoising network
are localized as Gaussians with small standard deviations. We validate our
performance on multiple datasets: the CelebA, CUB-200-2011, Tai-Chi-HD,
DeepFashion, and Human3.6m datasets. We achieve significantly improved
accuracy, sometimes even outperforming supervised ones, particularly for data
that is non-aligned and less curated. Our code is publicly available and can be
found through our project page: https://ubc-vision.github.io/StableKeypoints/ |
This paper introduces a novel unsupervised keypoint learning method that leverages the knowledge embedded within pre-trained text-to-image diffusion models, specifically targeting cross-attention maps to identify semantically meaningful locations in images. |
Unsupervised keypoint detection methods currently lag behind their supervised counterparts in performance, especially on non-aligned, in-the-wild datasets. This work aims to bridge this gap by utilizing the power of large pre-trained generative models. |
The proposed method optimizes text embeddings (tokens) to enforce localized responses in the cross-attention maps of a diffusion model. This is achieved by minimizing the difference between attention maps and Gaussian distributions centered at their maxima, while also enforcing equivariance to geometric transformations. The final keypoints are then extracted as the maxima of these localized attention maps. |
The method achieves state-of-the-art results on several benchmark datasets, including CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m, particularly excelling in unaligned and less curated settings.
The approach demonstrates strong generalization capability, effectively transferring learned keypoints to unseen datasets and even across different object categories.
The method highlights the potential of leveraging large pre-trained diffusion models for downstream vision tasks without requiring fine-tuning. |
While demonstrating strong performance in unaligned cases, the method's performance in heavily pre-processed and aligned settings could be further investigated and potentially improved.
Future work could explore the impact of different diffusion models and architectures on the quality of learned keypoints. |
unsupervised learning, keypoint detection, diffusion models, cross-attention, generalization |
2312.00063
Report |
MoMask: Generative Masked Modeling of 3D Human Motions |
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, Li Cheng |
We introduce MoMask, a novel masked modeling framework for text-driven 3D
human motion generation. In MoMask, a hierarchical quantization scheme is
employed to represent human motion as multi-layer discrete motion tokens with
high-fidelity details. Starting at the base layer, with a sequence of motion
tokens obtained by vector quantization, the residual tokens of increasing
orders are derived and stored at the subsequent layers of the hierarchy. This
is consequently followed by two distinct bidirectional transformers. For the
base-layer motion tokens, a Masked Transformer is designated to predict
randomly masked motion tokens conditioned on text input at training stage.
During generation (i.e. inference) stage, starting from an empty sequence, our
Masked Transformer iteratively fills up the missing tokens; Subsequently, a
Residual Transformer learns to progressively predict the next-layer tokens
based on the results from current layer. Extensive experiments demonstrate that
MoMask outperforms the state-of-art methods on the text-to-motion generation
task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset,
and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly
applied in related tasks without further model fine-tuning, such as text-guided
temporal inpainting. |
Introduces MoMask, a generative masked modeling framework for text-driven 3D human motion generation using a hierarchical quantization scheme and bidirectional transformers. |
Addresses limitations of existing text-to-motion methods by improving motion quality, capturing subtle language nuances, and enabling efficient bidirectional decoding. |
Employs residual vector quantization (RVQ) to represent motion as multi-layer discrete tokens. Utilizes a Masked Transformer to predict base-layer tokens conditioned on text, and a Residual Transformer to progressively predict residual tokens. |
Achieves state-of-the-art performance on HumanML3D and KIT-ML datasets with FID scores of 0.045 and 0.228, respectively.
Generates motions with higher quality and better understanding of subtle language concepts compared to baselines.
Demonstrates effectiveness in text-guided temporal inpainting. |
Limited motion diversity compared to fidelity and faithfulness.
Requires target motion length as input. |
text-to-motion generation, generative masked modeling, residual vector quantization, 3d human motion synthesis, motion inpainting |
2311.18837
Report |
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models |
Zhen Xing, Qi Dai, Zihao Zhang, Hui Zhang, Han Hu, Zuxuan Wu, Yu-Gang Jiang |
Diffusion models have achieved significant success in image and video
generation. This motivates a growing interest in video editing tasks, where
videos are edited according to provided text descriptions. However, most
existing approaches only focus on video editing for short clips and rely on
time-consuming tuning or inference. We are the first to propose Video
Instruction Diffusion (VIDiff), a unified foundation model designed for a wide
range of video tasks. These tasks encompass both understanding tasks (such as
language-guided video object segmentation) and generative tasks (video editing
and enhancement). Our model can edit and translate the desired results within
seconds based on user instructions. Moreover, we design an iterative
auto-regressive method to ensure consistency in editing and enhancing long
videos. We provide convincing generative results for diverse input videos and
written instructions, both qualitatively and quantitatively. More examples can
be found at our website https://ChenHsing.github.io/VIDiff. |
VIDiff is introduced, a unified diffusion-based framework for various video translation tasks guided by multimodal instructions. |
Existing video editing and understanding models often lack a unified approach, require time-consuming tuning, and struggle with long video consistency. |
A pre-trained T2I diffusion model is adapted using a multi-stage training process, incorporating temporal attention and a multimodal condition injection mechanism for image and text instructions. An iterative auto-regressive method ensures long video consistency. |
VIDiff achieves state-of-the-art performance on video editing benchmarks, outperforming methods requiring per-video tuning.
The model excels in video enhancement tasks like deblurring, dehazing, and in-painting, surpassing existing instruction-guided techniques.
The iterative generation method effectively maintains temporal consistency in long video translations. |
The performance on certain tasks is limited by the VAE encoder used in the LDM.
Future work includes exploring integration with large language models for more complex video understanding tasks. |
video editing, video enhancement, diffusion models, multimodal learning, instruction-guided video translation |
2311.18836
Report |
ChatPose: Chatting about 3D Human Pose |
Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black |
We introduce ChatPose, a framework employing Large Language Models (LLMs) to
understand and reason about 3D human poses from images or textual descriptions.
Our work is motivated by the human ability to intuitively understand postures
from a single image or a brief description, a process that intertwines image
interpretation, world knowledge, and an understanding of body language.
Traditional human pose estimation and generation methods often operate in
isolation, lacking semantic understanding and reasoning abilities. ChatPose
addresses these limitations by embedding SMPL poses as distinct signal tokens
within a multimodal LLM, enabling the direct generation of 3D body poses from
both textual and visual inputs. Leveraging the powerful capabilities of
multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks
while offering user interactions. Additionally, ChatPose empowers LLMs to apply
their extensive world knowledge in reasoning about human poses, leading to two
advanced tasks: speculative pose generation and reasoning about pose
estimation. These tasks involve reasoning about humans to generate 3D poses
from subtle text queries, possibly accompanied by images. We establish
benchmarks for these tasks, moving beyond traditional 3D pose generation and
estimation methods. Our results show that ChatPose outperforms existing
multimodal LLMs and task-specific methods on these newly proposed tasks.
Furthermore, ChatPose's ability to understand and generate 3D human poses based
on complex reasoning opens new directions in human pose analysis. |
ChatPose is a framework that enables Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions, bridging the gap between traditional pose estimation/generation methods and LLMs' general reasoning abilities. |
Existing human pose estimation and generation methods lack semantic understanding and reasoning, operating in isolation. ChatPose leverages LLMs' world knowledge and reasoning capabilities to overcome these limitations, unifying pose analysis tasks and enabling novel applications. |
ChatPose embeds SMPL poses as tokens within a multimodal LLM. It's trained on image-to-SMPL and text-to-SMPL data, allowing it to generate 3D poses from textual and visual inputs. The LLM's reasoning abilities are further utilized for two new tasks: Speculative Pose Generation (SPG) and Reasoning-based Pose Estimation (RPE). |
ChatPose outperforms other multimodal LLMs on pose generation and estimation tasks.
It demonstrates zero-shot capability in reasoning about human poses within multi-turn dialogues.
The framework excels in handling complex scenarios requiring reasoning, such as SPG and RPE, surpassing traditional methods. |
The accuracy of 3D pose estimation from images is not yet on par with specialized regressors, highlighting the need for larger, higher quality datasets relating language to pose.
Freezing the vision encoder during training poses a limitation, potentially addressed by more powerful backbones or whole-model fine-tuning. |
human pose estimation, pose generation, large language models, multimodal learning, reasoning |
2311.18835
Report |
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation |
Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, Hongsheng Li |
Empowering models to dynamically accomplish tasks specified through natural
language instructions represents a promising path toward more capable and
general artificial intelligence. In this work, we introduce InstructSeq, an
instruction-conditioned multi-modal modeling framework that unifies diverse
vision tasks through flexible natural language control and handling of both
visual and textual data. InstructSeq employs a multimodal transformer
architecture encompassing visual, language, and sequential modeling. We utilize
a visual encoder to extract image features and a text encoder to encode
instructions. An autoregressive transformer fuses the representations and
generates sequential task outputs. By training with LLM-generated natural
language instructions, InstructSeq acquires a strong comprehension of free-form
instructions for specifying visual tasks. This provides an intuitive interface
for directing capabilities using flexible natural instructions. Without any
task-specific tuning, InstructSeq achieves compelling performance on semantic
segmentation, referring expression segmentation/comprehension, and image
captioning. The flexible control and multi-task unification empower the model
with more human-like versatility and generalizability for computer vision. The
code will be released soon at https://github.com/rongyaofang/InstructSeq. |
Introduced InstructSeq, an instruction-conditioned multi-modal model that unifies diverse vision tasks through flexible natural language instructions, handling both visual and textual data. |
Addresses limitations of existing multi-modal models that rely on fixed instruction templates and lack flexibility in handling various vision tasks requiring different output types. |
Employs a multi-modal transformer architecture with a visual encoder, a frozen instruction encoder, and an autoregressive transformer to generate visual or textual outputs based on the input instruction. Utilizes an LLM to generate natural language instructions for training, enabling the model to comprehend and respond to diverse phrasings. |
Achieves competitive performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning without task-specific tuning.
Demonstrates superior generalization ability to novel instructions compared to models trained on fixed templates.
Provides confidence estimates for predictions through sampling-based token generation, enabling the identification of uncertain areas in outputs. |
Mixing textual and dense visual outputs during training might slightly degrade performance on specific tasks like referring segmentation.
Computational constraints limit exploring larger model sizes and more diverse datasets. |
multi-modal learning, natural language instructions, vision generalist model, sequence generation, computer vision |
2311.18834
Report |
ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models |
Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong |
We present ART$\boldsymbol{\cdot}$V, an efficient framework for
auto-regressive video generation with diffusion models. Unlike existing methods
that generate entire videos in one-shot, ART$\boldsymbol{\cdot}$V generates a
single frame at a time, conditioned on the previous ones. The framework offers
three distinct advantages. First, it only learns simple continual motions
between adjacent frames, therefore avoiding modeling complex long-range motions
that require huge training data. Second, it preserves the high-fidelity
generation ability of the pre-trained image diffusion models by making only
minimal network modifications. Third, it can generate arbitrarily long videos
conditioned on a variety of prompts such as text, image or their combinations,
making it highly versatile and flexible. To combat the common drifting issue in
AR models, we propose masked diffusion model which implicitly learns which
information can be drawn from reference images rather than network predictions,
in order to reduce the risk of generating inconsistent appearances that cause
drifting. Moreover, we further enhance generation coherence by conditioning it
on the initial frame, which typically contains minimal noise. This is
particularly useful for long video generation. When trained for only two weeks
on four GPUs, ART$\boldsymbol{\cdot}$V already can generate videos with natural
motions, rich details and a high level of aesthetic quality. Besides, it
enables various appealing applications, e.g., composing a long video from
multiple text prompts. |
This paper introduces ART⋅V, a novel auto-regressive framework using diffusion models for generating videos from text and/or image prompts. |
Existing text-to-video generation methods struggle to create realistic, long-range motions due to the limitations of one-shot generation and training data size. ART⋅V addresses these challenges by generating frames sequentially and focusing on short, continuous motions. |
ART⋅V employs a pre-trained image diffusion model with minimal modifications. Key techniques include: 1) Masked Diffusion Model (MDM) to mitigate drifting by leveraging information from reference frames. 2) Noise Augmentation to bridge the gap between training and testing. 3) Anchored Conditioning on the initial frame to enhance long-term coherence. |
ART⋅V generates videos with natural motion, rich detail, and high aesthetic quality despite limited training resources.
It outperforms existing methods in zero-shot video generation benchmarks (UCF-101, MSR-VTT), achieving state-of-the-art results when conditioned on ground truth images.
The auto-regressive approach allows for generating arbitrarily long videos from multiple text prompts with seamless transitions. |
Training on higher resolution and quality datasets is expected to further improve visual fidelity.
Exploring advanced temporal modeling techniques within the auto-regressive framework could enhance long-range motion quality. |
text-to-video generation, diffusion models, auto-regressive models, motion generation, video synthesis |
2311.18832
Report |
Exploiting Diffusion Prior for Generalizable Dense Prediction |
Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang |
Contents generated by recent advanced Text-to-Image (T2I) diffusion models
are sometimes too imaginative for existing off-the-shelf dense predictors to
estimate due to the immitigable domain gap. We introduce DMP, a pipeline
utilizing pre-trained T2I models as a prior for dense prediction tasks. To
address the misalignment between deterministic prediction tasks and stochastic
T2I models, we reformulate the diffusion process through a sequence of
interpolations, establishing a deterministic mapping between input RGB images
and output prediction distributions. To preserve generalizability, we use
low-rank adaptation to fine-tune pre-trained models. Extensive experiments
across five tasks, including 3D property estimation, semantic segmentation, and
intrinsic image decomposition, showcase the efficacy of the proposed method.
Despite limited-domain training data, the approach yields faithful estimations
for arbitrary images, surpassing existing state-of-the-art algorithms. |
The paper proposes DMP, a novel approach leveraging pre-trained text-to-image (T2I) diffusion models as priors for generalizable dense prediction tasks. |
Existing dense prediction models struggle with the domain gap between real-world and T2I-generated images, limiting their application on imaginative content. This work aims to bridge this gap and enable faithful estimations on arbitrary images. |
DMP introduces a deterministic image-to-prediction diffusion process, reformulating the stochastic T2I generation into a series of interpolations. This ensures deterministic mapping between input RGB and output predictions. Additionally, it employs low-rank adaptation to fine-tune pre-trained models on limited-domain data while preserving generalizability. |
DMP achieves superior accuracy compared to previous image-to-image translation and diffusion-based methods on tasks like depth, normal, and segmentation estimation.
Despite training on a small dataset of labeled bedroom images, DMP exhibits remarkable generalization, providing plausible predictions even on out-of-domain and arbitrary images.
The proposed deterministic diffusion process is shown to be crucial for achieving accurate and faithful estimations, outperforming alternative parameterizations and single-step prediction approaches. |
The performance on real-world multi-class semantic segmentation tasks remains limited due to challenges in encoding many classes in the image space.
Future work includes exploring the application of real-world datasets with text descriptions generated by image captioning models for potentially better performance. |
dense prediction, diffusion models, text-to-image generation, generalizability, low-rank adaptation |
2311.18830
Report |
MotionEditor: Editing Video Motion via Content-Aware Diffusion |
Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang |
Existing diffusion-based video editing models have made gorgeous advances for
editing attributes of a source video over time but struggle to manipulate the
motion information while preserving the original protagonist's appearance and
background. To address this, we propose MotionEditor, a diffusion model for
video motion editing. MotionEditor incorporates a novel content-aware motion
adapter into ControlNet to capture temporal motion correspondence. While
ControlNet enables direct generation based on skeleton poses, it encounters
challenges when modifying the source motion in the inverted noise due to
contradictory signals between the noise (source) and the condition (reference).
Our adapter complements ControlNet by involving source content to transfer
adapted control signals seamlessly. Further, we build up a two-branch
architecture (a reconstruction branch and an editing branch) with a
high-fidelity attention injection mechanism facilitating branch interaction.
This mechanism enables the editing branch to query the key and value from the
reconstruction branch in a decoupled manner, making the editing branch retain
the original background and protagonist appearance. We also propose a skeleton
alignment algorithm to address the discrepancies in pose size and position.
Experiments demonstrate the promising motion editing ability of MotionEditor,
both qualitatively and quantitatively. |
The paper proposes MotionEditor, a novel diffusion model designed for video motion editing, which transfers motion from a reference video to a source video while preserving the original protagonist's appearance and background. |
Existing diffusion-based video editing models primarily focus on texture editing and struggle to manipulate motion information effectively while preserving the original protagonist and background. |
MotionEditor incorporates a content-aware motion adapter into ControlNet for temporal motion correspondence and a two-branch architecture (reconstruction and editing branches) with a high-fidelity attention injection mechanism to facilitate branch interaction and preserve source appearance. It also employs a skeleton alignment algorithm to address pose discrepancies between source and reference. |
MotionEditor demonstrates superior performance in motion editing compared to previous video editing and human motion transfer methods, both qualitatively and quantitatively.
The proposed content-aware motion adapter enhances motion control and temporal consistency, while the high-fidelity attention injection mechanism preserves background details and protagonist appearance.
Ablation studies confirm the importance of core components, such as the motion adapter, attention injection, and skeleton alignment, for achieving high-quality motion editing. |
MotionEditor might fail in cases where foreground and background latents are confused, resulting in artifacts.
Future work can explore explicit decoupling of foreground and background before denoising and develop a learnable mixture adapter for more natural blending. |
video motion editing, diffusion models, content-aware motion adapter, high-fidelity attention injection, skeleton alignment |
2311.18829
Report |
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation |
Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Chuanxin Tang, Xiaoyan Sun, Chong Luo, Baining Guo |
We present MicroCinema, a straightforward yet effective framework for
high-quality and coherent text-to-video generation. Unlike existing approaches
that align text prompts with video directly, MicroCinema introduces a
Divide-and-Conquer strategy which divides the text-to-video into a two-stage
process: text-to-image generation and image\&text-to-video generation. This
strategy offers two significant advantages. a) It allows us to take full
advantage of the recent advances in text-to-image models, such as Stable
Diffusion, Midjourney, and DALLE, to generate photorealistic and highly
detailed images. b) Leveraging the generated image, the model can allocate less
focus to fine-grained appearance details, prioritizing the efficient learning
of motion dynamics. To implement this strategy effectively, we introduce two
core designs. First, we propose the Appearance Injection Network, enhancing the
preservation of the appearance of the given image. Second, we introduce the
Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities
of pre-trained 2D diffusion models. These design elements empower MicroCinema
to generate high-quality videos with precise motion, guided by the provided
text prompts. Extensive experiments demonstrate the superiority of the proposed
framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on
UCF-101 and 377.40 on MSR-VTT. See
https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples. |
This paper proposes MicroCinema, a two-stage text-to-video generation framework that leverages the strengths of existing text-to-image models for enhanced quality and coherence. |
Current text-to-video generation models struggle with appearance and temporal coherence, especially when trained directly from text-video pairs. This framework addresses these limitations by separating appearance and motion modeling. |
MicroCinema generates a key frame from text using an off-the-shelf text-to-image model. This key frame, along with the text prompt, guides a novel image&text-to-video model, featuring an Appearance Injection Network and an Appearance Noise Prior, to generate coherent videos. |
MicroCinema achieves state-of-the-art zero-shot FVD of 342.86 on UCF101 and 377.40 on MSR-VTT using only the WebVid-10M dataset for training.
The proposed Appearance Injection Network and Appearance Noise Prior significantly improve appearance preservation and motion modeling.
The framework allows flexible integration of different text-to-image models and exhibits strong controllability through text prompts. |
The model's performance on small objects, particularly faces, is limited by the reconstruction capabilities of the VAE used.
Future work includes exploring joint spatial-temporal super-resolution for further quality enhancements. |
text-to-video generation, diffusion models, appearance modeling, motion modeling, divide-and-conquer |
2311.18827
Report |
Motion-Conditioned Image Animation for Video Editing |
Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi |
We introduce MoCA, a Motion-Conditioned Image Animation approach for video
editing. It leverages a simple decomposition of the video editing problem into
image editing followed by motion-conditioned image animation. Furthermore,
given the lack of robust evaluation datasets for video editing, we introduce a
new benchmark that measures edit capability across a wide variety of tasks,
such as object replacement, background changes, style changes, and motion
edits. We present a comprehensive human evaluation of the latest video editing
methods along with MoCA, on our proposed benchmark. MoCA establishes a new
state-of-the-art, demonstrating greater human preference win-rate, and
outperforming notable recent approaches including Dreamix (63%), MasaCtrl
(75%), and Tune-A-Video (72%), with especially significant improvements for
motion edits. |
Introduces MoCA, a motion-conditioned image animation approach for text-driven video editing that outperforms existing methods, and a new dataset focused on motion editing for comprehensive benchmarking. |
Addresses the limitations of current video editing methods that struggle with motion editing or specialize in a narrow range of edits. |
Decomposes video editing into image editing and motion-conditioned image animation. Leverages pre-trained image editing models and a video generation model trained with text, first frame, and optical flow conditioning. |
MoCA outperforms state-of-the-art video editing models across various edit types based on human evaluation.
Motion conditioning is crucial for preserving original video motion during spatial edits.
Existing automatic metrics for video editing show low correlation with human judgment, especially for motion-based edits. |
Reliance on video extrapolation limits fidelity in preserving aspects introduced after the first frame.
Need for better automatic evaluation metrics aligned with human perception for video editing. |
video editing, video generation, motion conditioning, text-driven editing, diffusion models |
2311.18823
Report |
Initializing Models with Larger Ones |
Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu |
Weight initialization plays an important role in neural network training.
Widely used initialization methods are proposed and evaluated for networks that
are trained from scratch. However, the growing number of pretrained models now
offers new opportunities for tackling this classical problem of weight
initialization. In this work, we introduce weight selection, a method for
initializing smaller models by selecting a subset of weights from a pretrained
larger model. This enables the transfer of knowledge from pretrained weights to
smaller models. Our experiments demonstrate that weight selection can
significantly enhance the performance of small models and reduce their training
time. Notably, it can also be used together with knowledge distillation. Weight
selection offers a new approach to leverage the power of pretrained models in
resource-constrained settings, and we hope it can be a useful tool for training
small models in the large-model era. Code is available at
https://github.com/OscarXZQ/weight-selection. |
This paper introduces "weight selection," a method for initializing smaller neural networks by selecting a subset of weights from pretrained larger models within the same family. |
This approach enables the transfer of knowledge from pretrained models to smaller ones, which is particularly beneficial in resource-constrained settings where large models are impractical. |
Weight selection involves three steps: 1) selecting corresponding layers from the teacher model, 2) mapping components between student and teacher layers, and 3) selecting elements from the teacher's weight tensors to initialize the student model. |
Weight selection significantly improves test accuracy across various image classification datasets, especially for smaller datasets.
Weight selection substantially reduces training time compared to random initialization, achieving the same performance with fewer epochs.
Weight selection is compatible with knowledge distillation and can be combined to further enhance performance. |
The effectiveness of weight selection may be limited by the availability of pretrained models within the same family and of a suitable size.
Future work can explore different strategies for selecting weights, potentially incorporating importance or relevance metrics. |
weight initialization, transfer learning, knowledge distillation, model compression, pretrained models |
2311.18822
Report |
ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation |
Moayed Haji-Ali, Guha Balakrishnan, Vicente Ordonez |
Diffusion models have revolutionized image generation in recent years, yet
they are still limited to a few sizes and aspect ratios. We propose
ElasticDiffusion, a novel training-free decoding method that enables pretrained
text-to-image diffusion models to generate images with various sizes.
ElasticDiffusion attempts to decouple the generation trajectory of a pretrained
model into local and global signals. The local signal controls low-level pixel
information and can be estimated on local patches, while the global signal is
used to maintain overall structural consistency and is estimated with a
reference image. We test our method on CelebA-HQ (faces) and LAION-COCO
(objects/indoor/outdoor scenes). Our experiments and qualitative results show
superior image coherence quality across aspect ratios compared to
MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project
page: https://elasticdiffusion.github.io/ |
Introduces ElasticDiffusion, a training-free decoding method enabling pretrained text-to-image diffusion models to generate images at arbitrary sizes. |
Addresses the limitation of existing diffusion models that are typically trained on a few image sizes and struggle to maintain quality at different resolutions or aspect ratios. |
Decouples the generation trajectory into local (pixel-level details) and global signals (structural consistency) by leveraging insights from classifier-free guidance. Local signals are estimated on patches, while global signals are derived from a reference image. |
Generates coherent images at various resolutions, outperforming baselines like Stable Diffusion and MultiDiffusion.
Achieves comparable FID and CLIP scores to SDXL at 1024x1024 resolution using a smaller base model (Stable Diffusion 1.4).
Effectively handles diverse aspect ratios, surpassing baselines in maintaining image coherence and alignment with input prompts. |
Potential for artifact generation due to inaccuracies in estimating global/local signals.
Limited effectiveness in generating images at significantly extended sizes (beyond 4x the training resolution). |
diffusion models, image generation, arbitrary size, classifier-free guidance, resolution independence |
2311.18815
Report |
IMMA: Immunizing text-to-image Models against Malicious Adaptation |
Amber Yijia Zheng, Raymond A. Yeh |
Advancements in text-to-image models and fine-tuning methods have led to the
increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful
unauthorized content. Recent works, e.g., Glaze or MIST, have developed
data-poisoning techniques which protect the data against adaptation methods. In
this work, we consider an alternative paradigm for protection. We propose to
``immunize'' the model by learning model parameters that are difficult for the
adaptation methods when fine-tuning malicious content; in short IMMA. Empirical
results show IMMA's effectiveness against malicious adaptations, including
mimicking the artistic style and learning of inappropriate/unauthorized
content, over three adaptation methods: LoRA, Textual-Inversion, and
DreamBooth. |
The paper introduces IMMA, a novel model immunization technique designed to safeguard text-to-image models from malicious adaptation, preventing the generation of harmful or unauthorized content. |
The rise of open-source text-to-image models and fine-tuning methods necessitates protection against misuse, such as generating harmful content or infringing on artists' rights. Existing data poisoning methods place the burden on content creators. IMMA addresses this by immunizing the model itself. |
IMMA utilizes a bi-level optimization program. It learns a set of model parameters that lead to poor performance when adapted for malicious purposes, effectively acting as a poor model initialization for malicious adaptation. |
IMMA successfully inhibits re-learning of erased concepts, demonstrated by quantitative metrics and user studies.
IMMA effectively prevents adaptation towards personalized/unique concepts while preserving the model's adaptability for benign concepts.
IMMA exhibits resilience against JPEG compression, surpassing data poisoning methods like MIST in this aspect. |
Immunizing against certain target concepts may negatively impact the model's performance on other concepts.
Future research could explore methods to mitigate the potential negative impact on other concepts during immunization. |
model immunization, text-to-image generation, malicious adaptation, diffusion models, ai safety |
2311.18775
Report |
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation |
Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal |
We present CoDi-2, a versatile and interactive Multimodal Large Language
Model (MLLM) that can follow complex multimodal interleaved instructions,
conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any
input-output modality paradigm. By aligning modalities with language for both
encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not
only understand complex modality-interleaved instructions and in-context
examples, but also autoregressively generate grounded and coherent multimodal
outputs in the continuous feature space. To train CoDi-2, we build a
large-scale generation dataset encompassing in-context multimodal instructions
across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot
capabilities for multimodal generation, such as in-context learning, reasoning,
and compositionality of any-to-any modality generation through multi-round
interactive conversation. CoDi-2 surpasses previous domain-specific models on
tasks such as subject-driven image generation, vision transformation, and audio
editing. CoDi-2 signifies a substantial breakthrough in developing a
comprehensive multimodal foundation model adept at interpreting in-context
language-vision-audio interleaved instructions and producing multimodal
outputs. |
This paper presents METHODNAME, a versatile Multimodal Large Language Model (MLLM) that excels in following complex interleaved multimodal instructions, performing in-context learning (ICL), reasoning, engaging in conversations, and editing content in an any-to-any input-output modality paradigm. |
Current multimodal generative models struggle with zero-shot fine-grained control, are limited to single-round user interactions, and often handle only one or two input modalities. METHODNAME addresses these limitations by enabling sophisticated multimodal generation, multi-round interactions, and understanding modality-interleaved inputs. |
METHODNAME leverages a Large Language Model (LLM) as its core, enhanced with multimodal encoders and decoders. This architecture enables it to process text, image, and audio inputs aligned in the language space, facilitating in-context learning and reasoning. The model is trained on a diverse dataset incorporating existing multimodal resources and novel text-only datasets adapted for multimodal in-context learning. |
METHODNAME achieves competitive zero-shot performance in subject-driven image generation, demonstrating its ability to generalize to unseen tasks.
The model excels in audio manipulation tasks, surpassing previous methods in adding, dropping, and replacing audio elements.
METHODNAME demonstrates strong capabilities in various in-context multimodal generation tasks, including style adaptation, image composition, and exemplar-based editing. |
The training datasets, while diverse, might not cover all potential real-world applications, such as visual concept learning despite the model showing promising results in this area.
The model's performance might be further improved by exploring techniques to enhance its ability to learn and apply visual concepts. |
multimodal generation, large language models, in-context learning, multimodal reasoning, multimodal interaction |
2311.18765
Report |
MLLMs-Augmented Visual-Language Representation Learning |
Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You |
Visual-language pre-training has achieved remarkable success in many
multi-modal tasks, largely attributed to the availability of large-scale
image-text datasets. In this work, we demonstrate that Multi-modal Large
Language Models (MLLMs) can enhance visual-language representation learning by
establishing richer image-text associations for image-text datasets. Our
approach is simple, utilizing MLLMs to extend multiple diverse captions for
each image. To prevent the bias introduced by MLLMs' hallucinations and
monotonous language styles, we propose "text shearing" to maintain the quality
and availability of extended captions. In image-text retrieval, without
introducing additional training cost, our method consistently obtains 5.6 ~
35.0 and 16.8 ~ 46.1 improvement on Recall@1 under the fine-tuning and
zero-shot settings, respectively. Notably, we obtain zero-shot results that are
comparable to fine-tuning on target datasets, which encourages more exploration
of the versatile use of MLLMs. |
This paper proposes to leverage Multi-modal Large Language Models (MLLMs) to enhance visual-language representation learning. |
Large-scale image-text datasets are crucial for visual-language pre-training, but simply removing mismatched pairs reduces data size and negatively impacts performance. This method improves the quality and diversity of image-text pairs without reducing the dataset size. |
Multiple MLLMs are used to generate diverse captions for each image, then "text shearing" is applied to truncate captions to the average length of original captions. This process maintains caption quality and reduces MLLMs' hallucinations. |
The method consistently improves zero-shot and fine-tuned image-text retrieval performance on MSCOCO and Flickr30K datasets by significant margins.
Zero-shot CLIP with the proposed method outperforms vanilla CLIP fine-tuned on target datasets for image-text retrieval.
The method consistently improves performance on various downstream tasks, including image classification, visual question answering, visual reasoning, image captioning, and video-language tasks. |
Noise from unreliable MLLMs' outputs limits performance.
Future work could explore using more powerful MLLMs and larger datasets. |
visual-language pre-training, multi-modal large language models, image-text retrieval, image captioning, data augmentation |
2311.18763
Report |
Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters |
James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin |
Recent work has demonstrated a remarkable ability to customize text-to-image
diffusion models to multiple, fine-grained concepts in a sequential (i.e.,
continual) manner while only providing a few example images for each concept.
This setting is known as continual diffusion. Here, we ask the question: Can we
scale these methods to longer concept sequences without forgetting? Although
prior work mitigates the forgetting of previously learned concepts, we show
that its capacity to learn new tasks reaches saturation over longer sequences.
We address this challenge by introducing a novel method, STack-And-Mask
INcremental Adapters (STAMINA), which is composed of low-ranked
attention-masked adapters and customized MLP tokens. STAMINA is designed to
enhance the robust fine-tuning properties of LoRA for sequential concept
learning via learnable hard-attention masks parameterized with low rank MLPs,
enabling precise, scalable learning via sparse adaptation. Notably, all
introduced trainable parameters can be folded back into the model after
training, inducing no additional inference parameter costs. We show that
STAMINA outperforms the prior SOTA for the setting of text-to-image continual
customization on a 50-concept benchmark composed of landmarks and human faces,
with no stored replay data. Additionally, we extended our method to the setting
of continual learning for image classification, demonstrating that our gains
also translate to state-of-the-art performance in this standard benchmark. |
This paper presents STAMINA (STack-And-Mask INcremental Adapters), a novel method to improve continual learning in text-to-image diffusion models by enhancing low-rank adaptations with attention masking and learnable MLP tokens. |
Continual diffusion, the ability to sequentially customize models with new concepts without forgetting previous ones, is crucial for personalized applications but existing methods struggle to scale to longer concept sequences. |
STAMINA combines low-rank adapters with learnable hard-attention masks (parameterized with low-rank MLPs and Gumbel softmax) and replaces custom token embeddings with learnable MLPs. All introduced parameters can be folded back into the model after training, inducing no additional inference cost. |
STAMINA outperforms the previous state-of-the-art (C-LoRA) in continual text-to-image customization on a 50-concept benchmark composed of landmarks and human faces.
The method requires significantly fewer training steps compared to C-LoRA.
STAMINA also achieves state-of-the-art performance when applied to continual learning for image classification on a standard 20-task benchmark. |
The generation of multiple concepts in a single image still has a high failure rate and requires improvement.
Ethical considerations regarding the generation of personal images (e.g., faces) and potential bias in generated images need careful attention. |
continual learning, text-to-image generation, diffusion models, sparse adaptation, attention masking |
2311.18729
Report |
Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data |
Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, Baoyuan Wang |
Existing one-shot 4D head synthesis methods usually learn from monocular
videos with the aid of 3DMM reconstruction, yet the latter is evenly
challenging which restricts them from reasonable 4D head synthesis. We present
a method to learn one-shot 4D head synthesis via large-scale synthetic data.
The key is to first learn a part-wise 4D generative model from monocular images
via adversarial learning, to synthesize multi-view images of diverse identities
and full motions as training data; then leverage a transformer-based animatable
triplane reconstructor to learn 4D head reconstruction using the synthetic
data. A novel learning strategy is enforced to enhance the generalizability to
real images by disentangling the learning process of 3D reconstruction and
reenactment. Experiments demonstrate our superiority over the prior art. |
This paper introduces a novel method for one-shot 4D head avatar synthesis from a single image, leveraging a large-scale synthetic dataset generated by a novel 4D generative head model. |
Existing methods depend on 3DMM reconstruction from monocular videos, which limits their performance due to the inherent challenges of 3DMM estimation. This new method aims to overcome these limitations and achieve higher-fidelity 4D head synthesis. |
The method consists of two main components: 1) GenHead: a 4D generative model trained on monocular images to synthesize multi-view head images with diverse identities, full motion control, and background separation. 2) A one-shot 4D head synthesis model trained on the synthetic data, employing a transformer-based encoder-decoder architecture and a disentangled learning strategy to enhance generalizability to real images. |
Achieves high-fidelity 4D head reconstruction with reasonable geometry and complete motion control from single images.
Outperforms previous state-of-the-art methods in terms of visual quality, identity preservation, and pose accuracy, particularly under large pose variations.
Demonstrates the efficacy of using synthetic data for learning complex tasks like one-shot 4D head synthesis and opens up new possibilities for scalable head avatar creation. |
Limitations include difficulty handling complex accessories and makeups, potential for texture flickering, and challenges with extreme profile views.
Future work involves improving the handling of high-frequency details, addressing artifacts under specific expressions, and exploring ways to incorporate real data and 3D priors for enhanced realism. |
4d head avatar synthesis, one-shot learning, synthetic data, generative adversarial networks, neural rendering |
2311.18654
Report |
Detailed Human-Centric Text Description-Driven Large Scene Synthesis |
Gwanghyun Kim, Dong Un Kang, Hoigi Seo, Hayeon Kim, Se Young Chun |
Text-driven large scene image synthesis has made significant progress with
diffusion models, but controlling it is challenging. While using additional
spatial controls with corresponding texts has improved the controllability of
large scene synthesis, it is still challenging to faithfully reflect detailed
text descriptions without user-provided controls. Here, we propose
DetText2Scene, a novel text-driven large-scale image synthesis with high
faithfulness, controllability, and naturalness in a global context for the
detailed human-centric text description. Our DetText2Scene consists of 1)
hierarchical keypoint-box layout generation from the detailed description by
leveraging large language model (LLM), 2) view-wise conditioned joint diffusion
process to synthesize a large scene from the given detailed text with
LLM-generated grounded keypoint-box layout and 3) pixel perturbation-based
pyramidal interpolation to progressively refine the large scene for global
coherence. Our DetText2Scene significantly outperforms prior arts in
text-to-large scene synthesis qualitatively and quantitatively, demonstrating
strong faithfulness with detailed descriptions, superior controllability, and
excellent naturalness in a global context. |
Proposes DetText2Scene, a novel method for text-driven large-scale image synthesis that generates highly controllable and natural images faithfully reflecting detailed human-centric text descriptions. |
Existing text-to-large-scene generation methods struggle to faithfully and controllably generate images from detailed descriptions, often lacking global coherence. |
Leverages a hierarchical approach with three stages: 1) Generates a keypoint-box layout from text using a fine-tuned large language model (LLM). 2) Synthesizes a large scene using a view-wise conditioned joint diffusion process guided by the layout and text. 3) Improves global coherence through pixel perturbation-based pyramidal interpolation. |
DetText2Scene outperforms prior arts in text-to-large scene synthesis both qualitatively and quantitatively.
It demonstrates strong faithfulness to detailed descriptions, superior controllability over the number and attributes of generated instances, and excellent naturalness in global context.
User studies confirm significant preference for DetText2Scene over existing methods regarding faithfulness, controllability, and naturalness. |
The LLM's understanding of visual context, especially 3D information, can be limited.
The quality of generated images is contingent on the capabilities of the underlying text-to-image diffusion model (Stable Diffusion 1.5 in this case). |
text-to-image synthesis, large scene generation, diffusion models, large language models, layout generation |
2311.18651
Report |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning |
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen |
Recent advances in Large Multimodal Models (LMM) have made it possible for
various applications in human-machine interactions. However, developing LMMs
that can comprehend, reason, and plan in complex and diverse 3D environments
remains a challenging topic, especially considering the demand for
understanding permutation-invariant point cloud 3D representations of the 3D
scene. Existing works seek help from multi-view images, and project 2D features
to 3D space as 3D scene representations. This, however, leads to huge
computational overhead and performance degradation. In this paper, we present
LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and
respond to both textual-instructions and visual-prompts. This help LMMs better
comprehend human interactions and further help to remove the ambiguities in
cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results,
and surpasses various 3D vision-language models on both 3D Dense Captioning and
3D Question Answering. |
This paper introduces LL3DA, a Large Language 3D Assistant capable of understanding, reasoning, and planning in complex 3D environments by responding to textual instructions and visual prompts. |
Developing models that can comprehend and interact with 3D environments using natural language is crucial for advancements in fields like autonomous driving and embodied AI. |
LL3DA leverages a multi-modal transformer (Interactor3D) to process 3D scene data, textual instructions, and visual prompts, generating a fixed-length representation. This representation is then used as a prefix to a frozen pre-trained Large Language Model (LLM) for response generation. |
LL3DA achieves state-of-the-art results on 3D Dense Captioning benchmarks ScanRefer and Nr3D.
It also outperforms previous methods in 3D Question Answering on the ScanQA dataset.
The addition of visual prompts, like user clicks, significantly improves LL3DA's performance by removing ambiguities in complex scenes. |
The generalist performance of LL3DA on Nr3D is limited due to not differentiating between Nr3D and ScanRefer datasets during training.
Future work could focus on collecting higher-quality and more diverse 3D vision and language annotations to enhance the model's reasoning and planning abilities. |
3d vision and language, large language models, instruction following, visual prompts, point cloud understanding |
2311.18610
Report |
DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image |
Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, Angela Dai |
Perceiving 3D structures from RGB images based on CAD model primitives can
enable an effective, efficient 3D object-based representation of scenes.
However, current approaches rely on supervision from expensive annotations of
CAD models associated with real images, and encounter challenges due to the
inherent ambiguities in the task -- both in depth-scale ambiguity in monocular
perception, as well as inexact matches of CAD database models to real
observations. We thus propose DiffCAD, the first weakly-supervised
probabilistic approach to CAD retrieval and alignment from an RGB image. We
formulate this as a conditional generative task, leveraging diffusion to learn
implicit probabilistic models capturing the shape, pose, and scale of CAD
objects in an image. This enables multi-hypothesis generation of different
plausible CAD reconstructions, requiring only a few hypotheses to characterize
ambiguities in depth/scale and inexact shape matches. Our approach is trained
only on synthetic data, leveraging monocular depth and mask estimates to enable
robust zero-shot adaptation to various real target domains. Despite being
trained solely on synthetic data, our multi-hypothesis approach can even
surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8
hypotheses. |
This paper introduces DiffCAD, the first weakly-supervised probabilistic approach for retrieving and aligning CAD models to a single RGB image, addressing inherent ambiguities in depth, scale, and shape matching. |
Current methods for CAD-based 3D scene understanding rely on expensive real-image annotations and struggle with depth-scale ambiguities and inexact CAD matches. DiffCAD overcomes these limitations by learning probabilistic distributions for plausible reconstructions. |
The method uses diffusion models to capture distributions of scene scale, object pose (via Normalized Object Coordinates), and object shape (latent codes). Trained solely on synthetic data with depth and mask estimates, it generalizes to real images through a multi-hypothesis sampling scheme. |
DiffCAD outperforms state-of-the-art supervised methods on ScanNet, achieving 5.9% higher accuracy with only 8 hypotheses.
The learned probabilistic models effectively capture ambiguities, with performance increasing as more hypotheses are considered.
The method generalizes to unseen real-world datasets like ARKit, demonstrating robustness despite synthetic training data. |
The reliance on CAD models limits the reconstruction of objects without close database matches, suggesting future work in CAD model deformation.
The current approach doesn't explicitly model object relations, potentially hindering performance in highly structured scenes. Integrating scene context is a promising direction. |
3d vision, cad model retrieval, diffusion models, weakly-supervised learning, single-view reconstruction |
2311.18608
Report |
Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing |
Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye |
With the remarkable advent of text-to-image diffusion models, image editing
methods have become more diverse and continue to evolve. A promising recent
approach in this realm is Delta Denoising Score (DDS) - an image editing
technique based on Score Distillation Sampling (SDS) framework that leverages
the rich generative prior of text-to-image diffusion models. However, relying
solely on the difference between scoring functions is insufficient for
preserving specific structural elements from the original image, a crucial
aspect of image editing. To address this, here we present an embarrassingly
simple yet very powerful modification of DDS, called Contrastive Denoising
Score (CDS), for latent diffusion models (LDM). Inspired by the similarities
and differences between DDS and the contrastive learning for unpaired
image-to-image translation(CUT), we introduce a straightforward approach using
CUT loss within the DDS framework. Rather than employing auxiliary networks as
in the original CUT approach, we leverage the intermediate features of LDM,
specifically those from the self-attention layers, which possesses rich spatial
information. Our approach enables zero-shot image-to-image translation and
neural radiance field (NeRF) editing, achieving structural correspondence
between the input and output while maintaining content controllability.
Qualitative results and comparisons demonstrates the effectiveness of our
proposed method. Project page: https://hyelinnam.github.io/CDS/ |
This paper introduces Contrastive Denoising Score (CDS), a novel method for text-driven image editing that integrates Contrastive Unpaired Translation (CUT) loss into the Delta Denoising Score (DDS) framework. |
Existing image editing techniques based on diffusion models often struggle to balance semantic changes guided by text prompts with preserving the structural integrity of the source image. CDS addresses this limitation by enhancing DDS with structural consistency. |
CDS leverages intermediate latent representations from the self-attention layers of a pre-trained Latent Diffusion Model (LDM) to compute the CUT loss. This eliminates the need for training a separate encoder network, enabling zero-shot image editing. |
CDS successfully translates source images, achieving a better balance between content transformation aligned with the target text prompt and maintaining the structural details of the source image compared to existing methods.
Quantitative evaluations including CLIP accuracy, DINO-ViT structure distance, and LPIPS distance demonstrate that CDS outperforms previous state-of-the-art methods.
The method is applicable to various domains beyond image editing, including Neural Radiance Fields (NeRF), showcasing its versatility and potential for broader applications. |
Failure cases can arise from unfavorable random patch selections or when the source object has unconventional poses.
Future work includes exploring techniques to mitigate these limitations and further enhance the robustness of CDS. |
image editing, text-guided synthesis, diffusion models, contrastive learning, score distillation |
2311.18561
Report |
Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering |
Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, Li Zhang |
Modeling dynamic, large-scale urban scenes is challenging due to their highly
intricate geometric structures and unconstrained dynamics in both space and
time. Prior methods often employ high-level architectural priors, separating
static and dynamic elements, resulting in suboptimal capture of their
synergistic interactions. To address this challenge, we present a unified
representation model, called Periodic Vibration Gaussian (PVG). PVG builds upon
the efficient 3D Gaussian splatting technique, originally designed for static
scene representation, by introducing periodic vibration-based temporal
dynamics. This innovation enables PVG to elegantly and uniformly represent the
characteristics of various objects and elements in dynamic urban scenes. To
enhance temporally coherent and large scene representation learning with sparse
training data, we introduce a novel temporal smoothing mechanism and a
position-aware adaptive control strategy respectively. Extensive experiments on
Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses
state-of-the-art alternatives in both reconstruction and novel view synthesis
for both dynamic and static scenes. Notably, PVG achieves this without relying
on manually labeled object bounding boxes or expensive optical flow estimation.
Moreover, PVG exhibits 900-fold acceleration in rendering over the best
alternative. |
This paper proposes Periodic Vibration Gaussian (PVG), a novel unified representation model for dynamic urban scene reconstruction and real-time rendering. |
Modeling dynamic urban scenes is challenging due to their complex geometry and unconstrained dynamics. Existing methods struggle to capture synergistic interactions between static and dynamic elements or suffer from low efficiency. |
PVG extends 3D Gaussian Splatting by introducing periodic vibration for temporal dynamics. It also incorporates a temporal smoothing mechanism for coherence and a position-aware adaptive control strategy for large scenes. |
PVG outperforms state-of-the-art methods in novel view synthesis on Waymo Open Dataset and KITTI benchmark.
It achieves superior efficiency, with up to 900-fold rendering speedup compared to competitors.
PVG effectively captures both static and dynamic elements without manual annotations or pre-trained models. |
PVG's geometric representation accuracy is limited by its highly adaptable design.
Future work includes improving geometric accuracy and further enhancing its ability to depict complex urban scenes. |
dynamic scene reconstruction, novel view synthesis, 3d gaussian splatting, periodic vibration gaussian, autonomous driving |
2311.18512
Report |
Revisiting Proposal-based Object Detection |
Aritra Bhowmik, Martin R. Oswald, Pascal Mettes, Cees G. M. Snoek |
This paper revisits the pipeline for detecting objects in images with
proposals. For any object detector, the obtained box proposals or queries need
to be classified and regressed towards ground truth boxes. The common solution
for the final predictions is to directly maximize the overlap between each
proposal and the ground truth box, followed by a winner-takes-all ranking or
non-maximum suppression. In this work, we propose a simple yet effective
alternative. For proposal regression, we solve a simpler problem where we
regress to the area of intersection between proposal and ground truth. In this
way, each proposal only specifies which part contains the object, avoiding a
blind inpainting problem where proposals need to be regressed beyond their
visual scope. In turn, we replace the winner-takes-all strategy and obtain the
final prediction by taking the union over the regressed intersections of a
proposal group surrounding an object. Our revisited approach comes with minimal
changes to the detection pipeline and can be plugged into any existing method.
We show that our approach directly improves canonical object detection and
instance segmentation architectures, highlighting the utility of
intersection-based regression and grouping. |
This paper proposes a revisited object detection pipeline that decomposes the proposal-to-ground truth regression and proposal-candidate selection into intersection and union problems, leading to a more accurate and robust detection. |
The traditional object detection pipeline suffers from ill-posed regression targets and discards valuable information from multiple proposals. This paper addresses these issues to improve object localization accuracy. |
The authors introduce Intersection-based Regression, where proposals regress only to the intersection with the ground truth, and Intersection-based Grouping, where the union of regressed intersections from multiple proposals forms the final detection. |
The proposed method consistently outperforms baseline detectors like Faster R-CNN, Mask R-CNN, and YOLOv3 on COCO and PASCAL VOC datasets.
The approach demonstrates significant improvements in handling high IoU thresholds, indicating more accurate localization.
An oracle experiment shows that the method's performance scales with improved classification accuracy, highlighting its potential for future detectors. |
The method encounters limitations in handling crowded scenes, where merging multiple instances into a single proposal is possible.
Future work involves developing advanced grouping strategies to address the challenges posed by crowded scenes. |
object detection, intersection-based regression, intersection-based grouping, proposal combination, deep learning |
2311.18482
Report |
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding |
Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan |
Open-vocabulary querying in 3D space is challenging but essential for scene
understanding tasks such as object localization and segmentation.
Language-embedded scene representations have made progress by incorporating
language features into 3D spaces. However, their efficacy heavily depends on
neural networks that are resource-intensive in training and rendering. Although
recent 3D Gaussians offer efficient and high-quality novel view synthesis,
directly embedding language features in them leads to prohibitive memory usage
and decreased performance. In this work, we introduce Language Embedded 3D
Gaussians, a novel scene representation for open-vocabulary query tasks.
Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we
propose a dedicated quantization scheme that drastically alleviates the memory
requirement, and a novel embedding procedure that achieves smoother yet high
accuracy query, countering the multi-view feature inconsistencies and the
high-frequency inductive bias in point-based representations. Our comprehensive
experiments show that our representation achieves the best visual quality and
language querying accuracy across current language-embedded representations,
while maintaining real-time rendering frame rates on a single desktop GPU. |
This paper introduces Language Embedded 3D Gaussians, a novel scene representation framework for open-vocabulary query tasks in 3D scenes, achieving high precision and efficiency. |
Open-vocabulary querying in 3D space is crucial for scene understanding, enabling tasks like object localization and segmentation. Existing methods struggle to balance efficiency and accuracy in embedding language features into 3D representations. |
The method quantizes dense language features from CLIP and DINO into a compact feature space, significantly reducing memory requirements. These features are embedded into 3D Gaussians, and a novel mechanism utilizing learned uncertainty values smooths semantic features spatially to address visual inconsistencies across viewpoints. |
Achieves state-of-the-art visual quality in novel view synthesis surpassing NeRF-based and 3D Gaussian baselines.
Demonstrates superior accuracy in open-vocabulary querying tasks compared to existing language-embedded 3D representations.
Maintains real-time rendering frame rates on a single desktop GPU due to efficient quantization and compact representation. |
Detecting highly reflective or translucent objects remains challenging due to limitations in current visual-language models.
Fine-grained object geometry at high resolutions using CLIP-derived semantics needs improvement. |
3d scene understanding, open-vocabulary query, language embedding, 3d gaussians, novel view synthesis |
2311.18448
Report |
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video |
Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, Otmar Hilliges |
Since humans interact with diverse objects every day, the holistic 3D capture
of these interactions is important to understand and model human behaviour.
However, most existing methods for hand-object reconstruction from RGB either
assume pre-scanned object templates or heavily rely on limited 3D hand-object
data, restricting their ability to scale and generalize to more unconstrained
interaction settings. To this end, we introduce HOLD -- the first
category-agnostic method that reconstructs an articulated hand and object
jointly from a monocular interaction video. We develop a compositional
articulated implicit model that can reconstruct disentangled 3D hand and object
from 2D images. We also further incorporate hand-object constraints to improve
hand-object poses and consequently the reconstruction quality. Our method does
not rely on 3D hand-object annotations while outperforming fully-supervised
baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we
qualitatively show its robustness in reconstructing from in-the-wild videos.
Code: https://github.com/zc-alexfan/hold |
This paper introduces HOLD, a novel category-agnostic method for reconstructing articulated hand and object 3D surfaces jointly from a single interaction video. |
Understanding human behavior requires capturing 3D hand-object interactions, but existing methods are limited by pre-scanned templates or limited training data, hindering scalability and generalization. |
HOLD initializes hand and object poses using off-the-shelf estimators and structure-from-motion, respectively. It then uses a compositional implicit neural network trained with a multi-class segmentation loss, eikonal loss, sparsity loss, and an SDF loss for shape regularization. Hand-object interaction constraints further refine pose estimates, leading to more accurate reconstructions. |
HOLD significantly outperforms state-of-the-art methods in hand pose and object reconstruction accuracy, generalizing to unseen object categories.
Jointly modeling hand and object improves object reconstruction compared to modeling objects in isolation.
Pose refinement with interaction constraints significantly enhances the accuracy of object and hand poses, resulting in better object reconstructions. |
Reconstruction of thin or textureless objects is limited by detector-based SfM.
Reliance on raw RGB data for supervision may hinder the reconstruction of less visible object regions. |
hand-object reconstruction, 3d reconstruction, monocular video, neural implicit representation, interaction constraints |
2311.18435
Report |
Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis |
Zipeng Qi, Guoxi Huang, Zebin Huang, Qin Guo, Jinwen Chen, Junyu Han, Jian Wang, Gang Zhang, Lufei Liu, Errui Ding, Jingdong Wang |
This paper introduces innovative solutions to enhance spatial controllability
in diffusion models reliant on text queries. We present two key innovations:
Vision Guidance and the Layered Rendering Diffusion (LRDiff) framework. Vision
Guidance, a spatial layout condition, acts as a clue in the perturbed
distribution, greatly narrowing down the search space, to focus on the image
sampling process adhering to the spatial layout condition. The LRDiff framework
constructs an image-rendering process with multiple layers, each of which
applies the vision guidance to instructively estimate the denoising direction
for a single object. Such a layered rendering strategy effectively prevents
issues like unintended conceptual blending or mismatches, while allowing for
more coherent and contextually accurate image synthesis. The proposed method
provides a more efficient and accurate means of synthesising images that align
with specific spatial and contextual requirements. We demonstrate through our
experiments that our method provides better results than existing techniques
both quantitatively and qualitatively. We apply our method to three practical
applications: bounding box-to-image, semantic mask-to-image and image editing. |
This paper introduces LRDiff, a zero-shot diffusion-based framework for layout-guided image synthesis using vision guidance. |
Existing T2I models struggle with spatial controllability. Fine-tuning methods are costly, while attention-based methods suffer from blending and mismatch issues. |
LRDiff uses vision guidance, a spatial condition added to the input, to guide denoising. It employs a layered rendering approach, estimating object denoising separately before combining them with global context. |
LRDiff achieves superior spatial controllability compared to previous zero-shot methods, as demonstrated by higher AP and IoU scores.
It effectively mitigates object blending, a common issue in attention-manipulation based approaches.
The method is robust to various scene descriptions and allows for image editing by inserting or replacing objects. |
There's a trade-off between image fidelity and spatial alignment depending on the denoising period.
Generating small objects with high fidelity remains a challenge. |
text-to-image synthesis, diffusion models, layout guidance, vision guidance, layered rendering |
2311.18387
Report |
On Exact Inversion of DPM-Solvers |
Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, Se Young Chun |
Diffusion probabilistic models (DPMs) are a key component in modern
generative models. DPM-solvers have achieved reduced latency and enhanced
quality significantly, but have posed challenges to find the exact inverse
(i.e., finding the initial noise from the given image). Here we investigate the
exact inversions for DPM-solvers and propose algorithms to perform them when
samples are generated by the first-order as well as higher-order DPM-solvers.
For each explicit denoising step in DPM-solvers, we formulated the inversions
using implicit methods such as gradient descent or forward step method to
ensure the robustness to large classifier-free guidance unlike the prior
approach using fixed-point iteration. Experimental results demonstrated that
our proposed exact inversion methods significantly reduced the error of both
image and noise reconstructions, greatly enhanced the ability to distinguish
invisible watermarks and well prevented unintended background changes
consistently during image editing. Project page:
\url{https://smhongok.github.io/inv-dpm.html}. |
This paper proposes exact inversion methods for finding the initial noise of images generated by various Diffusion Probabilistic Models (DPMs), including high-order DPM solvers. |
Exact inversion is crucial for applications like image editing, style transfer, model attacks, watermark detection, and image restoration, enabling broader applications with DPMs. |
The authors propose using the backward Euler method for exact inversion of DDIM (a first-order DPM-solver). For high-order DPM-solvers, they introduce backward Euler with approximate high-order terms. To ensure robustness with large classifier-free guidance, they employ gradient descent or the forward step method. |
The proposed methods significantly reduce reconstruction errors compared to the na"ive DDIM inversion for both images and noise, in both pixel-space DPM and LDM.
They enable accurate reconstruction of noise-space watermarks, even improving the detection and enabling watermark classification.
The methods substantially improve background-preserving image editing without requiring the original latent vectors. |
The proposed method has a significantly larger computational time compared to na"ive DDIM inversion.
It assumes prior knowledge of the prompt used in LDMs, leaving joint estimation of prompt and initial noise for future work. |
diffusion probabilistic models, generative models, image inversion, image editing, watermark detection |
2311.18297
Report |
TrustMark: Universal Watermarking for Arbitrary Resolution Images |
Tu Bui, Shruti Agarwal, John Collomosse |
Imperceptible digital watermarking is important in copyright protection,
misinformation prevention, and responsible generative AI. We propose TrustMark
- a GAN-based watermarking method with novel design in architecture and
spatio-spectra losses to balance the trade-off between watermarked image
quality with the watermark recovery accuracy. Our model is trained with
robustness in mind, withstanding various in- and out-place perturbations on the
encoded image. Additionally, we introduce TrustMark-RM - a watermark remover
method useful for re-watermarking. Our methods achieve state-of-art performance
on 3 benchmarks comprising arbitrary resolution images. |
TrustMark, a novel GAN-based watermarking method for arbitrary resolution images, balances imperceptibility and watermark recovery while achieving robustness against perturbations. |
Addresses challenges in misinformation prevention, copyright protection, and responsible generative AI by enabling robust and imperceptible embedding of identifiers (e.g., provenance data) within images. |
Combines a novel architecture with 1x1 convolutional post-processing layers, focal frequency loss, extensive noise simulation during training, a resolution scaling method, and a watermark removal network (TrustMark-RM) for re-watermarking. |
Achieves state-of-the-art imperceptibility and watermark recovery performance on three benchmarks (CLIC, DIV2K, MetFace).
Demonstrates robustness against various noise sources, severity levels, and adversarial attacks.
Enables effective watermark removal and re-watermarking while preserving image quality. |
Performance slightly degrades on highly cluttered images.
Watermark removal effectiveness weakens with repeated re-watermarking due to accumulated noise. |
watermarking, deep learning, image processing, content provenance, robustness |
2311.18288
Report |
CosAvatar: Consistent and Animatable Portrait Video Tuning with Text Prompt |
Haiyao Xiao, Chenglai Zhong, Xuan Gao, Yudong Guo, Juyong Zhang |
Recently, text-guided digital portrait editing has attracted more and more
attentions. However, existing methods still struggle to maintain consistency
across time, expression, and view or require specific data prerequisites. To
solve these challenging problems, we propose CosAvatar, a high-quality and
user-friendly framework for portrait tuning. With only monocular video and text
instructions as input, we can produce animatable portraits with both temporal
and 3D consistency. Different from methods that directly edit in the 2D domain,
we employ a dynamic NeRF-based 3D portrait representation to model both the
head and torso. We alternate between editing the video frames' dataset and
updating the underlying 3D portrait until the edited frames reach 3D
consistency. Additionally, we integrate the semantic portrait priors to enhance
the edited results, allowing precise modifications in specified semantic areas.
Extensive results demonstrate that our proposed method can not only accurately
edit portrait styles or local attributes based on text instructions but also
support expressive animation driven by a source video. |
CosAvatar, a novel text-driven portrait editing framework using monocular dynamic NeRF to enable global style and local attribute editing with strong temporal and 3D consistency, while supporting animation. |
Existing methods struggle to maintain consistency across time, expression, and view during portrait editing, especially for dynamic sequences, limiting flexibility and generalizability. |
The method reconstructs a dynamic NeRF-based 3D portrait, separately modeling head and torso motion. It then leverages a structure-preserving image-conditioned diffusion model (InstructPix2Pix) to iteratively edit rendered frames, alternating between dataset updates and NeRF refinement. Semantic segmentation priors guide local attribute edits. |
Generates high-fidelity stylized portrait videos with strong temporal and 3D consistency from text instructions.
Allows precise control over local semantic region edits, like hair or clothing style.
Enables animation of the edited portrait driven by expressions and poses from reference videos. |
Fine-detail editing is limited due to InstructPix2Pix's limitations in object isolation and spatial reasoning.
Lacks explicit geometric proxies for body parts, potentially impacting accuracy in scenarios with significant body movement. |
portrait editing, text-driven editing, dynamic nerf, diffusion models, semantic segmentation |
2311.18266
Report |
Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning |
Ruxiao Duan, Yaoyao Liu, Jieneng Chen, Adam Kortylewski, Alan Yuille |
Replay-based methods in class-incremental learning (CIL) have attained
remarkable success, as replaying the exemplars of old classes can significantly
mitigate catastrophic forgetting. Despite their effectiveness, the inherent
memory restrictions of CIL result in saving a limited number of exemplars with
poor diversity, leading to data imbalance and overfitting issues. In this
paper, we introduce a novel exemplar super-compression and regeneration method,
ESCORT, which substantially increases the quantity and enhances the diversity
of exemplars. Rather than storing past images, we compress images into visual
and textual prompts, e.g., edge maps and class tags, and save the prompts
instead, reducing the memory usage of each exemplar to 1/24 of the original
size. In subsequent learning phases, diverse high-resolution exemplars are
generated from the prompts by a pre-trained diffusion model, e.g., ControlNet.
To minimize the domain gap between generated exemplars and real images, we
propose partial compression and diffusion-based data augmentation, allowing us
to utilize an off-the-shelf diffusion model without fine-tuning it on the
target dataset. Therefore, the same diffusion model can be downloaded whenever
it is needed, incurring no memory consumption. Comprehensive experiments
demonstrate that our method significantly improves model performance across
multiple CIL benchmarks, e.g., 5.0 percentage points higher than the previous
state-of-the-art on 10-phase Caltech-256 dataset. |
This paper introduces ESCORT, an exemplar super-compression and regeneration method using prompts, to enhance the quantity and diversity of exemplars in replay-based class-incremental learning (CIL). |
Replay-based CIL methods are limited by the small quantity and poor diversity of stored exemplars, leading to data imbalance and overfitting. |
ESCORT compresses past images into visual (edge maps) and textual (class tags) prompts for storage. Diverse, high-resolution exemplars are then regenerated using a pre-trained diffusion model (ControlNet) from these prompts during subsequent CIL phases. Partial compression and diffusion-based data augmentation mitigate the domain gap between generated and real images. |
ESCORT significantly improves model performance, achieving state-of-the-art results on three image classification benchmarks (Caltech-256, Food-101, Places-100).
The method consistently improves accuracy across various memory budgets, demonstrating its effectiveness in memory-constrained scenarios.
Ablation studies confirm the benefits of partial compression and diffusion-based data augmentation in improving exemplar quality and diversity. |
The current implementation relies on a pre-selected diffusion model (ControlNet). Adapting to other generative models might require further exploration.
The study primarily focuses on image classification tasks. Exploring ESCORT's applicability to other CIL domains, like object detection or semantic segmentation, is a promising future direction. |
class-incremental learning, exemplar compression, diffusion models, data augmentation, catastrophic forgetting |
2311.18257
Report |
Diffusion Models Without Attention |
Jing Nathan Yan, Jiatao Gu, Alexander M. Rush |
In recent advancements in high-fidelity image generation, Denoising Diffusion
Probabilistic Models (DDPMs) have emerged as a key player. However, their
application at high resolutions presents significant computational challenges.
Current methods, such as patchifying, expedite processes in UNet and
Transformer architectures but at the expense of representational capacity.
Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an
architecture that supplants attention mechanisms with a more scalable state
space model backbone. This approach effectively handles higher resolutions
without resorting to global compression, thus preserving detailed image
representation throughout the diffusion process. Our focus on FLOP-efficient
architectures in diffusion training marks a significant step forward.
Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions
demonstrate that DiffuSSMs are on par or even outperform existing diffusion
models with attention modules in FID and Inception Score metrics while
significantly reducing total FLOP usage. |
Introduces Diffusion State Space Model (DiffuSSM), an attention-free diffusion architecture that replaces attention with a more scalable state space model backbone for high-resolution image generation. |
Addresses computational challenges of existing diffusion models at high resolutions, particularly the quadratic complexity of attention mechanisms, without compromising representational capacity by avoiding patchification or multi-scale resolution compression. |
Employs a gated state space model (SSM) backbone to process finer-grained image representations without global compression and incorporates an hourglass architecture in MLP layers to enhance efficiency. |
Achieves comparable or superior FID and Inception Score results to existing diffusion models on ImageNet and LSUN datasets at various resolutions.
Demonstrates significantly reduced total FLOP usage compared to attention-based models like DiT.
Shows improved robustness in spatial reconstruction and visual quality by avoiding patchification in qualitative analysis. |
Focuses primarily on (un)conditional image generation, leaving text-to-image approaches for future exploration.
Does not incorporate recent advancements like masked image training, which could potentially further improve performance. |
diffusion models, image generation, state space models, attention mechanism, high-resolution |
2311.18248
Report |
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model |
Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang |
Recently, the strong text creation ability of Large Language Models(LLMs) has
given rise to many tools for assisting paper reading or even writing. However,
the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit
their application scenarios, especially for scientific academic paper writing.
In this work, towards a more versatile copilot for academic paper writing, we
mainly focus on strengthening the multi-modal diagram analysis ability of
Multimodal LLMs. By parsing Latex source files of high-quality papers, we
carefully build a multi-modal diagram understanding dataset M-Paper. By
aligning diagrams in the paper with related paragraphs, we construct
professional diagram analysis samples for training and evaluation. M-Paper is
the first dataset to support joint comprehension of multiple scientific
diagrams, including figures and tables in the format of images or Latex codes.
Besides, to better align the copilot with the user's intention, we introduce
the `outline' as the control signal, which could be directly given by the user
or revised based on auto-generated ones. Comprehensive experiments with a
state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows
stronger scientific diagram understanding performance, including diagram
captioning, diagram analysis, and outline recommendation. The dataset, code,
and model are available at
https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl. |
This paper introduces PaperOwl, a new model fine-tuned for scientific diagram analysis in academic papers. It leverages outlines as control signals to align generated analyses with user intent and preceding text. |
Existing LLMs and MLLMs struggle with the complex diagram analysis required for academic writing, limiting their use as writing copilots. |
The authors built mPLUG-DocOwl, a dataset with aligned diagrams, captions, paragraph analyses, and outlines extracted from high-quality computer science papers. They then fine-tuned a pre-trained MLLM on this dataset, incorporating techniques like image cropping and outline-guided generation. |
PaperOwl significantly outperforms state-of-the-art MLLMs on diagram captioning, analysis, and outline recommendation tasks.
Using outlines as control signals improves analysis quality and aligns it with user intent.
Incorporating preceding text as context further enhances analysis coherence and accuracy. |
The cropping module for high-resolution images poses challenges for balancing multimodal information in diagram analysis.
The model may sometimes prioritize following the outline over providing detailed insights from diagrams. |
multimodal learning, diagram understanding, academic paper writing, large language models, computer vision |
2311.18243
Report |
DKiS: Decay weight invertible image steganography with private key |
Hang Yang, Yitian Xu, Xuhua Liu |
Image steganography, defined as the practice of concealing information within
another image, traditionally encounters security challenges when its methods
become publicly known or are under attack. To address this, a novel private
key-based image steganography technique has been introduced. This approach
ensures the security of the hidden information, as access requires a
corresponding private key, regardless of the public knowledge of the
steganography method. Experimental evidence has been presented, demonstrating
the effectiveness of our method and showcasing its real-world applicability.
Furthermore, a critical challenge in the invertible image steganography process
has been identified by us: the transfer of non-essential, or `garbage',
information from the secret to the host pipeline. To tackle this issue, the
decay weight has been introduced to control the information transfer,
effectively filtering out irrelevant data and enhancing the performance of
image steganography. The code for this technique is publicly accessible at
https://github.com/yanghangAI/DKiS, and a practical demonstration can be found
at http://yanghang.site/hidekey. |
This paper introduces DKiS, a novel private key-based image steganography technique, which ensures the security of hidden information even if the method is publicly known. |
Traditional image steganography methods face security risks when the method is known or attacked. DKiS addresses this by incorporating preset private keys, enhancing security. |
DKiS uses invertible neural networks with a private key integrated into the encoding process. A decay weight is introduced to control information transfer, optimizing performance. |
DKiS significantly outperforms previous private key-based methods in image hiding quality.
The private key effectively prevents unauthorized embedding and extraction attacks, as shown by attack simulation.
DKiS demonstrates consistent performance across diverse datasets including COCO, ImageNet, and PubLayNet. |
The current implementation of DKiS is limited to hiding images within images. Exploring other data types for hiding is a potential future direction.
While DKiS has shown robustness, investigating its resilience against a wider range of steganalysis attacks is crucial for future work. |
image steganography, deep learning, private key, security, invertible neural network |
2311.18208
Report |
SMaRt: Improving GANs with Score Matching Regularity |
Mengfei Xia, Yujun Shen, Ceyuan Yang, Ran Yi, Wenping Wang, Yong-jin Liu |
Generative adversarial networks (GANs) usually struggle in learning from
highly diverse data, whose underlying manifold is complex. In this work, we
revisit the mathematical foundations of GANs, and theoretically reveal that the
native adversarial loss for GAN training is insufficient to fix the problem of
subsets with positive Lebesgue measure of the generated data manifold lying out
of the real data manifold. Instead, we find that score matching serves as a
promising solution to this issue thanks to its capability of persistently
pushing the generated data points towards the real data manifold. We thereby
propose to improve the optimization of GANs with score matching regularity
(SMaRt). Regarding the empirical evidences, we first design a toy example to
show that training GANs by the aid of a ground-truth score function can help
reproduce the real data distribution more accurately, and then confirm that our
approach can consistently boost the synthesis performance of various
state-of-the-art GANs on real-world datasets with pre-trained diffusion models
acting as the approximate score function. For instance, when training Aurora on
the ImageNet 64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par
with the performance of one-step consistency model. The source code will be
made public. |
This paper proposes SMaRt, a novel method using score matching as a regularity term during GAN training to address the gradient vanishing issue and enhance image generation quality. |
Gradient vanishing in GANs, often caused by generated data manifolds not fully aligning with real data manifolds, limits GAN performance and downstream applications. SMaRt aims to solve this by guiding generated samples towards the real data manifold. |
SMaRt leverages pre-trained diffusion models and incorporates a score matching loss term into the GAN objective function. This loss encourages the generator to produce samples that are more consistent with the real data distribution. |
SMaRt consistently improves the performance of various state-of-the-art GANs on diverse datasets, including CIFAR10, LSUN Bedroom, and ImageNet.
SMaRt effectively addresses the gradient vanishing issue by persistently providing gradients to push generated samples towards the real data manifold.
The paper demonstrates the effectiveness of SMaRt on a toy example with discrete data distribution, showcasing its ability to handle discreteness better than previous methods. |
The optimal choice of hyperparameters, such as the timestep interval and loss weight for score matching, is currently determined empirically and requires further investigation.
Despite employing a lazy strategy, the additional score matching computation in SMaRt slightly increases training time compared to baseline GANs. |
generative adversarial networks (gans), score matching, diffusion models, gradient vanishing, image generation |
2311.18159
Report |
Compact3D: Compressing Gaussian Splat Radiance Field Models with Vector Quantization |
KL Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, Hamed Pirsiavash |
3D Gaussian Splatting is a new method for modeling and rendering 3D radiance
fields that achieves much faster learning and rendering time compared to SOTA
NeRF methods. However, it comes with a drawback in the much larger storage
demand compared to NeRF methods since it needs to store the parameters for
several 3D Gaussians. We notice that many Gaussians may share similar
parameters, so we introduce a simple vector quantization method based on
\kmeans algorithm to quantize the Gaussian parameters. Then, we store the small
codebook along with the index of the code for each Gaussian. Moreover, we
compress the indices further by sorting them and using a method similar to
run-length encoding. We do extensive experiments on standard benchmarks as well
as a new benchmark which is an order of magnitude larger than the standard
benchmarks. We show that our simple yet effective method can reduce the storage
cost for the original 3D Gaussian Splatting method by a factor of almost
$20\times$ with a very small drop in the quality of rendered images. |
This paper introduces a vector quantization method based on the k-means algorithm to compress 3D Gaussian Splatting (3DGS) models for efficient storage and rendering of 3D radiance fields. |
3DGS offers fast learning and rendering compared to NeRF methods, but requires significantly more storage. This work addresses this limitation, making 3DGS practical for applications with storage constraints, such as edge devices. |
The method quantizes the parameters of 3D Gaussians by grouping similar parameters and clustering them independently using k-means. It stores a codebook and indices for each Gaussian, enabling significant storage reduction. Further compression is achieved by sorting Gaussians and employing run-length encoding. |
The compressed model (CompGS) reduces storage size by a factor of nearly 20x compared to the original 3DGS while maintaining comparable quality to state-of-the-art NeRF approaches.
CompGS preserves the real-time rendering capabilities of 3DGS.
Experiments on a newly introduced large-scale benchmark (ARKit) demonstrate the effectiveness of CompGS in compressing large indoor scenes for potential VR applications. |
The compression method introduces computational overhead during training due to the k-means clustering.
Future work could explore faster k-means implementations or alternative quantization techniques to further reduce training time. |
3d gaussian splatting, radiance fields, vector quantization, model compression, novel view synthesis |
2311.18068
Report |
ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic Reconstruction |
Silvan Weder, Francis Engelmann, Johannes L. Schönberger, Akihito Seki, Marc Pollefeys, Martin R. Oswald |
We propose an online 3D semantic segmentation method that incrementally
reconstructs a 3D semantic map from a stream of RGB-D frames. Unlike offline
methods, ours is directly applicable to scenarios with real-time constraints,
such as robotics or mixed reality. To overcome the inherent challenges of
online methods, we make two main contributions. First, to effectively extract
information from the input RGB-D video stream, we jointly estimate geometry and
semantic labels per frame in 3D. A key focus of our approach is to reason about
semantic entities both in the 2D input and the local 3D domain to leverage
differences in spatial context and network architectures. Our method predicts
2D features using an off-the-shelf segmentation network. The extracted 2D
features are refined by a lightweight 3D network to enable reasoning about the
local 3D structure. Second, to efficiently deal with an infinite stream of
input RGB-D frames, a subsequent network serves as a temporal expert predicting
the incremental scene updates by leveraging 2D, 3D, and past information in a
learned manner. These updates are then integrated into a global scene
representation. Using these main contributions, our method can enable scenarios
with real-time constraints and can scale to arbitrary scene sizes by processing
and updating the scene only in a local region defined by the new measurement.
Our experiments demonstrate improved results compared to existing online
methods that purely operate in local regions and show that complementary
sources of information can boost the performance. We provide a thorough
ablation study on the benefits of different architectural as well as
algorithmic design decisions. Our method yields competitive results on the
popular ScanNet benchmark and SceneNN dataset. |
This paper proposes an online 3D semantic segmentation method that incrementally reconstructs a semantically enriched 3D map from a stream of RGB-D frames. |
Online 3D semantic reconstruction is crucial for real-time applications like robotics and mixed reality, where agents need to interact with their environment without prior knowledge. |
The method uses a three-stage pipeline: a 2D encoder extracts features from RGB-D images, a 3D encoder incorporates geometric information, and a novel spatio-temporal expert network fuses 2D, 3D, and past scene information to update a learned 3D scene representation. |
The method achieves state-of-the-art results among online local reconstruction methods on the ScanNet and SceneNN benchmarks.
The proposed temporal expert network effectively combines 2D and 3D information, leading to improved semantic segmentation compared to using either source alone.
The approach is memory and computationally efficient, making it suitable for real-time applications on devices with limited resources. |
The 2D encoder is identified as the main bottleneck for real-time performance.
Future work could explore alternative 2D encoders or optimization strategies to improve runtime speed further. |
3d semantic segmentation, online reconstruction, rgb-d vision, spatio-temporal expert network, learned scene representation |
2311.17977
Report |
GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces |
Yingwenqi Jiang, Jiadong Tu, Yuan Liu, Xifeng Gao, Xiaoxiao Long, Wenping Wang, Yuexin Ma |
The advent of neural 3D Gaussians has recently brought about a revolution in
the field of neural rendering, facilitating the generation of high-quality
renderings at real-time speeds. However, the explicit and discrete
representation encounters challenges when applied to scenes featuring
reflective surfaces. In this paper, we present GaussianShader, a novel method
that applies a simplified shading function on 3D Gaussians to enhance the
neural rendering in scenes with reflective surfaces while preserving the
training and rendering efficiency. The main challenge in applying the shading
function lies in the accurate normal estimation on discrete 3D Gaussians.
Specifically, we proposed a novel normal estimation framework based on the
shortest axis directions of 3D Gaussians with a delicately designed loss to
make the consistency between the normals and the geometries of Gaussian
spheres. Experiments show that GaussianShader strikes a commendable balance
between efficiency and visual quality. Our method surpasses Gaussian Splatting
in PSNR on specular object datasets, exhibiting an improvement of 1.57dB. When
compared to prior works handling reflective surfaces, such as Ref-NeRF, our
optimization time is significantly accelerated (23h vs. 0.58h). Please click on
our project website to see more results. |
GaussianShader enhances neural rendering in scenes with reflective surfaces using a simplified shading function on 3D Gaussians, improving visual quality while preserving training and rendering efficiency. |
Existing neural rendering methods struggle with reflective surfaces: NeRF methods are computationally expensive, and while 3D Gaussian Splatting is efficient, it lacks explicit appearance modeling, hindering realism in scenes with reflective surfaces. |
The method incorporates a simplified shading function considering diffuse colors and direct reflections, with a residual term for complex reflections. It introduces a novel normal estimation framework based on the shortest axis direction of 3D Gaussians and a normal-geometry consistency loss to ensure accurate normal estimation on discrete Gaussian spheres. |
GaussianShader surpasses Gaussian Splatting in PSNR on specular object datasets by 1.57dB.
It achieves comparable visual quality to methods like Ref-NeRF and ENVIDR while significantly reducing optimization time (0.58h vs. 23h for Ref-NeRF).
The method maintains real-time rendering capabilities, making it suitable for interactive applications. |
The method's performance on highly complex lighting scenarios with intricate indirect illumination might be limited.
Future work could explore incorporating more sophisticated BRDF models for increased realism. |
neural rendering, 3d gaussian splatting, reflective surfaces, normal estimation, shading function |
2311.17971
Report |
GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation |
Baorui Ma, Haoge Deng, Junsheng Zhou, Yu-Shen Liu, Tiejun Huang, Xinlong Wang |
Text-to-3D generation by distilling pretrained large-scale text-to-image
diffusion models has shown great promise but still suffers from inconsistent 3D
geometric structures (Janus problems) and severe artifacts. The aforementioned
problems mainly stem from 2D diffusion models lacking 3D awareness during the
lifting. In this work, we present GeoDream, a novel method that incorporates
explicit generalized 3D priors with 2D diffusion priors to enhance the
capability of obtaining unambiguous 3D consistent geometric structures without
sacrificing diversity or fidelity. Specifically, we first utilize a multi-view
diffusion model to generate posed images and then construct cost volume from
the predicted image, which serves as native 3D geometric priors, ensuring
spatial consistency in 3D space. Subsequently, we further propose to harness 3D
geometric priors to unlock the great potential of 3D awareness in 2D diffusion
priors via a disentangled design. Notably, disentangling 2D and 3D priors
allows us to refine 3D geometric priors further. We justify that the refined 3D
geometric priors aid in the 3D-aware capability of 2D diffusion priors, which
in turn provides superior guidance for the refinement of 3D geometric priors.
Our numerical and visual comparisons demonstrate that GeoDream generates more
3D consistent textured meshes with high-resolution realistic renderings (i.e.,
1024 $\times$ 1024) and adheres more closely to semantic coherence. |
This paper presents GeoDream, a novel text-to-3D generation method that incorporates explicit 3D priors with 2D diffusion priors to improve 3D consistency and reduce artifacts, especially for asymmetric structures (Janus problems). |
Current text-to-3D generation methods based on large-scale text-to-image diffusion models struggle to generate 3D consistent structures, particularly for asymmetric shapes, due to the lack of 3D awareness in 2D diffusion models. |
GeoDream utilizes a multi-view diffusion model to generate posed images and constructs a cost volume representing 3D geometric priors. This volume, combined with a disentangled design for incorporating 2D diffusion priors, refines the 3D geometry and texture, resulting in high-fidelity textured meshes. |
GeoDream generates 3D consistent textured meshes with higher resolution (1024x1024) and realism than previous methods.
The method demonstrates superior semantic coherence, as measured by a newly proposed Uni3D_score metric.
GeoDream adapts to various multi-view diffusion models and benefits from a critical viewpoint sampling strategy for robust cost volume construction. |
The training process of GeoDream is relatively time-consuming, despite optimizations.
Future work includes exploring larger batch sizes and multi-GPU training for faster generation. |
text-to-3d generation, diffusion models, 3d priors, janus problem, cost volume |
2311.17963
Report |
M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation |
Xiaowei Chi, Rongyu Zhang, Zhengkai Jiang, Yijiang Liu, Yatian Wang, Xingqun Qi, Wenhan Luo, Peng Gao, Shanghang Zhang, Qifeng Liu, Yike Guo |
While current LLM chatbots like GPT-4V bridge the gap between human
instructions and visual representations to enable text-image generations, they
still lack efficient alignment methods for high-fidelity performance on
multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a
novel unified multimodal LLM framework for generating interleaved text-image
conversation across various scenarios. Specifically, we propose an
$M^{3}Adapter$ that efficiently integrates granular low-level visual
information and high-level semantic features from multi-modality prompts. Upon
the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating
strategy to balance the model creativity and consistency across various tasks
adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$
while preserving the coherence of semantic context comprehension, we introduce
a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint
groups of parameters for image-text alignment and visual-instruction
respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses
state-of-the-art counterparts across diverse benchmarks, showcasing its prowess
in interleaving generation, storytelling, and multimodal dialogue systems. The
demo and code are available at
\red{https://mattie-e.github.io/M2Chat.github.io}. |
This paper proposes M²Chat, a novel unified multimodal large language model (LLM) framework that generates interleaved text-image conversations across various scenarios using an M³Adapter. |
Current LLM chatbots lack efficient alignment methods for high-fidelity performance on multiple downstream tasks requiring both creative and consistent text-image generation. |
M²Chat integrates Stable Diffusion XL with LLaMA-AdapterV2. It employs an M³Adapter to integrate visual and semantic features from multimodal prompts and a two-stage M³FT fine-tuning strategy to optimize image-text alignment and visual instruction. |
M²Chat outperforms state-of-the-art models in interleaved generation tasks, showcasing superior quality and semantic consistency.
The M³Adapter effectively aligns visual and semantic features, enhancing text-image congruence and image fidelity.
The two-stage M³FT strategy significantly improves generative quality by optimizing for alignment and instruction following. |
The model's performance relies heavily on the quality and diversity of the training data.
Future work can explore incorporating more modalities, such as audio, to create richer and more engaging conversational experiences. |
multimodal generation, large language models, text-to-image synthesis, interleaved generation, multimodal dialogue |
2311.17957
Report |
HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting |
Wenquan Lu, Yufei Xu, Jing Zhang, Chaoyue Wang, Dacheng Tao |
Diffusion models have achieved remarkable success in generating realistic
images but suffer from generating accurate human hands, such as incorrect
finger counts or irregular shapes. This difficulty arises from the complex task
of learning the physical structure and pose of hands from training images,
which involves extensive deformations and occlusions. For correct hand
generation, our paper introduces a lightweight post-processing solution called
$\textbf{HandRefiner}$. HandRefiner employs a conditional inpainting approach
to rectify malformed hands while leaving other parts of the image untouched. We
leverage the hand mesh reconstruction model that consistently adheres to the
correct number of fingers and hand shape, while also being capable of fitting
the desired hand pose in the generated image. Given a generated failed image
due to malformed hands, we utilize ControlNet modules to re-inject such correct
hand information. Additionally, we uncover a phase transition phenomenon within
ControlNet as we vary the control strength. It enables us to take advantage of
more readily available synthetic data without suffering from the domain gap
between realistic and synthetic hands. Experiments demonstrate that HandRefiner
can significantly improve the generation quality quantitatively and
qualitatively. The code is available at
https://github.com/wenquanlu/HandRefiner . |
This paper introduces HandRefiner, a lightweight post-processing solution to rectify malformed hands in images generated by diffusion models, without needing to retrain the models. |
Diffusion models struggle to generate realistic human hands due to their complex structure and occlusion variations, leading to incorrect finger counts and irregular shapes. |
HandRefiner uses a hand mesh reconstruction model to estimate hand depth maps from generated images. These maps are then used as guidance within a ControlNet-based inpainting pipeline to reconstruct hand regions. |
HandRefiner significantly improves hand generation quality, as evidenced by improved FID/KID scores and user study results.
The paper identifies a phase transition phenomenon within ControlNet when varying control strength, allowing for effective use of synthetic data during training.
Fine-tuning with synthetic data, incorporating negative prompts, and using an inpainting loss all contribute to improved performance. |
HandRefiner currently faces challenges in generating interacting hands due to mesh reconstruction difficulties and training data limitations.
Future work could explore expanding HandRefiner's compatibility with larger diffusion models and addressing limitations related to generating small hands and interacting hands. |
diffusion models, hand generation, image inpainting, controlnet, synthetic data |
2311.17953
Report |
Rethinking Image Editing Detection in the Era of Generative AI Revolution |
Zhihao Sun, Haipeng Fang, Xinying Zhao, Danding Wang, Juan Cao |
The accelerated advancement of generative AI significantly enhance the
viability and effectiveness of generative regional editing methods. This
evolution render the image manipulation more accessible, thereby intensifying
the risk of altering the conveyed information within original images and even
propagating misinformation. Consequently, there exists a critical demand for
robust capable of detecting the edited images. However, the lack of
comprehensive dataset containing images edited with abundant and advanced
generative regional editing methods poses a substantial obstacle to the
advancement of corresponding detection methods.
We endeavor to fill the vacancy by constructing the GRE dataset, a
large-scale generative regional editing dataset with the following advantages:
1) Collection of real-world original images, focusing on two frequently edited
scenarios. 2) Integration of a logical and simulated editing pipeline,
leveraging multiple large models in various modalities. 3) Inclusion of various
editing approaches with distinct architectures. 4) Provision of comprehensive
analysis tasks. We perform comprehensive experiments with proposed three tasks:
edited image classification, edited method attribution and edited region
localization, providing analysis of distinct editing methods and evaluation of
detection methods in related fields. We expect that the GRE dataset can promote
further research and exploration in the field of generative region editing
detection. |
The paper introduces GRE, a large-scale dataset for detecting and analyzing generative regional editing in images. |
Generative AI advancements increase the risk of malicious image manipulation. Existing datasets lack comprehensiveness for detecting edits from advanced generative methods, hindering detection development. |
The authors collect real-world images and build a simulated editing pipeline using large models (e.g., ChatGPT, Stable Diffusion) to generate logically coherent edited images with various methods (GAN-based, diffusion-based, black-box). They provide annotations for classification, attribution, and localization tasks. |
Models trained on seen editing methods show good generalization on unseen ones for classification but struggle with generalization on localization.
Attribution of GAN-based methods is easier than diffusion-based methods.
Generative editing creates less perceptible edits, making detection challenging and highlighting GRE dataset's value. |
The paper focuses on known generative methods, leaving room for exploring detection against unknown methods.
Future work includes incorporating new editing methods and large models into the pipeline. |
generative ai, image manipulation detection, dataset, regional editing, deep learning |
2311.17946
Report |
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback |
Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, Cyrus Rashtchian |
Despite their wide-spread success, Text-to-Image models (T2I) still struggle
to produce images that are both aesthetically pleasing and faithful to the
user's input text. We introduce DreamSync, a model-agnostic training algorithm
by design that improves T2I models to be faithful to the text input. DreamSync
builds off a recent insight from TIFA's evaluation framework -- that large
vision-language models (VLMs) can effectively identify the fine-grained
discrepancies between generated images and the text inputs. DreamSync uses this
insight to train T2I models without any labeled data; it improves T2I models
using its own generations. First, it prompts the model to generate several
candidate images for a given input text. Then, it uses two VLMs to select the
best generation: a Visual Question Answering model that measures the alignment
of generated images to the text, and another that measures the generation's
aesthetic quality. After selection, we use LoRA to iteratively finetune the T2I
model to guide its generation towards the selected best generations. DreamSync
does not need any additional human annotation. model architecture changes, or
reinforcement learning. Despite its simplicity, DreamSync improves both the
semantic alignment and aesthetic appeal of two diffusion-based T2I models,
evidenced by multiple benchmarks (+1.7% on TIFA, +2.9% on DSG1K, +3.4% on VILA
aesthetic) and human evaluation. |
Introduces \textbf{\ours}, a model-agnostic training algorithm that enhances the faithfulness and aesthetic quality of text-to-image generation models. |
Existing text-to-image models struggle to produce images that are both aesthetically pleasing and faithful to the user's input text. This framework addresses these challenges in a model-agnostic way without human feedback. |
\ours uses vision-language models (VLMs) to evaluate and select the best generated images for fine-tuning the text-to-image generation model. It iteratively refines the model by generating multiple candidate images, having VLMs select the best based on faithfulness and aesthetics, and then fine-tuning on the selected images using LoRA. |
\textbf{\ours} significantly improves both the semantic alignment and aesthetic quality of two diffusion-based T2I models, SDXL and SD v1.4, as evidenced by multiple benchmarks (+1.7\% on TIFA, +2.9\% on DSG1K, +3.4\% on VILA aesthetic).
\textbf{\ours} outperforms existing state-of-the-art alignment methods on both TIFA and DSG benchmarks while maintaining high visual appeal, as measured by VILA and human evaluation.
Human evaluation on SDXL shows that \ours consistently improves image alignment across all categories in the DSG benchmark. |
The performance of \textbf{\ours} is limited by the pre-trained model it starts with, as certain complex compositions or attributes might not be adequately addressed.
Occasional decline in texture details and shadows is observed in some generated images after applying \textbf{\ours}, indicating room for further quality improvement. |
text-to-image generation, image faithfulness, vision-language models, model-agnostic training, iterative bootstrapping |
2311.17944
Report |
LALM: Long-Term Action Anticipation with Language Models |
Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, Xi Wang |
Understanding human activity is a crucial yet intricate task in egocentric
vision, a field that focuses on capturing visual perspectives from the camera
wearer's viewpoint. While traditional methods heavily rely on representation
learning trained on extensive video data, there exists a significant
limitation: obtaining effective video representations proves challenging due to
the inherent complexity and variability in human activities.Furthermore,
exclusive dependence on video-based learning may constrain a model's capability
to generalize across long-tail classes and out-of-distribution scenarios.
In this study, we introduce a novel approach for long-term action
anticipation using language models (LALM), adept at addressing the complex
challenges of long-term activity understanding without the need for extensive
training. Our method incorporates an action recognition model to track previous
action sequences and a vision-language model to articulate relevant
environmental details. By leveraging the context provided by these past events,
we devise a prompting strategy for action anticipation using large language
models (LLMs). Moreover, we implement Maximal Marginal Relevance for example
selection to facilitate in-context learning of the LLMs. Our experimental
results demonstrate that LALM surpasses the state-of-the-art methods in the
task of long-term action anticipation on the Ego4D benchmark. We further
validate LALM on two additional benchmarks, affirming its capacity for
generalization across intricate activities with different sets of taxonomies.
These are achieved without specific fine-tuning. |
This paper introduces a novel framework, PA-LLM, for long-term action anticipation in egocentric videos leveraging pre-trained vision-language and large language models (LLMs). |
Understanding human activity from an egocentric viewpoint is crucial for applications like user-assistance systems and patient monitoring. Existing methods struggle with the complexity and variability of human actions and often lack generalizability. PA-LLM addresses these challenges by leveraging the power of LLMs. |
PA-LLM utilizes an action recognition model to track past actions and a vision-language model to describe the visual context. This information is then used to construct prompts for an LLM, which predicts future actions. The LLM leverages in-context learning with exemplars selected using Maximal Marginal Relevance (MMR) for improved generalization. |
PA-LLM surpasses state-of-the-art methods on the Ego4D benchmark for long-term action anticipation.
The method demonstrates strong generalization capabilities, achieving competitive results on EK-55 and EGTEA datasets without fine-tuning.
Ablation studies highlight the importance of accurate action recognition, effective image captioning, and appropriate prompt design for optimal performance. |
The reliance on accurate past action descriptions poses a limitation, as errors in action recognition propagate to the prediction stage.
Future work can explore incorporating temporal information and advanced LLM prompting techniques, such as chain-of-thought prompting, for enhanced performance. |
egocentric vision, action anticipation, large language models, vision-language models, in-context learning |
2311.17937
Report |
Unlocking Spatial Comprehension in Text-to-Image Diffusion Models |
Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G. M. Snoek, Victor Rühle |
We propose CompFuser, an image generation pipeline that enhances spatial
comprehension and attribute assignment in text-to-image generative models. Our
pipeline enables the interpretation of instructions defining spatial
relationships between objects in a scene, such as `An image of a gray cat on
the left of an orange dog', and generate corresponding images. This is
especially important in order to provide more control to the user. CompFuser
overcomes the limitation of existing text-to-image diffusion models by decoding
the generation of multiple objects into iterative steps: first generating a
single object and then editing the image by placing additional objects in their
designated positions. To create training data for spatial comprehension and
attribute assignment we introduce a synthetic data generation process, that
leverages a frozen large language model and a frozen layout-based diffusion
model for object placement. We compare our approach to strong baselines and
show that our model outperforms state-of-the-art image generation models in
spatial comprehension and attribute assignment, despite being 3x to 5x smaller
in parameters. |
This paper introduces InstructObject2Scene, an image generation pipeline enhancing spatial comprehension and attribute assignment in text-to-image models by iteratively adding objects to a scene based on their relative positions. |
Existing text-to-image models struggle with accurately representing spatial relationships between multiple objects, limiting user control over image generation. |
The pipeline uses a large language model (LLM) to decode instructions into multiple generation steps, starting with a single object and iteratively adding others. A synthetic dataset, created using an LLM and layout-based diffusion model, trains the model to understand spatial relationships. |
InstructObject2Scene outperforms state-of-the-art models in spatial comprehension and attribute assignment.
The model achieves significantly higher accuracy in placing objects according to textual instructions.
Despite being smaller in size, InstructObject2Scene demonstrates superior performance compared to larger counterparts. |
The model is currently limited to two objects and left/right relationships.
The LLM used for layout generation lacks a deep understanding of geometry, potentially limiting the complexity of generated scenes. |
text-to-image generation, spatial reasoning, attribute assignment, image editing, diffusion models |
2311.17921
Report |
Do text-free diffusion models learn discriminative visual representations? |
Soumik Mukhopadhyay, Matthew Gwilliam, Yosuke Yamaguchi, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Tianyi Zhou, Abhinav Shrivastava |
While many unsupervised learning models focus on one family of tasks, either
generative or discriminative, we explore the possibility of a unified
representation learner: a model which addresses both families of tasks
simultaneously. We identify diffusion models, a state-of-the-art method for
generative tasks, as a prime candidate. Such models involve training a U-Net to
iteratively predict and remove noise, and the resulting model can synthesize
high-fidelity, diverse, novel images. We find that the intermediate feature
maps of the U-Net are diverse, discriminative feature representations. We
propose a novel attention mechanism for pooling feature maps and further
leverage this mechanism as DifFormer, a transformer feature fusion of features
from different diffusion U-Net blocks and noise steps. We also develop DifFeed,
a novel feedback mechanism tailored to diffusion. We find that diffusion models
are better than GANs, and, with our fusion and feedback mechanisms, can compete
with state-of-the-art unsupervised image representation learning methods for
discriminative tasks - image classification with full and semi-supervision,
transfer for fine-grained classification, object detection and segmentation,
and semantic segmentation. Our project website
(https://mgwillia.github.io/diffssl/) and code
(https://github.com/soumik-kanad/diffssl) are available publicly. |
This paper demonstrates that diffusion models, known for generative tasks, can also learn discriminative visual representations suitable for recognition tasks, making them promising candidates for unified self-supervised representation learning. |
Unified representation learning is important because it allows a single model to be used for various downstream tasks like image recognition, reconstruction, and synthesis, eliminating the need for training separate models for each task. |
The authors analyze diffusion model embeddings, propose an Attention head for feature pooling, introduce DifFormer (a transformer-based feature fusion method combining features from different diffusion U-Net blocks and noise steps), and develop DifFeed (a feedback mechanism for diffusion features). |
Diffusion models outperform GANs in image classification and generation.
The discriminative power of diffusion features is distributed across network blocks, noise time steps, and feature resolutions, requiring intelligent fusion strategies.
Proposed methods (Attention head, DifFormer, DifFeed) significantly improve diffusion model performance on ImageNet classification, semi-supervised learning, fine-grained visual classification, and semantic segmentation. |
While competitive, the proposed methods' performance on semantic segmentation doesn't surpass state-of-the-art methods like MAE (ViT-L).
Exploration of diffusion features for object detection and instance segmentation is limited due to training costs. |
diffusion models, self-supervised learning, unified representation learning, image classification, semantic segmentation |
2311.17917
Report |
AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text |
Jianfeng Zhang, Xuanmeng Zhang, Huichao Zhang, Jun Hao Liew, Chenxu Zhang, Yi Yang, Jiashi Feng |
We study the problem of creating high-fidelity and animatable 3D avatars from
only textual descriptions. Existing text-to-avatar methods are either limited
to static avatars which cannot be animated or struggle to generate animatable
avatars with promising quality and precise pose control. To address these
limitations, we propose AvatarStudio, a coarse-to-fine generative model that
generates explicit textured 3D meshes for animatable human avatars.
Specifically, AvatarStudio begins with a low-resolution NeRF-based
representation for coarse generation, followed by incorporating SMPL-guided
articulation into the explicit mesh representation to support avatar animation
and high resolution rendering. To ensure view consistency and pose
controllability of the resulting avatars, we introduce a 2D diffusion model
conditioned on DensePose for Score Distillation Sampling supervision. By
effectively leveraging the synergy between the articulated mesh representation
and the DensePose-conditional diffusion model, AvatarStudio can create
high-quality avatars from text that are ready for animation, significantly
outperforming previous methods. Moreover, it is competent for many
applications, e.g., multimodal avatar animations and style-guided avatar
creation. For more results, please refer to our project page:
http://jeff95.me/projects/avatarstudio.html |
This paper proposes \ours{}, a coarse-to-fine generative model that creates high-fidelity, animatable 3D avatars from text descriptions. |
Existing text-to-avatar methods are limited to static avatars or struggle to produce high-quality animatable avatars with accurate pose control. This limits practical applications that require realistic and controllable digital humans. |
The method uses a two-stage approach: 1) a low-resolution NeRF representation for coarse generation and 2) optimization of a SMPL-guided articulated textured mesh for high-resolution rendering. It leverages a DensePose-conditioned ControlNet for Score Distillation Sampling (SDS) to ensure view consistency and pose accuracy. |
\ours{} generates significantly higher-quality avatars with finer details compared to previous state-of-the-art methods.
It supports multimodal animation, allowing users to control avatar motion through videos or text descriptions.
By incorporating an adapter, it enables the creation of avatars with unique artistic styles guided by reference images. |
Currently, \ours{} does not support fine-grained facial expressions.
The avatar generation process could be made more efficient as it currently takes around 2.5 hours. |
3d avatar generation, text-to-3d, animatable avatars, densepose guidance, score distillation sampling |
2311.17907
Report |
CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting |
Alexander Vilesov, Pradyumna Chari, Achuta Kadambi |
With the onset of diffusion-based generative models and their ability to
generate text-conditioned images, content generation has received a massive
invigoration. Recently, these models have been shown to provide useful guidance
for the generation of 3D graphics assets. However, existing work in
text-conditioned 3D generation faces fundamental constraints: (i) inability to
generate detailed, multi-object scenes, (ii) inability to textually control
multi-object configurations, and (iii) physically realistic scene composition.
In this work, we propose CG3D, a method for compositionally generating scalable
3D assets that resolves these constraints. We find that explicit Gaussian
radiance fields, parameterized to allow for compositions of objects, possess
the capability to enable semantically and physically consistent scenes. By
utilizing a guidance framework built around this explicit representation, we
show state of the art results, capable of even exceeding the guiding diffusion
model in terms of object combinations and physics accuracy. |
Proposes CG3D, a method for generating scalable and composable 3D scenes from text prompts using explicit Gaussian radiance fields. |
Addresses limitations of existing text-to-3D methods in generating detailed multi-object scenes with controllable configurations and physically realistic compositions. |
Decomposes scene generation into object generation and interaction parameter estimation, leveraging Gaussian splatting and score distillation sampling from a pre-trained image diffusion model. |
Achieves zero-shot compositional generation of diverse scenes with plausible object poses and scales.
Enables physically realistic compositions through gravity and contact constraints.
Allows for efficient scene editing and object manipulation. |
Assumes rigid-body interactions, limiting its ability to model object deformations.
Performance relies heavily on the quality of guidance from the pre-trained diffusion model, potentially leading to failures in cases of weak guidance or ambiguities. |
text-to-3d, compositional generation, gaussian splatting, score distillation sampling, 3d scene synthesis |
2311.17902
Report |
Language-conditioned Detection Transformer |
Jang Hyun Cho, Philipp Krähenbühl |
We present a new open-vocabulary detection framework. Our framework uses both
image-level labels and detailed detection annotations when available. Our
framework proceeds in three steps. We first train a language-conditioned object
detector on fully-supervised detection data. This detector gets to see the
presence or absence of ground truth classes during training, and conditions
prediction on the set of present classes. We use this detector to pseudo-label
images with image-level labels. Our detector provides much more accurate
pseudo-labels than prior approaches with its conditioning mechanism. Finally,
we train an unconditioned open-vocabulary detector on the pseudo-annotated
images. The resulting detector, named DECOLA, shows strong zero-shot
performance in open-vocabulary LVIS benchmark as well as direct zero-shot
transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA
outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS
benchmark. DECOLA achieves state-of-the-art results in various model sizes,
architectures, and datasets by only training on open-sourced data and
academic-scale computing. Code is available at
https://github.com/janghyuncho/DECOLA. |
This paper introduces DECOLA, a transformer-based object detector that conditions its predictions on language embeddings of object categories, enhancing open-vocabulary detection performance. |
Current open-vocabulary detectors struggle to generalize to unseen object categories due to their reliance on fixed vocabularies and limited language integration. DECOLA addresses this by adapting its inner workings to any arbitrary set of concepts represented in language. |
DECOLA utilizes a two-phase approach: 1) Training a language-conditioned object detector on fully-supervised data to generate accurate pseudo-labels for weakly-labeled images. 2) Training an unconditioned open-vocabulary detector on the combined dataset of human-annotated and pseudo-annotated images. |
DECOLA achieves state-of-the-art results on open-vocabulary LVIS benchmark, outperforming previous methods by significant margins.
The language-conditioning mechanism leads to high-quality pseudo-labels, effectively expanding the training data and improving generalization to unseen categories.
DECOLA demonstrates strong direct zero-shot transfer performance on various benchmarks, including LVIS, COCO, Object365, and OpenImages. |
The performance improvement from ImageNet-21K pretraining is less significant for Deformable DETR compared to CenterNet2.
Future work includes exploring co-training language-conditioning and multi-class prediction in a single phase. |
open-vocabulary detection, language-conditioned detection, self-training, pseudo-labeling, zero-shot transfer |
2311.17891
Report |
Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation |
Or Hirschorn, Shai Avidan |
Traditional 2D pose estimation models are limited by their category-specific
design, making them suitable only for predefined object categories. This
restriction becomes particularly challenging when dealing with novel objects
due to the lack of relevant training data.
To address this limitation, category-agnostic pose estimation (CAPE) was
introduced. CAPE aims to enable keypoint localization for arbitrary object
categories using a single model, requiring minimal support images with
annotated keypoints. This approach not only enables object pose generation
based on arbitrary keypoint definitions but also significantly reduces the
associated costs, paving the way for versatile and adaptable pose estimation
applications.
We present a novel approach to CAPE that leverages the inherent geometrical
relations between keypoints through a newly designed Graph Transformer Decoder.
By capturing and incorporating this crucial structural information, our method
enhances the accuracy of keypoint localization, marking a significant departure
from conventional CAPE techniques that treat keypoints as isolated entities.
We validate our approach on the MP-100 benchmark, a comprehensive dataset
comprising over 20,000 images spanning more than 100 categories. Our method
outperforms the prior state-of-the-art by substantial margins, achieving
remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings,
respectively. Furthermore, our method's end-to-end training demonstrates both
scalability and efficiency compared to previous CAPE approaches. |
This paper introduces a novel category-agnostic pose estimation method that leverages geometrical relationships between keypoints using a Graph Transformer Decoder, improving accuracy in keypoint localization for arbitrary object categories. |
Category-agnostic pose estimation allows a single model to predict keypoints for various object categories, even those unseen during training, which is crucial for real-world applications with novel objects. |
The method employs a Graph Transformer Decoder within a DETR-like architecture. It leverages a pre-trained SwinV2 backbone and removes keypoint order dependency. The decoder uses a graph convolutional network to capture relationships between keypoints, enhancing localization accuracy. |
The method outperforms previous state-of-the-art methods, achieving a significant improvement in PCK accuracy on the MP-100 benchmark.
The model demonstrates robustness and generalization by effectively handling out-of-distribution images, including cartoons and AI-generated images.
Ablation studies confirm the importance of the graph structure, with performance dropping significantly when random graph connections are used. |
The method's reliance on accurate skeleton definitions might limit its applicability to objects with highly complex or variable structures.
Future work can explore extending this approach to 3D pose estimation. |
pose estimation, category-agnostic, graph neural networks, transformer networks, computer vision |
2311.17874
Report |
FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information |
Wen Jiang, Boshu Lei, Kostas Daniilidis |
This study addresses the challenging problem of active view selection and
uncertainty quantification within the domain of Radiance Fields. Neural
Radiance Fields (NeRF) have greatly advanced image rendering and
reconstruction, but the limited availability of 2D images poses uncertainties
stemming from occlusions, depth ambiguities, and imaging errors. Efficiently
selecting informative views becomes crucial, and quantifying NeRF model
uncertainty presents intricate challenges. Existing approaches either depend on
model architecture or are based on assumptions regarding density distributions
that are not generally applicable. By leveraging Fisher Information, we
efficiently quantify observed information within Radiance Fields without ground
truth data. This can be used for the next best view selection and pixel-wise
uncertainty quantification. Our method overcomes existing limitations on model
architecture and effectiveness, achieving state-of-the-art results in both view
selection and uncertainty quantification, demonstrating its potential to
advance the field of Radiance Fields. Our method with the 3D Gaussian Splatting
backend could perform view selections at 70 fps. |
Presents FisherRF, a novel method for active view selection and uncertainty quantification in Radiance Fields, leveraging Fisher information. |
Efficiently selecting informative views is crucial for NeRF models when only limited 2D images are available due to occlusions, depth ambiguities, and imaging errors. |
Leverages Fisher information to quantify observed information in Radiance Fields and uses it to select the next best view with the highest information gain. Employs approximations and exploits sparsity for efficient computation. |
FisherRF achieves state-of-the-art results in active view selection, outperforming previous methods and random baselines on Blender and Mip-NeRF360 datasets.
Method enables effective batch active view selection, crucial for real-world applications like view planning.
Demonstrates strong performance in pixel-wise uncertainty quantification, achieving superior results compared to state-of-the-art methods on the Light Field dataset. |
Limited to static scenes in confined scenarios.
Future work includes extending the method to large-scale and dynamically changing Radiance Fields. |
radiance fields, active view selection, uncertainty quantification, fisher information, volumetric rendering |
2311.17857
Report |
Gaussian Shell Maps for Efficient 3D Human Generation |
Rameen Abdal, Wang Yifan, Zifan Shi, Yinghao Xu, Ryan Po, Zhengfei Kuang, Qifeng Chen, Dit-Yan Yeung, Gordon Wetzstein |
Efficient generation of 3D digital humans is important in several industries,
including virtual reality, social media, and cinematic production. 3D
generative adversarial networks (GANs) have demonstrated state-of-the-art
(SOTA) quality and diversity for generated assets. Current 3D GAN
architectures, however, typically rely on volume representations, which are
slow to render, thereby hampering the GAN training and requiring
multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps
(GSMs) as a framework that connects SOTA generator network architectures with
emerging 3D Gaussian rendering primitives using an articulable multi
shell--based scaffold. In this setting, a CNN generates a 3D texture stack with
features that are mapped to the shells. The latter represent inflated and
deflated versions of a template surface of a digital human in a canonical body
pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the
shells whose attributes are encoded in the texture features. These Gaussians
are efficiently and differentiably rendered. The ability to articulate the
shells is important during GAN training and, at inference time, to deform a
body into arbitrary user-defined poses. Our efficient rendering scheme bypasses
the need for view-inconsistent upsamplers and achieves high-quality multi-view
consistent renderings at a native resolution of $512 \times 512$ pixels. We
demonstrate that GSMs successfully generate 3D humans when trained on
single-view datasets, including SHHQ and DeepFashion. |
This paper introduces Gaussian Shell Maps (GSMs), an efficient 3D GAN framework that combines CNN-based generators with 3D Gaussian rendering primitives for high-quality, real-time 3D human generation. |
Efficient 3D human generation is crucial for various industries. Existing methods either suffer from slow volume rendering or limited expressiveness. GSMs offer a solution by combining the efficiency of CNNs with the expressiveness of 3D Gaussians. |
GSMs anchor 3D Gaussians to "shells" derived from the SMPL human body template. A CNN generates texture maps encoding Gaussian parameters, enabling efficient rendering via Gaussian splatting. Articulation is achieved by deforming the shells. |
GSMs generate diverse and high-resolution (512x512) 3D humans with realistic clothing and accessories.
The method achieves state-of-the-art rendering speed (125 FPS) without requiring upsampling, eliminating aliasing artifacts.
GSMs outperform competing methods in pose control accuracy while achieving comparable visual quality and diversity. |
The method relies on a parametric deformation model, limiting its ability to handle complex dynamics of hair and loose clothing.
Extracting accurate geometry and normals from the irregular and sparse Gaussians is not straightforward.
Future work includes exploring surface splatting for better geometry extraction and incorporating multi-view data for enhanced realism. |
3d human generation, generative adversarial networks, 3d gaussians, shell maps, real-time rendering |
2311.17754
Report |
Cinematic Behavior Transfer via NeRF-based Differentiable Filming |
Xuekun Jiang, Anyi Rao, Jingbo Wang, Dahua Lin, Bo Dai |
In the evolving landscape of digital media and video production, the precise
manipulation and reproduction of visual elements like camera movements and
character actions are highly desired. Existing SLAM methods face limitations in
dynamic scenes and human pose estimation often focuses on 2D projections,
neglecting 3D statuses. To address these issues, we first introduce a reverse
filming behavior estimation technique. It optimizes camera trajectories by
leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then
introduce a cinematic transfer pipeline that is able to transfer various shot
types to a new 2D video or a 3D virtual environment. The incorporation of 3D
engine workflow enables superior rendering and control abilities, which also
achieves a higher rating in the user study. |
This paper presents a reverse filming behavior estimation technique for transferring cinematic behavior from movie shots to new 2D/3D content using NeRF and SMPL models, enabling artists to reuse camera trajectories and character movements through a 3D engine-based workflow. |
Precise manipulation and reproduction of camera movements and character actions in video production is crucial for maintaining continuity, style, and mood. Existing methods often struggle with complex dynamic scenes and decoupling human and camera motions. |
The method predicts SMPL tracks and optimizes camera trajectories using a differentiable renderer NeRF with image-level matching supervision. It refines character movements and applies them to new 2D videos or 3D virtual environments via a 3D engine workflow. |
The approach accurately extracts character movements and camera trajectories from various movie shots, enabling 2D and 3D cinematic transfers.
It outperforms state-of-the-art methods in frame composition restoration and camera pose estimation across different shot types.
User studies confirm the effectiveness and user satisfaction with the generated results, especially in terms of camera movement accuracy and character pose refinement. |
The method depends on SLAM for initial camera trajectory estimation, limiting its performance in highly dynamic scenes.
It focuses primarily on shots with prominent human subjects, requiring adaptation for scenes centered around environments or objects. |
cinematic behavior transfer, camera trajectory optimization, character motion estimation, nerf, smpl |
2311.17737
Report |
GenZI: Zero-Shot 3D Human-Scene Interaction Generation |
Lei Li, Angela Dai |
Can we synthesize 3D humans interacting with scenes without learning from any
3D human-scene interaction data? We propose GenZI, the first zero-shot approach
to generating 3D human-scene interactions. Key to GenZI is our distillation of
interaction priors from large vision-language models (VLMs), which have learned
a rich semantic space of 2D human-scene compositions. Given a natural language
description and a coarse point location of the desired interaction in a 3D
scene, we first leverage VLMs to imagine plausible 2D human interactions
inpainted into multiple rendered views of the scene. We then formulate a robust
iterative optimization to synthesize the pose and shape of a 3D human model in
the scene, guided by consistency with the 2D interaction hypotheses. In
contrast to existing learning-based approaches, GenZI circumvents the
conventional need for captured 3D interaction data, and allows for flexible
control of the 3D interaction synthesis with easy-to-use text prompts.
Extensive experiments show that our zero-shot approach has high flexibility and
generality, making it applicable to diverse scene types, including both indoor
and outdoor environments. |
The paper introduces \OurName{}, a zero-shot approach for generating 3D human-scene interactions (HSI) from text prompts, eliminating the need for 3D interaction training data. |
Existing HSI synthesis methods rely on large-scale 3D interaction datasets, which are costly and difficult to acquire, limiting their generalizability. This work explores zero-shot 3D HSI generation by leveraging powerful 2D vision-language models (VLMs). |
\OurName{} employs VLMs to imagine plausible 2D human interactions in multiple rendered scene views using a dynamic masking scheme for automated human inpainting. It then optimizes a 3D human model's pose and shape to ensure consistency with the inferred 2D interactions through a robust, iterative process. |
Perceptual studies show a strong preference for \OurName{}'s generations over baselines.
\OurName{} achieves the highest semantic consistency scores, indicating better alignment between generated interactions and text prompts.
The method demonstrates strong generalization to diverse indoor and outdoor scenes, unlike data-driven baselines limited to specific interaction types. |
The quality of \OurName{} depends on the inpainting ability of the VLM, which can be limited by the model's capacity and biases.
Inference speed is constrained by the iterative nature of diffusion models used for inpainting. |
3d human-scene interaction, zero-shot learning, vision-language models, latent diffusion models, 3d human pose estimation |
2311.17717
Report |
Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers |
Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang |
Concept erasure in text-to-image diffusion models aims to disable pre-trained
diffusion models from generating images related to a target concept. To perform
reliable concept erasure, the properties of robustness and locality are
desirable. The former refrains the model from producing images associated with
the target concept for any paraphrased or learned prompts, while the latter
preserves its ability in generating images with non-target concepts. In this
paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler).
It learns a lightweight Eraser to perform concept erasing while satisfying the
above desirable properties by proposed concept-localized regularization and
adversarial prompt learning schemes. Comprehensive experiments with various
concepts verify the superiority of Receler over previous methods. Our code will
be available upon acceptance. |
This paper proposes Receler, a method for reliably erasing concepts from text-to-image diffusion models using lightweight erasers while maintaining locality and robustness to paraphrased prompts. |
Concept erasure is crucial for mitigating the risks of generating NSFW or copyright-infringing content from pre-trained text-to-image diffusion models. |
Receler introduces a lightweight eraser that learns to remove target concepts from cross-attention layer outputs. It leverages concept-localized regularization for locality and adversarial prompt learning for robustness. |
Receler outperforms state-of-the-art methods in erasing objects and inappropriate content, demonstrating superior robustness and locality.
It effectively defends against learned attack prompts, showcasing enhanced reliability.
Receler allows for compositional concept erasure by combining outputs from separately trained erasers. |
The paper primarily focuses on erasing single concepts, leaving multi-concept erasure for future exploration.
The impact of different pre-trained diffusion models on erasure effectiveness requires further investigation. |
concept erasing, diffusion models, parameter-efficient fine-tuning, adversarial prompt learning, text-to-image generation |
2311.17707
Report |
SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation |
Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, Xiaoguang Han |
We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the
3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments
3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames.
Our key idea involves locating 3D points in scenes as natural 3D prompts to
align their projected pixel prompts across frames, ensuring frame-consistency
in both pixel prompts and their SAM-predicted masks. Moreover, we suggest
filtering out low-quality 3D prompts based on feedback from all 2D frames, for
enhancing segmentation quality. We also propose to consolidate different 3D
prompts if they are segmenting the same object, bringing a more comprehensive
segmentation. Notably, our method does not require any additional training on
domain-specific data, enabling us to preserve the zero-shot power of SAM.
Extensive qualitative and quantitative results show that our method
consistently achieves higher quality and more diverse segmentation than
previous zero-shot or fully supervised approaches, and in many cases even
surpasses human-level annotations. The project page can be accessed at
https://mutianxu.github.io/sampro3d/. |
This paper introduces SAMPro3D, a novel framework for zero-shot 3D indoor scene segmentation using the pretrained Segment Anything Model (SAM) applied to posed 2D frames of the scene. |
Existing methods for 3D scene segmentation lack zero-shot capability or require domain-specific training, limiting their generalizability to new scenes. SAMPro3D leverages the zero-shot power of SAM for direct application to 3D scenes. |
SAMPro3D locates 3D points as prompts to align corresponding pixel prompts and predicted masks across different frames, ensuring frame-consistency. It then filters low-quality prompts and consolidates those segmenting the same object. |
SAMPro3D consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches on ScanNet200.
Qualitative results showcase superior segmentation across various scenes and objects, often surpassing human annotations in diversity.
User studies confirm the effectiveness of SAMPro3D in terms of both segmentation accuracy and diversity, exceeding even human-level performance. |
The segmentation performance of SAMPro3D is inherently limited by the capabilities of SAM.
Future work could explore real-time harmonization of 3D scene segmentation and reconstruction using Mobile-SAM and parallel processing. |
3d scene segmentation, zero-shot learning, segment anything model (sam), prompt engineering, indoor scene understanding |
2311.17618
Report |
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model |
Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li, Tao Chen |
The advent of large language models, enabling flexibility through
instruction-driven approaches, has revolutionized many traditional generative
tasks, but large models for 3D data, particularly in comprehensively handling
3D shapes with other modalities, are still under-explored. By achieving
instruction-based shape generations, versatile multimodal generative shape
models can significantly benefit various fields like 3D virtual construction
and network-aided design. In this work, we present ShapeGPT, a shape-included
multi-modal framework to leverage strong pre-trained language models to address
multiple shape-relevant tasks. Specifically, ShapeGPT employs a
word-sentence-paragraph framework to discretize continuous shapes into shape
words, further assembles these words for shape sentences, as well as integrates
shape with instructional text for multi-modal paragraphs. To learn this
shape-language model, we use a three-stage training scheme, including shape
representation, multimodal alignment, and instruction-based generation, to
align shape-language codebooks and learn the intricate correlations among these
modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable
performance across shape-relevant tasks, including text-to-shape,
shape-to-text, shape completion, and shape editing. |
ShapeGPT, a unified shape-included multi-modal framework leveraging pre-trained LLMs to address multiple shape-related tasks. |
Existing methods lack a holistic understanding of the interplay between 3D shapes and other modalities, limiting their versatility across tasks. |
ShapeGPT discretizes shapes into shape words and sentences, integrates them with text for multi-modal paragraphs, and utilizes a three-stage training scheme (shape representation, multimodal alignment, instruction-based generation). |
ShapeGPT achieves comparable performance to state-of-the-art methods in image-to-shape, text-to-shape, and multi-modal-to-shape generation.
It effectively handles additional shape-centric tasks like shape captioning, completion, reasoning, and editing within a single architecture.
Ablation studies highlight the importance of shape token length, language model size, and pre-training for optimal performance. |
ShapeGPT's current capabilities are limited to single-object generation and lack texture generation.
Future work aims to expand ShapeGPT's capabilities to include more shape-centric tasks, textured shapes, and support for additional modalities like voice and video. |
3d shape generation, multimodal learning, large language models, shape-language pre-training, instruction-based generation |
2311.17609
Report |
AnyLens: A Generative Diffusion Model with Any Rendering Lens |
Andrey Voynov, Amir Hertz, Moab Arar, Shlomi Fruchter, Daniel Cohen-Or |
State-of-the-art diffusion models can generate highly realistic images based
on various conditioning like text, segmentation, and depth. However, an
essential aspect often overlooked is the specific camera geometry used during
image capture. The influence of different optical systems on the final scene
appearance is frequently overlooked. This study introduces a framework that
intimately integrates a text-to-image diffusion model with the particular lens
geometry used in image rendering. Our method is based on a per-pixel coordinate
conditioning method, enabling the control over the rendering geometry. Notably,
we demonstrate the manipulation of curvature properties, achieving diverse
visual effects, such as fish-eye, panoramic views, and spherical texturing
using a single diffusion model. |
Introduces AnyLens, a framework integrating text-to-image diffusion models with specific lens geometries for enhanced realism and control over optical effects in generated images. |
Addresses the limitation of existing text-to-image diffusion models in replicating diverse optical effects produced by various camera lenses, enhancing realism and control over image synthesis. |
Utilizes per-pixel coordinate conditioning, providing the diffusion model with spatial locations of pixels in an undistorted view, and introduces self-attention re-weighting to account for content density variations caused by warping. |
Successfully generates images simulating diverse lens effects, including fish-eye and panoramic views.
Enables spherical texturing and panorama generation with accurate surface curvature alignment.
Maintains base text-to-image generation quality while enabling lens-based control. |
Relies on metric representations compatible with the simulated lens geometry.
Limited extrapolation capabilities due to reliance on repetitions. |
diffusion models, lens simulation, image generation, spherical texturing, per-pixel conditioning |
2311.17536
Report |
SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning |
Liang Peng, Haoran Cheng, Zheng Yang, Ruisi Zhao, Linxuan Xia, Chaotian Song, Qinglin Lu, Boxi Wu, Wei Liu |
Recent one-shot video tuning methods, which fine-tune the network on a
specific video based on pre-trained text-to-image models (e.g., Stable
Diffusion), are popular in the community because of the flexibility. However,
these methods often produce videos marred by incoherence and inconsistency. To
address these limitations, this paper introduces a simple yet effective noise
constraint across video frames. This constraint aims to regulate noise
predictions across their temporal neighbors, resulting in smooth latents. It
can be simply included as a loss term during the training phase. By applying
the loss to existing one-shot video tuning methods, we significantly improve
the overall consistency and smoothness of the generated videos. Furthermore, we
argue that current video evaluation metrics inadequately capture smoothness. To
address this, we introduce a novel metric that considers detailed features and
their temporal dynamics. Experimental results validate the effectiveness of our
approach in producing smoother videos on various one-shot video tuning
baselines. The source codes and video demos are available at
\href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}. |
This paper introduces a noise constraint loss to improve the smoothness and coherence of videos generated by one-shot video tuning methods. |
Existing one-shot video tuning methods often produce videos with incoherence and flicker, leading to noticeable artifacts and reduced visual quality. |
The authors analyze the relationship between noise predictions and latent representations in the DDIM reverse process. They propose a noise constraint loss that regularizes the noise predictions across adjacent video frames, encouraging smoother transitions and reducing flicker. |
Applying the noise constraint loss to existing one-shot video tuning methods (Tune-A-Video, ControlVideo, Make-A-Protagonist) significantly improves the smoothness of generated videos.
The authors introduce a novel video smoothness metric, VL score, that outperforms traditional CLIP-based metrics in capturing temporal consistency.
The proposed method can be readily integrated into training-free video editing techniques, leading to enhanced coherence and reduced flickering. |
The noise constraint loss can sometimes negatively impact text alignment in the generated videos.
The sliding window design in the VL score metric cannot effectively handle complex scene motions like zoom in and zoom out. |
video generation, one-shot video tuning, diffusion models, video smoothness, noise constraint loss |
2311.17528
Report |
HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models |
Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, Jiajun Liang |
Diffusion models have become a mainstream approach for high-resolution image
synthesis. However, directly generating higher-resolution images from
pretrained diffusion models will encounter unreasonable object duplication and
exponentially increase the generation time. In this paper, we discover that
object duplication arises from feature duplication in the deep blocks of the
U-Net. Concurrently, We pinpoint the extended generation times to
self-attention redundancy in U-Net's top blocks. To address these issues, we
propose a tuning-free higher-resolution framework named HiDiffusion.
Specifically, HiDiffusion contains Resolution-Aware U-Net (RAU-Net) that
dynamically adjusts the feature map size to resolve object duplication and
engages Modified Shifted Window Multi-head Self-Attention (MSW-MSA) that
utilizes optimized window attention to reduce computations. we can integrate
HiDiffusion into various pretrained diffusion models to scale image generation
resolutions even to 4096x4096 at 1.5-6x the inference speed of previous
methods. Extensive experiments demonstrate that our approach can address object
duplication and heavy computation issues, achieving state-of-the-art
performance on higher-resolution image synthesis tasks. |
This paper presents HiDiffusion, a tuning-free framework to enable pretrained diffusion models to generate higher-resolution images (e.g., 4096x4096) with high efficiency. |
Existing methods for higher-resolution image synthesis using diffusion models suffer from limitations such as object duplication, lack of fine details, and slow inference speed. HiDiffusion aims to address these issues and improve the scalability of pretrained models. |
HiDiffusion incorporates two novel components: 1) Resolution-Aware U-Net (RAU-Net) dynamically adjusts feature map sizes to mitigate object duplication and retain fine details. 2) Modified Shifted Window Multi-head Self-Attention (MSW-MSA) reduces computational cost by replacing global attention with optimized window attention. |
HiDiffusion successfully scales image generation resolutions up to 4096x4096 while preserving fine details and avoiding object duplication.
The method demonstrates significant speed improvements, achieving 1.5-6x faster inference compared to previous methods.
HiDiffusion is a tuning-free approach, meaning it can be seamlessly integrated with existing pretrained diffusion models like SD 1.5, SD 2.1, SDXL, and SDXL Turbo without requiring additional training. |
While effective, HiDiffusion still relies on the inherent capabilities of the base diffusion model, requiring prompt engineering for optimal results.
Future work could explore better integration with super-resolution techniques to further enhance resolution and image quality. |
diffusion models, high-resolution image synthesis, image generation, model efficiency, tuning-free |
2311.17461
Report |
When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation |
Xiaoming Li, Xinyu Hou, Chen Change Loy |
Text-to-image diffusion models have remarkably excelled in producing diverse,
high-quality, and photo-realistic images. This advancement has spurred a
growing interest in incorporating specific identities into generated content.
Most current methods employ an inversion approach to embed a target visual
concept into the text embedding space using a single reference image. However,
the newly synthesized faces either closely resemble the reference image in
terms of facial attributes, such as expression, or exhibit a reduced capacity
for identity preservation. Text descriptions intended to guide the facial
attributes of the synthesized face may fall short, owing to the intricate
entanglement of identity information with identity-irrelevant facial attributes
derived from the reference image. To address these issues, we present the novel
use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve
enhanced identity preservation and disentanglement for diffusion models. By
aligning this semantically meaningful human face latent space with
text-to-image diffusion models, we succeed in maintaining high fidelity in
identity preservation, coupled with the capacity for semantic editing.
Additionally, we propose new training objectives to balance the influences of
both prompt and identity conditions, ensuring that the identity-irrelevant
background remains unaffected during facial attribute modifications. Extensive
experiments reveal that our method adeptly generates personalized text-to-image
outputs that are not only compatible with prompt descriptions but also amenable
to common StyleGAN editing directions in diverse settings. Our source code will
be available at \url{https://github.com/csxmli2016/w-plus-adapter}. |
This paper proposes a novel approach to personalized text-to-image generation that leverages the extended StyleGAN embedding space (W+) to enhance identity preservation and disentanglement for diffusion models. |
Existing methods struggle to simultaneously preserve identity, generate varied facial attributes, and create identity-irrelevant content aligned with text descriptions due to the entangled nature of textual embedding space. |
The approach involves two stages: 1) aligning W+ with Stable Diffusion by training a mapping network to project a w+ embedding to SD latent space and injecting this as an additional identity condition, 2) fine-tuning for in-the-wild generation using a residual cross-attention module and novel regularized training to disentangle identity-relevant and -irrelevant features. |
The method successfully generates personalized text-to-image outputs compatible with prompt descriptions and amenable to StyleGAN editing directions.
Quantitative evaluation shows comparable or superior performance to state-of-the-art methods in terms of CLIP Score, identity distance, and face detection score.
Ablation studies demonstrate the importance of each component, including the two-stage training, residual cross-attention, and regularization loss, in achieving a balance between identity preservation, attribute editability, and text prompt compatibility. |
The process of converting real images to w+ vectors can lead to a loss of detail, impacting identity fidelity.
The current framework is limited to single-face generation and editing. |
text-to-image generation, personalized image synthesis, stylegan, stable diffusion, facial attribute editing |
2311.17338
Report |
VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model |
Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Zuxuan Wu, Hang Xu, Yu-Gang Jiang |
Identity-consistent video generation seeks to synthesize videos that are
guided by both textual prompts and reference images of entities. Current
approaches typically utilize cross-attention layers to integrate the appearance
of the entity, which predominantly captures semantic attributes, resulting in
compromised fidelity of entities. Moreover, these methods necessitate iterative
fine-tuning for each new entity encountered, thereby limiting their
applicability. To address these challenges, we introduce VideoAssembler, a
novel end-to-end framework for identity-consistent video generation that can
conduct inference directly when encountering new entities. VideoAssembler is
adept at producing videos that are not only flexible with respect to the input
reference entities but also responsive to textual conditions. Additionally, by
modulating the quantity of input images for the entity, VideoAssembler enables
the execution of tasks ranging from image-to-video generation to sophisticated
video editing. VideoAssembler comprises two principal components: the Reference
Entity Pyramid (REP) encoder and the Entity-Prompt Attention Fusion (EPAF)
module. The REP encoder is designed to infuse comprehensive appearance details
into the denoising stages of the stable diffusion model. Concurrently, the EPAF
module is utilized to integrate text-aligned features effectively. Furthermore,
to mitigate the challenge of scarce data, we present a methodology for the
preprocessing of training data. Our evaluation of the VideoAssembler framework
on the UCF-101, MSR-VTT, and DAVIS datasets indicates that it achieves good
performances in both quantitative and qualitative analyses (346.84 in FVD and
48.01 in IS on UCF-101). Our project page is at
https://gulucaptain.github.io/videoassembler/. |
This paper introduces VideoAssembler, a novel end-to-end framework for identity-consistent video generation that can directly infer new entities without retraining. |
Identity-consistent video generation is challenging because it requires generating content-reasonable videos while accurately injecting given entity information. Existing methods struggle with appearance fidelity, weak action guidance, and reliance on few-shot fine-tuning. |
VideoAssembler utilizes a Reference Entity Pyramid (REP) encoder to infuse detailed appearance into the denoising stages of the stable diffusion model and an Entity-Prompt Attention Fusion (EPAF) module to integrate text-aligned features. It also introduces a data preprocessing methodology to address training data scarcity. |
VideoAssembler achieves state-of-the-art performance on UCF-101 and MSR-VTT datasets in terms of FVD and IS metrics.
The method exhibits strong entity fidelity and action control, as evidenced by qualitative comparisons with other methods like VideoDreamer.
It demonstrates flexibility in handling different numbers of input entities, enabling image-to-video generation and video editing tasks. |
The optimal number of input entities for the best performance needs further investigation.
The model's generative creativity might be slightly limited when solely relying on the REP encoder. |
video generation, identity-consistent, reference entities, diffusion models, video editing |
2311.17261
Report |
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors |
Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner |
We propose SceneTex, a novel method for effectively generating high-quality
and style-consistent textures for indoor scenes using depth-to-image diffusion
priors. Unlike previous methods that either iteratively warp 2D views onto a
mesh surface or distillate diffusion latent features without accurate geometric
and style cues, SceneTex formulates the texture synthesis task as an
optimization problem in the RGB space where style and geometry consistency are
properly reflected. At its core, SceneTex proposes a multiresolution texture
field to implicitly encode the mesh appearance. We optimize the target texture
via a score-distillation-based objective function in respective RGB renderings.
To further secure the style consistency across views, we introduce a
cross-attention decoder to predict the RGB values by cross-attending to the
pre-sampled reference locations in each instance. SceneTex enables various and
accurate texture synthesis for 3D-FRONT scenes, demonstrating significant
improvements in visual quality and prompt fidelity over the prior texture
generation methods. |
SceneTex, a novel method for generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. |
Generating realistic and style-consistent textures for 3D scenes is crucial for various applications but remains challenging due to the need for accurate geometry and style cues. |
SceneTex introduces a multiresolution texture field to represent scene appearance and leverages a cross-attention texture decoder for global style consistency. It optimizes the texture by distilling knowledge from a pre-trained depth-conditioned diffusion prior. |
SceneTex generates high-quality textures superior to baseline methods based on CLIP score and Inception Score.
The method effectively reflects input prompts in the generated textures, as demonstrated by user studies.
Ablation studies confirm the importance of the multiresolution texture field and cross-attention decoder for achieving high-quality and style-consistent results. |
The generated textures sometimes exhibit shading effects, potentially due to the diffusion prior.
Future work could explore addressing the shading issue and extending the method to handle more complex lighting conditions. |
texture synthesis, 3d scenes, diffusion models, cross-attention, style consistency |
2311.17245
Report |
LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS |
Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang |
Recent advancements in real-time neural rendering using point-based
techniques have paved the way for the widespread adoption of 3D
representations. However, foundational approaches like 3D Gaussian Splatting
come with a substantial storage overhead caused by growing the SfM points to
millions, often demanding gigabyte-level disk space for a single unbounded
scene, posing significant scalability challenges and hindering the splatting
efficiency.
To address this challenge, we introduce LightGaussian, a novel method
designed to transform 3D Gaussians into a more efficient and compact format.
Drawing inspiration from the concept of Network Pruning, LightGaussian
identifies Gaussians that are insignificant in contributing to the scene
reconstruction and adopts a pruning and recovery process, effectively reducing
redundancy in Gaussian counts while preserving visual effects. Additionally,
LightGaussian employs distillation and pseudo-view augmentation to distill
spherical harmonics to a lower degree, allowing knowledge transfer to more
compact representations while maintaining reflectance. Furthermore, we propose
a hybrid scheme, VecTree Quantization, to quantize all attributes, resulting in
lower bitwidth representations with minimal accuracy losses.
In summary, LightGaussian achieves an averaged compression rate over 15x
while boosting the FPS from 139 to 215, enabling an efficient representation of
complex scenes on Mip-NeRF 360, Tank and Temple datasets.
Project website: https://lightgaussian.github.io/ |
LightGaussian, a novel method that compresses 3D Gaussian representations for efficient novel view synthesis, achieving a 15x reduction in size and boosting rendering speed to 200+ FPS. |
3D Gaussian Splatting offers high-quality novel view synthesis but suffers from a large storage overhead due to millions of Gaussians. |
LightGaussian employs 1) Gaussian Pruning & Recovery to eliminate redundant Gaussians based on global significance, 2) SH Distillation with pseudo-view augmentation to transfer high-degree SH information to compact lower-degree representations, and 3) Gaussian Attribute Vector Quantization based on global significance to reduce the representation bit-width. |
LightGaussian achieves over 15x compression on the Mip-NeRF 360 dataset, reducing storage from 724MB to 42MB.
Rendering speed is improved from 119 FPS to 209 FPS.
Visual fidelity is maintained with minimal quality loss (SSIM decrease of 0.005). |
VQ applied to all Gaussian attributes leads to significant accuracy loss.
Future work includes exploring zero-shot compression for different 3D Gaussian Splatting frameworks. |
novel view synthesis, 3d gaussian splatting, model compression, knowledge distillation, vector quantization |
2311.17216
Report |
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation |
Hang Li, Chengzhi Shen, Philip Torr, Volker Tresp, Jindong Gu |
Diffusion-based models have gained significant popularity for text-to-image
generation due to their exceptional image-generation capabilities. A risk with
these models is the potential generation of inappropriate content, such as
biased or harmful images. However, the underlying reasons for generating such
undesired content from the perspective of the diffusion model's internal
representation remain unclear. Previous work interprets vectors in an
interpretable latent space of diffusion models as semantic concepts. However,
existing approaches cannot discover directions for arbitrary concepts, such as
those related to inappropriate concepts. In this work, we propose a novel
self-supervised approach to find interpretable latent directions for a given
concept. With the discovered vectors, we further propose a simple approach to
mitigate inappropriate generation. Extensive experiments have been conducted to
verify the effectiveness of our mitigation approach, namely, for fair
generation, safe generation, and responsible text-enhancing generation. Project
page: \url{https://interpretdiffusion.github.io}. |
This paper proposes a self-supervised approach to discover and utilize interpretable latent directions within diffusion models' internal representations for responsible text-to-image generation. |
Existing methods for interpreting and manipulating diffusion models struggle to discover directions for arbitrary concepts, especially those related to inappropriate content, hindering responsible generation. |
The authors optimize a concept vector by minimizing the reconstruction loss of images generated with concept-related prompts, forcing the vector to represent the missing concept information. This vector is then added to the model's internal activations during inference to guide responsible generation. |
The method successfully generates images with balanced representations of societal groups, mitigating bias in professions like doctors.
It effectively eliminates harmful content from inappropriate prompts, outperforming existing safety methods.
The approach enhances text guidance for responsible prompts, accurately representing concepts like 'no violence' in generated images. |
The linear manipulation of concepts might not fully capture complex relationships between different attributes.
The approach's reliance on synthesized data for concept discovery might not fully represent real-world diversity. |
diffusion models, responsible ai, text-to-image generation, fairness, safety |
2311.17138
Report |
Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now |
Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, D. A. Forsyth, Anand Bhattad |
Generative models can produce impressively realistic images. This paper
demonstrates that generated images have geometric features different from those
of real images. We build a set of collections of generated images, prequalified
to fool simple, signal-based classifiers into believing they are real. We then
show that prequalified generated images can be identified reliably by
classifiers that only look at geometric properties. We use three such
classifiers. All three classifiers are denied access to image pixels, and look
only at derived geometric features. The first classifier looks at the
perspective field of the image, the second looks at lines detected in the
image, and the third looks at relations between detected objects and shadows.
Our procedure detects generated images more reliably than SOTA local signal
based detectors, for images from a number of distinct generators. Saliency maps
suggest that the classifiers can identify geometric problems reliably. We
conclude that current generators cannot reliably reproduce geometric properties
of real images. |
This paper reveals that while AI-generated images are increasingly realistic, they often contain subtle geometric inconsistencies, particularly in perspective and shadow accuracy, which can be used to distinguish them from real images. |
This research is crucial as it highlights a key limitation in current generative models: their struggle to accurately replicate the rules of projective geometry, a fundamental aspect of realistic image creation. |
The authors curate a dataset of real and generated images, filtered to remove easily detectable artifacts. They then train three classifiers on geometric features: one analyzing line segments, another examining perspective fields, and the last focusing on object-shadow relationships. None of the classifiers see the image pixels. |
The classifiers, despite not analyzing image pixels directly, are highly effective at identifying generated images, achieving AUCs ranging from 0.72 to 0.97.
The analysis suggests that generative models struggle to maintain consistent vanishing points, leading to inaccuracies in line convergence and perspective distortion.
Shadow inconsistencies, such as mismatched directions and lengths relative to objects and light sources, are also reliably detected. |
The study primarily focuses on indoor and outdoor scenes, potentially limiting the generalizability of findings to other image types.
Future work could explore the use of these geometric inconsistencies as feedback mechanisms to improve the realism of future generative models. |
generative models, projective geometry, image forensics, shadow analysis, perspective analysis |
2311.17137
Report |
Generative Models: What do they know? Do they know things? Let's find out! |
Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, Anand Bhattad |
Generative models have been shown to be capable of synthesizing highly
detailed and realistic images. It is natural to suspect that they implicitly
learn to model some image intrinsics such as surface normals, depth, or
shadows. In this paper, we present compelling evidence that generative models
indeed internally produce high-quality scene intrinsic maps. We introduce
Intrinsic LoRA (I LoRA), a universal, plug-and-play approach that transforms
any generative model into a scene intrinsic predictor, capable of extracting
intrinsic scene maps directly from the original generator network without
needing additional decoders or fully fine-tuning the original network. Our
method employs a Low-Rank Adaptation (LoRA) of key feature maps, with newly
learned parameters that make up less than 0.6% of the total parameters in the
generative model. Optimized with a small set of labeled images, our
model-agnostic approach adapts to various generative architectures, including
Diffusion models, GANs, and Autoregressive models. We show that the scene
intrinsic maps produced by our method compare well with, and in some cases
surpass those generated by leading supervised techniques. |
The paper introduces a universal, plug-and-play approach called \methodfull (\methodnospace) that transforms any generative model into a scene intrinsic predictor, enabling the extraction of intrinsic scene maps like normals, depth, albedo, and shading directly from the original generator network. |
This is important for understanding the knowledge generative models possess, leveraging them for real image understanding, and potentially improving their quality. |
The methodology uses Low-Rank Adaptation (LoRA) of key feature maps (attention layers, affine layers, or convolutional attention layers depending on the model) to modulate the features and enable the extraction of scene intrinsics. The LoRA modules are trained using a small set of labeled images and pseudo-ground truth generated by state-of-the-art models. |
All tested generative models (diffusion, GANs, autoregressive) can be adapted to extract scene intrinsic maps using \method.
High-quality intrinsic extraction is achieved, outperforming SOTA supervised methods in some tasks, using minimal additional parameters (<0.6%) and limited labeled data (as few as 250 images).
A correlation exists between the quality of the generated images and the accuracy of extracted intrinsics, suggesting stronger generative models lead to better intrinsic predictions. |
Multi-step diffusion, while promising, introduces misalignment and color shift issues which need to be addressed.
Further work is needed to explore explicitly incorporating scene intrinsics into the learning process of generative models and developing evaluation metrics based on physical properties. |
generative models, scene intrinsic extraction, low-rank adaptation (lora), diffusion models, gans |
2311.17132
Report |
TransNeXt: Robust Foveal Visual Perception for Vision Transformers |
Dai Shi |
Due to the depth degradation effect in residual connections, many efficient
Vision Transformers models that rely on stacking layers for information
exchange often fail to form sufficient information mixing, leading to unnatural
visual perception. To address this issue, in this paper, we propose Aggregated
Attention, a biomimetic design-based token mixer that simulates biological
foveal vision and continuous eye movement while enabling each token on the
feature map to have a global perception. Furthermore, we incorporate learnable
tokens that interact with conventional queries and keys, which further
diversifies the generation of affinity matrices beyond merely relying on the
similarity between queries and keys. Our approach does not rely on stacking for
information exchange, thus effectively avoiding depth degradation and achieving
natural visual perception. Additionally, we propose Convolutional GLU, a
channel mixer that bridges the gap between GLU and SE mechanism, which empowers
each token to have channel attention based on its nearest neighbor image
features, enhancing local modeling capability and model robustness. We combine
aggregated attention and convolutional GLU to create a new visual backbone
called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves
state-of-the-art performance across multiple model sizes. At a resolution of
$224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing
ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet
accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of
$384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic
segmentation mIoU of 54.7. |
This paper introduces TransNeXt, a novel visual backbone network for computer vision tasks, incorporating two key components: Aggregated Attention (a biomimetic token mixer inspired by foveal vision) and Convolutional GLU (a channel mixer with gated channel attention). |
The authors aim to address limitations in existing Vision Transformer (ViT) models, such as depth degradation effects and unnatural visual perception arising from stacking layers for information exchange. |
The paper presents Aggregated Attention, which combines dual-path design (fine-grained local and coarse-grained global perception), query embedding, and positional attention to mimic human visual information processing. They also propose Convolutional GLU, which integrates local feature-based channel attention into GLU, improving model robustness. |
TransNeXt achieves state-of-the-art performance across multiple model sizes on various tasks including image classification, object detection, and semantic segmentation.
TransNeXt exhibits superior robustness compared to previous models, particularly on challenging datasets like ImageNet-A.
The CUDA implementation significantly accelerates training and inference, showcasing the practical efficiency of the proposed architecture. |
The model's throughput, while competitive, has room for improvement compared to models utilizing highly optimized dense GPU operators.
Further investigation into the potential trade-off associated with query embedding in out-of-distribution test sets is warranted. |
vision transformer, computer vision, biomimetic design, aggregated attention, convolutional glu |
2311.17126
Report |
Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis |
Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-Ping Liu, Hongxia Yang |
Recent advancements in text-to-image (T2I) generative models have shown
remarkable capabilities in producing diverse and imaginative visuals based on
text prompts. Despite the advancement, these diffusion models sometimes
struggle to translate the semantic content from the text into images entirely.
While conditioning on the layout has shown to be effective in improving the
compositional ability of T2I diffusion models, they typically require manual
layout input. In this work, we introduce a novel approach to improving T2I
diffusion models using Large Language Models (LLMs) as layout generators. Our
method leverages the Chain-of-Thought prompting of LLMs to interpret text and
generate spatially reasonable object layouts. The generated layout is then used
to enhance the generated images' composition and spatial accuracy. Moreover, we
propose an efficient adapter based on a cross-attention mechanism, which
explicitly integrates the layout information into the stable diffusion models.
Our experiments demonstrate significant improvements in image quality and
layout accuracy, showcasing the potential of LLMs in augmenting generative
image models. |
This paper introduces a novel approach to enhancing text-to-image diffusion models by employing Large Language Models (LLMs) as layout generators, leveraging Chain-of-Thought (CoT) prompting for improved compositionality. |
Existing text-to-image models often struggle with complex compositions involving multiple objects and spatial relations. This work leverages the reasoning and language understanding capabilities of LLMs to guide image generation with spatially-aware layouts. |
1. LLMs are used to generate object layouts (bounding boxes) from text prompts using CoT prompting for improved spatial reasoning. 2. A novel adapter, LACA, is proposed to integrate these layouts into Stable Diffusion models via cross-attention masks, explicitly guiding object placement during image generation. |
The method significantly improves image quality and layout accuracy compared to baseline Stable Diffusion models.
CoT prompting with in-context examples enhances the quality of LLM-generated layouts, resulting in more accurate object placements.
The approach exhibits improved generative counting accuracy, accurately depicting the number of objects specified in the text prompt. |
The reliance on LLM-generated layouts introduces a dependency on the accuracy and robustness of the LLM.
Future work can explore fine-tuning LLMs specifically for layout generation and investigate alternative layout representations beyond bounding boxes. |
text-to-image generation, diffusion models, large language models, chain-of-thought prompting, layout generation |
2311.17123
Report |
ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis |
Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, Yanpei Cao, Ying Shan, Long Quan |
In this work, we propose a method to address the challenge of rendering a 3D
human from a single image in a free-view manner. Some existing approaches could
achieve this by using generalizable pixel-aligned implicit fields to
reconstruct a textured mesh of a human or by employing a 2D diffusion model as
guidance with the Score Distillation Sampling (SDS) method, to lift the 2D
image into 3D space. However, a generalizable implicit field often results in
an over-smooth texture field, while the SDS method tends to lead to a
texture-inconsistent novel view with the input image. In this paper, we
introduce a texture-consistent back view synthesis module that could transfer
the reference image content to the back view through depth and text-guided
attention injection. Moreover, to alleviate the color distortion that occurs in
the side region, we propose a visibility-aware patch consistency regularization
for texture mapping and refinement combined with the synthesized back view
texture. With the above techniques, we could achieve high-fidelity and
texture-consistent human rendering from a single image. Experiments conducted
on both real and synthetic data demonstrate the effectiveness of our method and
show that our approach outperforms previous baseline methods. |
This paper introduces "ConTex-Human", a novel framework for generating high-fidelity, texture-consistent 3D human models from single images. |
Free-view human synthesis from single images is crucial for various applications, but existing methods struggle to generate high-fidelity results with consistent textures, especially in unseen areas. |
The framework uses a three-stage approach: (1) Reconstructing a coarse radiance field using a Zero-1-to-3 diffusion prior. (2) Synthesizing a texture-consistent back view image using a depth and text-guided attention injection module. (3) Refining the geometry and texture using a DMTet mesh and a visibility-aware patch consistency loss, guided by the synthesized back view and reference image. |
Outperforms baseline methods on THuman2.0 dataset in terms of PSNR, LPIPS, and CLIP metrics, demonstrating better alignment with ground truth.
Generates higher-quality and more texture-consistent results than baseline methods on both synthetic and real datasets (THuman2.0 and SSHQ).
Ablation studies confirm the importance of the texture-consistent back view synthesis and visibility-aware patch consistency loss for achieving high-quality and consistent texture. |
Generated geometry can be coarse, especially in hand and foot regions, and struggles to recover from significant errors in the coarse stage.
Side and invisible regions, though color-consistent, exhibit lower quality and occasional noise compared to front and back views. |
3d human rendering, single image reconstruction, texture consistency, diffusion models, score distillation sampling |
2311.17117
Report |
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation |
Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo |
Character Animation aims to generating character videos from still images
through driving signals. Currently, diffusion models have become the mainstream
in visual generation research, owing to their robust generative capabilities.
However, challenges persist in the realm of image-to-video, especially in
character animation, where temporally maintaining consistency with detailed
information from character remains a formidable problem. In this paper, we
leverage the power of diffusion models and propose a novel framework tailored
for character animation. To preserve consistency of intricate appearance
features from reference image, we design ReferenceNet to merge detail features
via spatial attention. To ensure controllability and continuity, we introduce
an efficient pose guider to direct character's movements and employ an
effective temporal modeling approach to ensure smooth inter-frame transitions
between video frames. By expanding the training data, our approach can animate
arbitrary characters, yielding superior results in character animation compared
to other image-to-video methods. Furthermore, we evaluate our method on
benchmarks for fashion video and human dance synthesis, achieving
state-of-the-art results. |
Presents *Animate Anyone*, a novel diffusion model-based framework for character animation that generates animated videos from character images and desired pose sequences while maintaining appearance consistency and temporal stability. |
Current image-to-video methods struggle to maintain temporal consistency with detailed information from character images, especially in character animation. |
Leverages Stable Diffusion architecture and incorporates: 1) ReferenceNet with spatial attention to preserve detailed appearance features, 2) a lightweight pose guider for controllable movements, and 3) a temporal layer for smooth inter-frame transitions. |
Maintains spatial and temporal consistency of character appearance in videos.
Produces high-definition videos without temporal jitter or flickering.
Achieves state-of-the-art results on fashion video and human dance synthesis benchmarks, outperforming existing image-to-video methods in character animation. |
May struggle with highly stable hand movement generation, leading to occasional distortions and motion blur.
Generating unseen parts during character movement can be unstable due to limited information from a single-view image.
Lower operational efficiency compared to non-diffusion-model-based methods due to DDPM sampling. |
character animation, diffusion models, image-to-video synthesis, appearance consistency, temporal stability |
2311.17095
Report |
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models |
Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, Boyang Li |
From image-text pairs, large-scale vision-language models (VLMs) learn to
implicitly associate image regions with words, which prove effective for tasks
like visual question answering. However, leveraging the learned association for
open-vocabulary semantic segmentation remains a challenge. In this paper, we
propose a simple, yet extremely effective, training-free technique,
Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task.
PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an
image-text matching loss. To balance between over-segmentation and
under-segmentation, we introduce Salience Dropout; by iteratively dropping
patches that the model is most attentive to, we are able to better resolve the
entire extent of the segmentation mask. \shortname{} does not require any
neural network training and performs hyperparameter tuning without the need for
any segmentation annotations, even for a validation set. PnP-OVSS demonstrates
substantial improvements over comparable baselines (+29.4% mIoU on Pascal VOC,
+13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, and +11.4% mIoU on
ADE-20K.) and even outperforms most baselines that conduct additional network
training on top of pretrained VLMs. Our codebase is at
https://github.com/letitiabanana/PnP-OVSS. |
This paper introduces PnP-OVSS, a simple and effective training-free technique for open-vocabulary semantic segmentation, leveraging VLMs with cross-attention and image-text matching loss. |
Bridging the gap between VLMs' ability to associate image regions with words and their application in open-vocabulary semantic segmentation. |
PnP-OVSS extracts cross-attention maps from a VLM, sharpens them with GradCAM using ITM loss gradients, refines them iteratively with Salience Dropout, and applies Gaussian blur and Dense CRF for final segmentation. |
PnP-OVSS achieves substantial improvements over training-free baselines (e.g., +29.4% mIoU on Pascal VOC).
It outperforms most methods requiring finetuning but not using image-text pairs (e.g., +13.7% mIoU on Pascal VOC).
PnP-OVSS even surpasses several techniques requiring finetuning on image-text pairs, particularly on datasets with more classes per image. |
The performance of PnP-OVSS heavily relies on the choice of cross-attention layers and heads.
PnP-OVSS struggles with images containing multiple small object instances or a clutter of different objects. |
open-vocabulary semantic segmentation, vision-language models, zero-shot learning, cross-attention, gradcam |
2311.17092
Report |
SEED-Bench-2: Benchmarking Multimodal Large Language Models |
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan |
Multimodal large language models (MLLMs), building upon the foundation of
powerful large language models (LLMs), have recently demonstrated exceptional
capabilities in generating not only texts but also images given interleaved
multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However,
existing MLLM benchmarks remain limited to assessing only models' comprehension
ability of single image-text inputs, failing to keep up with the strides made
in MLLMs. A comprehensive benchmark is imperative for investigating the
progress and uncovering the limitations of current MLLMs. In this work, we
categorize the capabilities of MLLMs into hierarchical levels from $L_0$ to
$L_4$ based on the modalities they can accept and generate, and propose
SEED-Bench-2, a comprehensive benchmark that evaluates the
\textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2
comprises 24K multiple-choice questions with accurate human annotations, which
spans 27 dimensions, including the evaluation of both text and image
generation. Multiple-choice questions with groundtruth options derived from
human annotation enables an objective and efficient assessment of model
performance, eliminating the need for human or GPT intervention during
evaluation. We further evaluate the performance of 23 prominent open-source
MLLMs and summarize valuable observations. By revealing the limitations of
existing MLLMs through extensive evaluations, we aim for SEED-Bench-2 to
provide insights that will motivate future research towards the goal of General
Artificial Intelligence. Dataset and evaluation code are available at
\href{https://github.com/AILab-CVC/SEED-Bench} |
This paper presents SEED-Bench-2, a comprehensive benchmark designed to evaluate the hierarchical capabilities of Multimodal Large Language Models (MLLMs) up to L3, including their ability to generate both text and images from interleaved image-text inputs. |
Existing MLLM benchmarks primarily focus on single image-text comprehension and fail to showcase the full range of MLLM capabilities, hindering progress in the field. A comprehensive benchmark is crucial for effectively evaluating and advancing MLLMs towards general artificial intelligence. |
The authors categorize MLLM capabilities into hierarchical levels (L0-L4) and construct SEED-Bench-2 with 24K multiple-choice questions spanning 27 evaluation dimensions. The benchmark utilizes a sophisticated pipeline with foundation models, adapts existing datasets, and incorporates human-designed questions to ensure diversity and quality. Multiple-choice format enables objective evaluation using accuracy. |
Existing MLLMs have not yet reached the ceiling level of capability L1 for fixed-form image and text comprehension, with the top model achieving only 60% accuracy.
MLLMs struggle with comprehending free-form interleaved image-text inputs (L2) more than fixed-format inputs, likely due to training data limitations.
Only a few MLLMs have reached capability L3 (image and text generation), highlighting the need for more research in this area. |
Not all MLLMs with image generation capabilities utilize visual autoregression, limiting the evaluation strategy for image output.
Future work includes incorporating evaluations for capability level L4 (open-form interleaved image-text input and output) and expanding evaluation dimensions. |
multimodal large language models, benchmarking, multimodal comprehension, image generation, artificial intelligence |
2311.17091
Report |
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models |
Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang |
Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the
open-world generalization has gained increasing popularity due to its practical
value. However, performance advancements are limited when relying solely on
intricate algorithmic designs for a single model, even one exhibiting strong
performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the
collaborative potential of leveraging much weaker VLMs to enhance the
generalization of a robust single model. The affirmative findings motivate us
to address the generalization problem from a novel perspective, i.e., ensemble
of pre-trained VLMs. We introduce three customized ensemble strategies, each
tailored to one specific scenario. Firstly, we introduce the zero-shot
ensemble, automatically adjusting the logits of different models based on their
confidence when only pre-trained VLMs are available. Furthermore, for scenarios
with extra few-shot samples, we propose the training-free and tuning ensemble,
offering flexibility based on the availability of computing resources. The
proposed ensemble strategies are evaluated on zero-shot, base-to-new, and
cross-dataset generalization, achieving new state-of-the-art performance.
Notably, this work represents an initial stride toward enhancing the
generalization performance of VLMs via ensemble. The code is available at
https://github.com/zhiheLu/Ensemble_VLM.git. |
This paper explores ensemble learning to improve the open-world generalization of pre-trained vision-language models (VLMs), proposing three strategies: zero-shot ensemble, training-free ensemble, and tuning ensemble. |
Existing methods relying on a single VLM, even when powerful, have reached performance saturation in generalization tasks. This paper shows that leveraging multiple VLMs, even weaker ones, can significantly enhance performance. |
The authors introduce three ensemble strategies: (1) **Zero-shot ensemble:** Assigns confidence-aware weights to VLMs based on their prediction confidences. (2) **Training-free ensemble:** Uses a greedy search to find optimal weights on a small training set. (3) **Tuning ensemble:** Trains a sample-aware weight generator on a training set to dynamically generate weights for test samples. |
Zero-shot ensemble achieves an average accuracy gain of 2.61% across 11 diverse datasets.
Tuning ensemble achieves state-of-the-art performance on base-to-new and cross-dataset generalization benchmarks.
The paper demonstrates the effectiveness of 'weak helps strong' phenomenon, where weaker VLMs contribute significantly to the ensemble's performance. |
The explored ensemble strategies only scratch the surface of ensemble learning potential for VLM generalization, leaving room for further investigation.
The current study primarily focuses on image classification tasks, future work could explore its applicability in other downstream tasks. |
vision-language models, ensemble learning, open-world generalization, zero-shot learning, few-shot learning |
2311.17089
Report |
Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering |
Zhiwen Yan, Weng Fei Low, Yu Chen, Gim Hee Lee |
3D Gaussians have recently emerged as a highly efficient representation for
3D reconstruction and rendering. Despite its high rendering quality and speed
at high resolutions, they both deteriorate drastically when rendered at lower
resolutions or from far away camera position. During low resolution or far away
rendering, the pixel size of the image can fall below the Nyquist frequency
compared to the screen size of each splatted 3D Gaussian and leads to aliasing
effect. The rendering is also drastically slowed down by the sequential alpha
blending of more splatted Gaussians per pixel. To address these issues, we
propose a multi-scale 3D Gaussian splatting algorithm, which maintains
Gaussians at different scales to represent the same scene. Higher-resolution
images are rendered with more small Gaussians, and lower-resolution images are
rendered with fewer larger Gaussians. With similar training time, our algorithm
can achieve 13\%-66\% PSNR and 160\%-2400\% rendering speed improvement at
4$\times$-128$\times$ scale rendering on Mip-NeRF360 dataset compared to the
single scale 3D Gaussian splatting. |
This paper introduces a multi-scale 3D Gaussian splatting algorithm for novel view synthesis that enhances rendering quality and speed at low resolutions or when viewed from a distance. |
Existing 3D Gaussian splatting methods suffer from severe aliasing and slow rendering speeds at low resolutions, limiting their use in large-scale scenes. |
The algorithm utilizes multi-scale 3D Gaussians to represent the scene at varying levels of detail. Small Gaussians are aggregated into larger ones for coarser representations. During rendering, Gaussians are selectively chosen based on their 'pixel coverage', ensuring appropriate level of detail for the given resolution. |
The method achieves 13%-66% PSNR and 160%-2400% rendering speed improvement at 4x-128x downsampled scales on the Mip-NeRF360 dataset.
It maintains comparable rendering quality and speed to single-scale methods at the original resolution.
Qualitative comparisons demonstrate significant reduction in aliasing artifacts and improved visual fidelity at low resolutions. |
Gaussian filtering based on 'pixel coverage' requires splatting all Gaussians before filtering, introducing overhead at very low resolutions.
Future work will explore lightweight criteria for filtering Gaussians before splatting for further speed enhancements. |
3d gaussian splatting, novel view synthesis, anti-aliasing, multi-scale representation, computer graphics |
2311.17086
Report |
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation |
Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu |
Text-to-image diffusion models are well-known for their ability to generate
realistic images based on textual prompts. However, the existing works have
predominantly focused on English, lacking support for non-English text-to-image
models. The most commonly used translation methods cannot solve the generation
problem related to language culture, while training from scratch on a specific
language dataset is prohibitively expensive. In this paper, we are inspired to
propose a simple plug-and-play language transfer method based on knowledge
distillation. All we need to do is train a lightweight MLP-like
parameter-efficient adapter (PEA) with only 6M parameters under teacher
knowledge distillation along with a small parallel data corpus. We are
surprised to find that freezing the parameters of UNet can still achieve
remarkable performance on the language-specific prompt evaluation set,
demonstrating that PEA can stimulate the potential generation ability of the
original UNet. Additionally, it closely approaches the performance of the
English text-to-image model on a general prompt evaluation set. Furthermore,
our adapter can be used as a plugin to achieve significant results in
downstream tasks in cross-lingual text-to-image generation. Code will be
available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion |
This paper presents PEA-Diffusion, a plug-and-play adapter with knowledge distillation for parameter-efficient adaptation of English text-to-image diffusion models to other languages. |
Existing solutions for non-English text-to-image generation are costly (training from scratch) or struggle with capturing culture-specific concepts. |
PEA-Diffusion uses a lightweight MLP adapter trained with knowledge distillation from a pre-trained English Stable Diffusion model. It aligns representation spaces between the new (non-English) text encoder and the frozen image generator, requiring only a small amount of parallel data. |
PEA-Diffusion effectively captures cultural nuances, outperforming translation, multilingual models, and direct fine-tuning on language-specific prompts.
The method retains strong general image generation abilities, achieving comparable results to the original English model on general prompts.
The plug-and-play adapter seamlessly integrates with downstream applications like LoRA, ControlNet, Inpainting, etc., facilitating cross-lingual adaptation of the English SD ecosystem. |
Performance relies on the quality and representational power of the language-specific CLIP text encoder.
The approach is bounded by the capabilities of the base English model, unable to surpass its general synthesis limits. |
text-to-image generation, cross-lingual transfer, knowledge distillation, parameter-efficient adaptation, diffusion models |
2311.17083
Report |
CLiC: Concept Learning in Context |
Mehdi Safaee, Aryan Mikaeili, Or Patashnik, Daniel Cohen-Or, Ali Mahdavi-Amiri |
This paper addresses the challenge of learning a local visual pattern of an
object from one image, and generating images depicting objects with that
pattern. Learning a localized concept and placing it on an object in a target
image is a nontrivial task, as the objects may have different orientations and
shapes. Our approach builds upon recent advancements in visual concept
learning. It involves acquiring a visual concept (e.g., an ornament) from a
source image and subsequently applying it to an object (e.g., a chair) in a
target image. Our key idea is to perform in-context concept learning, acquiring
the local visual concept within the broader context of the objects they belong
to. To localize the concept learning, we employ soft masks that contain both
the concept within the mask and the surrounding image area. We demonstrate our
approach through object generation within an image, showcasing plausible
embedding of in-context learned concepts. We also introduce methods for
directing acquired concepts to specific locations within target images,
employing cross-attention mechanisms, and establishing correspondences between
source and target objects. The effectiveness of our method is demonstrated
through quantitative and qualitative experiments, along with comparisons
against baseline techniques. |
This paper proposes a method for learning local visual patterns from a single image and transferring them to other objects or generating new objects with the learned pattern, all while preserving the context of the pattern within the object. |
Existing image editing methods struggle to effectively transfer local patterns while maintaining their context and relationship to the object they belong to. This method aims to address this challenge by learning patterns in the context of their surrounding object. |
The method utilizes a diffusion model with in-context concept learning. It learns a token representing the pattern by optimizing multiple loss functions that encourage the model to focus on the pattern region, learn it in the context of the object, and avoid overfitting to the specific instance in the source image. For transfer, it uses masked blended diffusion editing and cross-attention guidance. |
The method can successfully transfer various local patterns, such as ornaments, to objects of the same or different classes.
It enables the generation of new objects that incorporate the learned pattern in a contextually relevant manner.
Quantitative and qualitative comparisons demonstrate the superiority of the proposed method over existing personalization methods like Custom Diffusion, Break-A-Scene, and RealFill. |
The method's performance may deteriorate when the source and target images have significant domain differences.
The optimization process, while effective, is time-consuming and not suitable for real-time applications. |
concept learning, image editing, image generation, diffusion models, pattern transfer |
2311.17082
Report |
DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling |
Linqi Zhou, Andy Shih, Chenlin Meng, Stefano Ermon |
Recent methods such as Score Distillation Sampling (SDS) and Variational
Score Distillation (VSD) using 2D diffusion models for text-to-3D generation
have demonstrated impressive generation quality. However, the long generation
time of such algorithms significantly degrades the user experience. To tackle
this problem, we propose DreamPropeller, a drop-in acceleration algorithm that
can be wrapped around any existing text-to-3D generation pipeline based on
score distillation. Our framework generalizes Picard iterations, a classical
algorithm for parallel sampling an ODE path, and can account for non-ODE paths
such as momentum-based gradient updates and changes in dimensions during the
optimization process as in many cases of 3D generation. We show that our
algorithm trades parallel compute for wallclock time and empirically achieves
up to 4.7x speedup with a negligible drop in generation quality for all tested
frameworks. |
This paper introduces DreamPropeller, a general acceleration algorithm applicable to any text-to-3D generation pipeline using score distillation. |
Current text-to-3D methods based on score distillation, while producing impressive results, suffer from prohibitively long generation times, hindering their practical use. |
The method leverages parallel computation by generalizing Picard iterations, a classic technique for parallel sampling of ODE paths, to accommodate the complexities of 3D generation, such as changing parameter dimensions and momentum-based gradient updates. |
DreamPropeller consistently achieves more than a 4x speedup across various 3D representations and score distillation frameworks.
The algorithm's performance improves with higher computational demands per iteration, making it especially beneficial for complex methods like ProlificDreamer.
DreamPropeller maintains high generation quality, comparable to the original non-parallelized methods. |
The current implementation relies on fixed random seeds to eliminate stochasticity during parallel computation, which might limit the exploration of diverse solutions.
Further investigation is needed to explore more efficient strategies for handling LoRA model updates in VSD, potentially through asynchronous parameter sharing or gradient compression. |
text-to-3d generation, score distillation, parallel computation, picard iteration, 3d gaussian splatting |
2311.17076
Report |
Compositional Chain-of-Thought Prompting for Large Multimodal Models |
Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig |
The combination of strong visual backbones and Large Language Model (LLM)
reasoning has led to Large Multimodal Models (LMMs) becoming the current
standard for a wide range of vision and language (VL) tasks. However, recent
research has shown that even the most advanced LMMs still struggle to capture
aspects of compositional visual reasoning, such as attributes and relationships
between objects. One solution is to utilize scene graphs (SGs)--a formalization
of objects and their relations and attributes that has been extensively used as
a bridge between the visual and textual domains. Yet, scene graph data requires
scene graph annotations, which are expensive to collect and thus not easily
scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic
forgetting of the pretraining objective. To overcome this, inspired by
chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a
novel zero-shot Chain-of-Thought prompting method that utilizes SG
representations in order to extract compositional knowledge from an LMM.
Specifically, we first generate an SG using the LMM, and then use that SG in
the prompt to produce a response. Through extensive experiments, we find that
the proposed CCoT approach not only improves LMM performance on several vision
and language VL compositional benchmarks but also improves the performance of
several popular LMMs on general multimodal benchmarks, without the need for
fine-tuning or annotated ground-truth SGs. Code:
https://github.com/chancharikmitra/CCoT |
This paper proposes CCoT, a novel zero-shot Chain-of-Thought prompting method for Large Multimodal Models (LMMs) to improve compositional visual reasoning by leveraging scene graph (SG) representations. |
Existing LMMs struggle to capture compositional aspects of visual scenes, often treating them as a 'bag of objects'. CCoT aims to address this by incorporating structured scene graph information into the reasoning process. |
CCoT employs a two-step process: 1) Scene Graph Generation: An LMM is prompted to generate a scene graph relevant to the input image and task. 2) Response Generation: The LMM is prompted with the image, task prompt, and the *generated* scene graph to produce a response, leveraging the compositional information. |
CCoT significantly improves performance on compositional visual reasoning benchmarks like Winoground and WHOOPS!.
It also enhances performance on general multimodal benchmarks like SEEDBench, MMBench, and LLaVA-Bench In-the-Wild.
Ablations confirm the importance of structured SGs over captions, JSON formatting, and optimal SG length for improved reasoning. |
The method's performance is limited by the context length of current LLM backbones used in LMMs.
Scene graphs might not be suitable for multimodal tasks with a stronger emphasis on language over visual reasoning. |
large multimodal models, compositional reasoning, scene graphs, chain-of-thought prompting, zero-shot learning |
2311.17043
Report |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models |
Yanwei Li, Chengyao Wang, Jiaya Jia |
In this work, we present a novel method to tackle the token generation
challenge in Vision Language Models (VLMs) for video and image understanding,
called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning
and visual question answering, face computational burdens when processing long
videos due to the excessive visual tokens. LLaMA-VID addresses this issue by
representing each frame with two distinct tokens, namely context token and
content token. The context token encodes the overall image context based on
user input, whereas the content token encapsulates visual cues in each frame.
This dual-token strategy significantly reduces the overload of long videos
while preserving critical information. Generally, LLaMA-VID empowers existing
frameworks to support hour-long videos and pushes their upper limit with an
extra context token. It is proved to surpass previous methods on most of video-
or image-based benchmarks. Code is available
https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID |
LLaMA-VID is a novel method addressing the token generation challenge in Vision Language Models (VLMs) for video and image understanding. |
Existing VLMs struggle to process long videos due to the computational burden of excessive visual tokens from consecutive frames. |
LLaMA-VID represents each frame with two tokens: a context token encoding overall image context based on user input, and a content token encapsulating frame-specific visual cues. This reduces token overload while preserving vital information. |
LLaMA-VID enables existing VLMs to support hour-long videos.
It achieves state-of-the-art results on multiple video and image understanding benchmarks.
The method is computationally efficient, completing training in 2 days on a single machine with 8 A100 GPUs. |
Performance slightly decreases when the content token is significantly compressed (e.g., to 1 token/frame).
Future work involves exploring dynamic token compression based on content importance and resource availability. |
vision language models, video understanding, token generation, long video processing, instruction tuning |
2311.17009
Report |
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer |
Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel |
We present a new method for text-driven motion transfer - synthesizing a
video that complies with an input text prompt describing the target objects and
scene while maintaining an input video's motion and scene layout. Prior methods
are confined to transferring motion across two subjects within the same or
closely related object categories and are applicable for limited domains (e.g.,
humans). In this work, we consider a significantly more challenging setting in
which the target and source objects differ drastically in shape and
fine-grained motion characteristics (e.g., translating a jumping dog into a
dolphin). To this end, we leverage a pre-trained and fixed text-to-video
diffusion model, which provides us with generative and motion priors. The
pillar of our method is a new space-time feature loss derived directly from the
model. This loss guides the generation process to preserve the overall motion
of the input video while complying with the target object in terms of shape and
fine-grained motion traits. |
This paper introduces a zero-shot method for text-driven motion transfer, enabling the transfer of motion from a source video to a target object specified by a text prompt, even when the source and target objects have significant differences in shape and motion characteristics. |
This approach addresses limitations of existing motion transfer techniques that struggle with significant structural deviations between source and target objects, particularly those relying on explicit pose estimation or similar object categories. |
The method leverages the generative motion priors of a pre-trained text-to-video diffusion model, guiding the generation process using a novel loss function based on pairwise differences of spatial marginal mean features extracted from the model. |
The method successfully transfers motion while accommodating significant shape variations and generating plausible scene elements.
Quantitative evaluation demonstrates superior performance in preserving motion fidelity and achieving high edit fidelity compared to existing text-driven video editing methods.
User studies confirm the method's effectiveness, with participants consistently preferring its results over baselines. |
Performance is limited by the pre-trained text-to-video model's ability to handle out-of-distribution motion and object combinations.
Current text-to-video models have limitations in quality, resolution, and video length, restricting the applicability of the method. |
motion transfer, text-driven video editing, diffusion models, space-time feature analysis, generative motion priors |
2311.17002
Report |
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following |
Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou |
Existing text-to-image (T2I) diffusion models usually struggle in
interpreting complex prompts, especially those with quantity, object-attribute
binding, and multi-subject descriptions. In this work, we introduce a semantic
panel as the middleware in decoding texts to images, supporting the generator
to better follow instructions. The panel is obtained through arranging the
visual concepts parsed from the input text by the aid of large language models,
and then injected into the denoising network as a detailed control signal to
complement the text condition. To facilitate text-to-panel learning, we come up
with a carefully designed semantic formatting protocol, accompanied by a
fully-automatic data preparation pipeline. Thanks to such a design, our
approach, which we call Ranni, manages to enhance a pre-trained T2I generator
regarding its textual controllability. More importantly, the introduction of
the generative middleware brings a more convenient form of interaction (i.e.,
directly adjusting the elements in the panel or using language instructions)
and further allows users to finely customize their generation, based on which
we develop a practical system and showcase its potential in continuous
generation and chatting-based editing. Our project page is at
https://ranni-t2i.github.io/Ranni. |
Introduces Ranni, an improved text-to-image generation framework that uses a 'semantic panel' as a middleware to enhance the accuracy and controllability of image generation from text prompts. |
Addresses the limitations of existing text-to-image models in interpreting complex prompts, particularly those involving quantity, attribute binding, and multi-subject descriptions. |
Utilizes large language models (LLMs) to parse text prompts into visual concepts, which are then arranged into a structured semantic panel. This panel serves as a detailed control signal for a diffusion-based image generation model, enabling more precise control over image content and attributes. |
Demonstrates superior performance in following complex prompts, particularly those involving quantity, spatial relationships, and attribute binding.
Enables interactive image editing through direct manipulation of the semantic panel, allowing for intuitive modifications to object attributes, positions, and relationships.
Explores the potential of LLM-powered chatting-based editing, enabling users to refine images through natural language instructions. |
The text-to-panel stage, relying on LLMs, can sometimes produce inaccurate or overlapping object placements, leading to generation errors.
The panel-to-image generation, while more controllable, still exhibits some degree of robustness, sometimes rectifying improper layouts from the text-to-panel stage, which might not always align with user intent. |
text-to-image synthesis, diffusion models, large language models, semantic panel, interactive image editing |
2311.16974
Report |
COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design |
Peidong Jia, Chenxuan Li, Yuhui Yuan, Zeyu Liu, Yichao Shen, Bohan Chen, Xingru Chen, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo |
Graphic design, which has been evolving since the 15th century, plays a
crucial role in advertising. The creation of high-quality designs demands
design-oriented planning, reasoning, and layer-wise generation. Unlike the
recent CanvaGPT, which integrates GPT-4 with existing design templates to build
a custom GPT, this paper introduces the COLE system - a hierarchical generation
framework designed to comprehensively address these challenges. This COLE
system can transform a vague intention prompt into a high-quality multi-layered
graphic design, while also supporting flexible editing based on user input.
Examples of such input might include directives like ``design a poster for
Hisaishi's concert.'' The key insight is to dissect the complex task of
text-to-design generation into a hierarchy of simpler sub-tasks, each addressed
by specialized models working collaboratively. The results from these models
are then consolidated to produce a cohesive final output. Our hierarchical task
decomposition can streamline the complex process and significantly enhance
generation reliability. Our COLE system comprises multiple fine-tuned Large
Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models
(DMs), each specifically tailored for design-aware layer-wise captioning,
layout planning, reasoning, and the task of generating images and text.
Furthermore, we construct the DESIGNINTENTION benchmark to demonstrate the
superiority of our COLE system over existing methods in generating high-quality
graphic designs from user intent. Last, we present a Canva-like multi-layered
image editing tool to support flexible editing of the generated multi-layered
graphic design images. We perceive our COLE system as an important step towards
addressing more complex and multi-layered graphic design generation tasks in
the future. |
This paper presents a novel hierarchical generation framework named COLE, designed to simplify the complex process of graphic design generation. COLE leverages the power of large multimodal models (LMMs), large language models (LLMs), and diffusion models to decompose the task into manageable, coordinated sub-tasks. |
Existing text-to-image models often struggle with generating high-quality, editable graphic designs from simple user intentions. COLE addresses these limitations by enabling design-oriented planning, reasoning, and layer-wise generation, resulting in multi-layered and editable graphic designs. |
COLE employs a hierarchical approach, utilizing specialized models for each sub-task: 1) Design-LLM translates user intentions into structured JSON files. 2) Cascaded diffusion models generate background and object layers with visual planning and reasoning. 3) Typography-LMM predicts typography attributes based on visual content. 4) Multi-layered SVG editor allows for user editing. |
COLE demonstrates competitive performance against state-of-the-art image generators like DALL-E and CanvaGPT, achieving superior results in design layout, typography, and innovation according to GPT4-V evaluation.
The hierarchical task decomposition and specialized models in COLE facilitate the generation of high-quality graphic design images that are both editable and aligned with user intentions.
The proposed COLE system exhibits strong generalization capability in layout planning and typography attribute reasoning, as evidenced by its performance on the Crello text box placement task. |
The system exhibits limitations in typography block arrangement, the variety of editable visual elements, and typography color selection.
Future work will focus on addressing these limitations and exploring the generation of more complex and diverse graphic designs. |
graphic design generation, hierarchical generation, large language models, diffusion models, typography |
2311.16973
Report |
DemoFusion: Democratising High-Resolution Image Generation With No $$$ |
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma |
High-resolution image generation with Generative Artificial Intelligence
(GenAI) has immense potential but, due to the enormous capital investment
required for training, it is increasingly centralised to a few large
corporations, and hidden behind paywalls. This paper aims to democratise
high-resolution GenAI by advancing the frontier of high-resolution generation
while remaining accessible to a broad audience. We demonstrate that existing
Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution
image generation. Our novel DemoFusion framework seamlessly extends open-source
GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated
Sampling mechanisms to achieve higher-resolution image generation. The
progressive nature of DemoFusion requires more passes, but the intermediate
results can serve as "previews", facilitating rapid prompt iteration. |
DemoFusion is a novel framework that extends open-source latent diffusion models (LDMs) for high-resolution image generation without requiring additional training or excessive memory resources. |
High-resolution image generation with GenAI is becoming increasingly centralized and commercialized. DemoFusion aims to democratize this technology by enabling users with modest hardware to generate high-resolution images using existing open-source models. |
DemoFusion builds upon the MultiDiffusion framework and introduces three key mechanisms: (i) Progressive Upscaling: Generates images iteratively from low to high resolutions, refining details in each phase. (ii) Skip Residual: Enhances global consistency by integrating noise-inverted representations from lower resolutions as residuals. (iii) Dilated Sampling: Improves global semantic coherence by using dilated sampling of denoising paths. |
DemoFusion successfully generates high-resolution images (up to 4096^2 and beyond) with rich local details and global semantic coherence.
Quantitative evaluations using FID, IS, and CLIP Score demonstrate DemoFusion's superior performance compared to SDXL, MultiDiffusion, SDXL+BSRGAN, and SCALECRAFTER.
DemoFusion enables high-resolution generation on consumer-grade GPUs, making it accessible to a wider audience. |
DemoFusion requires longer inference times compared to baseline methods due to its progressive upscaling and patch-wise denoising processes.
Performance heavily relies on the underlying LDM's ability to generate coherent local patches at higher resolutions, and can be limited by the LDM's inherent biases. |
generative artificial intelligence, high-resolution image generation, latent diffusion models, multidiffusion, progressive upscaling |
2311.16961
Report |
HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion |
Jingbo Zhang, Xiaoyu Li, Qi Zhang, Yanpei Cao, Ying Shan, Jing Liao |
Generating a 3D human model from a single reference image is challenging
because it requires inferring textures and geometries in invisible views while
maintaining consistency with the reference image. Previous methods utilizing 3D
generative models are limited by the availability of 3D training data.
Optimization-based methods that lift text-to-image diffusion models to 3D
generation often fail to preserve the texture details of the reference image,
resulting in inconsistent appearances in different views. In this paper, we
propose HumanRef, a 3D human generation framework from a single-view input. To
ensure the generated 3D model is photorealistic and consistent with the input
image, HumanRef introduces a novel method called reference-guided score
distillation sampling (Ref-SDS), which effectively incorporates image guidance
into the generation process. Furthermore, we introduce region-aware attention
to Ref-SDS, ensuring accurate correspondence between different body regions.
Experimental results demonstrate that HumanRef outperforms state-of-the-art
methods in generating 3D clothed humans with fine geometry, photorealistic
textures, and view-consistent appearances. |
HumanRef, a novel framework for generating 3D clothed humans from a single image, leveraging a reference-guided score distillation sampling (Ref-SDS) loss to produce realistic and view-consistent textures. |
Reconstructing 3D humans from single images is challenging due to the need to infer textures and geometries in unseen areas while maintaining consistency with the input view. |
HumanRef uses a hash-encoded SDF network optimized in a coarse-to-fine manner, incorporating human geometry constraints, a Ref-SDS loss that injects image guidance into a pretrained diffusion model, and region-aware attention for precise local-region guidance. |
Outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry and photorealistic textures.
Successfully generates view-consistent results matching the reference image.
Produces more realistic textures compared to methods relying solely on text-guided diffusion models. |
May suffer from the Janus problem in side views due to lack of view-specific constraints.
Can fail in cases of extreme poses where body estimation is inaccurate. |
3d human generation, diffusion models, score distillation sampling, region-aware attention, single image reconstruction |
2311.16933
Report |
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models |
Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, Bo Dai |
The development of text-to-video (T2V), i.e., generating videos with a given
text prompt, has been significantly advanced in recent years. However, relying
solely on text prompts often results in ambiguous frame composition due to
spatial uncertainty. The research community thus leverages the dense structure
signals, e.g., per-frame depth/edge sequences, to enhance controllability,
whose collection accordingly increases the burden of inference. In this work,
we present SparseCtrl to enable flexible structure control with temporally
sparse signals, requiring only one or a few inputs, as shown in Figure 1. It
incorporates an additional condition encoder to process these sparse signals
while leaving the pre-trained T2V model untouched. The proposed approach is
compatible with various modalities, including sketches, depth maps, and RGB
images, providing more practical control for video generation and promoting
applications such as storyboarding, depth rendering, keyframe animation, and
interpolation. Extensive experiments demonstrate the generalization of
SparseCtrl on both original and personalized T2V generators. Codes and models
will be publicly available at https://guoyww.github.io/projects/SparseCtrl . |
This paper presents SparseCtrl, an efficient method for controlling text-to-video (T2V) generation using temporally sparse condition maps, such as sketches, depth maps, or RGB images, via an add-on encoder. |
Current T2V models struggle with fine-grained control and often require dense condition maps for every frame, leading to impractical inference costs. This paper addresses these limitations by enabling control with only a few keyframe conditions. |
SparseCtrl employs a condition encoder with temporal-aware layers to propagate information from conditioned keyframes to unconditioned frames. It leverages masking strategies to handle varying sparsity levels and improves upon ControlNet's design by removing the noised sample input to the encoder. |
SparseCtrl achieves high-fidelity control over the generated video content, closely adhering to the input conditions even with sparse input.
The method demonstrates strong generalization ability, successfully applied to various tasks like sketch-to-video generation, depth-guided generation, image animation, and video interpolation.
Extensive experiments and comparisons with existing methods showcase SparseCtrl's effectiveness in maintaining temporal consistency and achieving comparable or superior performance on chosen tasks. |
The quality and domain of generated videos are limited by the pre-trained T2V backbone and training data.
Out-of-domain inputs, like anime images, can pose challenges due to data scarcity in the training dataset. Future work could explore domain-specific backbones or more diverse training data. |
text-to-video generation, sparse control, diffusion models, controllable video synthesis, keyframe animation |
2311.16922
Report |
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding |
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing |
Large Vision-Language Models (LVLMs) have advanced considerably, intertwining
visual recognition and language understanding to generate content that is not
only coherent but also contextually attuned. Despite their success, LVLMs still
suffer from the issue of object hallucinations, where models generate plausible
yet incorrect outputs that include objects that do not exist in the images. To
mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple
and training-free method that contrasts output distributions derived from
original and distorted visual inputs. The proposed VCD effectively reduces the
over-reliance on statistical bias and unimodal priors, two essential causes of
object hallucinations. This adjustment ensures the generated content is closely
grounded to visual inputs, resulting in contextually accurate outputs. Our
experiments show that VCD, without either additional training or the usage of
external tools, significantly mitigates the object hallucination issue across
different LVLM families. Beyond mitigating object hallucinations, VCD also
excels in general LVLM benchmarks, highlighting its wide-ranging applicability. |
This paper introduces Visual Contrastive Decoding (VCD), a training-free method to mitigate object hallucinations in Large Vision-Language Models (LVLMs) by contrasting output distributions from original and distorted visual inputs. |
Object hallucinations, the generation of plausible yet incorrect object descriptions, present a significant challenge to the reliability and applicability of LVLMs in real-world scenarios. |
VCD contrasts output distributions generated from original and distorted images, effectively calibrating the model's over-reliance on statistical bias and unimodal priors (language priors). This approach requires no additional training or external tools. |
VCD significantly reduces object hallucinations across different LVLM families (LLaVA-1.5, InstructBLIP, Qwen-VL) and datasets (MSCOCO, A-OKVQA, GQA).
VCD shows consistent improvements on object hallucination benchmarks, including up to +7.4 F1 score boost on POPE and +18% improvement on MME.
Beyond mitigating hallucinations, VCD enhances the general perception capabilities of LVLMs, as demonstrated by improved performance on MME and LLaVA-Bench. |
The current implementation relies on basic Gaussian noise for visual distortion; exploring more fine-grained techniques could be beneficial.
The study focuses on image and text LVLMs, future work could explore extending VCD to video understanding. |
object hallucination, vision-language models, contrastive decoding, multimodal learning, artificial intelligence |
2311.16918
Report |
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D |
Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mutian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Bo, Xiaoguang Han |
Lifting 2D diffusion for 3D generation is a challenging problem due to the
lack of geometric prior and the complex entanglement of materials and lighting
in natural images. Existing methods have shown promise by first creating the
geometry through score-distillation sampling (SDS) applied to rendered surface
normals, followed by appearance modeling. However, relying on a 2D RGB
diffusion model to optimize surface normals is suboptimal due to the
distribution discrepancy between natural images and normals maps, leading to
instability in optimization. In this paper, recognizing that the normal and
depth information effectively describe scene geometry and be automatically
estimated from images, we propose to learn a generalizable Normal-Depth
diffusion model for 3D generation. We achieve this by training on the
large-scale LAION dataset together with the generalizable image-to-depth and
normal prior models. In an attempt to alleviate the mixed illumination effects
in the generated materials, we introduce an albedo diffusion model to impose
data-driven constraints on the albedo component. Our experiments show that when
integrated into existing text-to-3D pipelines, our models significantly enhance
the detail richness, achieving state-of-the-art results. Our project page is
https://aigc3d.github.io/richdreamer/. |
This paper presents RichDreamer, a novel text-to-3D generation method that leverages a generalizable Normal-Depth diffusion model for enhanced detail and fidelity. |
Existing text-to-3D methods struggle to generate high-quality, detailed objects due to the limitations of 2D diffusion models in capturing 3D geometry and material properties. |
The authors propose a two-stage approach: 1) Train a Normal-Depth diffusion model on a massive real-world dataset (LAION) and fine-tune it on a synthetic dataset (Objaverse) to provide robust geometric priors. 2) Introduce a depth-conditioned albedo diffusion model to disentangle albedo from lighting effects, leading to more accurate appearance modeling. |
RichDreamer significantly outperforms state-of-the-art methods in terms of both geometry and appearance quality, as evidenced by CLIP score comparisons and user studies.
Pre-training the Normal-Depth diffusion model on a large-scale real-world dataset proves crucial for generalization ability.
The depth-conditioned albedo diffusion model effectively separates albedo from lighting artifacts, resulting in more realistic relighting. |
The current method primarily focuses on object-level generation, limiting its applicability to complex scenes.
Further research is needed to develop a comprehensive appearance prior model that regularizes both diffuse and specular components. |
text-to-3d, diffusion model, geometry prior, albedo diffusion, appearance modeling |
2311.16854
Report |
A Unified Approach for Text- and Image-guided 4D Scene Generation |
Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello |
Large-scale diffusion generative models are greatly simplifying image, video
and 3D asset creation from user-provided text prompts and images. However, the
challenging problem of text-to-4D dynamic 3D scene generation with diffusion
guidance remains largely unexplored. We propose Dream-in-4D, which features a
novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D
diffusion guidance to effectively learn a high-quality static 3D asset in the
first stage; (2) a deformable neural radiance field that explicitly
disentangles the learned static asset from its deformation, preserving quality
during motion learning; and (3) a multi-resolution feature grid for the
deformation field with a displacement total variation loss to effectively learn
motion with video diffusion guidance in the second stage. Through a user
preference study, we demonstrate that our approach significantly advances image
and motion quality, 3D consistency and text fidelity for text-to-4D generation
compared to baseline approaches. Thanks to its motion-disentangled
representation, Dream-in-4D can also be easily adapted for controllable
generation where appearance is defined by one or multiple images, without the
need to modify the motion learning stage. Thus, our method offers, for the
first time, a unified approach for text-to-4D, image-to-4D and personalized 4D
generation tasks. |
Presents Dream-in-4D, a novel two-stage approach for text-to-4D dynamic 3D scene generation with diffusion guidance. |
Addresses the limitations of existing methods in generating high-quality, 3D-consistent, and text-faithful dynamic 3D scenes from text prompts. |
Leverages 3D and 2D diffusion guidance for high-quality static 3D asset generation in the first stage. Employs a deformable neural radiance field to disentangle static assets from motion, enabling motion learning with video diffusion guidance in the second stage. Introduces a multi-resolution feature grid and a displacement total variation loss for detailed and smooth motion. |
Significantly improves image and motion quality, 3D consistency, and text fidelity for text-to-4D generation compared to baseline approaches.
Enables controllable generation where appearance is defined by one or multiple images, without modifying the motion learning stage.
Offers a unified approach for text-to-4D, image-to-4D, and personalized 4D generation tasks. |
The combination of 3D and 2D diffusion priors may not always learn correct static 3D representations, particularly for complex prompts.
The method cannot recover or learn correct motion if the initial static representation is inaccurate. |
text-to-4d, diffusion models, deformable nerf, 3d scene generation, motion synthesis |
2311.16737
Report |
Point'n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields |
Jiajun Huang, Hongchuan Yu |
We propose Point'n Move, a method that achieves interactive scene object
manipulation with exposed region inpainting. Interactivity here further comes
from intuitive object selection and real-time editing. To achieve this, we
adopt Gaussian Splatting Radiance Field as the scene representation and fully
leverage its explicit nature and speed advantage. Its explicit representation
formulation allows us to devise a 2D prompt points to 3D mask dual-stage
self-prompting segmentation algorithm, perform mask refinement and merging,
minimize change as well as provide good initialization for scene inpainting and
perform editing in real-time without per-editing training, all leads to
superior quality and performance. We test our method by performing editing on
both forward-facing and 360 scenes. We also compare our method against existing
scene object removal methods, showing superior quality despite being more
capable and having a speed advantage. |
Point'n Move, an interactive method for scene object manipulation with exposed region inpainting on Gaussian Splatting Radiance Fields. |
Allows users to intuitively select, manipulate, and rearrange objects within 3D scenes for various applications like virtual home furnishing and AR/VR environment creation. |
Leverages the explicit nature and speed of Gaussian Splatting Radiance Fields to perform 2D point prompt to 3D segmentation, scene content revealing pruning, reprojection-based initialization for inpainting, and direct manipulation of primitives for real-time editing. |
Achieves high-quality object selection and editing in both 360 and forward-facing scenes.
Demonstrates competitive performance against existing object removal methods, particularly in terms of inpainting quality and speed.
Shows effectiveness of the dual-stage segmentation, content-revealing pruning, and reprojection-based initialization through ablation studies. |
Currently does not handle lighting or texture, focusing solely on geometry editing.
Inaccuracies in segmentation can lead to artifacts in the inpainted regions, particularly with shadows. |
3d scene manipulation, gaussian splatting radiance fields, exposed region inpainting, interactive editing, 3d segmentation |
2311.16711
Report |
LEDITS++: Limitless Image Editing using Text-to-Image Models |
Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos |
Text-to-image diffusion models have recently received increasing interest for
their astonishing ability to produce high-fidelity images from solely text
inputs. Subsequent research efforts aim to exploit and apply their capabilities
to real image editing. However, existing image-to-image methods are often
inefficient, imprecise, and of limited versatility. They either require
time-consuming fine-tuning, deviate unnecessarily strongly from the input
image, and/or lack support for multiple, simultaneous edits. To address these
issues, we introduce LEDITS++, an efficient yet versatile and precise textual
image manipulation technique. LEDITS++'s novel inversion approach requires no
tuning nor optimization and produces high-fidelity results with a few diffusion
steps. Second, our methodology supports multiple simultaneous edits and is
architecture-agnostic. Third, we use a novel implicit masking technique that
limits changes to relevant image regions. We propose the novel TEdBench++
benchmark as part of our exhaustive evaluation. Our results demonstrate the
capabilities of LEDITS++ and its improvements over previous methods. The
project page is available at https://leditsplusplus-project.static.hf.space . |
\model~is a novel, efficient, versatile, and precise method for text-driven image editing using text-to-image diffusion models. |
Existing image-to-image editing methods are often inefficient, imprecise, and limited in versatility. They often require time-consuming fine-tuning, deviate significantly from the input image, and lack support for multiple simultaneous edits. |
\model~employs a three-pronged approach: 1) Efficient image inversion using a modified DPM-Solver++ for faster and perfect reconstruction. 2) Versatile textual editing through a novel guidance term that allows multiple edits with individual control. 3) Semantic grounding of edits by combining attention and noise-based masking to restrict changes to relevant image regions. |
\model~achieves perfect image reconstruction with significantly faster runtime compared to existing methods.
\model~effectively performs various complex edits, including multi-concept editing, outperforming competing methods in fidelity and preserving image composition.
Implicit masking within \model~is shown to effectively identify and ground edits to semantically relevant regions, as demonstrated through a segmentation task proxy. |
Editing success is partially dependent on the capabilities of the underlying pre-trained diffusion model.
While excelling at compositional robustness, object coherence within the edited region can be further improved. |
image editing, text-to-image synthesis, diffusion models, semantic guidance, image inversion |
2311.16635
Report |
MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation |
Sitong Su, Litao Guo, Lianli Gao, Hengtao Shen, Jingkuan Song |
Zero-shot Text-to-Video synthesis generates videos based on prompts without
any videos. Without motion information from videos, motion priors implied in
prompts are vital guidance. For example, the prompt "airplane landing on the
runway" indicates motion priors that the "airplane" moves downwards while the
"runway" stays static. Whereas the motion priors are not fully exploited in
previous approaches, thus leading to two nontrivial issues: 1) the motion
variation pattern remains unaltered and prompt-agnostic for disregarding motion
priors; 2) the motion control of different objects is inaccurate and entangled
without considering the independent motion priors of different objects. To
tackle the two issues, we propose a prompt-adaptive and disentangled motion
control strategy coined as MotionZero, which derives motion priors from prompts
of different objects by Large-Language-Models and accordingly applies motion
control of different objects to corresponding regions in disentanglement.
Furthermore, to facilitate videos with varying degrees of motion amplitude, we
propose a Motion-Aware Attention scheme which adjusts attention among frames by
motion amplitude. Extensive experiments demonstrate that our strategy could
correctly control motion of different objects and support versatile
applications including zero-shot video edit. |
MotionZero, a novel zero-shot text-to-video generation method, introduces a prompt-adaptive and disentangled motion control strategy by leveraging motion priors from prompts and first frames, enabling precise and realistic motion generation in synthesized videos. |
Existing zero-shot text-to-video generation methods fail to fully utilize motion information inherent in prompts, leading to unrealistic motion patterns and inaccurate control of multiple objects. |
1) **Extracting Motion Priors**: LLMs are employed to extract motion priors from text prompts and the generated first frame, identifying moving objects and their directions. 2) **Disentangled Motion Control**: Motion priors are applied separately to corresponding objects in the feature space, utilizing a segmentation model to locate object positions. 3) **Motion-Aware Attention**: A novel attention scheme adjusts attention among frames based on motion amplitude to accommodate videos with varying degrees of motion. |
MotionZero demonstrates superior accuracy in motion control compared to existing methods, evidenced by quantitative evaluations and visual comparisons.
The method effectively disentangles the motion of multiple objects, enabling independent and realistic movement within a scene.
MotionZero supports various applications, including zero-shot video editing with background and foreground manipulation, human body control through skeleton information, camera motion simulation, and evolving event depiction. |
The reliance on external segmentation models and LLMs introduces computational overhead.
The current implementation focuses on generating videos with a fixed number of frames. |
zero-shot learning, text-to-video generation, motion control, large language models, video editing |
2311.16567
Report |
MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices |
Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou |
The deployment of large-scale text-to-image diffusion models on mobile
devices is impeded by their substantial model size and slow inference speed. In
this paper, we propose \textbf{MobileDiffusion}, a highly efficient
text-to-image diffusion model obtained through extensive optimizations in both
architecture and sampling techniques. We conduct a comprehensive examination of
model architecture design to reduce redundancy, enhance computational
efficiency, and minimize model's parameter count, while preserving image
generation quality. Additionally, we employ distillation and diffusion-GAN
finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference
respectively. Empirical studies, conducted both quantitatively and
qualitatively, demonstrate the effectiveness of our proposed techniques.
MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for
generating a $512\times512$ image on mobile devices, establishing a new state
of the art. |
This paper introduces MobileDiffusion, a highly efficient text-to-image diffusion model designed for mobile devices, achieved by optimizing the model architecture and sampling techniques. |
Deploying large-scale text-to-image models on mobile devices is challenging due to their size and slow inference speed. This work addresses this challenge by substantially reducing inference time and model size, paving the way for on-device text-to-image generation. |
The authors perform a comprehensive analysis and modification of the UNet architecture, including reducing redundancy in transformer blocks, employing separable convolutions, and pruning residual blocks. They also utilize distillation and diffusion-GAN finetuning to reduce sampling steps. |
MobileDiffusion achieves sub-second inference speed for generating 512x512 images on mobile devices (e.g., 0.2 seconds on iPhone 15 Pro).
The model achieves comparable image quality to existing models like Stable Diffusion while being significantly smaller and faster.
MobileDiffusion demonstrates strong performance in downstream tasks like controllable generation and LoRA finetuning. |
The model still struggles with generating images requiring uncommon knowledge or complex quantity interpretation, potentially limited by the text encoder.
Future work could explore extending MobileDiffusion to pixel-based diffusion models. |
text-to-image generation, diffusion models, mobile ai, model compression, efficient architecture |
2311.16513
Report |
Fine-grained Appearance Transfer with Diffusion Models |
Yuteng Ye, Guanwen Li, Hang Zhou, Cai Jiale, Junqing Yu, Yawei Luo, Zikai Song, Qilong Xing, Youjia Zhang, Wei Yang |
Image-to-image translation (I2I), and particularly its subfield of appearance
transfer, which seeks to alter the visual appearance between images while
maintaining structural coherence, presents formidable challenges. Despite
significant advancements brought by diffusion models, achieving fine-grained
transfer remains complex, particularly in terms of retaining detailed
structural elements and ensuring information fidelity. This paper proposes an
innovative framework designed to surmount these challenges by integrating
various aspects of semantic matching, appearance transfer, and latent
deviation. A pivotal aspect of our approach is the strategic use of the
predicted $x_0$ space by diffusion models within the latent space of diffusion
processes. This is identified as a crucial element for the precise and natural
transfer of fine-grained details. Our framework exploits this space to
accomplish semantic alignment between source and target images, facilitating
mask-wise appearance transfer for improved feature acquisition. A significant
advancement of our method is the seamless integration of these features into
the latent space, enabling more nuanced latent deviations without necessitating
extensive model retraining or fine-tuning. The effectiveness of our approach is
demonstrated through extensive experiments, which showcase its ability to
adeptly handle fine-grained appearance transfers across a wide range of
categories and domains. We provide our code at
https://github.com/babahui/Fine-grained-Appearance-Transfer |
This paper proposes a novel framework for fine-grained appearance transfer in image-to-image translation, leveraging the predicted x_0 space of diffusion models. |
Fine-grained appearance transfer, aiming to alter visual appearance while preserving structure, is challenging for existing diffusion models, particularly in maintaining details and fidelity. |
The framework integrates semantic matching in the x_0 space for detail alignment, appearance transfer based on matched features, and a latent deviation method for smooth integration of transferred features into the latent space. |
The method effectively transfers fine-grained details across diverse categories and domains, outperforming existing image-guided methods in qualitative and quantitative comparisons.
The framework excels in preserving structural integrity and information fidelity, contrasting with limitations of text-guided methods in precise detail control.
Ablation studies validate the importance of both semantic matching and latent deviation components for accurate and visually plausible results. |
The method's performance is susceptible to significant viewpoint differences and size discrepancies between source and target images.
Future work will focus on extending the framework to broader image transfer contexts, addressing challenges beyond appearance transfer while maintaining fine-grained detail focus. |
image-to-image translation, appearance transfer, diffusion models, semantic matching, latent deviation |
2311.16512
Report |
CoSeR: Bridging Image and Language for Cognitive Super-Resolution |
Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, Yujiu Yang |
Existing super-resolution (SR) models primarily focus on restoring local
texture details, often neglecting the global semantic information within the
scene. This oversight can lead to the omission of crucial semantic details or
the introduction of inaccurate textures during the recovery process. In our
work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering
SR models with the capacity to comprehend low-resolution images. We achieve
this by marrying image appearance and language understanding to generate a
cognitive embedding, which not only activates prior information from large
text-to-image diffusion models but also facilitates the generation of
high-quality reference images to optimize the SR process. To further improve
image fidelity, we propose a novel condition injection scheme called
"All-in-Attention", consolidating all conditional information into a single
module. Consequently, our method successfully restores semantically correct and
photorealistic details, demonstrating state-of-the-art performance across
multiple benchmarks. Code: https://github.com/VINHYU/CoSeR |
Introduces Cognitive Super-Resolution (CoSeR), a novel framework that empowers super-resolution models with cognitive abilities by generating semantic embeddings and high-quality reference images from low-resolution inputs using text-to-image diffusion models. |
Existing SR models often neglect global semantic information, leading to the loss of crucial details or introduction of inaccurate textures. CoSeR addresses this by incorporating cognitive understanding similar to human perception. |
A cognitive encoder extracts semantic and textural embeddings from LR images. These embeddings generate high-fidelity reference images and are integrated with LR input into a denoising U-Net via a novel All-in-Attention (AiA) module. |
CoSeR achieves state-of-the-art performance on multiple benchmarks, including ImageNet Test2000, RealSR, and DRealSR.
Generated reference images exhibit high semantic similarity to LR inputs, effectively guiding the restoration process.
The AiA module enhances the fidelity of SR results, ensuring consistency with the input image. |
The improvement from using multiple reference images plateaus beyond 2-3 images.
Further research on accelerating the sampling process in diffusion-based SR models is needed. |
image super-resolution, cognitive super-resolution, diffusion models, reference image generation, all-in-attention module |
2311.16507
Report |
Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance |
Siyu Xing, Jie Cao, Huaibo Huang, Xiao-Yu Zhang, Ran He |
Flow matching as a paradigm of generative model achieves notable success
across various domains. However, existing methods use either multi-round
training or knowledge within minibatches, posing challenges in finding a
favorable coupling strategy for straight trajectories. To address this issue,
we propose a novel approach, Straighter trajectories of Flow Matching
(StraightFM). It straightens trajectories with the coupling strategy guided by
diffusion model from entire distribution level. First, we propose a coupling
strategy to straighten trajectories, creating couplings between image and noise
samples under diffusion model guidance. Second, StraightFM also integrates real
data to enhance training, employing a neural network to parameterize another
coupling process from images to noise samples. StraightFM is jointly optimized
with couplings from above two mutually complementary directions, resulting in
straighter trajectories and enabling both one-step and few-step generation.
Extensive experiments demonstrate that StraightFM yields high quality samples
with fewer step. StraightFM generates visually appealing images with a lower
FID among diffusion and traditional flow matching methods within 5 sampling
steps when trained on pixel space. In the latent space (i.e., Latent
Diffusion), StraightFM achieves a lower KID value compared to existing methods
on the CelebA-HQ 256 dataset in fewer than 10 sampling steps. |
StraightFM is a novel approach for flow matching generative models that leverages diffusion model guidance to straighten trajectories and enable one-step and few-step generation. |
Existing flow matching methods rely on multi-round training or minibatch knowledge for coupling strategies, leading to limitations in finding favorable couplings for straight trajectories. |
StraightFM uses a pre-trained diffusion model to guide the coupling strategy by creating couplings between image and noise samples. It also integrates real data to enhance training by using a neural network to parameterize another coupling process from images to noise samples. |
StraightFM achieves straighter paths and high-quality image generation in fewer steps, even with one-step generation, on CIFAR-10.
StraightFM outperforms latent diffusion models on CelebA-HQ 256x256 dataset in under 10 sampling steps.
StraightFM demonstrates promising results in image inpainting, highlighting the efficacy of natural optimal transport couplings for flow matching in restoration tasks. |
The training of StraightFM depends on the coupling mechanism of diffusion model sampling, which can be slower than random coupling strategies.
Future work can explore balancing coupling speed and sample quality. |
generative models, flow matching, diffusion models, optimal transport, image generation |
2311.16504
Report |
Rethinking Directional Integration in Neural Radiance Fields |
Congyue Deng, Jiawei Yang, Leonidas Guibas, Yue Wang |
Recent works use the Neural radiance field (NeRF) to perform multi-view 3D
reconstruction, providing a significant leap in rendering photorealistic
scenes. However, despite its efficacy, NeRF exhibits limited capability of
learning view-dependent effects compared to light field rendering or
image-based view synthesis. To that end, we introduce a modification to the
NeRF rendering equation which is as simple as a few lines of code change for
any NeRF variations, while greatly improving the rendering quality of
view-dependent effects. By swapping the integration operator and the direction
decoder network, we only integrate the positional features along the ray and
move the directional terms out of the integration, resulting in a
disentanglement of the view-dependent and independent components. The modified
equation is equivalent to the classical volumetric rendering in ideal cases on
object surfaces with Dirac densities. Furthermore, we prove that with the
errors caused by network approximation and numerical integration, our rendering
equation exhibits better convergence properties with lower error accumulations
compared to the classical NeRF. We also show that the modified equation can be
interpreted as light field rendering with learned ray embeddings. Experiments
on different NeRF variations show consistent improvements in the quality of
view-dependent effects with our simple modification. |
This paper introduces LiNeRF, a simple modification to the Neural Radiance Field (NeRF) rendering equation that enhances the rendering quality of view-dependent effects by disentangling view-dependent and view-independent components. |
NeRFs struggle to effectively model view-dependent effects due to redundant view-direction queries that over-consume network capacity. This modification addresses this issue by integrating positional features along rays and decoding the aggregated feature with view direction, leading to a more efficient and accurate representation. |
The core methodology involves swapping the integration operator and the direction decoder network in the NeRF rendering equation. This modification, which can be implemented with minimal code changes, integrates positional features along the ray and decodes the aggregated feature with view direction, similar to light field rendering with learned ray embeddings. |
LiNeRF demonstrates consistent improvements in rendering view-dependent effects across various NeRF architectures and input encodings.
Theoretical analysis proves that LiNeRF provides a better numerical estimator of radiance integration with a tighter error bound compared to classic NeRF.
Experimental results on synthetic and real-world datasets, including Shiny Blender and Shiny datasets, showcase LiNeRF's capability to effectively model diverse view-dependent effects like reflections, refractions, and light interferences. |
Limitations: LiNeRF, while showing significant improvements over classic NeRF, still lags behind image-based view synthesis methods specifically designed for non-Lambertian effects, which benefit from explicit pixel value representations.
Future Work: The research team aims to investigate tighter integration of implicit radiance field rendering with explicit pixel-based rendering techniques and explore optimal feature selection strategies for integration based on network architectures. |
neural radiance fields, nerf, view synthesis, light field rendering, view-dependent effects |
2311.16499
Report |
Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images |
Shiu-hong Kao, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang |
This paper presents Deceptive-Human, a novel Prompt-to-NeRF framework
capitalizing state-of-the-art control diffusion models (e.g., ControlNet) to
generate a high-quality controllable 3D human NeRF. Different from direct 3D
generative approaches, e.g., DreamFusion and DreamHuman, Deceptive-Human
employs a progressive refinement technique to elevate the reconstruction
quality. This is achieved by utilizing high-quality synthetic human images
generated through the ControlNet with view-consistent loss. Our method is
versatile and readily extensible, accommodating multimodal inputs, including a
text prompt and additional data such as 3D mesh, poses, and seed images. The
resulting 3D human NeRF model empowers the synthesis of highly photorealistic
novel views from 360-degree perspectives. The key to our Deceptive-Human for
hallucinating multi-view consistent synthetic human images lies in our
progressive finetuning strategy. This strategy involves iteratively enhancing
views using the provided multimodal inputs at each intermediate step to improve
the human NeRF model. Within this iterative refinement process, view-dependent
appearances are systematically eliminated to prevent interference with the
underlying density estimation. Extensive qualitative and quantitative
experimental comparison shows that our deceptive human models achieve
state-of-the-art application quality. |
Presents Deceptive-Human, a novel Prompt-to-NeRF framework that leverages control diffusion models to generate high-quality, controllable 3D human NeRFs. |
Addresses the challenge of creating realistic 3D human models, particularly in generating high-fidelity appearances and enabling controllability. |
Employs a progressive refinement technique. Generates coarse NeRF using view-consistent diffusion models. Iteratively enhances the coarse NeRF by denoising rendered images. |
Achieves state-of-the-art application quality for 3D human generation.
Demonstrates the first 3D human model that accepts inputs with various controls like text, pose, style, edges, depth, and seed images.
Generates highly photorealistic novel views from 360-degree perspectives, showcasing the model's capability for novel view synthesis. |
The quality of the generated 3D human relies on the accuracy of the mesh estimation module, which can be improved.
Extending the framework to generate animated 3D humans, potentially leveraging style and pose controls, presents an interesting direction for future work. |
3d human generation, neural radiance fields, diffusion models, controllable generation, progressive refinement |
2311.16498
Report |
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model |
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou |
This paper studies the human image animation task, which aims to generate a
video of a certain reference identity following a particular motion sequence.
Existing animation works typically employ the frame-warping technique to
animate the reference image towards the target motion. Despite achieving
reasonable results, these approaches face challenges in maintaining temporal
consistency throughout the animation due to the lack of temporal modeling and
poor preservation of reference identity. In this work, we introduce
MagicAnimate, a diffusion-based framework that aims at enhancing temporal
consistency, preserving reference image faithfully, and improving animation
fidelity. To achieve this, we first develop a video diffusion model to encode
temporal information. Second, to maintain the appearance coherence across
frames, we introduce a novel appearance encoder to retain the intricate details
of the reference image. Leveraging these two innovations, we further employ a
simple video fusion technique to encourage smooth transitions for long video
animation. Empirical results demonstrate the superiority of our method over
baseline approaches on two benchmarks. Notably, our approach outperforms the
strongest baseline by over 38% in terms of video fidelity on the challenging
TikTok dancing dataset. Code and model will be made available. |
This paper presents TempoAvatar, a novel diffusion-based human image animation framework that leverages temporal modeling and robust appearance encoding for generating temporally consistent and high-fidelity animations. |
Existing animation methods often struggle with maintaining temporal consistency and preserving fine-grained details of the reference image, leading to flickering and unrealistic results. |
TempoAvatar employs a video diffusion model with temporal attention blocks for capturing temporal information and introduces an appearance encoder to retain detailed features from the reference image. It also utilizes an image-video joint training strategy and a video fusion technique for enhancing animation quality and smoothness. |
TempoAvatar achieves state-of-the-art performance on two benchmarks, TikTok and TED-talks, surpassing baselines in video fidelity and single-frame quality.
The method demonstrates superior temporal consistency compared to existing diffusion-based animation approaches.
TempoAvatar exhibits strong generalization ability, enabling cross-identity animation, unseen domain animation, and multi-person animation. |
The higher L1 error on the TED-talks dataset suggests potential limitations in handling dynamic backgrounds due to the use of DensePose as motion input.
Future work includes exploring alternative motion representations and extending the framework to handle more complex scenes and interactions. |
image animation, diffusion models, temporal consistency, appearance encoding, human avatar |
2311.16492
Report |
VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation |
Zijian Zhou, Miaojing Shi, Holger Caesar |
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image
understanding by simultaneously segmenting objects and predicting relations
among objects. However, the long-tail problem among relations leads to
unsatisfactory results in real-world applications. Prior methods predominantly
rely on vision information or utilize limited language information, such as
object or relation names, thereby overlooking the utility of language
information. Leveraging the recent progress in Large Language Models (LLMs), we
propose to use language information to assist relation prediction, particularly
for rare relations. To this end, we propose the Vision-Language Prompting
(VLPrompt) model, which acquires vision information from images and language
information from LLMs. Then, through a prompter network based on attention
mechanism, it achieves precise relation prediction. Our extensive experiments
show that VLPrompt significantly outperforms previous state-of-the-art methods
on the PSG dataset, proving the effectiveness of incorporating language
information and alleviating the long-tail problem of relations. |
This paper proposes VLPrompt, a novel Vision-Language Prompting model for Panoptic Scene Graph Generation (PSG) that leverages the rich language information from Large Language Models (LLMs) to address the long-tail problem in relation categories. |
Current PSG models struggle to accurately predict rare relations due to the long-tail problem. This paper explores the use of LLMs to provide common sense knowledge and improve relation prediction, particularly for rare relations. |
VLPrompt consists of three components: (1) Vision feature extractor: extracts visual features from object pairs using a segmentation network; (2) Language feature extractor: employs designed prompts and LLMs to generate descriptions for potential relations and judgments on relation triplets, encoding them into features; (3) Vision-language prompter: utilizes an attention-based network to enable interaction between vision and language features for relation prediction, which are then fused for the final prediction. |
VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, demonstrating the effectiveness of incorporating language information.
Ablation studies validate the contribution of each component and the effectiveness of design choices.
VLPrompt shows significant improvement in predicting rare relations, effectively alleviating the long-tail problem. |
The model's efficiency could be further improved, as it currently has higher FLOPS compared to some previous models.
The reliance on pre-trained LLMs may limit its generalizability to open-set relation prediction. |
panoptic scene graph generation, large language models, vision-language model, long-tail problem, relation prediction |
2311.16473
Report |
GS-IR: 3D Gaussian Splatting for Inverse Rendering |
Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia |
We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian
Splatting (GS) that leverages forward mapping volume rendering to achieve
photorealistic novel view synthesis and relighting results. Unlike previous
works that use implicit neural representations and volume rendering (e.g.
NeRF), which suffer from low expressive power and high computational
complexity, we extend GS, a top-performance representation for novel view
synthesis, to estimate scene geometry, surface material, and environment
illumination from multi-view images captured under unknown lighting conditions.
There are two main problems when introducing GS to inverse rendering: 1) GS
does not support producing plausible normal natively; 2) forward mapping (e.g.
rasterization and splatting) cannot trace the occlusion like backward mapping
(e.g. ray tracing). To address these challenges, our GS-IR proposes an
efficient optimization scheme that incorporates a depth-derivation-based
regularization for normal estimation and a baking-based occlusion to model
indirect lighting. The flexible and expressive GS representation allows us to
achieve fast and compact geometry reconstruction, photorealistic novel view
synthesis, and effective physically-based rendering. We demonstrate the
superiority of our method over baseline methods through qualitative and
quantitative evaluations on various challenging scenes. |
GS-IR, a novel 3D Gaussian-based inverse rendering framework, leverages forward mapping splatting to deduce the physical attributes of a complex scene from multi-view images captured under unknown lighting conditions. |
Existing inverse rendering methods using implicit neural representations suffer from low expressive power and high computational complexity. GS-IR, based on 3D Gaussian Splatting, offers a more compact and efficient representation for faster, real-time rendering while achieving high quality. |
GS-IR employs a three-stage strategy: 1) Optimizes 3D Gaussians for geometry reconstruction and uses depth gradient to supervise normal estimation. 2) Precomputes occlusion information and stores it in spherical harmonics-based architectures to model indirect illumination. 3) Uses differentiable splatting with a physically-based rendering pipeline to optimize illumination and material-aware 3D Gaussians. |
GS-IR achieves superior novel view synthesis and albedo quality compared to baseline methods on the TensoIR Synthetic dataset.
The method demonstrates fast convergence and supports real-time rendering due to its efficient 3D Gaussian representation and tile-based rasterizer.
GS-IR effectively handles complex real unbounded scenes, reconstructing high-fidelity geometry and materials. |
Modeling the specular term of indirect illumination remains a limitation.
Spherical Harmonics used for occlusion modeling are only suitable for low-frequency details. |
inverse rendering, 3d gaussian splatting, physically-based rendering, novel view synthesis, relighting |
2311.16465
Report |
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering |
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei |
The diffusion model has been proven a powerful generative model in recent
years, yet remains a challenge in generating visual text. Several methods
alleviated this issue by incorporating explicit text position and content as
guidance on where and what text to render. However, these methods still suffer
from several drawbacks, such as limited flexibility and automation, constrained
capability of layout prediction, and restricted style diversity. In this paper,
we present TextDiffuser-2, aiming to unleash the power of language models for
text rendering. Firstly, we fine-tune a large language model for layout
planning. The large language model is capable of automatically generating
keywords for text rendering and also supports layout modification through
chatting. Secondly, we utilize the language model within the diffusion model to
encode the position and texts at the line level. Unlike previous methods that
employed tight character-level guidance, this approach generates more diverse
text images. We conduct extensive experiments and incorporate user studies
involving human participants as well as GPT-4V, validating TextDiffuser-2's
capacity to achieve a more rational text layout and generation with enhanced
diversity. The code and model will be available at
\url{https://aka.ms/textdiffuser-2}. |
This document provides guidelines for formatting author responses to peer reviews, specifically for LaTeX users. |
Standardized formatting ensures that author responses are clear, concise, and easy for reviewers to read and assess. |
The document outlines specific formatting requirements including page limits, font sizes, margin spacing, figure placement, and referencing styles for LaTeX. |
Author responses must be no longer than one page.
Figures and references should be numbered separately from the main paper.
Font sizes and line widths should be legible in a printed copy. |
The guidelines are specific to LaTeX, potentially limiting accessibility for authors using other typesetting systems.
Further clarification on acceptable content beyond factual errors or requested information could be beneficial. |
latex, author response, formatting guidelines, peer review, academic publishing |
2311.16254
Report |
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models |
Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara |
Large-scale vision-and-language models, such as CLIP, are typically trained
on web-scale data, which can introduce inappropriate content and lead to the
development of unsafe and biased behavior. This, in turn, hampers their
applicability in sensitive and trustworthy contexts and could raise significant
concerns in their adoption. Our research introduces a novel approach to
enhancing the safety of vision-and-language models by diminishing their
sensitivity to NSFW (not safe for work) inputs. In particular, our methodology
seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage
between unsafe linguistic or visual items and unsafe regions of the embedding
space. We show how this can be done by fine-tuning a CLIP model on synthetic
data obtained from a large language model trained to convert between safe and
unsafe sentences, and a text-to-image generator. We conduct extensive
experiments on the resulting embedding space for cross-modal retrieval,
text-to-image, and image-to-text generation, where we show that our model can
be remarkably employed with pre-trained generative models. Our source code and
trained models are available at: https://github.com/aimagelab/safe-clip. |
Introduces Safe-CLIP, a fine-tuning methodology to enhance the safety of pre-trained CLIP models by diminishing their sensitivity to NSFW inputs. |
Addresses the issue of inappropriate content and biased behavior in large-scale vision-and-language models trained on web-scale data, enhancing their applicability in sensitive contexts. |
Fine-tunes CLIP using a synthetic dataset of safe and unsafe images and texts, generated via a toxic language model and a text-to-image generator. Employs multiple loss functions to redirect inappropriate content to safe regions while preserving the embedding space structure. |
Safe-CLIP significantly reduces the retrieval of NSFW images and text when using unsafe queries.
Significantly reduces the probability of generating NSFW images using Stable Diffusion with both I2P and VISU prompts.
Effectively reduces the probability of generating inappropriate textual descriptions by multimodal LLMs (e.g., LLaVA) when provided with NSFW images. |
Limited guarantee of success, with potential failure cases.
Ethical implications of the toxic language model used for dataset generation. |
trustworthy ai, vision-and-language, nsfw concepts, cross-modal retrieval, text-to-image generation |
2311.16122
Report |
Semantic Generative Augmentations for Few-Shot Counting |
Perla Doubinsky, Nicolas Audebert, Michel Crucianu, Hervé Le Borgne |
With the availability of powerful text-to-image diffusion models, recent
works have explored the use of synthetic data to improve image classification
performances. These works show that it can effectively augment or even replace
real data. In this work, we investigate how synthetic data can benefit few-shot
class-agnostic counting. This requires to generate images that correspond to a
given input number of objects. However, text-to-image models struggle to grasp
the notion of count. We propose to rely on a double conditioning of Stable
Diffusion with both a prompt and a density map in order to augment a training
dataset for few-shot counting. Due to the small dataset size, the fine-tuned
model tends to generate images close to the training images. We propose to
enhance the diversity of synthesized images by exchanging captions between
images thus creating unseen configurations of object types and spatial layout.
Our experiments show that our diversified generation strategy significantly
improves the counting accuracy of two recent and performing few-shot counting
models on FSC147 and CARPK. |
This paper introduces a novel data augmentation strategy for few-shot object counting that leverages the power of text-to-image diffusion models by conditioning them on both text prompts and density maps. |
Few-shot object counting suffers from limited data, hindering performance. This work addresses this challenge with a data augmentation strategy based on text-to-image diffusion models tailored for counting tasks. |
The authors fine-tune a pre-trained Stable Diffusion model using ControlNet to generate new images conditioned on both textual prompts and density maps. To enhance diversity, they propose a caption swapping mechanism based on semantic similarity, generating unseen combinations of objects and spatial layouts. |
The proposed diverse augmentation strategy significantly improves counting accuracy over traditional augmentation methods on the FSC147 benchmark dataset.
The method also improves the generalization capability of the models, as demonstrated by state-of-the-art performance on the CARPK dataset for car counting.
Experiments demonstrate the importance of caption similarity-based swapping and the optimal balance between real and synthetic data during training. |
Changing the object category through caption swapping may lead to inaccurate exemplar bounding boxes if the new object's shape differs significantly.
Further exploration is needed for effectively refining exemplar boxes in cases where caption swapping leads to mismatches between the generated object and the original bounding box. |
few-shot learning, object counting, data augmentation, diffusion models, controlnet |
2311.16103
Report |
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models |
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan |
Video-based large language models (Video-LLMs) have been recently introduced,
targeting both fundamental improvements in perception and comprehension, and a
diverse range of user inquiries. In pursuit of the ultimate goal of achieving
artificial general intelligence, a truly intelligent Video-LLM model should not
only see and understand the surroundings, but also possess human-level
commonsense, and make well-informed decisions for the users. To guide the
development of such a model, the establishment of a robust and comprehensive
evaluation system becomes crucial. To this end, this paper proposes
\textit{Video-Bench}, a new comprehensive benchmark along with a toolkit
specifically designed for evaluating Video-LLMs. The benchmark comprises 10
meticulously crafted tasks, evaluating the capabilities of Video-LLMs across
three distinct levels: Video-exclusive Understanding, Prior Knowledge-based
Question-Answering, and Comprehension and Decision-making. In addition, we
introduce an automatic toolkit tailored to process model outputs for various
tasks, facilitating the calculation of metrics and generating convenient final
scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The
findings reveal that current Video-LLMs still fall considerably short of
achieving human-like comprehension and analysis of real-world videos, offering
valuable insights for future research directions. The benchmark and toolkit are
available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}. |
This paper introduces "Video-Bench," a comprehensive benchmark and toolkit for evaluating Video-LLMs across three levels of capability: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. |
A robust evaluation system is crucial for guiding the development of Video-LLMs towards achieving artificial general intelligence, as existing benchmarks lack comprehensiveness in assessing these capabilities. |
The benchmark comprises 10 meticulously crafted tasks, evaluating various aspects of Video-LLM abilities. An automatic toolkit processes model outputs, calculates metrics, and generates final scores, streamlining the evaluation workflow. |
Current Video-LLMs excel at summarizing basic video content but struggle with temporal reasoning and detail-oriented tasks.
Lack of domain-specific prior knowledge limits Video-LLMs' ability to understand and answer questions requiring external knowledge.
Most tested models exhibit limited proficiency in comprehension and decision-making within complex scenarios, suggesting a need for larger-scale training and enhanced multimodal understanding. |
The reliance on multiple-choice questions, while simplifying evaluation, may not fully capture the nuances of Video-LLM responses.
Future work should explore more robust evaluation metrics for long-form text responses and address the need for efficient long video understanding. |
video-llms, benchmarking, video understanding, multimodal learning, artificial general intelligence |
2311.16101
Report |
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie |
This work focuses on the potential of Vision LLMs (VLLMs) in visual
reasoning. Different from prior studies, we shift our focus from evaluating
standard performance to introducing a comprehensive safety evaluation suite,
covering both out-of-distribution (OOD) generalization and adversarial
robustness. For the OOD evaluation, we present two novel VQA datasets, each
with one variant, designed to test model performance under challenging
conditions. In exploring adversarial robustness, we propose a straightforward
attack strategy for misleading VLLMs to produce visual-unrelated responses.
Moreover, we assess the efficacy of two jailbreaking strategies, targeting
either the vision or language component of VLLMs. Our evaluation of 21 diverse
models, ranging from open-source VLLMs to GPT-4V, yields interesting
observations: 1) Current VLLMs struggle with OOD texts but not images, unless
the visual information is limited; and 2) These VLLMs can be easily misled by
deceiving vision encoders only, and their vision-language training often
compromise safety protocols. We release this safety evaluation suite at
https://github.com/UCSC-VLAA/vllm-safety-benchmark. |
This paper introduces a comprehensive safety evaluation suite for Vision Large Language Models (VLLMs) encompassing out-of-distribution (OOD) generalization and adversarial robustness. |
Assessing the safety of VLLMs is crucial for their responsible integration into real-world applications, as existing benchmarks primarily focus on standard performance. |
The authors propose two novel VQA datasets for OOD evaluation and a simple attack strategy to mislead VLLMs. They also benchmark two jailbreaking attacks targeting vision and language components. Evaluation is performed on 21 models, including open-source VLLMs and GPT-4V. |
VLLMs excel in comprehending OOD visual content but struggle with OOD textual input, highlighting the importance of language understanding.
VLLMs, including GPT-4V, face challenges processing sketch images due to limited information content.
Simple attacks targeting CLIP's vision encoder effectively mislead VLLMs, while GPT-4V exhibits a higher tendency to refuse answers to inappropriate inputs. |
The study primarily focuses on CLIP-based VLLMs, leaving room for future research on other architectures.
The proposed attack strategies, while effective, might not encompass the full spectrum of potential vulnerabilities in VLLMs. |
vision language models, safety evaluation, out-of-distribution generalization, adversarial robustness, jailbreaking attacks |
2311.16099
Report |
GART: Gaussian Articulated Template Models |
Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, Kostas Daniilidis |
We introduce Gaussian Articulated Template Model GART, an explicit,
efficient, and expressive representation for non-rigid articulated subject
capturing and rendering from monocular videos. GART utilizes a mixture of
moving 3D Gaussians to explicitly approximate a deformable subject's geometry
and appearance. It takes advantage of a categorical template model prior (SMPL,
SMAL, etc.) with learnable forward skinning while further generalizing to more
complex non-rigid deformations with novel latent bones. GART can be
reconstructed via differentiable rendering from monocular videos in seconds or
minutes and rendered in novel poses faster than 150fps. |
This paper introduces GART, a novel explicit and efficient representation for capturing and rendering non-rigid articulated subjects from monocular videos using Gaussian Mixture Models (GMM). |
Current implicit methods like NeRFs, though high-quality, suffer from slow rendering speeds. Explicit methods, while efficient, often lack quality. GART bridges this gap by explicitly approximating the implicit radiance field, combining the strengths of both. |
GART leverages a template model (e.g., SMPL, SMAL) and represents the canonical shape and appearance using GMM. It employs learnable forward skinning for animation and introduces latent bones to capture complex deformations, such as loose clothing. |
GART achieves state-of-the-art performance on monocular human reconstruction and rendering benchmarks (ZJU-MoCap, People-Snapshot) with superior efficiency.
GART demonstrates high fidelity in reconstructing challenging clothing like long dresses from the UBC-Fashion dataset, outperforming baselines like InstantAvatar.
GART successfully extends to animal reconstruction, capturing detailed appearances of diverse dog breeds from in-the-wild videos using the D-SMAL template. |
The method currently relies on the availability of template pose estimators, limiting its applicability to species without readily available estimators.
Future work could explore capturing category-level priors from large in-the-wild video collections to generalize beyond single-video fitting. |
3d reconstruction, articulated motion, monocular video, gaussian mixture model, differentiable rendering |
2311.16097
Report |
CG-HOI: Contact-Guided 3D Human-Object Interaction Generation |
Christian Diller, Angela Dai |
We propose CG-HOI, the first method to address the task of generating dynamic
3D human-object interactions (HOIs) from text. We model the motion of both
human and object in an interdependent fashion, as semantically rich human
motion rarely happens in isolation without any interactions. Our key insight is
that explicitly modeling contact between the human body surface and object
geometry can be used as strong proxy guidance, both during training and
inference. Using this guidance to bridge human and object motion enables
generating more realistic and physically plausible interaction sequences, where
the human body and corresponding object move in a coherent manner. Our method
first learns to model human motion, object motion, and contact in a joint
diffusion process, inter-correlated through cross-attention. We then leverage
this learned contact for guidance during inference to synthesize realistic and
coherent HOIs. Extensive evaluation shows that our joint contact-based
human-object interaction approach generates realistic and physically plausible
sequences, and we show two applications highlighting the capabilities of our
method. Conditioned on a given object trajectory, we can generate the
corresponding human motion without re-training, demonstrating strong
human-object interdependency learning. Our approach is also flexible, and can
be applied to static real-world 3D scene scans. |
This paper proposes CONTACT, a novel method for generating dynamic 3D human-object interactions (HOIs) from text descriptions by jointly modeling human motion, object motion, and contact between them. |
Realistic modeling of human-object interactions is crucial for various applications, but previous methods struggled to generate plausible and coherent interactions due to neglecting the interdependency of human and object motions. |
CONTACT utilizes a denoising diffusion process with cross-attention to learn the correlations between human, object, and contact representations. A contact-based object transform weighting scheme ensures object motion is primarily influenced by the body part in closest contact. During inference, a contact-based guidance refines generated sequences for physical plausibility. |
CONTACT generates more realistic and physically plausible HOIs compared to baselines, effectively mitigating artifacts like object floating.
The method demonstrates strong human-object interdependency learning, enabling conditional generation of human motion given object trajectories without retraining.
CONTACT can be applied to populate static 3D scene scans with realistic HOIs. |
The method currently focuses on interactions with a single object, limiting its applicability to more complex scenarios with multiple objects.
The reliance on expensive 3D HOI data for training and manual text annotations poses challenges for scalability and generalization. |
3d human-object interaction, denoising diffusion model, contact modeling, text-to-motion generation, scene population |
2311.16096
Report |
Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling |
Zhe Li, Zerong Zheng, Lizhen Wang, Yebin Liu |
Modeling animatable human avatars from RGB videos is a long-standing and
challenging problem. Recent works usually adopt MLP-based neural radiance
fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to
regress pose-dependent garment details. To this end, we introduce Animatable
Gaussians, a new avatar representation that leverages powerful 2D CNNs and 3D
Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians
with the animatable avatar, we learn a parametric template from the input
videos, and then parameterize the template on two front \& back canonical
Gaussian maps where each pixel represents a 3D Gaussian. The learned template
is adaptive to the wearing garments for modeling looser clothes like dresses.
Such template-guided 2D parameterization enables us to employ a powerful
StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling
detailed dynamic appearances. Furthermore, we introduce a pose projection
strategy for better generalization given novel poses. Overall, our method can
create lifelike avatars with dynamic, realistic and generalized appearances.
Experiments show that our method outperforms other state-of-the-art approaches.
Code: https://github.com/lizhe00/AnimatableGaussians |
This paper presents Animatable Gaussians, a novel method for creating high-fidelity animatable human avatars from multi-view RGB videos using 3D Gaussian splatting and 2D CNNs. |
Existing methods struggle to model fine-grained, dynamic details due to the limitations of MLPs in representing implicit functions. This work aims to overcome these limitations by leveraging the strengths of explicit representations and 2D CNNs. |
The method learns a parametric template from the input videos to capture garment shape. It then parameterizes 3D Gaussians on this template and employs a StyleGAN-based network to predict pose-dependent Gaussian maps. Finally, it utilizes LBS for deforming Gaussians and differentiable splatting for rendering. |
Creates high-fidelity avatars with detailed dynamic appearances, surpassing existing methods in visual quality.
Learns a character-specific template, allowing for accurate animation of complex garments like long dresses.
Introduces a pose projection strategy, enhancing generalization to novel, out-of-distribution poses. |
Limited to entangled modeling of body and clothes, hindering applications like virtual try-on.
Requires multi-view input for template reconstruction, limiting its applicability to monocular videos. |
animatable avatars, 3d gaussian splatting, human modeling, computer vision, deep learning |
2311.16090
Report |
Self-correcting LLM-controlled Diffusion Models |
Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell |
Text-to-image generation has witnessed significant progress with the advent
of diffusion models. Despite the ability to generate photorealistic images,
current text-to-image diffusion models still often struggle to accurately
interpret and follow complex input text prompts. In contrast to existing models
that aim to generate images only with their best effort, we introduce
Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that
generates an image from the input prompt, assesses its alignment with the
prompt, and performs self-corrections on the inaccuracies in the generated
image. Steered by an LLM controller, SLD turns text-to-image generation into an
iterative closed-loop process, ensuring correctness in the resulting image. SLD
is not only training-free but can also be seamlessly integrated with diffusion
models behind API access, such as DALL-E 3, to further boost the performance of
state-of-the-art diffusion models. Experimental results show that our approach
can rectify a majority of incorrect generations, particularly in generative
numeracy, attribute binding, and spatial relationships. Furthermore, by simply
adjusting the instructions to the LLM, SLD can perform image editing tasks,
bridging the gap between text-to-image generation and image editing pipelines.
We will make our code available for future research and applications. |
Introduces Self-correcting LLM-controlled Diffusion (SLD), a framework that enhances text-to-image alignment in diffusion models by iteratively identifying and rectifying errors in generated images through LLM-guided object detection and latent space operations. |
Addresses the limitations of existing text-to-image diffusion models that often struggle to accurately interpret and follow complex prompts, especially those requiring numeracy, spatial relationships, and attribute binding. |
Employs an LLM parser to extract key objects from user prompts, an open-vocabulary detector to locate objects in the image, and an LLM controller to analyze discrepancies and suggest correction operations (addition, deletion, repositioning, attribute modification) implemented via latent space composition. |
Significantly improves generation correctness over state-of-the-art diffusion models on complex prompts, as demonstrated by the LMD benchmark.
Achieves substantial performance gains on numeracy, attribute binding, and spatial reasoning tasks.
Effectively unifies text-to-image generation and image editing tasks within a single framework. |
Faces challenges with objects of complex shapes due to limitations in the object segmentation module.
Future work includes exploring the integration of advanced LMMs for more streamlined image assessment and editing. |
text-to-image generation, diffusion models, large language models, image editing, self-correction |
2311.16043
Report |
Relightable 3D Gaussian: Real-time Point Cloud Relighting with BRDF Decomposition and Ray Tracing |
Jian Gao, Chun Gu, Youtian Lin, Hao Zhu, Xun Cao, Li Zhang, Yao Yao |
We present a novel differentiable point-based rendering framework for
material and lighting decomposition from multi-view images, enabling editing,
ray-tracing, and real-time relighting of the 3D point cloud. Specifically, a 3D
scene is represented as a set of relightable 3D Gaussian points, where each
point is additionally associated with a normal direction, BRDF parameters, and
incident lights from different directions. To achieve robust lighting
estimation, we further divide incident lights of each point into global and
local components, as well as view-dependent visibilities. The 3D scene is
optimized through the 3D Gaussian Splatting technique while BRDF and lighting
are decomposed by physically-based differentiable rendering. Moreover, we
introduce an innovative point-based ray-tracing approach based on the bounding
volume hierarchy for efficient visibility baking, enabling real-time rendering
and relighting of 3D Gaussian points with accurate shadow effects. Extensive
experiments demonstrate improved BRDF estimation and novel view rendering
results compared to state-of-the-art material estimation approaches. Our
framework showcases the potential to revolutionize the mesh-based graphics
pipeline with a relightable, traceable, and editable rendering pipeline solely
based on point cloud. Project
page:https://nju-3dv.github.io/projects/Relightable3DGaussian/. |
This paper introduces a novel differentiable point-based rendering framework named Relightable 3D Gaussian for material and lighting decomposition from multi-view images. This enables editing, ray-tracing, and real-time relighting of the reconstructed 3D point cloud. |
The proposed framework offers a potential alternative to the mesh-based graphics pipeline with a relightable, traceable, and editable rendering pipeline solely based on point cloud. |
The framework represents a 3D scene as a set of relightable 3D Gaussian points, each associated with normal direction, BRDF parameters, and incident lights. It optimizes the scene representation using a combination of 3D Gaussian Splatting, physically-based differentiable rendering, and a novel point-based ray tracing approach based on the bounding volume hierarchy. |
The method achieves improved BRDF estimation compared to existing material estimation approaches.
It enables high-quality novel view synthesis, outperforming several state-of-the-art methods.
The framework allows for real-time rendering and relighting of scenes with realistic shadow effects. |
The method struggles with unbounded scenes and requires object masks during optimization.
The integration of multi-view stereo (MVS) into the optimization process for more accurate geometry is left for future work. |
differentiable rendering, point-based rendering, material and lighting decomposition, ray tracing, 3d gaussian splatting |
2311.16037
Report |
GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions |
Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, Qi Tian |
Recently, impressive results have been achieved in 3D scene editing with text
instructions based on a 2D diffusion model. However, current diffusion models
primarily generate images by predicting noise in the latent space, and the
editing is usually applied to the whole image, which makes it challenging to
perform delicate, especially localized, editing for 3D scenes. Inspired by
recent 3D Gaussian splatting, we propose a systematic framework, named
GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text
instructions. Benefiting from the explicit property of 3D Gaussians, we design
a series of techniques to achieve delicate editing. Specifically, we first
extract the region of interest (RoI) corresponding to the text instruction,
aligning it to 3D Gaussians. The Gaussian RoI is further used to control the
editing process. Our framework can achieve more delicate and precise editing of
3D scenes than previous methods while enjoying much faster training speed, i.e.
within 20 minutes on a single V100 GPU, more than twice as fast as
Instruct-NeRF2NeRF (45 minutes -- 2 hours). |
This paper proposes \name, a novel framework to edit 3D scenes delicately using text instructions and 3D Gaussian splatting. |
Existing 3D scene editing methods using 2D diffusion models lack the ability to localize editing regions, making it difficult to perform delicate and precise 3D scene editing. |
The method consists of three steps: 1) Region of Interest (RoI) extraction from text instruction, 2) Aligning the instruction RoI to 3D Gaussians through an image grounding model and training, 3) Editing the original 3D Gaussians within the obtained Gaussian RoI by a 2D diffusion model. |
\name enables separate foreground and background editing, even in complex multi-object scenes.
It achieves more delicate and precise 3D scene editing compared to previous methods like Instruct-NeRF2NeRF.
The method exhibits fast training time, completing within 20 minutes on a single V100 GPU. |
The scene description generation might be inaccurate when descriptions from different views of the same object vary significantly.
The system's performance is limited by the accuracy of the grounding segmentation and diffusion models. |
3d scene editing, text-guided editing, 3d gaussian splatting, region of interest, diffusion models |
2311.15980
Report |
Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion |
Yuanxun Lu, Jingyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao |
Recent advances in generative AI have unveiled significant potential for the
creation of 3D content. However, current methods either apply a pre-trained 2D
diffusion model with the time-consuming score distillation sampling (SDS), or a
direct 3D diffusion model trained on limited 3D data losing generation
diversity. In this work, we approach the problem by employing a multi-view 2.5D
diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D
diffusion directly models the structural distribution of 3D data, while still
maintaining the strong generalization ability of the original 2D diffusion
model, filling the gap between 2D diffusion-based and direct 3D diffusion-based
methods for 3D content generation. During inference, multi-view normal maps are
generated using the 2.5D diffusion, and a novel differentiable rasterization
scheme is introduced to fuse the almost consistent multi-view normal maps into
a consistent 3D model. We further design a normal-conditioned multi-view image
generation module for fast appearance generation given the 3D geometry. Our
method is a one-pass diffusion process and does not require any SDS
optimization as post-processing. We demonstrate through extensive experiments
that, our direct 2.5D generation with the specially-designed fusion scheme can
achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in
only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25. |
This paper introduces a novel approach to rapidly generate textured 3D meshes from text prompts, leveraging fine-tuned multi-view 2.5D diffusion models. |
This work bridges the gap between computationally expensive Score Distillation Sampling (SDS) methods and limited generalizability of direct 3D diffusion models for text-to-3D generation. |
The approach uses two fine-tuned diffusion models: one for multi-view normal map generation and another for texture generation conditioned on the normals. A differentiable rasterization scheme fuses the multi-view normals into a 3D mesh, and texture mapping completes the process. |
The method generates diverse, high-fidelity 3D content in just 10 seconds, significantly faster than SDS-based techniques.
It exhibits strong generalization to complex text prompts, surpassing direct 3D diffusion methods.
The two-stage architecture allows for geometry-appearance disentanglement, enabling flexible content control. |
The limited number of views (four) may result in incomplete reconstruction of unseen areas, such as concavities.
Texture generation quality is constrained by the training data and could be enhanced with more sophisticated techniques. |
text-to-3d generation, diffusion models, multi-view synthesis, differentiable rasterization, 2.5d representation |
2311.15864
Report |
InterControl: Generate Human Motion Interactions by Controlling Every Joint |
Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, Bo Dai |
Text-conditioned human motion synthesis has made remarkable progress with the
emergence of diffusion models in recent research. However, the majority of
these motion diffusion models are primarily designed for a single character and
overlook multi-human interactions. In our approach, we strive to explore this
problem by synthesizing human motion with interactions for a group of
characters of any size. The key aspect of our approach is the adaptation of
human-wise interactions as pairs of human joints that can be either in contact
or separated by a desired distance. In contrast to existing methods that
necessitate training motion generation models on multi-human motion datasets
with a fixed number of characters, our approach inherently possesses the
flexibility to model human interactions involving an arbitrary number of
individuals, thereby transcending the limitations imposed by the training data.
We introduce a novel controllable motion generation method, InterControl, to
encourage the synthesized motions maintaining the desired distance between
joint pairs. It consists of a motion controller and an inverse kinematics
guidance module that realistically and accurately aligns the joints of
synthesized characters to the desired location. Furthermore, we demonstrate
that the distance between joint pairs for human-wise interactions can be
generated using an off-the-shelf Large Language Model (LLM). Experimental
results highlight the capability of our framework to generate interactions with
multiple human characters and its potential to work with off-the-shelf
physics-based character simulators. |
InterControl generates multi-person interactions using a single-person motion generation model trained on single-person data by precisely controlling the position of every joint in every person at any time, conditioned on text prompts and joint relations. |
This approach overcomes the limitations of previous methods that require multi-human motion datasets with fixed numbers of characters and struggle with precise spatial control for realistic interactions. |
InterControl integrates a Motion ControlNet (inspired by ControlNet) to process spatial control signals and an Inverse Kinematics (IK) Guidance module to align the synthesized motions to the desired locations. It uses joint contact pairs, automatically generated from text prompts by an off-the-shelf LLM, as control signals for interaction generation. |
InterControl achieves state-of-the-art performance in semantic-level metrics (FID, R-precision, Diversity) on single-person motion generation.
It demonstrates superior accuracy in spatial control metrics (Trajectory error, Location error, Average error) compared to previous spatially controllable methods.
InterControl generates realistic multi-person interactions, confirmed by low spatial errors and a strong preference (80.4%) over prior work in a user study. |
InterControl's interaction definition currently relies on distance and orientation, potentially limiting the complexity of interactions.
The plausibility of generated interactions depends on the quality of the single-person motion data and the LLM's ability to infer joint contact pairs consistent with interaction descriptions. |
motion synthesis, human interaction generation, diffusion models, controllable motion generation, inverse kinematics |
2311.15841
Report |
Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation |
Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang |
This study focuses on a novel task in text-to-image (T2I) generation, namely
action customization. The objective of this task is to learn the co-existing
action from limited data and generalize it to unseen humans or even animals.
Experimental results show that existing subject-driven customization methods
fail to learn the representative characteristics of actions and struggle in
decoupling actions from context features, including appearance. To overcome the
preference for low-level features and the entanglement of high-level features,
we propose an inversion-based method Action-Disentangled Identifier (ADI) to
learn action-specific identifiers from the exemplar images. ADI first expands
the semantic conditioning space by introducing layer-wise identifier tokens,
thereby increasing the representational richness while distributing the
inversion across different features. Then, to block the inversion of
action-agnostic features, ADI extracts the gradient invariance from the
constructed sample triples and masks the updates of irrelevant channels. To
comprehensively evaluate the task, we present an ActionBench that includes a
variety of actions, each accompanied by meticulously selected samples. Both
quantitative and qualitative results show that our ADI outperforms existing
baselines in action-customized T2I generation. Our project page is at
https://adi-t2i.github.io/ADI. |
This paper introduces Action Customization for text-to-image generation, enabling the learning of specific actions from limited examples and their transfer to new subjects, including humans and animals. |
Generating images with specific actions is challenging due to the difficulty in providing precise text descriptions and the limitations of existing controllable generation methods relying on skeletons or sketches. |
The paper proposes ADI, which expands the semantic conditioning space with layer-wise identifier tokens and utilizes gradient masking to decouple action-related features from action-agnostic information like appearance. |
ADI achieves high accuracy in generating specified actions while maintaining the fidelity of generated subjects, outperforming baselines like Stable Diffusion and ControlNet.
The learned action identifiers can be effectively combined with various characters and animals to generate high-quality images, demonstrating generalization ability.
Ablation studies confirm the effectiveness of layer-wise identifier tokens and gradient masking strategies in improving action customization performance. |
The optimal masking ratio in ADI might need to be adjusted for different actions to achieve the best performance.
Future work could explore incorporating action dynamics and temporal information to enhance the expressiveness of generated actions. |
text-to-image generation, action customization, diffusion models, gradient masking, controllable image synthesis |
2311.15813
Report |
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax |
Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang |
Text-to-video (T2V) generation is a rapidly growing research area that aims
to translate the scenes, objects, and actions within complex video text into a
sequence of coherent visual frames. We present FlowZero, a novel framework that
combines Large Language Models (LLMs) with image diffusion models to generate
temporally-coherent videos. FlowZero uses LLMs to understand complex
spatio-temporal dynamics from text, where LLMs can generate a comprehensive
dynamic scene syntax (DSS) containing scene descriptions, object layouts, and
background motion patterns. These elements in DSS are then used to guide the
image diffusion model for video generation with smooth object motions and
frame-to-frame coherence. Moreover, FlowZero incorporates an iterative
self-refinement process, enhancing the alignment between the spatio-temporal
layouts and the textual prompts for the videos. To enhance global coherence, we
propose enriching the initial noise of each frame with motion dynamics to
control the background movement and camera motion adaptively. By using
spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves
improvement in zero-shot video synthesis, generating coherent videos with vivid
motion. |
FlowZero, a novel framework that combines LLMs with image diffusion models to generate temporally coherent videos by converting text prompts into dynamic scene syntax (DSS) including scene descriptions, object layouts, and background motion patterns. |
Generating coherent dynamic visual scenes in videos from text prompts remains challenging due to the succinct and abstract nature of video text prompts. |
FlowZero uses LLMs to generate DSS, employs iterative self-refinement to ensure layout accuracy, and introduces motion-guided noise shifting to enhance global coherence. A modified U-Net with cross-attention mechanisms synthesizes the video frames. |
FlowZero generates videos with accurate object motion and transformations, surpassing existing zero-shot and some training-based methods.
Self-refinement process significantly improves the alignment of generated layouts with text prompts, enhancing spatial and temporal accuracy.
Motion-guided noise shifting effectively controls background motion, leading to smoother and more coherent video synthesis. |
The framework currently relies on pre-defined motion directions for background motion.
Further research is needed to explore the generation of videos with longer durations and more complex scenes. |
text-to-video generation, large language models, diffusion models, dynamic scene syntax, temporal coherence |
2311.15776
Report |
Stable Segment Anything Model |
Qi Fan, Xin Tao, Lei Ke, Mingqiao Ye, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu-Wing Tai, Chi-Keung Tang |
The Segment Anything Model (SAM) achieves remarkable promptable segmentation
given high-quality prompts which, however, often require good skills to
specify. To make SAM robust to casual prompts, this paper presents the first
comprehensive analysis on SAM's segmentation stability across a diverse
spectrum of prompt qualities, notably imprecise bounding boxes and insufficient
points. Our key finding reveals that given such low-quality prompts, SAM's mask
decoder tends to activate image features that are biased towards the background
or confined to specific object parts. To mitigate this issue, our key idea
consists of calibrating solely SAM's mask attention by adjusting the sampling
locations and amplitudes of image features, while the original SAM model
architecture and weights remain unchanged. Consequently, our deformable
sampling plugin (DSP) enables SAM to adaptively shift attention to the prompted
target regions in a data-driven manner, facilitated by our effective robust
training strategy (RTS). During inference, dynamic routing plugin (DRP) is
proposed that toggles SAM between the deformable and regular grid sampling
modes, conditioned on the input prompt quality. Thus, our solution, termed
Stable-SAM, offers several advantages: 1) improved SAM's segmentation stability
across a wide range of prompt qualities, while 2) retaining SAM's powerful
promptable segmentation efficiency and generality, with 3) minimal learnable
parameters (0.08 M) and fast adaptation (by 1 training epoch). Extensive
experiments across multiple datasets validate the effectiveness and advantages
of our approach, underscoring Stable-SAM as a more robust solution for
segmenting anything. Codes will be released upon acceptance.
https://github.com/fanq15/Stable-SAM |
The paper introduces Stable-SAM, a novel method to enhance the robustness of the Segment Anything Model (SAM) to inaccurate or insufficient prompts. |
SAM's performance heavily relies on high-quality prompts, which are often difficult to obtain in real-world applications. This limits SAM's practical use in scenarios with casual or imprecise user inputs. |
The paper proposes a Deformable Sampling Plugin (DSP) that calibrates SAM's mask attention by adjusting the sampling positions and amplitudes of image features based on a learnable offset network. Additionally, a Dynamic Routing Plugin (DRP) is introduced to toggle between DSP and regular grid sampling based on prompt quality. A robust training strategy (RTS) incorporating diverse prompt qualities further enhances the model's stability. |
Stable-SAM significantly improves SAM's segmentation accuracy and stability across various prompt qualities, particularly for imprecise boxes and sparse points.
Stable-SAM maintains SAM's zero-shot generalization ability and achieves competitive performance on multiple benchmarks, including MS COCO and SGinW.
Stable-SAM exhibits strong model scalability, requiring minimal learnable parameters (0.08M) and achieving fast adaptation with only one epoch of training. |
The spatial attention mechanism is not as effective as the proposed deformable sampling plugin in adapting SAM to handle suboptimal prompts.
The robust training strategy, while improving stability, slightly compromises performance with high-quality prompts. |
segment anything model, deformable attention, robust segmentation, zero-shot learning, prompt engineering |
2311.15773
Report |
Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation |
Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu |
Diffusion models have recently achieved remarkable progress in generating
realistic images. However, challenges remain in accurately understanding and
synthesizing the layout requirements in the textual prompts. To align the
generated image with layout instructions, we present a training-free layout
calibration system SimM that intervenes in the generative process on the fly
during inference time. Specifically, following a "check-locate-rectify"
pipeline, the system first analyses the prompt to generate the target layout
and compares it with the intermediate outputs to automatically detect errors.
Then, by moving the located activations and making intra- and inter-map
adjustments, the rectification process can be performed with negligible
computational overhead. To evaluate SimM over a range of layout requirements,
we present a benchmark SimMBench that compensates for the lack of superlative
spatial relations in existing datasets. And both quantitative and qualitative
results demonstrate the effectiveness of the proposed SimM in calibrating the
layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM. |
This paper presents SimM, a training-free layout calibration system for text-to-image generation that aligns generated images with layout instructions in textual prompts. |
Most text-to-image generators struggle to accurately understand and interpret textual layout instructions, compromising the quality and fidelity of generated images. |
SimM follows a "check-locate-rectify" pipeline. It checks for layout requirements and discrepancies, locates misplaced objects in intermediate cross-attention maps, and rectifies the activations by transferring them to target regions and performing intra-/inter-map activation adjustments. |
SimM achieves state-of-the-art generation accuracy on both DrawBench and a newly proposed benchmark focusing on superlative spatial relations.
The system effectively rectifies layout inconsistencies while maintaining excellent image quality.
SimM operates in real-time with negligible computational overhead compared to training-based or large language model-based layout control methods. |
A single adjustment strength parameter may not be optimal for all generation scenarios, leading to potential errors in complex layouts.
The current implementation focuses on single-view image generation and could be extended to multi-view generation. |
text-to-image generation, layout calibration, diffusion models, spatial relations, real-time system |
2311.15744
Report |
One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls |
Minghui Hu, Jianbin Zheng, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham |
It is well known that many open-released foundational diffusion models have
difficulty in generating images that substantially depart from average
brightness, despite such images being present in the training data. This is due
to an inconsistency: while denoising starts from pure Gaussian noise during
inference, the training noise schedule retains residual data even in the final
timestep distribution, due to difficulties in numerical conditioning in
mainstream formulation, leading to unintended bias during inference. To
mitigate this issue, certain $\epsilon$-prediction models are combined with an
ad-hoc offset-noise methodology. In parallel, some contemporary models have
adopted zero-terminal SNR noise schedules together with
$\mathbf{v}$-prediction, which necessitate major alterations to pre-trained
models. However, such changes risk destabilizing a large multitude of
community-driven applications anchored on these pre-trained models. In light of
this, our investigation revisits the fundamental causes, leading to our
proposal of an innovative and principled remedy, called One More Step (OMS). By
integrating a compact network and incorporating an additional simple yet
effective step during inference, OMS elevates image fidelity and harmonizes the
dichotomy between training and inference, while preserving original model
parameters. Once trained, various pre-trained diffusion models with the same
latent domain can share the same OMS module. |
This paper proposes "One More Step" (OMS), a plug-and-play method to improve image fidelity in pre-trained diffusion models without modifying their parameters. |
Existing diffusion models often generate images with average brightness due to a discrepancy in terminal noise distribution between training and inference. |
OMS introduces a compact, text-conditional network that maps pure Gaussian noise to the data-adulterated noise expected by pre-trained models at the start of sampling. |
OMS enables generation of images with a wider range of brightness levels.
The method is adaptable to various diffusion models and can share the same module across models with the same latent domain.
Modifying prompts in OMS allows control over low-frequency image aspects like brightness and color. |
Integrating OMS into the student model through distillation could reduce computational cost.
Further exploration of OMS integration during model training from scratch or fine-tuning could be beneficial. |
diffusion models, text-to-image synthesis, noise schedule, image fidelity, one more step |
2311.15732
Report |
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? |
Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang |
This paper does not present a novel method. Instead, it delves into an
essential, yet must-know baseline in light of the latest advancements in
Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual
understanding. Our study centers on the evaluation of GPT-4's linguistic and
visual capabilities in zero-shot visual recognition tasks: Firstly, we explore
the potential of its generated rich textual descriptions across various
categories to enhance recognition performance without any training. Secondly,
we evaluate GPT-4's visual proficiency in directly recognizing diverse visual
content. We conducted extensive experiments to systematically evaluate GPT-4's
performance across images, videos, and point clouds, using 16 benchmark
datasets to measure top-1 and top-5 accuracy. Our findings show that GPT-4,
enhanced with rich linguistic descriptions, significantly improves zero-shot
recognition, offering an average top-1 accuracy increase of 7% across all
datasets. GPT-4 excels in visual recognition, outshining OpenAI-CLIP's ViT-L
and rivaling EVA-CLIP's ViT-E, particularly in video datasets HMDB-51 and
UCF-101, where it leads by 22% and 9%, respectively. We hope this research
contributes valuable data points and experience for future studies. We release
our code at https://github.com/whwu95/GPT4Vis. |
This paper presents a comprehensive evaluation of GPT-4's linguistic and visual capabilities for zero-shot visual recognition across images, videos, and point clouds. |
This evaluation is important because it provides quantitative insights into GPT-4's visual understanding abilities, a crucial aspect of multimodal AI development. |
The authors evaluate GPT-4's performance on 16 benchmark datasets using two approaches: 1) Leveraging GPT-4 to generate rich textual descriptions to enhance CLIP-based zero-shot recognition. 2) Directly evaluating GPT-4V's visual recognition accuracy. |
GPT-4's generated descriptions consistently improve zero-shot recognition, achieving an average 7% top-1 accuracy gain across all datasets.
GPT-4V demonstrates strong visual recognition capabilities, rivaling or exceeding EVA-CLIP's ViT-E, particularly on video datasets like UCF-101 and HMDB-51.
Both GPT-4-enhanced CLIP and GPT-4V struggle with tasks heavily reliant on temporal modeling, like Something-Something V1. |
The study focuses solely on visual recognition, neglecting other important vision tasks like object detection.
The prompting strategy for GPT-4V is basic and may be suboptimal, potentially limiting performance. |
gpt-4, zero-shot learning, visual recognition, multimodal ai, computer vision |
2311.15707
Report |
SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation |
Jiehong Lin, Lihua Liu, Dekun Lu, Kui Jia |
Zero-shot 6D object pose estimation involves the detection of novel objects
with their 6D poses in cluttered scenes, presenting significant challenges for
model generalizability. Fortunately, the recent Segment Anything Model (SAM)
has showcased remarkable zero-shot transfer performance, which provides a
promising solution to tackle this task. Motivated by this, we introduce SAM-6D,
a novel framework designed to realize the task through two steps, including
instance segmentation and pose estimation. Given the target objects, SAM-6D
employs two dedicated sub-networks, namely Instance Segmentation Model (ISM)
and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D
images. ISM takes SAM as an advanced starting point to generate all possible
object proposals and selectively preserves valid ones through meticulously
crafted object matching scores in terms of semantics, appearance and geometry.
By treating pose estimation as a partial-to-partial point matching problem, PEM
performs a two-stage point matching process featuring a novel design of
background tokens to construct dense 3D-3D correspondence, ultimately yielding
the pose estimates. Without bells and whistles, SAM-6D outperforms the existing
methods on the seven core datasets of the BOP Benchmark for both instance
segmentation and pose estimation of novel objects. |
SAM-6D, a novel framework for zero-shot 6D object pose estimation using RGB-D images, leveraging the Segment Anything Model (SAM) for enhanced proposal generation and a two-stage point matching process for accurate pose prediction. |
Zero-shot 6D object pose estimation is crucial for real-world applications but challenging due to the need for model generalizability to novel objects. |
SAM-6D consists of two sub-networks: ISM leverages SAM for proposal generation and introduces an object matching score based on semantics, appearance, and geometry for proposal selection. PEM formulates pose estimation as a partial-to-partial point matching problem, using background tokens and a two-stage matching process with novel Sparse-to-Dense Point Transformers for accurate pose calculation. |
Outperforms existing methods in both instance segmentation and pose estimation of novel objects on seven BOP benchmark datasets.
Proposed object matching score effectively identifies proposals corresponding to novel objects.
Two-stage point matching process with background tokens and Sparse-to-Dense Point Transformers enables accurate pose estimation even with sparse correspondence. |
Reliance on depth information may limit applicability in scenarios where depth sensing is unreliable.
Computational cost, especially with SAM-based segmentation, may hinder real-time performance in certain applications. |
6d object pose estimation, zero-shot learning, segment anything model, point matching, instance segmentation |
2311.15658
Report |
Regularization by Texts for Latent Diffusion Inverse Solvers |
Jeongsol Kim, Geon Yeong Park, Hyungjin Chung, Jong Chul Ye |
The recent advent of diffusion models has led to significant progress in
solving inverse problems, leveraging these models as effective generative
priors. Nonetheless, there remain challenges related to the ill-posed nature of
such problems, often due to inherent ambiguities in measurements or intrinsic
system symmetries. To address this, drawing inspiration from the human ability
to resolve visual ambiguities through perceptual biases, here we introduce a
novel latent diffusion inverse solver by regularization by texts (TReg).
Specifically, TReg applies the textual description of the preconception of the
solution during the reverse diffusion sampling, of which the description is
dynamically reinforced through null-text optimization for adaptive negation.
Our comprehensive experimental results demonstrate that TReg successfully
mitigates ambiguity in the inverse problems, enhancing their effectiveness and
accuracy. |
Introduces "Regularization by Text" (TReg), a novel latent diffusion inverse solver that uses textual descriptions to reduce ambiguity in inverse problems. |
Diffusion-based inverse solvers, while powerful, often struggle with inherent ambiguities in measurements. TReg aims to bridge this gap by incorporating human-like perceptual biases through textual descriptions. |
TReg integrates textual descriptions during the reverse diffusion sampling process using an adaptive negation method. This method dynamically refines the textual guidance through null-text optimization, ensuring alignment with the evolving image reconstruction. |
TReg successfully mitigates ambiguity in inverse problems, leading to more consistent and accurate solutions.
Quantitative evaluations demonstrate superior performance in super-resolution and deblurring tasks compared to baseline methods, exhibiting lower LPIPS and y-MSE values.
Qualitative results showcase TReg's ability to generate high-fidelity reconstructions that adhere to both the provided text prompts and the measurement data. |
The effectiveness of TReg can be limited by the specificity and accuracy of the provided text prompt.
Identifying informative text prompts solely from severely degraded measurements in real-world applications poses a challenge. |
inverse problems, text regularization, latent diffusion models, generative priors, image reconstruction |
2311.15657
Report |
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning |
Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin |
Text-to-image diffusion models are typically trained to optimize the
log-likelihood objective, which presents challenges in meeting specific
requirements for downstream tasks, such as image aesthetics and image-text
alignment. Recent research addresses this issue by refining the diffusion U-Net
using human rewards through reinforcement learning or direct backpropagation.
However, many of them overlook the importance of the text encoder, which is
typically pretrained and fixed during training. In this paper, we demonstrate
that by finetuning the text encoder through reinforcement learning, we can
enhance the text-image alignment of the results, thereby improving the visual
quality. Our primary motivation comes from the observation that the current
text encoder is suboptimal, often requiring careful prompt adjustment. While
fine-tuning the U-Net can partially improve performance, it remains suffering
from the suboptimal text encoder. Therefore, we propose to use reinforcement
learning with low-rank adaptation to finetune the text encoder based on
task-specific rewards, referred as \textbf{TexForce}. We first show that
finetuning the text encoder can improve the performance of diffusion models.
Then, we illustrate that TexForce can be simply combined with existing U-Net
finetuned models to get much better results without additional training.
Finally, we showcase the adaptability of our method in diverse applications,
including the generation of high-quality face and hand images. |
Presents TexForce, a novel method employing reinforcement learning with low-rank adaptation to fine-tune the text encoder in text-to-image diffusion models, enhancing text-image alignment and improving visual quality. |
Existing diffusion models often struggle with text-image alignment and achieving specific requirements for downstream tasks. While fine-tuning the U-Net has shown promise, the fixed, suboptimal text encoder limits overall efficacy. |
Leverages DDPO (a PPO variant for diffusion models) to update the text encoder based on task-specific rewards. Employs LoRA for efficient adaptation and combination of learned capabilities from diverse tasks. |
Fine-tuning the text encoder significantly improves text-image alignment and visual quality compared to the original Stable Diffusion model and other state-of-the-art methods.
TexForce can be seamlessly integrated with existing fine-tuned U-Net models, further enhancing their performance without additional training.
Demonstrates strong adaptability across various tasks, including generating high-quality face and hand images, and allows for combining learned capabilities from different tasks. |
Similar to other RL-based methods, TexForce faces challenges in terms of sample efficiency.
Engineering suitable reward functions for specific tasks can be complex. |
text-to-image synthesis, diffusion models, reinforcement learning, text encoder fine-tuning, lora |
2311.15648
Report |
Reinforcement Learning from Diffusion Feedback: Q* for Image Search |
Aboli Marathe |
Large vision-language models are steadily gaining personalization
capabilities at the cost of fine-tuning or data augmentation. We present two
models for image generation using model-agnostic learning that align semantic
priors with generative capabilities. RLDF, or Reinforcement Learning from
Diffusion Feedback, is a singular approach for visual imitation through
prior-preserving reward function guidance. This employs Q-learning (with
standard Q*) for generation and follows a semantic-rewarded trajectory for
image search through finite encoding-tailored actions. The second proposed
method, noisy diffusion gradient, is optimization driven. At the root of both
methods is a special CFG encoding that we propose for continual semantic
guidance. Using only a single input image and no text input, RLDF generates
high-quality images over varied domains including retail, sports and
agriculture showcasing class-consistency and strong visual diversity. Project
website is available at https://infernolia.github.io/RLDF. |
Presents RLDF and nDg, model-agnostic learning models for class-driven semantic image imitation using a single input image, without text guidance or fine-tuning. |
Addresses the bottleneck of human feedback in visual prompt engineering by enabling context-driven image generation guided by semantic priors. |
Formulates image search as a Markov Decision Process (MDP) where an agent navigates a semantic encoding space derived from Context-Free Grammar. It employs Q-learning with semantic rewards based on diffusion feedback to guide the generation process towards the target image's semantic attributes. |
RLDF generates high-quality images across various domains with class-consistency and visual diversity.
Demonstrates model-agnostic stability across DALLE-2, SD 1.4, and SD 2.1 models.
Generates a photo-realistic ImageNet clone with a distribution closer to the original ImageNet compared to baseline methods. |
Computational cost increases in larger, more complex environments.
Subject inconsistency persists, as the focus is on class-consistency over specific object replication. |
image generation, semantic guidance, reinforcement learning, diffusion models, text-to-image synthesis |
2311.15561
Report |
ET3D: Efficient Text-to-3D Generation via Multi-View Distillation |
Yiming Chen, Zhiqi Li, Peidong Liu |
Recent breakthroughs in text-to-image generation has shown encouraging
results via large generative models. Due to the scarcity of 3D assets, it is
hardly to transfer the success of text-to-image generation to that of
text-to-3D generation. Existing text-to-3D generation methods usually adopt the
paradigm of DreamFusion, which conducts per-asset optimization by distilling a
pretrained text-to-image diffusion model. The generation speed usually ranges
from several minutes to tens of minutes per 3D asset, which degrades the user
experience and also imposes a burden to the service providers due to the high
computational budget.
In this work, we present an efficient text-to-3D generation method, which
requires only around 8 $ms$ to generate a 3D asset given the text prompt on a
consumer graphic card. The main insight is that we exploit the images generated
by a large pre-trained text-to-image diffusion model, to supervise the training
of a text conditioned 3D generative adversarial network. Once the network is
trained, we are able to efficiently generate a 3D asset via a single forward
pass. Our method requires no 3D training data and provides an alternative
approach for efficient text-to-3D generation by distilling pre-trained image
diffusion models. |
This paper proposes ET3D, an efficient text-to-3D generation method that distills knowledge from pre-trained text-to-multi-view image diffusion models to enable rapid 3D asset creation. |
Existing text-to-3D generation techniques, often relying on time-consuming optimization processes, hinder user experience and escalate computational costs. ET3D addresses this by offering a fast and efficient alternative. |
ET3D employs a teacher-student framework. A pre-trained text-to-multi-view image diffusion model acts as the teacher, generating multi-view images from text prompts. A text-conditioned GAN, the student, learns to generate 3D objects that, when rendered, match the teacher's multi-view image distribution. |
ET3D generates 3D assets in approximately 8ms on a consumer-grade GPU, significantly faster than optimization-based methods.
Evaluations demonstrate that ET3D achieves comparable or superior text-to-3D alignment compared to state-of-the-art approaches.
The method exhibits strong generalization ability, effectively handling unseen text prompts and composing novel objects and styles. |
The current implementation is trained on a limited set of text prompts due to resource constraints, potentially affecting performance on a wider range of concepts.
Future work will explore incorporating larger and more diverse datasets to further enhance ET3D's generative capabilities. |
text-to-3d generation, generative adversarial networks, multi-view distillation, diffusion models, efficient 3d content creation |
2311.15556
Report |
PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images |
Jiquan Yuan, Xinyan Cao, Changjin Li, Fanyi Yang, Jinlong Lin, Xixin Cao |
As image generation technology advances, AI-based image generation has been
applied in various fields and Artificial Intelligence Generated Content (AIGC)
has garnered widespread attention. However, the development of AI-based image
generative models also brings new problems and challenges. A significant
challenge is that AI-generated images (AIGI) may exhibit unique distortions
compared to natural images, and not all generated images meet the requirements
of the real world. Therefore, it is of great significance to evaluate AIGIs
more comprehensively. Although previous work has established several human
perception-based AIGC image quality assessment (AIGCIQA) databases for
text-generated images, the AI image generation technology includes scenarios
like text-to-image and image-to-image, and assessing only the images generated
by text-to-image models is insufficient. To address this issue, we establish a
human perception-based image-to-image AIGCIQA database, named PKU-I2IQA. We
conduct a well-organized subjective experiment to collect quality labels for
AIGIs and then conduct a comprehensive analysis of the PKU-I2IQA database.
Furthermore, we have proposed two benchmark models: NR-AIGCIQA based on the
no-reference image quality assessment method and FR-AIGCIQA based on the
full-reference image quality assessment method. Finally, leveraging this
database, we conduct benchmark experiments and compare the performance of the
proposed benchmark models. The PKU-I2IQA database and benchmarks will be
released to facilitate future research on
\url{https://github.com/jiquan123/I2IQA}. |
This paper introduces PKU-I2IQA, the first human perception-based image-to-image database for assessing the quality of AI-generated images. |
Existing AIGC image quality assessment (AIGCIQA) methods primarily focus on text-to-image generation, neglecting the image-to-image scenario. This new database addresses this gap and enables more comprehensive evaluation of AIGC image quality. |
The researchers collected images from 200 ImageNet categories and used them as prompts for two image-to-image generation models. They then conducted subjective experiments to collect human ratings on the generated images' quality, authenticity, and text-image correspondence. Two benchmark models were proposed: NR-AIGCIQA (no-reference) and FR-AIGCIQA (full-reference), leveraging different input combinations during training and testing. |
FR-AIGCIQA outperforms NR-AIGCIQA, highlighting the benefit of using reference images.
ResNet18 backbone network achieved the best performance for quality and correspondence scores.
ResNet50 achieved the best overall performance. |
The proposed models show promise but have room for improvement in terms of performance.
Future work will explore incorporating reference images for text-to-image generation and enhancing the generalization ability of AIGCIQA models across different AI image generators. |
aigc, image-to-image generation, image quality assessment, nr-aigciqa, fr-aigciqa |
2311.15551
Report |
Instruct2Attack: Language-Guided Semantic Adversarial Attacks |
Jiang Liu, Chen Wei, Yuxiang Guo, Heng Yu, Alan Yuille, Soheil Feizi, Chun Pong Lau, Rama Chellappa |
We propose Instruct2Attack (I2A), a language-guided semantic attack that
generates semantically meaningful perturbations according to free-form language
instructions. We make use of state-of-the-art latent diffusion models, where we
adversarially guide the reverse diffusion process to search for an adversarial
latent code conditioned on the input image and text instruction. Compared to
existing noise-based and semantic attacks, I2A generates more natural and
diverse adversarial examples while providing better controllability and
interpretability. We further automate the attack process with GPT-4 to generate
diverse image-specific text instructions. We show that I2A can successfully
break state-of-the-art deep neural networks even under strong adversarial
defenses, and demonstrate great transferability among a variety of network
architectures. |
The paper proposes Instruct2Attack (I2A), a novel language-guided semantic attack method that generates semantically meaningful adversarial perturbations using free-form language instructions. |
I2A addresses the limitations of noise-based and existing semantic attacks by generating more natural and diverse adversarial examples with better controllability and interpretability, providing insights into model failure modes beyond pixel-level perturbations. |
I2A leverages a latent conditional diffusion model, adversarially guiding the reverse diffusion process to find an adversarial latent code conditioned on the input image and text instruction. It also uses a perceptual constraint (LPIPS) to ensure similarity between the original and adversarial images. Additionally, it automates the instruction generation process with GPT-4. |
I2A achieves significantly higher attack success rates than baseline attacks on ImageNet, especially under strong defenses (e.g., adversarial training, DiffPure).
I2A shows better transferability under black-box settings compared to noise-based and existing semantic attacks.
The generated adversarial examples are visually appealing and interpretable, reflecting the vulnerabilities of DNNs to common natural semantic modifications. |
The current implementation of I2A has high computational cost.
The quality and plausibility of automatically generated instructions need further improvement. |
adversarial attack, semantic attack, diffusion model, language-guided image editing, gpt-4 |
2311.15537
Report |
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation |
Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang |
Open-vocabulary semantic segmentation strives to distinguish pixels into
different semantic groups from an open set of categories. Most existing methods
explore utilizing pre-trained vision-language models, in which the key is to
adopt the image-level model for pixel-level segmentation task. In this paper,
we propose a simple encoder-decoder, named SED, for open-vocabulary semantic
segmentation, which comprises a hierarchical encoder-based cost map generation
and a gradual fusion decoder with category early rejection. The hierarchical
encoder-based cost map generation employs hierarchical backbone, instead of
plain transformer, to predict pixel-level image-text cost map. Compared to
plain transformer, hierarchical backbone better captures local spatial
information and has linear computational complexity with respect to input size.
Our gradual fusion decoder employs a top-down structure to combine cost map and
the feature maps of different backbone levels for segmentation. To accelerate
inference speed, we introduce a category early rejection scheme in the decoder
that rejects many no-existing categories at the early layer of decoder,
resulting in at most 4.7 times acceleration without accuracy degradation.
Experiments are performed on multiple open-vocabulary semantic segmentation
datasets, which demonstrates the efficacy of our SED method. When using
ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150
categories at 82 millisecond ($ms$) per image on a single A6000. We will
release it at \url{https://github.com/xb534/SED.git}. |
This paper proposes SED, a novel encoder-decoder model for open-vocabulary semantic segmentation, featuring a hierarchical encoder for improved cost map generation and a gradual fusion decoder with category early rejection for efficient inference. |
Existing open-vocabulary semantic segmentation methods struggle with either weak local spatial information, high computational cost, or slow inference speed. This work addresses these limitations to achieve a better balance between accuracy and efficiency. |
The hierarchical encoder extracts multi-scale features to generate a pixel-level image-text cost map. The gradual fusion decoder combines cost map and hierarchical features for segmentation, employing category early rejection to accelerate inference by eliminating unlikely categories early on. |
SED outperforms state-of-the-art methods on multiple open-vocabulary semantic segmentation benchmarks, including ADE20K, PASCAL VOC, and PASCAL-Context.
The hierarchical encoder significantly improves performance compared to plain transformer-based encoders, thanks to its ability to capture rich local spatial information.
The category early rejection scheme accelerates inference speed by up to 4.7 times without noticeable performance degradation. |
The model sometimes struggles with differentiating near-synonym categories.
Future work includes exploring category attention strategies and leveraging large-scale fine-grained datasets to address the synonym challenge. |
open-vocabulary semantic segmentation, vision-language models, hierarchical encoder, gradual fusion decoder, category early rejection |
2311.15510
Report |
CaesarNeRF: Calibrated Semantic Representation for Few-shot Generalizable Neural Rendering |
Haidong Zhu, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Ram Nevatia, Luming Liang |
Generalizability and few-shot learning are key challenges in Neural Radiance
Fields (NeRF), often due to the lack of a holistic understanding in pixel-level
rendering. We introduce CaesarNeRF, an end-to-end approach that leverages
scene-level CAlibratEd SemAntic Representation along with pixel-level
representations to advance few-shot, generalizable neural rendering,
facilitating a holistic understanding without compromising high-quality
details. CaesarNeRF explicitly models pose differences of reference views to
combine scene-level semantic representations, providing a calibrated holistic
understanding. This calibration process aligns various viewpoints with precise
location and is further enhanced by sequential refinement to capture varying
details. Extensive experiments on public datasets, including LLFF, Shiny,
mip-NeRF 360, and MVImgNet, show that CaesarNeRF delivers state-of-the-art
performance across varying numbers of reference views, proving effective even
with a single reference image. The project page of this work can be found at
https://haidongz-usc.github.io/project/caesarnerf. |
Introduces CaesarNeRF, a novel few-shot generalizable NeRF method leveraging calibrated scene-level semantic representations alongside pixel-level features, enabling high-quality rendering of novel scenes from as few as one reference view. |
Addresses the limitations of existing generalizable NeRF methods that struggle with few-shot rendering due to their reliance solely on pixel-level features, lacking a holistic scene understanding. |
Employs a shared encoder to generate both scene-level and pixel-level features. It calibrates semantic representations across views using camera pose transformations and introduces a sequential refinement module to capture varying details at different rendering stages. |
Achieves state-of-the-art performance on LLFF, Shiny, mip-NeRF 360, and MVImgNet datasets, demonstrating superior quality and consistency, especially with one or two reference views.
Shows significant improvement over existing methods in few-shot scenarios, effectively mitigating depth ambiguity and producing sharper, more detailed renderings.
Demonstrates adaptability by integrating the Caesar pipeline with other state-of-the-art NeRF architectures, leading to consistent performance gains. |
CaesarNeRF's performance could be further enhanced by incorporating explicit depth information.
Exploring the integration of generative capabilities within the Caesar framework could further improve rendering quality. |
neural radiance fields, novel view synthesis, few-shot learning, generalizable nerf, semantic representation |
2311.15478
Report |
HawkI: Homography & Mutual Information Guidance for 3D-free Single Image to Aerial View |
Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha |
We present HawkI, for synthesizing aerial-view images from text and an
exemplar image, without any additional multi-view or 3D information for
finetuning or at inference. HawkI uses techniques from classical computer
vision and information theory. It seamlessly blends the visual features from
the input image within a pretrained text-to-2Dimage stable diffusion model with
a test-time optimization process for a careful bias-variance trade-off, which
uses an Inverse Perspective Mapping (IPM) homography transformation to provide
subtle cues for aerialview synthesis. At inference, HawkI employs a unique
mutual information guidance formulation to steer the generated image towards
faithfully replicating the semantic details of the input-image, while
maintaining a realistic aerial perspective. Mutual information guidance
maximizes the semantic consistency between the generated image and the input
image, without enforcing pixel-level correspondence between vastly different
viewpoints. Through extensive qualitative and quantitative comparisons against
text + exemplar-image based methods and 3D/ multi-view based novel-view
synthesis methods on proposed synthetic and real datasets, we demonstrate that
our method achieves a significantly better bias-variance trade-off towards
generating high fidelity aerial-view images.Code and data is available at
https://github.com/divyakraman/HawkI2024. |
\model~synthesizes aerial-view images from text and a single exemplar image without relying on multi-view or 3D data during finetuning or inference. |
This method is valuable for generating diverse aerial-view synthetic data for tasks like aerial perception and providing weak supervision in cross-view synthesis applications like localization and mapping. |
\model~employs a test-time optimization process to incorporate the input image's features into a pretrained text-to-2D-image stable diffusion model. It utilizes Inverse Perspective Mapping (IPM) for weak aerial-view guidance and a novel mutual information guidance formulation to ensure semantic consistency between generated aerial views and input images. |
\model~generates more accurate aerial viewpoints compared to text + exemplar-image based methods like DreamBooth and Imagic.
It demonstrates superior fidelity to the input image compared to prior text-based aerial view synthesis techniques, as evidenced by higher CLIP-I, SSCD, and DINO scores.
Despite being 3D-free, \model~achieves comparable or better results than 3D-based novel-view synthesis methods on benchmark tasks, highlighting the effectiveness of its classical guidance approaches. |
The lack of explicit 3D information limits precise camera angle control in the generated scenes.
Further improvement in fidelity with respect to the input image is needed for more accurate cross-view synthesis applications. |
aerial view synthesis, text-to-image generation, stable diffusion, inverse perspective mapping, mutual information guidance |
2311.15477
Report |
DreamCreature: Crafting Photorealistic Virtual Creatures from Imagination |
Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, Tao Xiang |
Recent text-to-image (T2I) generative models allow for high-quality synthesis
following either text instructions or visual examples. Despite their
capabilities, these models face limitations in creating new, detailed creatures
within specific categories (e.g., virtual dog or bird species), which are
valuable in digital asset creation and biodiversity analysis. To bridge this
gap, we introduce a novel task, Virtual Creatures Generation: Given a set of
unlabeled images of the target concepts (e.g., 200 bird species), we aim to
train a T2I model capable of creating new, hybrid concepts within diverse
backgrounds and contexts. We propose a new method called DreamCreature, which
identifies and extracts the underlying sub-concepts (e.g., body parts of a
specific species) in an unsupervised manner. The T2I thus adapts to generate
novel concepts (e.g., new bird species) with faithful structures and
photorealistic appearance by seamlessly and flexibly composing learned
sub-concepts. To enhance sub-concept fidelity and disentanglement, we extend
the textual inversion technique by incorporating an additional projector and
tailored attention loss regularization. Extensive experiments on two
fine-grained image benchmarks demonstrate the superiority of DreamCreature over
prior methods in both qualitative and quantitative evaluation. Ultimately, the
learned sub-concepts facilitate diverse creative applications, including
innovative consumer product designs and nuanced property modifications. |
This paper introduces DreamCreature, a novel method for virtual creature generation that automatically discovers and composes sub-concepts from unlabeled images, enabling the creation of new, hybrid concepts (e.g., novel bird species). |
Existing text-to-image models struggle to create new, detailed concepts within specific categories, limiting their application in areas like digital asset creation and biodiversity analysis. DreamCreature addresses this gap by enabling the creation of novel concepts with realistic appearances and structures. |
DreamCreature uses unsupervised learning to identify sub-concepts (e.g., body parts) within a dataset. It then leverages textual inversion with a dedicated projector and an attention loss to disentangle and learn representations for each sub-concept, enabling their flexible composition during generation. |
DreamCreature outperforms existing personalization methods in generating new creatures by combining sub-concepts from different species, as evidenced by higher Exact Matching Rate (EMR) and Cosine Similarity (CoSim) scores.
The method demonstrates superior performance in conventional image generation tasks, achieving better FID, CLIP, and DINO scores compared to other approaches.
The learned sub-concepts exhibit strong transferability, allowing for creative applications like property modification in images and innovative digital asset design. |
The accuracy of sub-concept discovery may be limited by the use of a self-supervised pre-trained feature extractor.
Composing small sub-concepts (e.g., tails, legs) presents a challenge. |
virtual creature generation, text-to-image synthesis, sub-concept learning, textual inversion, creative ai |
2311.15475
Report |
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers |
Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner |
We introduce MeshGPT, a new approach for generating triangle meshes that
reflects the compactness typical of artist-created meshes, in contrast to dense
triangle meshes extracted by iso-surfacing methods from neural fields. Inspired
by recent advances in powerful large language models, we adopt a sequence-based
approach to autoregressively generate triangle meshes as sequences of
triangles. We first learn a vocabulary of latent quantized embeddings, using
graph convolutions, which inform these embeddings of the local mesh geometry
and topology. These embeddings are sequenced and decoded into triangles by a
decoder, ensuring that they can effectively reconstruct the mesh. A transformer
is then trained on this learned vocabulary to predict the index of the next
embedding given previous embeddings. Once trained, our model can be
autoregressively sampled to generate new triangle meshes, directly generating
compact meshes with sharp edges, more closely imitating the efficient
triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable
improvement over state of the art mesh generation methods, with a 9% increase
in shape coverage and a 30-point enhancement in FID scores across various
categories. |
Introduces MeshGPT, a novel method for generating compact and efficient triangle meshes, mimicking the style of human-crafted meshes, using a GPT-inspired transformer trained on a vocabulary of learned geometric embeddings. |
Existing 3D shape generation methods often rely on representations like voxels, point clouds, or neural fields, which require post-processing to convert into meshes, resulting in dense and over-tessellated outputs. MeshGPT addresses this by directly generating compact meshes, reflecting the efficient triangulation patterns found in artist-created models. |
Learns a vocabulary of quantized geometric embeddings from mesh triangles using graph convolutions. A GPT-style decoder-only transformer is trained on this vocabulary to autoregressively predict sequences of triangle embeddings, which are then decoded into mesh faces. |
Achieves a 9% improvement in shape coverage and a 30-point enhancement in FID scores compared to state-of-the-art methods.
Generates compact meshes with sharp edges and high fidelity, surpassing baselines in visual quality.
Demonstrates shape novelty, producing shapes that differ from the training dataset while maintaining realism. |
Autoregressive generation leads to slower sampling times, posing challenges for real-time applications.
Limited context window size of the transformer might restrict the generation of large-scale scenes. |
mesh generation, transformers, geometric deep learning, generative models, 3d shape synthesis |
2311.15435
Report |
Functional Diffusion |
Biao Zhang, Peter Wonka |
We propose a new class of generative diffusion models, called functional
diffusion. In contrast to previous work, functional diffusion works on samples
that are represented by functions with a continuous domain. Functional
diffusion can be seen as an extension of classical diffusion models to an
infinite-dimensional domain. Functional diffusion is very versatile as images,
videos, audio, 3D shapes, deformations, \etc, can be handled by the same
framework with minimal changes. In addition, functional diffusion is especially
suited for irregular data or data defined in non-standard domains. In our work,
we derive the necessary foundations for functional diffusion and propose a
first implementation based on the transformer architecture. We show generative
results on complicated signed distance functions and deformation functions
defined on 3D surfaces. |
Introduces functional diffusion, a novel class of generative diffusion models that operate on samples represented as functions with continuous domains, extending diffusion models to infinite-dimensional spaces. |
Provides a versatile framework for generating various data types (images, videos, audio, 3D shapes, deformations) within a unified framework, especially suitable for irregular data or non-standard domains. |
Represents functions using continuous latent vectors and sampled function values, trains a denoising network to progressively denoise functions from noisy initial states, and leverages a DDIM-based sampling method for efficient inference. |
Generates high-quality, detailed 3D shapes from sparse point clouds, outperforming existing methods in terms of visual fidelity and quantitative metrics.
Successfully models and generates 3D deformation fields from sparse correspondences, demonstrating superior performance compared to baseline methods.
Demonstrates the capability to generate raw signed distance functions (SDFs) directly, unlike previous methods that predict binary occupancies or truncated SDFs. |
Requires significant computational resources for training, potentially limiting its scalability to large datasets.
Involves exploring the sampling rate of the sampled function representation as a hyperparameter during training. |
generative diffusion models, functional data, 3d shape generation, deformation fields, neural fields |
2311.15383
Report |
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding |
Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li |
3D Visual Grounding (3DVG) aims at localizing 3D object based on textual
descriptions. Conventional supervised methods for 3DVG often necessitate
extensive annotations and a predefined vocabulary, which can be restrictive. To
address this issue, we propose a novel visual programming approach for
zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language
models (LLMs). Our approach begins with a unique dialog-based method, engaging
with LLMs to establish a foundational understanding of zero-shot 3DVG. Building
on this, we design a visual program that consists of three types of modules,
i.e., view-independent, view-dependent, and functional modules. These modules,
specifically tailored for 3D scenarios, work collaboratively to perform complex
reasoning and inference. Furthermore, we develop an innovative language-object
correlation module to extend the scope of existing 3D object detectors into
open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot
approach can outperform some supervised baselines, marking a significant stride
towards effective 3DVG. |
This paper proposes a novel zero-shot visual programming approach for 3D Visual Grounding (3DVG) that leverages the capabilities of large language models (LLMs) to localize 3D objects in a scene based on textual descriptions, without the need for extensive annotations or a predefined vocabulary. |
Existing supervised 3DVG methods are limited by the need for extensive annotations and a predefined vocabulary, making them difficult to apply in real-world scenarios. |
The approach involves a dialog-based method to establish an understanding of zero-shot 3DVG with LLMs and designs a visual program consisting of view-independent, view-dependent, and functional modules for reasoning and inference. It also introduces a language-object correlation (LOC) module to extend 3D object detectors to open-vocabulary scenarios. |
The zero-shot approach outperforms some existing supervised methods on the ScanRefer and Nr3D datasets.
The LOC module effectively combines 3D geometric information and 2D appearance features for improved object localization in open-vocabulary settings.
The visual programming approach demonstrates the ability to handle complex spatial relations and perform multi-step reasoning for 3DVG. |
The accuracy of the approach heavily relies on the quality of the generated visual programs and the performance of the LLMs.
Expanding the range of spatial relations and modules within the visual programming framework can further enhance the capabilities and address more complex 3DVG scenarios. |
3d visual grounding, large language models, zero-shot learning, visual programming, open vocabulary |
2311.15368
Report |
Flow-Guided Diffusion for Video Inpainting |
Bohai Gu, Yongsheng Yu, Heng Fan, Libo Zhang |
Video inpainting has been challenged by complex scenarios like large
movements and low-light conditions. Current methods, including emerging
diffusion models, face limitations in quality and efficiency. This paper
introduces the Flow-Guided Diffusion model for Video Inpainting (FGDVI), a
novel approach that significantly enhances temporal consistency and inpainting
quality via reusing an off-the-shelf image generation diffusion model. We
employ optical flow for precise one-step latent propagation and introduces a
model-agnostic flow-guided latent interpolation technique. This technique
expedites denoising, seamlessly integrating with any Video Diffusion Model
(VDM) without additional training. Our FGDVI demonstrates a remarkable 10%
improvement in flow warping error E_warp over existing state-of-the-art
methods. Our comprehensive experiments validate superior performance of FGDVI,
offering a promising direction for advanced video inpainting. The code and
detailed results will be publicly available in
https://github.com/NevSNev/FGDVI. |
This paper presents FGDVI, a novel flow-guided diffusion model for video inpainting that leverages optical flow and reuses an off-the-shelf image generation diffusion model for enhanced temporal consistency and inpainting quality. |
Video inpainting in complex scenarios with large movements and low-light conditions remains challenging for existing methods, demanding improved quality and efficiency. |
FGDVI employs optical flow for precise one-step latent propagation and introduces a model-agnostic flow-guided latent interpolation technique to accelerate denoising, integrating seamlessly with any Video Diffusion Model (VDM) without additional training. |
FGDVI demonstrates a remarkable 10% improvement in flow warping error (E_warp) over state-of-the-art methods.
The proposed flow-guided latent interpolation method boosts inference speed by approximately 29% compared to vanilla diffusion.
FGDVI excels in qualitative and quantitative evaluations, especially in handling complex scenarios with large masks and object removal. |
The paper uses a pre-trained LDM instead of more powerful Stable Diffusion models with cross-attention for text input.
Future work aims to design algorithms with fewer keyframes for flow-based interpolation to further enhance temporal consistency. |
video inpainting, diffusion models, optical flow, latent interpolation, temporal consistency |
2311.15308
Report |
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset |
Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Kalin Stefanov |
The detection and localization of highly realistic deepfake audio-visual
content are challenging even for the most advanced state-of-the-art methods.
While most of the research efforts in this domain are focused on detecting
high-quality deepfake images and videos, only a few works address the problem
of the localization of small segments of audio-visual manipulations embedded in
real videos. In this research, we emulate the process of such content
generation and propose the AV-Deepfake1M dataset. The dataset contains
content-driven (i) video manipulations, (ii) audio manipulations, and (iii)
audio-visual manipulations for more than 2K subjects resulting in a total of
more than 1M videos. The paper provides a thorough description of the proposed
data generation pipeline accompanied by a rigorous analysis of the quality of
the generated data. The comprehensive benchmark of the proposed dataset
utilizing state-of-the-art deepfake detection and localization methods
indicates a significant drop in performance compared to previous datasets. The
proposed dataset will play a vital role in building the next-generation
deepfake localization methods. The dataset and associated code are available at
https://github.com/ControlNet/AV-Deepfake1M . |
This paper presents AV-Deepfake1M, a large-scale content-driven audio-visual dataset for temporal deepfake localization generated using ChatGPT for realistic transcript manipulation and state-of-the-art audio and video generation methods. |
Discriminating real from fake content is increasingly challenging with advancements in content generation, making reliable detection methods vital, especially for localized manipulations within real content, which existing datasets lack. |
The dataset is generated in a three-stage pipeline: 1) ChatGPT manipulates real transcripts with insertions, deletions, and replacements, 2) High-quality audio is generated using VITS and YourTTS, 3) Lip-synced visual frames are generated using TalkLip. |
AV-Deepfake1M significantly surpasses previous datasets in scale and diversity with over 2K subjects and 1M videos, including diverse fake segment lengths and lower average proportions of manipulations.
Benchmarking state-of-the-art deepfake detection and localization methods on AV-Deepfake1M reveals a significant performance drop compared to previous datasets, highlighting the dataset's difficulty.
Human evaluation shows that even experts struggle to detect and localize deepfakes in AV-Deepfake1M, emphasizing the need for advanced detection methods. |
The dataset exhibits an imbalance in terms of the number of fake and real videos.
Potential misuse of the dataset exists despite distribution restrictions and end-user license agreements. |
deepfakes, dataset, temporal localization, content-driven manipulation, large language model |
2311.15291
Report |
Obj-NeRF: Extract Object NeRFs from Multi-view Images |
Zhiyi Li, Lihe Ding, Tianfan Xue |
Neural Radiance Fields (NeRFs) have demonstrated remarkable effectiveness in
novel view synthesis within 3D environments. However, extracting a radiance
field of one specific object from multi-view images encounters substantial
challenges due to occlusion and background complexity, thereby presenting
difficulties in downstream applications such as NeRF editing and 3D mesh
extraction. To solve this problem, in this paper, we propose Obj-NeRF, a
comprehensive pipeline that recovers the 3D geometry of a specific object from
multi-view images using a single prompt. This method combines the 2D
segmentation capabilities of the Segment Anything Model (SAM) in conjunction
with the 3D reconstruction ability of NeRF. Specifically, we first obtain
multi-view segmentation for the indicated object using SAM with a single
prompt. Then, we use the segmentation images to supervise NeRF construction,
integrating several effective techniques. Additionally, we construct a large
object-level NeRF dataset containing diverse objects, which can be useful in
various downstream tasks. To demonstrate the practicality of our method, we
also apply Obj-NeRF to various applications, including object removal,
rotation, replacement, and recoloring. |
This paper presents Obj-NeRF, a novel pipeline for extracting and reconstructing the 3D geometry of specific objects from multi-view images using a single prompt. |
Extracting object-specific radiance fields from multi-view images is challenging due to occlusion and background complexity, hindering downstream applications like NeRF editing and 3D mesh extraction. Obj-NeRF addresses this by leveraging the strengths of 2D segmentation and 3D NeRF reconstruction. |
Obj-NeRF leverages the Segment Anything Model (SAM) for multi-view segmentation based on user prompts and combines it with NeRF reconstruction techniques. It employs a sparse point cloud for multi-view consistency, handles object obstruction, and incorporates sparse and dense depth supervision for enhanced novel view synthesis. |
Obj-NeRF effectively segments and reconstructs objects from various multi-view datasets, outperforming previous methods in quality.
The pipeline enables the creation of a large, multi-view object NeRF dataset beneficial for tasks like 3D generation.
Extracted object NeRFs are demonstrated for applications like object removal, replacement, rotation, and color changing within existing NeRF scenes. |
Future work includes extending the constructed object NeRF dataset to broader 3D generation tasks.
Investigating methods for further improving the reconstruction quality and handling complex object interactions is crucial. |
neural radiance fields, nerf, 3d object segmentation, novel view synthesis, segment anything model (sam) |
2311.15260
Report |
NeuRAD: Neural Rendering for Autonomous Driving |
Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, Christoffer Petersson |
Neural radiance fields (NeRFs) have gained popularity in the autonomous
driving (AD) community. Recent methods show NeRFs' potential for closed-loop
simulation, enabling testing of AD systems, and as an advanced training data
augmentation technique. However, existing methods often require long training
times, dense semantic supervision, or lack generalizability. This, in turn,
hinders the application of NeRFs for AD at scale. In this paper, we propose
NeuRAD, a robust novel view synthesis method tailored to dynamic AD data. Our
method features simple network design, extensive sensor modeling for both
camera and lidar -- including rolling shutter, beam divergence and ray dropping
-- and is applicable to multiple datasets out of the box. We verify its
performance on five popular AD datasets, achieving state-of-the-art performance
across the board. To encourage further development, we will openly release the
NeuRAD source code. See https://github.com/georghess/NeuRAD . |
NeuRAD is a novel view synthesis method for dynamic autonomous driving data, capable of handling large-scale scenes and generalizing to multiple datasets. |
NeRFs have potential for closed-loop simulation and data augmentation in autonomous driving, but existing methods are limited by long training times, reliance on dense supervision, and lack of generalizability. |
NeuRAD uses a single network with an actor-aware hash encoding for static and dynamic elements. It models sensor characteristics like rolling shutter, beam divergence, and ray dropping. It employs a CNN decoder and proposal sampling for efficiency. |
Achieves state-of-the-art novel view synthesis performance on five AD datasets (PandaSet, nuScenes, KITTI, Argoverse 2, ZOD).
Significantly outperforms previous methods in lidar simulation, accurately capturing ray dropping effects.
Demonstrates generalization to novel viewpoints and actor manipulations, enabling realistic scenario generation. |
Assumes rigid actors, limiting its applicability to pedestrians and other deformable objects.
Struggles with challenging conditions like night scenes and time-dependent object appearance (e.g., brake lights). |
neural radiance fields, autonomous driving, novel view synthesis, lidar simulation, scene generation |
2311.15230
Report |
GAIA: Zero-shot Talking Avatar Generation |
Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian |
Zero-shot talking avatar generation aims at synthesizing natural talking
videos from speech and a single portrait image. Previous methods have relied on
domain-specific heuristics such as warping-based motion representation and 3D
Morphable Models, which limit the naturalness and diversity of the generated
avatars. In this work, we introduce GAIA (Generative AI for Avatar), which
eliminates the domain priors in talking avatar generation. In light of the
observation that the speech only drives the motion of the avatar while the
appearance of the avatar and the background typically remain the same
throughout the entire video, we divide our approach into two stages: 1)
disentangling each frame into motion and appearance representations; 2)
generating motion sequences conditioned on the speech and reference portrait
image. We collect a large-scale high-quality talking avatar dataset and train
the model on it with different scales (up to 2B parameters). Experimental
results verify the superiority, scalability, and flexibility of GAIA as 1) the
resulting model beats previous baseline models in terms of naturalness,
diversity, lip-sync quality, and visual quality; 2) the framework is scalable
since larger models yield better results; 3) it is general and enables
different applications like controllable talking avatar generation and
text-instructed avatar generation. |
Introduces GAIA (Generative AI for Avatar), a novel framework for zero-shot talking avatar generation that eliminates domain-specific priors like warping-based motion representations and 3D Morphable Models. |
Existing methods rely on domain-specific heuristics that limit the naturalness and diversity of generated avatars. GAIA aims to overcome these limitations by directly learning from data distributions. |
GAIA uses a two-stage approach: 1) Disentangling motion and appearance representations of video frames with a Variational AutoEncoder (VAE). 2) Generating motion sequences conditioned on speech and a reference portrait image using a diffusion model. |
GAIA outperforms previous state-of-the-art methods in subjective evaluations of naturalness, diversity, lip-sync quality, and visual quality.
The framework is scalable, with larger models yielding better results.
GAIA is a general framework enabling applications like controllable talking avatar generation and text-instructed avatar generation. |
Reliance on pre-trained landmark and head pose extractors might hinder end-to-end learning.
Future work includes exploring fully end-to-end learning and disentangling motion and appearance without landmarks. |
talking avatar generation, zero-shot learning, diffusion models, variational autoencoder, motion and appearance disentanglement |
2311.15157
Report |
Advancing Vision Transformers with Group-Mix Attention |
Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo |
Vision Transformers (ViTs) have been shown to enhance visual recognition
through modeling long-range dependencies with multi-head self-attention (MHSA),
which is typically formulated as Query-Key-Value computation. However, the
attention map generated from the Query and Key captures only token-to-token
correlations at one single granularity. In this paper, we argue that
self-attention should have a more comprehensive mechanism to capture
correlations among tokens and groups (i.e., multiple adjacent tokens) for
higher representational capacity. Thereby, we propose Group-Mix Attention (GMA)
as an advanced replacement for traditional self-attention, which can
simultaneously capture token-to-token, token-to-group, and group-to-group
correlations with various group sizes. To this end, GMA splits the Query, Key,
and Value into segments uniformly and performs different group aggregations to
generate group proxies. The attention map is computed based on the mixtures of
tokens and group proxies and used to re-combine the tokens and groups in Value.
Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which
achieves state-of-the-art performance in image classification, object
detection, and semantic segmentation with fewer parameters than existing
models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input)
attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while
GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K. |
This paper proposes Group-Mix Attention (GMA), an advanced attention mechanism for Vision Transformers (ViTs) that captures token-to-token, token-to-group, and group-to-group correlations to enhance representational capacity. |
Standard self-attention in ViTs only captures token-to-token correlations at a single granularity, limiting their ability to model complex visual patterns. GMA addresses this by incorporating correlations among token groups of various sizes. |
GMA divides input tokens into segments and uses sliding-window-based aggregators (e.g., depth-wise convolutions) to generate group proxies. It then computes attention on mixtures of individual tokens and these group proxies, enabling multi-granularity correlation modeling. |
GroupMixFormer, a hierarchical ViT built on GMA, achieves state-of-the-art performance on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.
Experiments show that GMA effectively models group correlations, leading to fine-grained visual representations beneficial for various vision tasks.
Incorporating GMA into other ViT architectures like Swin and PVT also consistently improves their performance. |
The current implementation of GMA with depth-wise convolutions as aggregators leads to slower inference speed, though this can be improved by using more efficient aggregators.
Exploring alternative aggregator implementations and further optimizing the kernel size configurations could yield additional performance gains. |
vision transformer, self-attention, group-mix attention, image classification, object detection, semantic segmentation |
2311.15040
Report |
InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser |
Xing Cui, Zekun Li, Pei Pei Li, Huaibo Huang, Zhaofeng He |
Stylized text-to-image generation focuses on creating images from textual
descriptions while adhering to a style specified by a few reference images.
However, subtle style variations within different reference images can hinder
the model from accurately learning the target style. In this paper, we propose
InstaStyle, a novel approach that excels in generating high-fidelity stylized
images with only a single reference image. Our approach is based on the finding
that the inversion noise from a stylized reference image inherently carries the
style signal, as evidenced by their non-zero signal-to-noise ratio. We employ
DDIM inversion to extract this noise from the reference image and leverage a
diffusion model to generate new stylized images from the "style" noise.
Additionally, the inherent ambiguity and bias of textual prompts impede the
precise conveying of style. To address this, we introduce a learnable style
token via prompt refinement, which enhances the accuracy of the style
description for the reference image. Qualitative and quantitative experimental
results demonstrate that InstaStyle achieves superior performance compared to
current benchmarks. Furthermore, our approach also showcases its capability in
the creative task of style combination with mixed inversion noise. |
Proposes InstaStyle, a novel stylized text-to-image generation method that effectively captures and generates images in specific styles using only a single reference image. |
Addresses limitations of existing methods that struggle to capture subtle style variations from multiple reference images or rely on ambiguous textual prompts. |
Leverages DDIM inversion to extract style information from a single reference image and employs a prompt refinement scheme to learn a style token, enhancing style accuracy and enabling style combination. |
Generates high-fidelity stylized images with fine-grained style details from a single reference image.
Learned style token effectively avoids ambiguity and bias present in human-written textual style descriptions.
Supports creative style combination by mixing inversion noise and employing a composed guidance mechanism. |
Limited exploration of the impact of different masking strategies and prompt mix ratios on style combination.
Reliance on manual selection for prompt refinement, which could be automated in future work. |
stylized image generation, text-to-image synthesis, diffusion models, ddim inversion, prompt refinement |
2311.15027
Report |
Double-Flow-based Steganography without Embedding for Image-to-Image Hiding |
Bingbing Song, Derui Wang, Tianwei Zhang, Renyang Liu, Yu Lin, Wei Zhou |
As an emerging concept, steganography without embedding (SWE) hides a secret
message without directly embedding it into a cover. Thus, SWE has the unique
advantage of being immune to typical steganalysis methods and can better
protect the secret message from being exposed. However, existing SWE methods
are generally criticized for their poor payload capacity and low fidelity of
recovered secret messages. In this paper, we propose a novel
steganography-without-embedding technique, named DF-SWE, which addresses the
aforementioned drawbacks and produces diverse and natural stego images.
Specifically, DF-SWE employs a reversible circulation of double flow to build a
reversible bijective transformation between the secret image and the generated
stego image. Hence, it provides a way to directly generate stego images from
secret images without a cover image. Besides leveraging the invertible
property, DF-SWE can invert a secret image from a generated stego image in a
nearly lossless manner and increases the fidelity of extracted secret images.
To the best of our knowledge, DF-SWE is the first SWE method that can hide
large images and multiple images into one image with the same size,
significantly enhancing the payload capacity. According to the experimental
results, the payload capacity of DF-SWE achieves 24-72 BPP is 8000-16000 times
compared to its competitors while producing diverse images to minimize the
exposure risk. Importantly, DF-SWE can be applied in the steganography of
secret images in various domains without requiring training data from the
corresponding domains. This domain-agnostic property suggests that DF-SWE can
1) be applied to hiding private data and 2) be deployed in resource-limited
systems. |
This paper proposes DF-SWE, a novel steganography-without-embedding technique that uses a reversible circulation of double flow to hide large and multiple images within a single, naturally generated stego image. |
Existing SWE methods suffer from limited payload capacity and low fidelity of recovered secret messages. DF-SWE addresses these limitations by enabling the hiding of large images, even multiple images, without a cover image, thereby significantly enhancing security against steganalysis. |
DF-SWE employs two flow-based models to establish a reversible bijective transformation between secret images and generated stego images. It leverages prior knowledge sampling for initialization, high-dimensional space replacement for information transfer, and distribution consistency transformation to ensure high-quality stego image generation. |
DF-SWE achieves a payload capacity of 24-72 BPP, significantly higher than existing SWE methods.
The method ensures a low extraction error, enabling near-lossless recovery of hidden images.
DF-SWE exhibits domain generalization, enabling it to hide images from different domains without requiring domain-specific training data. |
While achieving excellent secret image recovery, the method is not completely lossless.
Future work includes exploring complete lossless recovery and extending the method to multi-modal data hiding. |
image steganography, steganography without embedding, flow-based model, image hiding, domain generalization |
2311.14768
Report |
AdaDiff: Adaptive Step Selection for Fast Diffusion |
Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang |
Diffusion models, as a type of generative models, have achieved impressive
results in generating images and videos conditioned on textual conditions.
However, the generation process of diffusion models involves denoising for
dozens of steps to produce photorealistic images/videos, which is
computationally expensive. Unlike previous methods that design
``one-size-fits-all'' approaches for speed up, we argue denoising steps should
be sample-specific conditioned on the richness of input texts. To this end, we
introduce AdaDiff, a lightweight framework designed to learn instance-specific
step usage policies, which are then used by the diffusion model for generation.
AdaDiff is optimized using a policy gradient method to maximize a carefully
designed reward function, balancing inference time and generation quality. We
conduct experiments on three image generation and two video generation
benchmarks and demonstrate that our approach achieves similar results in terms
of visual quality compared to the baseline using a fixed 50 denoising steps
while reducing inference time by at least 33%, going as high as 40%.
Furthermore, our qualitative analysis shows that our method allocates more
steps to more informative text conditions and fewer steps to simpler text
conditions. |
This paper introduces AdaDiff, an end-to-end framework that learns instance-specific step usage policies for diffusion models conditioned on textual prompts to reduce computational cost and inference time. |
Diffusion models, while effective, require dozens of computationally expensive denoising steps for generating high-quality images/videos. This paper argues that the number of steps should be adaptive to the complexity of the input prompt, unlike traditional "one-size-fits-all" approaches. |
AdaDiff employs a lightweight step selection network trained using reinforcement learning with a policy gradient method. The network learns to maximize a reward function that balances image/video quality (evaluated using an IQS model) and the number of steps saved. |
AdaDiff reduces inference time by 33%-40% compared to fixed-step baselines while maintaining similar visual quality across various image and video generation benchmarks.
The learned adaptive policies demonstrate superior performance over random step selection, achieving better visual quality with similar computational resources.
AdaDiff can be seamlessly integrated with other diffusion model acceleration methods and exhibits promising zero-shot transfer capabilities to different datasets. |
The current implementation primarily focuses on a predefined set of discrete steps for the DDIM sampler.
The IQS model, while effective, might not fully encapsulate the nuances of human perception in all scenarios. |
diffusion models, generative models, text-to-image generation, text-to-video generation, reinforcement learning |
2311.14760
Report |
SinSR: Diffusion-Based Image Super-Resolution in a Single Step |
Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C. Kot, Bihan Wen |
While super-resolution (SR) methods based on diffusion models exhibit
promising results, their practical application is hindered by the substantial
number of required inference steps. Recent methods utilize degraded images in
the initial state, thereby shortening the Markov chain. Nevertheless, these
solutions either rely on a precise formulation of the degradation process or
still necessitate a relatively lengthy generation path (e.g., 15 iterations).
To enhance inference speed, we propose a simple yet effective method for
achieving single-step SR generation, named SinSR. Specifically, we first derive
a deterministic sampling process from the most recent state-of-the-art (SOTA)
method for accelerating diffusion-based SR. This allows the mapping between the
input random noise and the generated high-resolution image to be obtained in a
reduced and acceptable number of inference steps during training. We show that
this deterministic mapping can be distilled into a student model that performs
SR within only one inference step. Additionally, we propose a novel
consistency-preserving loss to simultaneously leverage the ground-truth image
during the distillation process, ensuring that the performance of the student
model is not solely bound by the feature manifold of the teacher model,
resulting in further performance improvement. Extensive experiments conducted
on synthetic and real-world datasets demonstrate that the proposed method can
achieve comparable or even superior performance compared to both previous SOTA
methods and the teacher model, in just one sampling step, resulting in a
remarkable up to x10 speedup for inference. Our code will be released at
https://github.com/wyf0912/SinSR |
This paper proposes SinSR, a novel method for single-step super-resolution (SR) image generation using a distilled deterministic mapping from a pre-trained diffusion model. |
Existing diffusion-based SR methods, while effective, suffer from slow inference speed due to the numerous steps required in the Markov chain. |
The authors first derive a deterministic sampling process from a state-of-the-art SR diffusion model (ResShift). Then, they train a student network to learn the deterministic mapping between input noise and the generated HR image in a single step using a novel consistency-preserving loss that leverages ground-truth images. |
SinSR achieves comparable or superior performance to state-of-the-art SR methods on both synthetic and real-world datasets.
The method reduces inference steps from 15 to 1, resulting in a significant speedup.
Directly learning the deterministic mapping between noise and HR images is shown to be more effective than denoising at different noise levels. |
The training process, while faster than training from scratch, still involves solving ODEs which can be computationally expensive.
Further exploration of alternative teacher diffusion models and distillation strategies could potentially yield additional performance gains. |
super-resolution, diffusion models, image generation, single-step inference, knowledge distillation |
2311.14749
Report |
Compositional Zero-shot Learning via Progressive Language-based Observations |
Lin Li, Guikun Chen, Jun Xiao, Long Chen |
Compositional zero-shot learning aims to recognize unseen state-object
compositions by leveraging known primitives (state and object) during training.
However, effectively modeling interactions between primitives and generalizing
knowledge to novel compositions remains a perennial challenge. There are two
key factors: object-conditioned and state-conditioned variance, i.e., the
appearance of states (or objects) can vary significantly when combined with
different objects (or states). For instance, the state "old" can signify a
vintage design for a "car" or an advanced age for a "cat". In this paper, we
argue that these variances can be mitigated by predicting composition
categories based on pre-observed primitive. To this end, we propose Progressive
Language-based Observations (PLO), which can dynamically determine a better
observation order of primitives. These observations comprise a series of
concepts or languages that allow the model to understand image content in a
step-by-step manner. Specifically, PLO adopts pre-trained vision-language
models (VLMs) to empower the model with observation capabilities. We further
devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing
classifier dynamically determines the observation order of two primitives. 2)
PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to
craft composition-specific prompts for step-by-step observing. Extensive
ablations on three challenging datasets demonstrate the superiority of PLO
compared with state-of-the-art methods, affirming its abilities in
compositional recognition. |
The paper proposes Progressive Language-based Observations (PLO), a novel approach for compositional zero-shot learning that dynamically determines the order of observations using language to recognize unseen state-object compositions. |
Effectively modeling interactions between primitives (states and objects) and generalizing to novel compositions in CZSL is challenging due to object-conditioned and state-conditioned variance in visual appearance. |
PLO leverages pre-trained vision-language models (VLMs) to enable models to observe image content step-by-step. It has two variants: PLO-VLM uses a pre-observing classifier to dynamically determine the observation order of primitives, while PLO-LLM utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observation. |
PLO outperforms state-of-the-art CZSL methods on three benchmark datasets (MIT-States, UT-Zappos, and C-GQA) in both closed-world and open-world settings.
Dynamically determining the observation order based on image content leads to better performance than fixed observation orders.
Increasing the number of observation prompts in PLO-LLM generally improves accuracy. |
PLO primarily focuses on recognizing novel compositions of seen states and objects, not entirely new state or object categories.
PLO-LLM's reliance on external language model APIs introduces cost constraints, especially with a large number of composition categories. |
compositional zero-shot learning, vision-language models, large language models, dynamic observation order, progressive observation |
2311.14671
Report |
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation |
Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang |
In-context segmentation aims at segmenting novel images using a few labeled
example images, termed as "in-context examples", exploring content similarities
between examples and the target. The resulting models can be generalized
seamlessly to novel segmentation tasks, significantly reducing the labeling and
training costs compared with conventional pipelines. However, in-context
segmentation is more challenging than classic ones requiring the model to learn
segmentation rules conditioned on a few samples. Unlike previous work with
ad-hoc or non-end-to-end designs, we propose SEGIC, an end-to-end
segment-in-context framework built upon a single vision foundation model (VFM).
In particular, SEGIC leverages the emergent correspondence within VFM to
capture dense relationships between target images and in-context samples. As
such, information from in-context samples is then extracted into three types of
instructions, i.e. geometric, visual, and meta instructions, serving as
explicit conditions for the final mask prediction. SEGIC is a straightforward
yet effective approach that yields state-of-the-art performance on one-shot
segmentation benchmarks. Notably, SEGIC can be easily generalized to diverse
tasks, including video object segmentation and open-vocabulary segmentation.
Code will be available at https://github.com/MengLcool/SEGIC. |
\modelname is an end-to-end segment-in-context framework that leverages the emergent correspondence of a single frozen vision foundation model for in-context segmentation. |
In-context learning in vision, particularly for segmentation, is challenging but highly desirable as it allows models to generalize to novel segmentation tasks with low training costs. |
\modelname leverages a pre-trained vision foundation model to establish dense correspondences between target images and in-context examples. It then extracts geometric, visual, and meta instructions from in-context samples to guide a lightweight mask decoder for segmentation. |
\modelname achieves state-of-the-art performance on one-shot segmentation benchmarks, including COCO-20$^i$, FSS-1000, and LVIS-92$^i$.
Without fine-tuning on video data, \modelname demonstrates competitive zero-shot video object segmentation performance on DAVIS-17 and YouTube-VOS-18.
It shows strong performance on generic semantic segmentation (COCO, ADE20k) and open-vocabulary semantic segmentation (PC-459, ADE-847) benchmarks. |
Current work mainly focuses on utilizing one in-context example per entity.
Instance-level segmentation in an open-world setting is not extensively explored. |
in-context learning, segmentation generalist, vision foundation model, emergent correspondence, one-shot segmentation |
2311.14631
Report |
CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization |
Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, Xinbo Gao |
We propose CatVersion, an inversion-based method that learns the personalized
concept through a handful of examples. Subsequently, users can utilize text
prompts to generate images that embody the personalized concept, thereby
achieving text-to-image personalization. In contrast to existing approaches
that emphasize word embedding learning or parameter fine-tuning for the
diffusion model, which potentially causes concept dilution or overfitting, our
method concatenates embeddings on the feature-dense space of the text encoder
in the diffusion model to learn the gap between the personalized concept and
its base class, aiming to maximize the preservation of prior knowledge in
diffusion models while restoring the personalized concepts. To this end, we
first dissect the text encoder's integration in the image generation process to
identify the feature-dense space of the encoder. Afterward, we concatenate
embeddings on the Keys and Values in this space to learn the gap between the
personalized concept and its base class. In this way, the concatenated
embeddings ultimately manifest as a residual on the original attention output.
To more accurately and unbiasedly quantify the results of personalized image
generation, we improve the CLIP image alignment score based on masks.
Qualitatively and quantitatively, CatVersion helps to restore personalization
concepts more faithfully and enables more robust editing. |
CatVersion, a novel text-to-image personalization method, concatenates embeddings into a highly integrated feature space within the text encoder of diffusion models, learning the difference between a personalized concept and its base class. |
Existing T2I personalization methods struggle with concept dilution or overfitting when learning personalized concepts. This work addresses these limitations by learning in a feature-dense space within the text encoder, improving the fidelity of personalized concept restoration and text-guided editability. |
The authors first identify a feature-dense space within the last few layers of the CLIP text encoder. Then, learnable embeddings are concatenated to the Keys and Values in this space and optimized to learn the difference between the personalized concept and its base class. This difference is ultimately represented as a residual on the original attention output. |
CatVersion demonstrates superior performance in restoring personalized concepts and enabling text-guided editing compared to baseline methods.
Optimizing embeddings in a feature-dense space leads to better learning of the target concept and improves contextual understanding.
Concatenating residual embeddings significantly enhances the reconstruction ability of personalized concepts. |
The current implementation requires separate optimization for each concept, impacting inversion speed.
The method is limited to learning a single concept per optimization process. |
text-to-image generation, personalization, diffusion models, clip, concept inversion |
2311.14603
Report |
Animate124: Animating One Image to 4D Dynamic Scene |
Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee |
We introduce Animate124 (Animate-one-image-to-4D), the first work to animate
a single in-the-wild image into 3D video through textual motion descriptions,
an underexplored problem with significant applications. Our 4D generation
leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model,
optimized in three distinct stages using multiple diffusion priors. Initially,
a static model is optimized using the reference image, guided by 2D and 3D
diffusion priors, which serves as the initialization for the dynamic NeRF.
Subsequently, a video diffusion model is employed to learn the motion specific
to the subject. However, the object in the 3D videos tends to drift away from
the reference image over time. This drift is mainly due to the misalignment
between the text prompt and the reference image in the video diffusion model.
In the final stage, a personalized diffusion prior is therefore utilized to
address the semantic drift. As the pioneering image-text-to-4D generation
framework, our method demonstrates significant advancements over existing
baselines, evidenced by comprehensive quantitative and qualitative assessments. |
\ours is the first framework to animate a single in-the-wild image into 3D video with motion defined by a text prompt. |
Dynamic 3D scenes effectively represent the real world and have applications in video games, AR, and VR. |
A static-to-dynamic and coarse-to-fine strategy optimizes a 4D grid dynamic NeRF using diffusion priors from 2D image, 3D, and personalized image diffusion models in three stages. |
\ours outperforms baselines in generating coherent 3D videos from single images and text prompts.
\ours exhibits superior control over the protagonist's motion compared to MAV3D.
Semantic refinement using a personalized diffusion prior effectively mitigates semantic drift. |
The reliance on a large CFG scale for SDS can lead to over-saturation and over-smoothing.
Limited availability of diverse and high-quality image-text-4D datasets. |
3d video generation, dynamic nerf, diffusion models, text-to-3d, image animation |
2311.14552
Report |
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models |
Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang |
Replicating the innate human ability to detect all objects based on free-form
texts at any granularity remains a formidable challenge for Vision-Language
models. Current Large Vision Language Models (LVLMs) are predominantly
constrained to grounding a single, pre-existing object, relying solely on data
from Referring Expression Comprehension tasks. The limitation leads to a
compromise in model design, necessitating the introduction of visual expert
models or the integration of customized head structures. Beyond these
constraints, our research delves into the untapped potential of LVLMs and
uncover their inherent capability for basic object perception, allowing them to
accurately identify and locate objects of interest. Building on this insight,
we introduce a novel language-prompted localization dataset designed to fully
unleash the capabilities of LVLMs in integrating fine-grained object perception
with precise location awareness. More importantly, we present
$\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the
introduction of any special tokens, expert models, or additional detection
modules. It simply maintains a consistent structure with popular LVLMs by
unifying data formats across various localization-related scenarios and is
trained end-to-end through a well-designed pipeline. Comprehensive experiments
demonstrate that $\textbf{Griffon}$ not only achieves state-of-the-art
performance on the fine-grained RefCOCO series but also approaches the
capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO. |
This paper introduces a novel language-prompted localization dataset and a purely LVLM-based baseline model called Griffon, capable of localizing objects at any granularity based on free-form input texts. |
Existing Vision-Language models struggle to locate multiple objects from complex text descriptions, often relying on external expert models or specialized heads, limiting their generalizability and efficiency. |
The authors construct a large-scale dataset with various localization scenarios and train Griffon in two stages: (1) basic scenario pre-training for multi-object perception and (2) full scenario instruction tuning for user intention comprehension. A training-free scoring mechanism ranks object outputs for improved confidence. |
Griffon achieves state-of-the-art results on the RefCOCO series for single referent localization.
Griffon approaches the performance of the expert model Faster RCNN on the MSCOCO object detection benchmark.
Griffon effectively handles complex scenarios, including localizing multiple objects of the same category and refusing to output for non-existing objects. |
The current work mainly focuses on localization tasks.
Future work will explore integrating other vision and language tasks into Griffon. |
vision-language models, object localization, referring expression comprehension, object detection, multi-object perception |
2311.14521
Report |
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting |
Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, Guosheng Lin |
3D editing plays a crucial role in many areas such as gaming and virtual
reality. Traditional 3D editing methods, which rely on representations like
meshes and point clouds, often fall short in realistically depicting complex
scenes. On the other hand, methods based on implicit 3D representations, like
Neural Radiance Field (NeRF), render complex scenes effectively but suffer from
slow processing speeds and limited control over specific scene areas. In
response to these challenges, our paper presents GaussianEditor, an innovative
and efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D
representation. GaussianEditor enhances precision and control in editing
through our proposed Gaussian semantic tracing, which traces the editing target
throughout the training process. Additionally, we propose Hierarchical Gaussian
splatting (HGS) to achieve stabilized and fine results under stochastic
generative guidance from 2D diffusion models. We also develop editing
strategies for efficient object removal and integration, a challenging task for
existing methods. Our comprehensive experiments demonstrate GaussianEditor's
superior control, efficacy, and rapid performance, marking a significant
advancement in 3D editing. Project Page:
https://buaacyw.github.io/gaussian-editor/ |
Presents GaussianEditor, a novel 3D editing algorithm based on Gaussian Splatting for fast and controllable 3D scene editing. |
Traditional mesh-based editing struggles with complex scenes, while NeRF-based editing is slow and lacks local control. GaussianEditor leverages Gaussian Splatting's advantages for speed and controllability in 3D editing. |
Introduces Gaussian semantic tracing for precise editing target localization and Hierarchical Gaussian Splatting (HGS) for stable optimization under generative guidance. Develops specific algorithms for object removal and integration in Gaussian Splatting. |
Achieves superior control over editing areas compared to previous methods like Instruct-Nerf2Nerf.
Enables efficient object removal and integration within minutes, significantly faster than NeRF-based editing.
Demonstrates effectiveness in various editing tasks, including scene modification, facial swaps, and object manipulation. |
Reliance on 2D diffusion models for guidance limits editing capabilities for complex prompts.
Future work includes exploring alternative guidance mechanisms and expanding editing functionalities. |
3d editing, gaussian splatting, generative guidance, semantic tracing, 3d inpainting |
2311.14494
Report |
MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation |
Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu |
We introduce MVControl, a novel neural network architecture that enhances
existing pre-trained multi-view 2D diffusion models by incorporating additional
input conditions, e.g. edge maps. Our approach enables the generation of
controllable multi-view images and view-consistent 3D content. To achieve
controllable multi-view image generation, we leverage MVDream as our base
model, and train a new neural network module as additional plugin for
end-to-end task-specific condition learning. To precisely control the shapes
and views of generated images, we innovatively propose a new conditioning
mechanism that predicts an embedding encapsulating the input spatial and view
conditions, which is then injected to the network globally. Once MVControl is
trained, score-distillation (SDS) loss based optimization can be performed to
generate 3D content, in which process we propose to use a hybrid diffusion
prior. The hybrid prior relies on a pre-trained Stable-Diffusion network and
our trained MVControl for additional guidance. Extensive experiments
demonstrate that our method achieves robust generalization and enables the
controllable generation of high-quality 3D content. Code available at
https://github.com/WU-CVGL/MVControl/. |
Introduces MVControl, a novel neural network architecture enhancing pre-trained multi-view 2D diffusion models with additional input conditions (e.g., edge maps) for controllable text-to-3D generation. |
Addresses the limitations of existing text-to-3D generation methods in achieving fine-grained control over generated content, similar to ControlNet in text-to-image generation. |
Leverages MVDream as the base model and incorporates a trainable control network. Employs a conditioning module to predict embeddings from input conditions (e.g., edge maps, camera poses), injecting them into the network for control. Utilizes a hybrid diffusion prior with Stable-Diffusion and MVControl for controllable text-to-3D generation via score distillation optimization. |
Achieves fine-grained control over the shapes and views of generated multi-view images.
Generates high-fidelity controllable multi-view images and view-consistent 3D content.
Demonstrates robust generalization and superior performance compared to prior text-to-3D methods. |
Current implementation primarily explores edge maps as conditional input.
Reliance on pre-trained models might limit the generation of novel object categories. |
text-to-3d generation, multi-view diffusion models, controllable image synthesis, score distillation sampling, 3d deep learning |
2311.14284
Report |
Paragraph-to-Image Generation with Information-Enriched Diffusion Model |
Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang |
Text-to-image (T2I) models have recently experienced rapid development,
achieving astonishing performance in terms of fidelity and textual alignment
capabilities. However, given a long paragraph (up to 512 words), these
generation models still struggle to achieve strong alignment and are unable to
generate images depicting complex scenes. In this paper, we introduce an
information-enriched diffusion model for paragraph-to-image generation task,
termed ParaDiffusion, which delves into the transference of the extensive
semantic comprehension capabilities of large language models to the task of
image generation. At its core is using a large language model (e.g., Llama V2)
to encode long-form text, followed by fine-tuning with LORA to alignthe
text-image feature spaces in the generation task. To facilitate the training of
long-text semantic alignment, we also curated a high-quality paragraph-image
pair dataset, namely ParaImage. This dataset contains a small amount of
high-quality, meticulously annotated data, and a large-scale synthetic dataset
with long text descriptions being generated using a vision-language model.
Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models
(SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45%
human voting rate improvements for visual appeal and text faithfulness,
respectively. The code and dataset will be released to foster community
research on long-text alignment. |
Introduces ParaDiffusion, an information-enriched diffusion model for paragraph-to-image generation, tackling the challenge of aligning long-form text with images. |
Existing T2I models struggle with long paragraphs due to limitations in data (short captions) and architecture (text encoder constraints). |
1. Created ParaImage, a dataset with paragraph-image pairs (up to 400 words), including synthetic (ParaImage-Big) and manually annotated (ParaImage-Small) data. 2. Employed Llama V2 as a text encoder, fine-tuned with LoRA to align text and image feature spaces. 3. Three-stage training: pre-training, paragraph-image alignment learning, quality tuning. |
Outperforms state-of-the-art models (SD XL, DeepFloyd IF) on visual appeal and text faithfulness by up to 15% and 45%, respectively.
Fine-tuning LLM with LoRA significantly improves performance compared to using frozen LLMs.
ParaImage dataset, especially the manually annotated portion, proves crucial for high-quality results. |
Inference speed needs optimization (consider ODE solvers, consistency models).
Occasional unrealistic image generation (address with data augmentation, geometric/semantic constraints). |
text-to-image generation, long-text alignment, diffusion models, large language models, dataset creation |
2311.14282
Report |
Image Super-Resolution with Text Prompt Diffusion |
Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, Xiaokang Yang |
Image super-resolution (SR) methods typically model degradation to improve
reconstruction accuracy in complex and unknown degradation scenarios. However,
extracting degradation information from low-resolution images is challenging,
which limits the model performance. To boost image SR performance, one feasible
approach is to introduce additional priors. Inspired by advancements in
multi-modal methods and text prompt image processing, we introduce text prompts
to image SR to provide degradation priors. Specifically, we first design a
text-image generation pipeline to integrate text into the SR dataset through
the text degradation representation and degradation model. The text
representation applies a discretization manner based on the binning method to
describe the degradation abstractly. This method maintains the flexibility of
the text and is user-friendly. Meanwhile, we propose the PromptSR to realize
the text prompt SR. The PromptSR utilizes the pre-trained language model (e.g.,
T5 or CLIP) to enhance restoration. We train the model on the generated
text-image dataset. Extensive experiments indicate that introducing text
prompts into SR, yields excellent results on both synthetic and real-world
images. Code is available at: https://github.com/zhengchen1999/PromptSR. |
This paper introduces text prompts as degradation priors to enhance image super-resolution by providing additional information about image degradation. |
Modeling degradation in image super-resolution is crucial, especially in complex real-world scenarios, and using text prompts can provide richer degradation information than solely relying on low-resolution images. |
The authors propose a text-image generation pipeline to create a dataset containing low-resolution images, corresponding high-resolution images, and text prompts describing the degradation. They also introduce PromptSR, a network based on a diffusion model and a pre-trained language model, to perform super-resolution conditioned on both the low-resolution image and the text prompt. |
Introducing text prompts significantly improves super-resolution performance compared to methods without text guidance.
The proposed method exhibits flexibility in handling different degradation and prompt formats, including random order and simplified descriptions.
PromptSR achieves superior performance on both synthetic and real-world datasets, demonstrating the effectiveness of incorporating text prompts in image super-resolution. |
The performance slightly drops when using randomly ordered degradation operations compared to a fixed order.
Future work could explore combining image content descriptions with degradation descriptions in the text prompt for potential further improvements. |
image super-resolution, text prompt, diffusion model, degradation prior, blind super-resolution |
2311.14208
Report |
ECRF: Entropy-Constrained Neural Radiance Fields Compression with Frequency Domain Optimization |
Soonbin Lee, Fangwen Shu, Yago Sanchez, Thomas Schierl, Cornelius Hellge |
Explicit feature-grid based NeRF models have shown promising results in terms
of rendering quality and significant speed-up in training. However, these
methods often require a significant amount of data to represent a single scene
or object. In this work, we present a compression model that aims to minimize
the entropy in the frequency domain in order to effectively reduce the data
size. First, we propose using the discrete cosine transform (DCT) on the
tensorial radiance fields to compress the feature-grid. This feature-grid is
transformed into coefficients, which are then quantized and entropy encoded,
following a similar approach to the traditional video coding pipeline.
Furthermore, to achieve a higher level of sparsity, we propose using an entropy
parameterization technique for the frequency domain, specifically for DCT
coefficients of the feature-grid. Since the transformed coefficients are
optimized during the training phase, the proposed model does not require any
fine-tuning or additional information. Our model only requires a lightweight
compression pipeline for encoding and decoding, making it easier to apply
volumetric radiance field methods for real-world applications. Experimental
results demonstrate that our proposed frequency domain entropy model can
achieve superior compression performance across various datasets. The source
code will be made publicly available. |
This paper introduces Entropy-Constrained Radiance Fields (ECRF), a novel compression framework for tensorial radiance fields that minimizes entropy in the DCT coefficient domain for efficient compression. |
Explicit grid-based NeRF models, while efficient in training and rendering, often lead to large storage sizes, hindering their practicality. This work addresses this issue by significantly compressing these models without compromising rendering quality. |
The proposed ECRF employs a frequency-domain entropy parameterization technique. This involves applying DCT to the feature-grid, quantizing the coefficients to 8-bit, and finally employing entropy coding for a compact representation. |
ECRF achieves superior compression performance, especially at low bitrates, outperforming existing methods.
The use of DCT and entropy minimization in the frequency domain leads to more sparse and efficient representations compared to spatial domain methods.
The proposed compression pipeline, including quantization and entropy coding, achieves a significant reduction in model size (up to 28x) with minimal impact on rendering quality (PSNR drop of only 0.1 dB). |
The entropy calculation adds overhead to the training time (4-5 minutes longer than the baseline).
Extremely low bitrates can lead to block artifacts due to the loss of high-frequency information. |
neural radiance fields, nerf compression, 3d scene representation, frequency domain compression, discrete cosine transform |
2311.14097
Report |
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models |
Fei Kong, Jinhao Duan, Lichao Sun, Hao Cheng, Renjing Xu, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu |
Though diffusion models excel in image generation, their step-by-step
denoising leads to slow generation speeds. Consistency training addresses this
issue with single-step sampling but often produces lower-quality generations
and requires high training costs. In this paper, we show that optimizing
consistency training loss minimizes the Wasserstein distance between target and
generated distributions. As timestep increases, the upper bound accumulates
previous consistency training losses. Therefore, larger batch sizes are needed
to reduce both current and accumulated losses. We propose Adversarial
Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS)
divergence between distributions at each timestep using a discriminator.
Theoretically, ACT enhances generation quality, and convergence. By
incorporating a discriminator into the consistency training framework, our
method achieves improved FID scores on CIFAR10 and ImageNet 64$\times$64 and
LSUN Cat 256$\times$256 datasets, retains zero-shot image inpainting
capabilities, and uses less than $1/6$ of the original batch size and fewer
than $1/2$ of the model parameters and training steps compared to the baseline
method, this leads to a substantial reduction in resource consumption. Our code
is available:https://github.com/kong13661/ACT |
This paper introduces Adversarial Consistency Training (ACT), a novel method for improving the efficiency and performance of consistency training in diffusion models. |
Diffusion models excel in image generation but suffer from slow generation speeds due to iterative denoising. Consistency training accelerates this process with single-step sampling, but often compromises generation quality. This work aims to address these limitations. |
The paper analyzes consistency training loss and proves its equivalence to optimizing the upper bound of the Wasserstein distance. To mitigate accumulated errors, ACT incorporates a discriminator into consistency training, directly minimizing the Jensen-Shannon divergence between distributions at each timestep. Additionally, it utilizes gradient penalty-based adaptive data augmentation to further enhance performance. |
ACT achieves significantly better FID scores compared to standard consistency training on CIFAR10, ImageNet 64x64, and LSUN Cat 256x256 datasets.
ACT achieves these improvements with a significantly smaller batch size (less than 1/6th), fewer model parameters, and fewer training steps compared to the baseline consistency training method.
The proposed method retains the zero-shot image inpainting capability inherent to consistency models. |
The interaction between consistency training loss and the adversarial loss introduced by the discriminator requires further investigation.
Exploration of distances beyond Jensen-Shannon Divergence for minimizing the discrepancy between generated and target distributions could be beneficial. |
diffusion models, generative adversarial networks, consistency training, image generation, adversarial consistency training |
2311.14029
Report |
Understanding the Vulnerability of CLIP to Image Compression |
Cangxiong Chen, Vinay P. Namboodiri, Julian Padget |
CLIP is a widely used foundational vision-language model that is used for
zero-shot image recognition and other image-text alignment tasks. We
demonstrate that CLIP is vulnerable to change in image quality under
compression. This surprising result is further analysed using an attribution
method-Integrated Gradients. Using this attribution method, we are able to
better understand both quantitatively and qualitatively exactly the nature in
which the compression affects the zero-shot recognition accuracy of this model.
We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the
basis to understand this vulnerability of CLIP and can help us develop more
effective methods to improve the robustness of CLIP and other vision-language
models. |
This paper investigates the sensitivity of CLIP, a popular vision-language model, to image compression in zero-shot image recognition tasks. |
CLIP, trained on massive datasets with diverse image qualities, is expected to be robust to image degradation. However, this paper discovers its vulnerability to compression, which is crucial for understanding and improving the reliability of vision-language models. |
The authors evaluate CLIP's performance on compressed CIFAR-10 and STL-10 datasets with different image encoders. They employ Integrated Gradients, an attribution method, to analyze how compression affects predictions at the pixel level. |
CLIP's accuracy significantly decreases with increasing image compression on both CIFAR-10 and STL-10.
Integrated Gradients effectively quantifies and visualizes the impact of compression on CLIP's predictions.
The visualizations reveal the inductive biases of different image encoders, such as ResNet-50's locality and ViT-B/32's global attention. |
The study primarily focuses on JPEG compression and a fixed text prompt, limiting the generalizability of findings.
Future work includes investigating mitigation strategies like data augmentation to enhance CLIP's robustness to image quality variations. |
clip, vision-language models, image compression, robustness, integrated gradients |
2311.13833
Report |
Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models |
Saman Motamed, Danda Pani Paudel, Luc Van Gool |
Diffusion models have revolutionized generative content creation and
text-to-image (T2I) diffusion models in particular have increased the creative
freedom of users by allowing scene synthesis using natural language. T2I models
excel at synthesizing concepts such as nouns, appearances, and styles. To
enable customized content creation based on a few example images of a concept,
methods such as Textual Inversion and DreamBooth invert the desired concept and
enable synthesizing it in new scenes. However, inverting more general concepts
that go beyond object appearance and style (adjectives and verbs) through
natural language, remains a challenge. Two key characteristics of these
concepts contribute to the limitations of current inversion methods. 1)
Adjectives and verbs are entangled with nouns (subject) and can hinder
appearance-based inversion methods, where the subject appearance leaks into the
concept embedding and 2) describing such concepts often extends beyond single
word embeddings (being frozen in ice, walking on a tightrope, etc.) that
current methods do not handle.
In this study, we introduce Lego, a textual inversion method designed to
invert subject entangled concepts from a few example images. Lego disentangles
concepts from their associated subjects using a simple yet effective Subject
Separation step and employs a Context Loss that guides the inversion of
single/multi-embedding concepts. In a thorough user study, Lego-generated
concepts were preferred over 70% of the time when compared to the baseline.
Additionally, visual question answering using a large language model suggested
Lego-generated concepts are better aligned with the text description of the
concept. |
Introduced "Lego," a novel textual inversion method for text-to-image diffusion models that inverts general concepts from images, focusing on adjectives and verbs entangled with subjects. |
Current text-to-image models and inversion techniques struggle to represent concepts beyond object appearance, especially adjectives and verbs entangled with subjects, limiting creative control in image generation. |
Lego augments Textual Inversion with (1) "Subject Separation" to disentangle concept embeddings from subject appearance and (2) a contrastive "Context Loss" to guide learning of multi-embedding concepts. |
Lego successfully inverts concepts like "melting," "frozen in ice," and "walking on a rope," outperforming language-guided models.
Human evaluation shows a strong preference for Lego-generated concepts (over 70%) compared to baseline language descriptions.
Lego demonstrates compositionality by combining learned concepts and handling complex, multi-word embedding concepts. |
Lego's ability to invert concepts is limited by the capabilities of the backbone diffusion model (e.g., facial expressions with earlier versions).
Future work includes extending Lego to learn dynamic concepts from videos. |
text-to-image synthesis, diffusion models, textual inversion, concept learning, generative ai |
2311.13831
Report |
Posterior Distillation Sampling |
Juil Koo, Chanho Park, Minhyuk Sung |
We introduce Posterior Distillation Sampling (PDS), a novel optimization
method for parametric image editing based on diffusion models. Existing
optimization-based methods, which leverage the powerful 2D prior of diffusion
models to handle various parametric images, have mainly focused on generation.
Unlike generation, editing requires a balance between conforming to the target
attribute and preserving the identity of the source content. Recent 2D image
editing methods have achieved this balance by leveraging the stochastic latent
encoded in the generative process of diffusion models. To extend the editing
capabilities of diffusion models shown in pixel space to parameter space, we
reformulate the 2D image editing method into an optimization form named PDS.
PDS matches the stochastic latents of the source and the target, enabling the
sampling of targets in diverse parameter spaces that align with a desired
attribute while maintaining the source's identity. We demonstrate that this
optimization resembles running a generative process with the target attribute,
but aligning this process with the trajectory of the source's generative
process. Extensive editing results in Neural Radiance Fields and Scalable
Vector Graphics representations demonstrate that PDS is capable of sampling
targets to fulfill the aforementioned balance across various parameter spaces. |
The paper introduces Posterior Distillation Sampling (PDS), a new optimization method for editing parametric images generated by diffusion models. |
Existing editing methods for parametric images struggle to balance conforming to the target attribute while preserving the source content's identity. PDS addresses this by aligning the generative process of the target image with that of the source image. |
PDS reformulates stochastic diffusion inversion, a 2D image editing method, into an optimization form. It matches the stochastic latents of the source and target images during the generative process, ensuring the target image inherits the source's identity. |
PDS enables complex geometric changes and object addition in NeRF editing, outperforming existing methods in both qualitative and quantitative comparisons.
In SVG editing, PDS makes minimal changes to align with target prompts while preserving structural semantics better than other optimization methods.
User studies for both NeRF and SVG editing demonstrate a strong preference for PDS results over baseline methods. |
The paper notes occasional artifacts in NeRF editing results, mitigated by a refinement stage using SDEdit and a reconstruction loss.
Future work could explore further applications of PDS in other parametric image domains. |
diffusion models, image editing, parametric images, neural radiance fields (nerf), scalable vector graphics (svg) |
2311.13681
Report |
Compact 3D Gaussian Representation for Radiance Field |
Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, Eunbyung Park |
Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in
capturing complex 3D scenes with high fidelity. However, one persistent
challenge that hinders the widespread adoption of NeRFs is the computational
bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian
splatting (3DGS) has recently emerged as an alternative representation that
leverages a 3D Gaussisan-based representation and adopts the rasterization
pipeline to render the images rather than volumetric rendering, achieving very
fast rendering speed and promising image quality. However, a significant
drawback arises as 3DGS entails a substantial number of 3D Gaussians to
maintain the high fidelity of the rendered images, which requires a large
amount of memory and storage. To address this critical issue, we place a
specific emphasis on two key objectives: reducing the number of Gaussian points
without sacrificing performance and compressing the Gaussian attributes, such
as view-dependent color and covariance. To this end, we propose a learnable
mask strategy that significantly reduces the number of Gaussians while
preserving high performance. In addition, we propose a compact but effective
representation of view-dependent color by employing a grid-based neural field
rather than relying on spherical harmonics. Finally, we learn codebooks to
compactly represent the geometric attributes of Gaussian by vector
quantization. With model compression techniques such as quantization and
entropy coding, we consistently show over 25$\times$ reduced storage and
enhanced rendering speed, while maintaining the quality of the scene
representation, compared to 3DGS. Our work provides a comprehensive framework
for 3D scene representation, achieving high performance, fast training,
compactness, and real-time rendering. Our project page is available at
https://maincold2.github.io/c3dgs/. |
This paper introduces a novel method for compactly representing 3D scenes using 3D Gaussians, significantly reducing storage and enhancing rendering speed in 3D Gaussian Splatting (3DGS) without compromising quality. |
3DGS, despite its fast rendering, requires significant memory and storage due to the large number of Gaussians and their attributes. This work addresses this limitation, paving the way for efficient and high-quality 3D scene representation. |
The proposed method employs a learnable masking strategy to remove redundant Gaussians based on volume and transparency. It also utilizes a grid-based neural field for compact view-dependent color representation and learnable codebooks for efficient storage of Gaussian geometry (scale and rotation). |
Achieves over 25x storage reduction and significantly enhanced rendering speed compared to 3DGS across various datasets.
Maintains high-quality scene reconstruction, comparable or even superior to 3DGS.
Demonstrates the effectiveness of volume-based masking, compact color representation, and geometry codebooks through ablation studies. |
The training time of the proposed method is slightly longer than 3DGS due to the additional learning components.
Future work includes exploring more efficient neural field architectures and codebook compression techniques for further reducing storage and memory requirements. |
3d gaussian splatting, 3d scene representation, neural rendering, model compression, real-time rendering |
2311.13655
Report |
GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar |
Berna Kabadayi, Wojciech Zielonka, Bharat Lal Bhatnagar, Gerard Pons-Moll, Justus Thies |
Digital humans and, especially, 3D facial avatars have raised a lot of
attention in the past years, as they are the backbone of several applications
like immersive telepresence in AR or VR. Despite the progress, facial avatars
reconstructed from commodity hardware are incomplete and miss out on parts of
the side and back of the head, severely limiting the usability of the avatar.
This limitation in prior work stems from their requirement of face tracking,
which fails for profile and back views. To address this issue, we propose to
learn person-specific animatable avatars from images without assuming to have
access to precise facial expression tracking. At the core of our method, we
leverage a 3D-aware generative model that is trained to reproduce the
distribution of facial expressions from the training data. To train this
appearance model, we only assume to have a collection of 2D images with the
corresponding camera parameters. For controlling the model, we learn a mapping
from 3DMM facial expression parameters to the latent space of the generative
model. This mapping can be learned by sampling the latent space of the
appearance model and reconstructing the facial parameters from a normalized
frontal view, where facial expression estimation performs well. With this
scheme, we decouple 3D appearance reconstruction and animation control to
achieve high fidelity in image synthesis. In a series of experiments, we
compare our proposed technique to state-of-the-art monocular methods and show
superior quality while not requiring expression tracking of the training data. |
This paper introduces a novel method for generating animatable 3D human head avatars from images without requiring precise facial expression tracking during training. |
Existing methods heavily rely on accurate facial expression tracking, which often fails for profile or back views, limiting their ability to reconstruct complete head avatars. |
The method utilizes a 3D-aware generative model (EG3D) to learn person-specific appearance and geometry from images and camera parameters. It then employs a mapping network to map 3DMM facial expression parameters to the latent space of the generative model for animation control. |
The method achieves superior visual quality compared to state-of-the-art monocular avatar reconstruction techniques, particularly in challenging regions like teeth and hair.
It enables the generation of 3D-consistent novel views, leading to the reconstruction of complete 360-degree head avatars.
The approach demonstrates robustness to imperfect camera poses, outperforming baseline methods when trained on noisy data. |
The training process is computationally expensive, requiring several hours on high-end GPUs.
The method is limited to the facial expressions present in the training data and cannot extrapolate to unseen expressions. |
3d avatar reconstruction, generative adversarial networks, facial expression mapping, novel view synthesis, tracker-free appearance learning |
2311.13620
Report |
The Challenges of Image Generation Models in Generating Multi-Component Images |
Tham Yik Foong, Shashank Kotyan, Po Yuan Mao, Danilo Vasconcellos Vargas |
Recent advances in text-to-image generators have led to substantial
capabilities in image generation. However, the complexity of prompts acts as a
bottleneck in the quality of images generated. A particular under-explored
facet is the ability of generative models to create high-quality images
comprising multiple components given as a prior. In this paper, we propose and
validate a metric called Components Inclusion Score (CIS) to evaluate the
extent to which a model can correctly generate multiple components. Our results
reveal that the evaluated models struggle to incorporate all the visual
elements from prompts with multiple components (8.53% drop in CIS per component
for all evaluated models). We also identify a significant decline in the
quality of the images and context awareness within an image as the number of
components increased (15.91% decrease in inception Score and 9.62% increase in
Frechet Inception Distance). To remedy this issue, we fine-tuned Stable
Diffusion V2 on a custom-created test dataset with multiple components,
outperforming its vanilla counterpart. To conclude, these findings reveal a
critical limitation in existing text-to-image generators, shedding light on the
challenge of generating multiple components within a single image using a
complex prompt. |
This paper introduces Components Inclusion Score (CIS), a novel metric to evaluate the ability of text-to-image generators to accurately incorporate multiple components from a prompt into a single image. |
Current text-to-image generators struggle to generate high-quality images with multiple components, limiting their ability to handle complex prompts. This work provides a way to quantify this limitation and analyze the factors contributing to it. |
The paper proposes the CIS metric which uses CLIP model to evaluate the presence of each component mentioned in the prompt within the generated image. Additionally, a new dataset MCID is created by combining images from ImageNet to train and evaluate models on multi-component image generation. |
Existing image generators show a significant drop in CIS as the number of components in the prompt increases, indicating their difficulty in handling multi-component generation.
The quality of generated images, as measured by IS and FID, also deteriorates with an increase in the number of components.
Fine-tuning Stable Diffusion on MCID leads to improved CIS, emphasizing the importance of data distribution with multi-component images. |
The accuracy of CIS is limited by the capability of CLIP model used for component identification.
The MCID dataset, while diverse, may not fully represent the complexity of real-world multi-component scenes with natural interactions. |
text-to-image generation, multi-component generation, evaluation metric, clip, stable diffusion |
2311.13617
Report |
Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning |
Kai Yu, Jinlin Liu, Mengyang Feng, Miaomiao Cui, Xuansong Xie |
We present Boosting3D, a multi-stage single image-to-3D generation method
that can robustly generate reasonable 3D objects in different data domains. The
point of this work is to solve the view consistency problem in single
image-guided 3D generation by modeling a reasonable geometric structure. For
this purpose, we propose to utilize better 3D prior to training the NeRF. More
specifically, we train an object-level LoRA for the target object using
original image and the rendering output of NeRF. And then we train the LoRA and
NeRF using a progressive training strategy. The LoRA and NeRF will boost each
other while training. After the progressive training, the LoRA learns the 3D
information of the generated object and eventually turns to an object-level 3D
prior. In the final stage, we extract the mesh from the trained NeRF and use
the trained LoRA to optimize the structure and appearance of the mesh. The
experiments demonstrate the effectiveness of the proposed method. Boosting3D
learns object-specific 3D prior which is beyond the ability of pre-trained
diffusion priors and achieves state-of-the-art performance in the single
image-to-3d generation task. |
Boosting3D, a multi-stage single image-to-3D generation method that robustly generates 3D objects in different data domains by modeling geometric structure. |
Addresses the view consistency problem in single image-guided 3D generation, which struggles with uncommon or asymmetrical objects. |
Uses a three-stage optimization process: coarse NeRF generation, fine NeRF refinement using a progressively trained object-level LoRA, and mesh refinement using the trained LoRA. |
Generates high-quality and stable 3D objects from single images.
Learns object-specific 3D priors beyond pre-trained diffusion models.
Achieves state-of-the-art performance in single image-to-3D generation for both real and synthetic images. |
High computational cost, requiring over an hour of training time.
Future work will focus on optimizing speed using faster 3D representations. |
image-to-3d generation, 3d reconstruction, diffusion models, nerf, lora |
2311.13608
Report |
Breathing Life Into Sketches Using Text-to-Video Priors |
Rinon Gal, Yael Vinker, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Ariel Shamir, Gal Chechik |
A sketch is one of the most intuitive and versatile tools humans use to
convey their ideas visually. An animated sketch opens another dimension to the
expression of ideas and is widely used by designers for a variety of purposes.
Animating sketches is a laborious process, requiring extensive experience and
professional design skills. In this work, we present a method that
automatically adds motion to a single-subject sketch (hence, "breathing life
into it"), merely by providing a text prompt indicating the desired motion. The
output is a short animation provided in vector representation, which can be
easily edited. Our method does not require extensive training, but instead
leverages the motion prior of a large pretrained text-to-video diffusion model
using a score-distillation loss to guide the placement of strokes. To promote
natural and smooth motion and to better preserve the sketch's appearance, we
model the learned motion through two components. The first governs small local
deformations and the second controls global affine transformations.
Surprisingly, we find that even models that struggle to generate sketch videos
on their own can still serve as a useful backbone for animating abstract
representations. |
This paper introduces a method for automatically animating single-subject sketches using text prompts, leveraging the motion priors of pre-trained text-to-video diffusion models. |
Animating sketches is a laborious task that requires significant artistic expertise. This method simplifies the process, requiring only a static sketch and a text prompt, making animation accessible to a wider audience. |
The method uses a neural network trained with a score-distillation sampling loss to predict displacements for the control points of a vector-based sketch representation. It separates motion into local deformations and global affine transformations to ensure smooth and natural movement while preserving the original sketch's characteristics. |
The method effectively animates sketches across diverse domains and prompts, capturing complex movements like swaying, dancing, and swirling.
It outperforms existing pixel-based image-to-video methods in preserving sketch fidelity and aligning motion with text prompts.
User studies confirm that the method produces animations that are both consistent with the input sketch and aligned with the desired motion. |
The method is currently limited to single-subject sketches and may struggle with complex scenes or sketches with multiple objects.
There is a trade-off between motion quality and preserving the sketch's appearance, requiring careful hyperparameter tuning. |
sketch animation, text-to-video generation, score-distillation sampling, vector graphics, motion priors |
2311.13601
Report |
Visual In-Context Prompting |
Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao |
In-context prompting in large language models (LLMs) has become a prevalent
approach to improve zero-shot capabilities, but this idea is less explored in
the vision domain. Existing visual prompting methods focus on referring
segmentation to segment the most relevant object, falling short of addressing
many generic vision tasks like open-set segmentation and detection. In this
paper, we introduce a universal visual in-context prompting framework for both
tasks. In particular, we build on top of an encoder-decoder architecture, and
develop a versatile prompt encoder to support a variety of prompts like
strokes, boxes, and points. We further enhance it to take an arbitrary number
of reference image segments as the context. Our extensive explorations show
that the proposed visual in-context prompting elicits extraordinary referring
and generic segmentation capabilities to refer and detect, yielding competitive
performance to close-set in-domain datasets and showing promising results on
many open-set segmentation datasets. By joint training on COCO and SA-1B, our
model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be
available at https://github.com/UX-Decoder/DINOv. |
This paper proposes DINOv, a novel visual in-context prompting framework for both referring and generic image segmentation, enabling open-set segmentation using only visual prompts. |
In-context prompting is powerful for LLMs but less explored in vision, particularly for generic tasks like open-set segmentation. Existing visual prompting methods mainly focus on referring segmentation. |
DINOv leverages an encoder-decoder architecture with a prompt encoder to handle various prompts (strokes, boxes, points). It utilizes reference image-prompt pairs to learn visual concepts and adapts prompts to target images. The model is trained jointly on COCO and SA-1B datasets for both referring and generic segmentation. |
DINOv achieves comparable performance to close-set models on in-domain datasets like COCO.
It demonstrates promising generalization ability on open-set benchmarks like ADE20K and SegInW using only visual prompts.
The framework effectively handles video object segmentation in a zero-shot manner by leveraging learned visual prompts from previous frames. |
The model's performance could be further improved by scaling up the semantically labeled data.
Future work can explore incorporating text prompts for enhanced multi-modal understanding. |
visual prompting, in-context learning, open-set segmentation, referring segmentation, video object segmentation |
2311.13600
Report |
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs |
Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani |
Methods for finetuning generative models for concept-driven personalization
generally achieve strong results for subject-driven or style-driven generation.
Recently, low-rank adaptations (LoRA) have been proposed as a
parameter-efficient way of achieving concept-driven personalization. While
recent work explores the combination of separate LoRAs to achieve joint
generation of learned styles and subjects, existing techniques do not reliably
address the problem; they often compromise either subject fidelity or style
fidelity. We propose ZipLoRA, a method to cheaply and effectively merge
independently trained style and subject LoRAs in order to achieve generation of
any user-provided subject in any user-provided style. Experiments on a wide
range of subject and style combinations show that ZipLoRA can generate
compelling results with meaningful improvements over baselines in subject and
style fidelity while preserving the ability to recontextualize. Project page:
https://ziplora.github.io |
Proposes ZipLoRA, a method for merging independently trained style and subject LoRAs to generate images of any subject in any style using diffusion models. |
Solves the open problem of generating a specific subject in a specific style with diffusion models, enabling greater control and personalization. |
Leverages the sparsity of LoRA updates and minimizes cosine similarity between merged columns to reduce signal interference while preserving individual LoRA capabilities. |
ZipLoRA generates high-quality stylized images superior to direct merging and joint training.
It retains the ability to re-contextualize subjects and control the extent of stylization.
Quantitative user studies and image/text alignment scores demonstrate the effectiveness of ZipLoRA over baselines. |
Relies on the style learning capability of SDXL, which needs further investigation.
Image/text alignment metrics used for evaluation might not perfectly capture stylistic nuances. |
image stylization, diffusion models, lora, personalized image generation, stable diffusion |
2311.13596
Report |
T-Rex: Counting by Visual Prompting |
Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang |
We introduce T-Rex, an interactive object counting model designed to first
detect and then count any objects. We formulate object counting as an open-set
object detection task with the integration of visual prompts. Users can specify
the objects of interest by marking points or boxes on a reference image, and
T-Rex then detects all objects with a similar pattern. Guided by the visual
feedback from T-Rex, users can also interactively refine the counting results
by prompting on missing or falsely-detected objects. T-Rex has achieved
state-of-the-art performance on several class-agnostic counting benchmarks. To
further exploit its potential, we established a new counting benchmark
encompassing diverse scenarios and challenges. Both quantitative and
qualitative results show that T-Rex possesses exceptional zero-shot counting
capabilities. We also present various practical application scenarios for
T-Rex, illustrating its potential in the realm of visual prompting. |
Introduces T-Rex, an interactive object counting model that uses visual prompts (boxes or points) to detect and count objects in an image, achieving state-of-the-art performance on several benchmarks. |
Object counting is important for various fields but existing methods have limitations like unintuitive visualization, closed-set detectors, or reliance on textual descriptions. T-Rex addresses these with visual prompts and interactive refinement. |
T-Rex utilizes an image encoder, prompt encoder, and box decoder. It supports positive-only, positive with negative, and cross-image prompt modes for accurate and user-refined counting. A new benchmark, CA-44, was created to test its capabilities. |
T-Rex outperforms state-of-the-art methods on FSC147 and FSCD-LVIS benchmarks.
It demonstrates superior performance on CA-44, showcasing its zero-shot counting ability across diverse domains.
T-Rex shows higher accuracy than GPT-4V in counting, suggesting its advantage in object perception for this task. |
T-Rex faces challenges in single-target scenes with dense clusters, dense multi-object scenes, and cross-image workflows.
Future work will focus on improving its performance and robustness. |
object counting, visual prompting, interactive model, open-set detection, computer vision |
2311.13570
Report |
WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space |
Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger, Karsten Kreis |
Modern learning-based approaches to 3D-aware image synthesis achieve high
photorealism and 3D-consistent viewpoint changes for the generated images.
Existing approaches represent instances in a shared canonical space. However,
for in-the-wild datasets a shared canonical system can be difficult to define
or might not even exist. In this work, we instead model instances in view
space, alleviating the need for posed images and learned camera distributions.
We find that in this setting, existing GAN-based methods are prone to
generating flat geometry and struggle with distribution coverage. We hence
propose WildFusion, a new approach to 3D-aware image synthesis based on latent
diffusion models (LDMs). We first train an autoencoder that infers a compressed
latent representation, which additionally captures the images' underlying 3D
structure and enables not only reconstruction but also novel view synthesis. To
learn a faithful 3D representation, we leverage cues from monocular depth
prediction. Then, we train a diffusion model in the 3D-aware latent space,
thereby enabling synthesis of high-quality 3D-consistent image samples,
outperforming recent state-of-the-art GAN-based methods. Importantly, our
3D-aware LDM is trained without any direct supervision from multiview images or
3D geometry and does not require posed images or learned pose or camera
distributions. It directly learns a 3D representation without relying on
canonical camera coordinates. This opens up promising research avenues for
scalable 3D-aware image synthesis and 3D content creation from in-the-wild
image data. See https://katjaschwarz.github.io/wildfusion for videos of our 3D
results. |
Introduces WildFusion, a 3D-aware latent diffusion model for image synthesis that operates in *view space*, eliminating the need for posed images or pre-defined camera distributions. |
Existing 3D-aware generative models struggle with in-the-wild datasets lacking a shared canonical coordinate system and often suffer from limitations like mode collapse in GAN-based approaches. |
Two-stage approach: 1) Trains a 3D-aware autoencoder with adversarial supervision for novel views and incorporates monocular depth cues for improved geometry. 2) Fits a latent diffusion model on the compressed, 3D-aware latent space learned by the autoencoder. |
Outperforms state-of-the-art 3D-aware GANs on unposed image datasets, demonstrating superior image quality, geometry, and diversity.
Achieves high-quality novel view synthesis directly from single images, surpassing GAN-based methods requiring inversion.
Demonstrates promising applications in 3D-aware image manipulation, including semantic interpolation and generative resampling. |
While modeling in *view space* is advantageous, achieving sharp 3D geometry remains challenging.
Current implementation is limited to a predefined range of viewpoints and cannot generate full 360° views. |
3d-aware image synthesis, latent diffusion models, view space, unposed images, novel view synthesis |
2311.13535
Report |
DiffusionMat: Alpha Matting as Sequential Refinement Learning |
Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo |
In this paper, we introduce DiffusionMat, a novel image matting framework
that employs a diffusion model for the transition from coarse to refined alpha
mattes. Diverging from conventional methods that utilize trimaps merely as
loose guidance for alpha matte prediction, our approach treats image matting as
a sequential refinement learning process. This process begins with the addition
of noise to trimaps and iteratively denoises them using a pre-trained diffusion
model, which incrementally guides the prediction towards a clean alpha matte.
The key innovation of our framework is a correction module that adjusts the
output at each denoising step, ensuring that the final result is consistent
with the input image's structures. We also introduce the Alpha Reliability
Propagation, a novel technique designed to maximize the utility of available
guidance by selectively enhancing the trimap regions with confident alpha
information, thus simplifying the correction task. To train the correction
module, we devise specialized loss functions that target the accuracy of the
alpha matte's edges and the consistency of its opaque and transparent regions.
We evaluate our model across several image matting benchmarks, and the results
indicate that DiffusionMat consistently outperforms existing methods. Project
page at~\url{https://cnnlstm.github.io/DiffusionMat |
Presents DiffusionMat, a novel image matting framework that uses a diffusion model to refine alpha mattes from coarse to refined, treating image matting as a sequential refinement learning process. |
Overcomes limitations of conventional methods that treat trimaps as static guidance, instead leveraging the iterative feedback of diffusion models to enhance the matting of unknown regions. |
Trains a diffusion model on alpha mattes, injects noise into the input trimap, then iteratively denoises it. Employs a correction module at each step to ensure consistency with the input image. Introduces Alpha Reliability Propagation to focus on refining ambiguous regions. |
Achieves state-of-the-art performance on portrait matting benchmarks (P3M-10K, Human-2K) and general image matting (Composition-1k).
Exhibits robustness against inaccurate trimaps by leveraging the generative prior learned from extensive alpha matte datasets.
Produces perceptually favorable alpha mattes with finer details compared to conventional methods. |
Computational efficiency is lower compared to single-pass methods.
Future work includes exploring more efficient diffusion models to improve speed. |
image matting, diffusion models, sequential refinement learning, alpha matte prediction, trimap guidance |
2311.13443
Report |
Guided Flows for Generative Modeling and Decision Making |
Qinqing Zheng, Matt Le, Neta Shaul, Yaron Lipman, Aditya Grover, Ricky T. Q. Chen |
Classifier-free guidance is a key component for enhancing the performance of
conditional generative models across diverse tasks. While it has previously
demonstrated remarkable improvements for the sample quality, it has only been
exclusively employed for diffusion models. In this paper, we integrate
classifier-free guidance into Flow Matching (FM) models, an alternative
simulation-free approach that trains Continuous Normalizing Flows (CNFs) based
on regressing vector fields. We explore the usage of \emph{Guided Flows} for a
variety of downstream applications. We show that Guided Flows significantly
improves the sample quality in conditional image generation and zero-shot
text-to-speech synthesis, boasting state-of-the-art performance. Notably, we
are the first to apply flow models for plan generation in the offline
reinforcement learning setting, showcasing a 10x speedup in computation
compared to diffusion models while maintaining comparable performance. |
This paper integrates classifier-free guidance into Flow Matching (FM) models, enhancing their performance in conditional generation tasks. |
This integration is crucial as it allows FM models, a computationally efficient alternative to diffusion models, to leverage conditional information more effectively, leading to significant improvements in sample quality for various downstream applications. |
The authors introduce 'Guided Flows', an adaptation of classifier-free guidance for FM models. They modify velocity vector fields by combining unconditional and conditional velocity fields, weighted by a guidance parameter. |
Guided Flows significantly enhance sample quality in conditional image generation and zero-shot text-to-speech synthesis, achieving state-of-the-art performance.
This paper demonstrates the first successful application of flow models for return-conditioned plan generation in offline reinforcement learning, achieving comparable performance to diffusion models but with a 10x speedup in computation.
Guided Flows outperform unguided flows in generating coherent state sequences for locomotion tasks, highlighting the importance of guidance for planning. |
The theoretical justification for Guided Flows relies on an assumption that may not hold perfectly in practice, suggesting a need for further investigation.
While replanning at every timestep ensures planning accuracy, exploring heuristics to reuse previously generated plans could further improve computational efficiency. |
flow matching, classifier-free guidance, generative modeling, offline reinforcement learning, plan generation |
2311.13435
Report |
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models |
Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan |
Extending image-based Large Multimodal Models (LMMs) to videos is challenging
due to the inherent complexity of video data. The recent approaches extending
image-based LMMs to videos either lack the grounding capabilities (e.g.,
VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for
better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we
propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability,
integrating audio cues by transcribing them into text to enrich video-context
understanding. Our framework uses an off-the-shelf tracker and a novel
grounding module, enabling it to spatially localize objects in videos following
user instructions. We evaluate PG-Video-LLaVA using video-based generative and
question-answering benchmarks and introduce new benchmarks specifically
designed to measure prompt-based object grounding performance in videos.
Further, we propose the use of Vicuna over GPT-3.5, as utilized in
Video-ChatGPT, for video-based conversation benchmarking, ensuring
reproducibility of results which is a concern with the proprietary nature of
GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its
advantages to the video domain, delivering promising gains on video-based
conversation and grounding tasks. Project Page:
https://github.com/mbzuai-oryx/Video-LLaVA |
The paper introduces PG-Video-LLaVA, the first video-based Large Multimodal Model (LMM) capable of pixel-level grounding, integrating audio cues to enhance video understanding. |
Extending image-based LMMs to videos is challenging due to the complexity of video data. Existing video LMMs lack either grounding capabilities or the ability to utilize audio signals effectively. |
PG-Video-LLaVA leverages a CLIP-based visual encoder, audio transcription, and a novel grounding module for object localization. It's trained on a large video instruction dataset and evaluated on video-based generative and question-answering benchmarks. |
PG-Video-LLaVA outperforms existing video-based conversational models like Video-ChatGPT and Video-LLaMA in ungrounded dialogues.
The model effectively localizes objects in videos based on user instructions, demonstrating superior spatial grounding capabilities.
Incorporating audio transcripts significantly enhances the model's understanding of video content, leading to improved accuracy in tasks like question answering. |
The spatial grounding module's reliance on scene segmentation and object tracking can introduce errors in complex scenarios.
Further research is needed to explore more sophisticated methods for integrating audio and visual information for a deeper understanding of video content. |
large multimodal models, video understanding, visual grounding, audio-visual integration, video question answering |
2311.13398
Report |
Depth-Regularized Optimization for 3D Gaussian Splatting in Few-Shot Images |
Jaeyoung Chung, Jeongtaek Oh, Kyoung Mu Lee |
In this paper, we present a method to optimize Gaussian splatting with a
limited number of images while avoiding overfitting. Representing a 3D scene by
combining numerous Gaussian splats has yielded outstanding visual quality.
However, it tends to overfit the training views when only a small number of
images are available. To address this issue, we introduce a dense depth map as
a geometry guide to mitigate overfitting. We obtained the depth map using a
pre-trained monocular depth estimation model and aligning the scale and offset
using sparse COLMAP feature points. The adjusted depth aids in the color-based
optimization of 3D Gaussian splatting, mitigating floating artifacts, and
ensuring adherence to geometric constraints. We verify the proposed method on
the NeRF-LLFF dataset with varying numbers of few images. Our approach
demonstrates robust geometry compared to the original method that relies solely
on images. Project page: robot0321.github.io/DepthRegGS |
This paper proposes a novel method to optimize 3D Gaussian Splatting using a limited number of images by leveraging depth information to prevent overfitting. |
Reconstructing 3D scenes from a few images is crucial for practical applications but challenging due to limited geometric information, leading to overfitting in existing methods like 3D Gaussian Splatting. |
The method utilizes a pre-trained monocular depth estimation model to obtain dense depth maps, adjusts their scale and offset using sparse COLMAP feature points, and integrates the adjusted depth as a geometry guide during the 3D Gaussian splatting optimization process. Additionally, an early stopping strategy and a smoothness constraint on the depth map further enhance the optimization stability. |
The proposed depth-guided optimization significantly improves the performance of 3D Gaussian Splatting in few-shot scenarios, achieving better visual quality and geometric accuracy compared to the original method.
The approach successfully mitigates overfitting issues and generates plausible 3D reconstructions even with a limited number of input images.
Ablation studies confirm the effectiveness of each component, including depth adjustment, depth loss, smoothness constraint, and early stopping strategy. |
The performance heavily relies on the accuracy of the pre-trained monocular depth estimation model and its generalization ability to different scenes and domains.
Reliance on COLMAP points for depth adjustment limits the applicability to scenes where COLMAP might struggle, such as textureless regions. Future work includes exploring alternative depth regularization methods. |
3d gaussian splatting, few-shot learning, depth estimation, 3d reconstruction, novel view synthesis |
2311.13384
Report |
LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes |
Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, Kyoung Mu Lee |
With the widespread usage of VR devices and contents, demands for 3D scene
generation techniques become more popular. Existing 3D scene generation models,
however, limit the target scene to specific domain, primarily due to their
training strategies using 3D scan dataset that is far from the real-world. To
address such limitation, we propose LucidDreamer, a domain-free scene
generation pipeline by fully leveraging the power of existing large-scale
diffusion-based generative model. Our LucidDreamer has two alternate steps:
Dreaming and Alignment. First, to generate multi-view consistent images from
inputs, we set the point cloud as a geometrical guideline for each image
generation. Specifically, we project a portion of point cloud to the desired
view and provide the projection as a guidance for inpainting using the
generative model. The inpainted images are lifted to 3D space with estimated
depth maps, composing a new points. Second, to aggregate the new points into
the 3D scene, we propose an aligning algorithm which harmoniously integrates
the portions of newly generated 3D scenes. The finally obtained 3D scene serves
as initial points for optimizing Gaussian splats. LucidDreamer produces
Gaussian splats that are highly-detailed compared to the previous 3D scene
generation methods, with no constraint on domain of the target scene. Project
page: https://luciddreamer-cvlab.github.io/ |
LucidDreamer, a domain-free 3D scene generation pipeline that leverages Stable Diffusion, depth estimation, and 3D Gaussian splatting to create diverse, high-quality scenes from various inputs (text, RGB, RGBD). |
Existing 3D scene generation methods are limited to specific domains due to training on restricted 3D scan datasets. LucidDreamer overcomes this by leveraging the power of pre-trained, large-scale image generation models for diverse and high-quality results. |
1. **Point Cloud Construction:** Starting from an initial image/depth map, LucidDreamer iteratively expands the point cloud. It projects the existing points to a new camera view, inpaints the missing regions using Stable Diffusion, estimates depth, and lifts the inpainted pixels to 3D, aligning them for consistency.
2. **Gaussian Splat Optimization:** The final point cloud initializes a 3D Gaussian Splatting model, further optimized using reprojected images for a continuous, high-fidelity 3D scene representation. |
Generates high-quality, multi-view consistent 3D scenes from various input domains (realistic, anime, lego) and formats (text, RGB, RGBD).
Outperforms existing methods like RGBD2 in terms of visual quality, resolution, and domain generalization.
Ablation studies validate the importance of point cloud initialization and masked training for Gaussian Splat optimization. |
Reliance on multiple off-the-shelf models (Stable Diffusion, depth estimation) could lead to accumulated errors.
Exploration of more efficient point cloud aggregation and alignment strategies for larger-scale scenes. |
3d scene generation, diffusion models, gaussian splatting, multi-view consistency, domain generalization |
2311.13231
Report |
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model |
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li |
Using reinforcement learning with human feedback (RLHF) has shown significant
promise in fine-tuning diffusion models. Previous methods start by training a
reward model that aligns with human preferences, then leverage RL techniques to
fine-tune the underlying models. However, crafting an efficient reward model
demands extensive datasets, optimal architecture, and manual hyperparameter
tuning, making the process both time and cost-intensive. The direct preference
optimization (DPO) method, effective in fine-tuning large language models,
eliminates the necessity for a reward model. However, the extensive GPU memory
requirement of the diffusion model's denoising process hinders the direct
application of the DPO method. To address this issue, we introduce the Direct
Preference for Denoising Diffusion Policy Optimization (D3PO) method to
directly fine-tune diffusion models. The theoretical analysis demonstrates that
although D3PO omits training a reward model, it effectively functions as the
optimal reward model trained using human feedback data to guide the learning
process. This approach requires no training of a reward model, proving to be
more direct, cost-effective, and minimizing computational overhead. In
experiments, our method uses the relative scale of objectives as a proxy for
human preference, delivering comparable results to methods using ground-truth
rewards. Moreover, D3PO demonstrates the ability to reduce image distortion
rates and generate safer images, overcoming challenges lacking robust reward
models. Our code is publicly available at https://github.com/yk7333/D3PO. |
Introduces D3PO, a method for directly fine-tuning diffusion models using human feedback without needing a separate reward model. |
Existing RLHF methods for fine-tuning diffusion models require resource-intensive reward model training, making them inefficient and costly. D3PO aims to address this by directly incorporating human preferences. |
D3PO reinterprets the diffusion model's denoising process as a multi-step MDP. By extending the DPO theory to this MDP framework, D3PO directly updates the diffusion model's policy based on human feedback at each denoising step. |
Achieves comparable performance to reward-model-based methods on quantitative objectives like image compressibility and aesthetic quality.
Successfully reduces image distortions in generated hands and anime characters.
Demonstrates the ability to enhance image safety and improve prompt-image alignment based on human feedback. |
Assumes that all state-action pairs within a preferred segment are better than those in a less preferred segment.
Relies on human evaluation, which can be subjective and difficult to scale. |
diffusion models, reinforcement learning, human feedback, direct preference optimization, image generation |
2311.13073
Report |
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline |
Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov |
Multimedia generation approaches occupy a prominent place in artificial
intelligence research. Text-to-image models achieved high-quality results over
the last few years. However, video synthesis methods recently started to
develop. This paper presents a new two-stage latent diffusion text-to-video
generation architecture based on the text-to-image diffusion model. The first
stage concerns keyframes synthesis to figure the storyline of a video, while
the second one is devoted to interpolation frames generation to make movements
of the scene and objects smooth. We compare several temporal conditioning
approaches for keyframes generation. The results show the advantage of using
separate temporal blocks over temporal layers in terms of metrics reflecting
video generation quality aspects and human preference. The design of our
interpolation model significantly reduces computational costs compared to other
masked frame interpolation approaches. Furthermore, we evaluate different
configurations of MoVQ-based video decoding scheme to improve consistency and
achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our
pipeline with existing solutions and achieve top-2 scores overall and top-1
among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page:
https://ai-forever.github.io/kandinsky-video/ |
This paper presents FusionFrames, a novel two-stage latent diffusion model for text-to-video generation, focusing on improving video quality, consistency, and smoothness. |
Video generation models are computationally expensive and require large datasets. This paper addresses these challenges by leveraging pre-trained text-to-image models and introducing efficient architectural designs. |
The model uses a pre-trained T2I model for keyframe generation and introduces separate temporal blocks for enhanced temporal consistency. A novel interpolation model generates intermediate frames, and a MoVQ-GAN based decoder with architectural variations is used for improved decoding. |
Separate temporal blocks outperform traditional mixed spatial-temporal layers in video quality metrics and human evaluation.
The proposed interpolation architecture generates higher-quality interpolated frames with faster inference compared to masked frame interpolation.
A MoVQ-GAN decoder with 3D convolutions and temporal attention yields the best decoding quality. |
Ambiguities in calculating metrics like FVD and IS make comparison with other studies challenging.
Lack of open-source solutions for latent space interpolation limits direct comparison with existing interpolation methods. |
text-to-video generation, latent diffusion models, video frame interpolation, temporal consistency, movq-gan |
2311.12981
Report |
SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion |
Yueqian Lin, Jingyang Zhang, Yiran Chen, Hai Li |
Natural Adversarial Examples (NAEs), images arising naturally from the
environment and capable of deceiving classifiers, are instrumental in robustly
evaluating and identifying vulnerabilities in trained models. In this work,
unlike prior works that passively collect NAEs from real images, we propose to
actively synthesize NAEs using the state-of-the-art Stable Diffusion.
Specifically, our method formulates a controlled optimization process, where we
perturb the token embedding that corresponds to a specified class to generate
NAEs. This generation process is guided by the gradient of loss from the target
classifier, ensuring that the created image closely mimics the ground-truth
class yet fools the classifier. Named SD-NAE (Stable Diffusion for Natural
Adversarial Examples), our innovative method is effective in producing valid
and useful NAEs, which is demonstrated through a meticulously designed
experiment. Code is available at https://github.com/linyueqian/SD-NAE. |
This paper introduces SD-NAE, a novel method for actively synthesizing Natural Adversarial Examples (NAEs) using Stable Diffusion by optimizing the class token embedding in the condition embedding space. |
Robustly evaluating deep image classifiers is challenging, and NAEs are instrumental in identifying model vulnerabilities. Unlike prior passive NAE collection methods, SD-NAE offers greater flexibility and control over generating specific challenging examples. |
SD-NAE optimizes the class-related token embedding of a pre-trained Stable Diffusion model, guided by the loss gradient of a target image classifier. This process aims to induce misclassification while maintaining the image's ground-truth semantic meaning. |
SD-NAE achieves a 43.5% fooling rate against a ResNet-50 ImageNet classifier, demonstrating its effectiveness in generating NAEs.
The generated NAEs exhibit variations in color, background, view angle, and style, highlighting SD-NAE's potential for evaluating model generalization.
Compared to a GAN-based NAE generation method, SD-NAE shows superior performance in both fooling rate and the quality of generated images. |
SD-NAE can be computationally expensive, especially with a large number of optimization steps.
The generated images might sometimes exhibit unnatural appearances, inheriting limitations from the underlying Stable Diffusion model. |
natural adversarial examples, stable diffusion, robustness evaluation, image classification, adversarial machine learning |
2311.12908
Report |
Diffusion Model Alignment Using Direct Preference Optimization |
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik |
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better
aligned with users' preferences. In contrast to LLMs, human preference learning
has not been widely explored in text-to-image diffusion models; the best
existing approach is to fine-tune a pretrained model using carefully curated
high quality images and captions to improve visual appeal and text alignment.
We propose Diffusion-DPO, a method to align diffusion models to human
preferences by directly optimizing on human comparison data. Diffusion-DPO is
adapted from the recently developed Direct Preference Optimization (DPO), a
simpler alternative to RLHF which directly optimizes a policy that best
satisfies human preferences under a classification objective. We re-formulate
DPO to account for a diffusion model notion of likelihood, utilizing the
evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic
dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model
of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with
Diffusion-DPO. Our fine-tuned base model significantly outperforms both base
SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement
model in human evaluation, improving visual appeal and prompt alignment. We
also develop a variant that uses AI feedback and has comparable performance to
training on human preferences, opening the door for scaling of diffusion model
alignment methods. |
The paper introduces Diffusion-DPO, a novel method for aligning text-to-image diffusion models with human preferences by directly optimizing the model on pairwise comparison data. |
Current text-to-image diffusion models lack a robust alignment stage with human preferences, limiting their ability to generate images that truly reflect user desires. |
The authors adapt the Direct Preference Optimization (DPO) method from LLMs to diffusion models, utilizing the evidence lower bound to derive a differentiable objective function. They fine-tune Stable Diffusion XL (SDXL)-1.0 model using Diffusion-DPO on the Pick-a-Pic dataset, a large dataset of crowdsourced pairwise preferences. |
Diffusion-DPO significantly outperforms the baseline SDXL and the larger SDXL-(base + refinement) model in human evaluation, achieving a 69% preference rate on the PartiPrompts dataset.
Diffusion-DPO-tuned SDXL generates images with superior visual appeal, better prompt alignment, and finer details compared to the baseline models.
The method can also effectively learn from AI feedback, demonstrating promising results for scaling diffusion model alignment using pretrained scoring networks. |
Potential biases in human preference data could be reflected in the trained model.
Current work is an offline algorithm, requiring further research for online learning methods. |
text-to-image generation, diffusion models, human preference learning, dpo, stable diffusion xl |
2311.12897
Report |
An Efficient 3D Gaussian Representation for Monocular/Multi-view Dynamic Scenes |
Kai Katsumata, Duc Minh Vo, Hideki Nakayama |
In novel view synthesis of scenes from multiple input views, 3D Gaussian
splatting emerges as a viable alternative to existing radiance field
approaches, delivering great visual quality and real-time rendering. While
successful in static scenes, the present advancement of 3D Gaussian
representation, however, faces challenges in dynamic scenes in terms of memory
consumption and the need for numerous observations per time step, due to the
onus of storing 3D Gaussian parameters per time step. In this study, we present
an efficient 3D Gaussian representation tailored for dynamic scenes in which we
define positions and rotations as functions of time while leaving other
time-invariant properties of the static 3D Gaussian unchanged. Notably, our
representation reduces memory usage, which is consistent regardless of the
input sequence length. Additionally, it mitigates the risk of overfitting
observed frames by accounting for temporal changes. The optimization of our
Gaussian representation based on image and flow reconstruction results in a
powerful framework for dynamic scene view synthesis in both monocular and
multi-view cases. We obtain the highest rendering speed of $118$ frames per
second (FPS) at a resolution of $1352 \times 1014$ with a single GPU, showing
the practical usability and effectiveness of our proposed method in dynamic
scene rendering scenarios. |
This paper proposes an efficient dynamic 3D Gaussian representation for real-time novel view synthesis of dynamic scenes from monocular or multi-view videos. |
Existing methods for dynamic scene novel view synthesis either suffer from slow rendering speed (neural radiance fields) or high memory consumption in dynamic scenes (3D Gaussian splatting). |
The method represents 3D Gaussian parameters (position, rotation) as a function of time, allowing for compact representation of dynamic motion. It optimizes the Gaussian parameters by minimizing the reconstruction loss between rendered and target images, and further enhances temporal consistency using optical flow supervision. |
Achieves competitive visual quality with state-of-the-art neural rendering methods on D-NeRF, DyNeRF, and HyperNeRF datasets.
Significantly faster rendering speed than previous high-quality methods, achieving real-time performance even at high resolutions.
Demonstrates lower memory consumption compared to methods storing parameters per timestamp, especially beneficial for long sequences. |
The method assumes Gaussian existence throughout the scene, limiting its ability to model topological changes like fluid motion.
The explicit representation sacrifices continuity and smoothness of neural rendering, leading to artifacts with inaccurate camera poses and lower generalization performance. |
novel view synthesis, dynamic scenes, 3d gaussian splatting, real-time rendering, optical flow |
2311.12891
Report |
Text-Guided Texturing by Synchronized Multi-View Diffusion |
Yuxin Liu, Minshan Xie, Hanyuan Liu, Tien-Tsin Wong |
This paper introduces a novel approach to synthesize texture to dress up a
given 3D object, given a text prompt. Based on the pretrained text-to-image
(T2I) diffusion model, existing methods usually employ a project-and-inpaint
approach, in which a view of the given object is first generated and warped to
another view for inpainting. But it tends to generate inconsistent texture due
to the asynchronous diffusion of multiple views. We believe such asynchronous
diffusion and insufficient information sharing among views are the root causes
of the inconsistent artifact. In this paper, we propose a synchronized
multi-view diffusion approach that allows the diffusion processes from
different views to reach a consensus of the generated content early in the
process, and hence ensures the texture consistency. To synchronize the
diffusion, we share the denoised content among different views in each
denoising step, specifically blending the latent content in the texture domain
from views with overlap. Our method demonstrates superior performance in
generating consistent, seamless, highly detailed textures, comparing to
state-of-the-art methods. |
This paper introduces Synchronized Multi-View Diffusion (MVD), a novel approach to generate consistent, seamless, and highly detailed textures on 3D objects from text prompts, leveraging pre-trained text-to-image diffusion models. |
Existing project-and-inpaint methods for text-guided 3D object texturing suffer from inconsistencies and artifacts due to the asynchronous nature of diffusion across multiple views. |
MVD synchronizes the diffusion process across multiple views by sharing denoised latent information in overlapping texture regions during each denoising step, enabling early consensus on texture structure and color distribution. It also leverages self-attention reuse for enhanced consistency. |
MVD generates consistent and seamless textures, effectively addressing the limitations of existing approaches.
The method produces highly detailed textures, preserving fine-grained features.
Quantitative evaluation demonstrates superior performance compared to state-of-the-art methods, achieving the best FID score. |
The method inherits the pre-trained model's bias towards common viewpoints, making it challenging to generate textures for less common views.
Depth discontinuities can lead to imperfect boundaries in denoised views, potentially causing color bleeding during texture extraction. Future work could explore optimization-based extraction methods with perceptual losses or boundary masking. |
texture synthesis, text-guided synthesis, 3d object texturing, diffusion models, multi-view consistency |
2311.12847
Report |
CopyScope: Model-level Copyright Infringement Quantification in the Diffusion Workflow |
Junlei Zhou, Jiashi Gao, Ziwei Wang, Xuetao Wei |
Web-based AI image generation has become an innovative art form that can
generate novel artworks with the rapid development of the diffusion model.
However, this new technique brings potential copyright infringement risks as it
may incorporate the existing artworks without the owners' consent. Copyright
infringement quantification is the primary and challenging step towards
AI-generated image copyright traceability. Previous work only focused on data
attribution from the training data perspective, which is unsuitable for tracing
and quantifying copyright infringement in practice because of the following
reasons: (1) the training datasets are not always available in public; (2) the
model provider is the responsible party, not the image. Motivated by this, in
this paper, we propose CopyScope, a new framework to quantify the infringement
of AI-generated images from the model level. We first rigorously identify
pivotal components within the AI image generation pipeline. Then, we propose to
take advantage of Fr\'echet Inception Distance (FID) to effectively capture the
image similarity that fits human perception naturally. We further propose the
FID-based Shapley algorithm to evaluate the infringement contribution among
models. Extensive experiments demonstrate that our work not only reveals the
intricacies of infringement quantification but also effectively depicts the
infringing models quantitatively, thus promoting accountability in AI
image-generation tasks. |
Proposes CopyScope, a framework to quantify copyright infringement in AI-generated images at the model level, addressing limitations of data attribution methods. |
AI image generation tools raise copyright concerns as they may infringe on existing artworks, necessitating model-level infringement quantification for accountability. |
Identifies key infringement components in the diffusion workflow, uses Fréchet Inception Distance (FID) to measure image similarity, and employs a FID-based Shapley algorithm to evaluate model contributions. |
FID effectively captures image similarity aligning with human perception.
FID-Shapley algorithm accurately quantifies infringement contributions of different models in the diffusion workflow.
CopyScope provides a promising solution for copyright traceability and promotes legal AI-generated content use. |
Current work focuses on a single image, Mona Lisa, for evaluation.
Future work will explore extending CopyScope to broader image datasets and real-world infringement scenarios. |
copyright infringement, ai image generation, diffusion models, accountability, fid |
2311.12793
Report |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin |
In the realm of large multi-modal models (LMMs), efficient modality alignment
is crucial yet often constrained by the scarcity of high-quality image-text
data. To address this bottleneck, we introduce the ShareGPT4V dataset, a
pioneering large-scale resource featuring 1.2 million highly descriptive
captions, which surpasses existing datasets in diversity and information
content, covering world knowledge, object properties, spatial relationships,
and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated
100K high-quality captions collected from advanced GPT4-Vision and has been
expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V
first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT)
phase, by substituting an equivalent quantity of detailed captions in existing
SFT datasets with a subset of our high-quality captions, significantly
enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME
and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and
2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training
and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple
architecture that has remarkable performance across a majority of the
multi-modal benchmarks. This project is available at
https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the
LMMs community. |
This paper introduces ShareGPT4V, a large-scale dataset containing 1.2 million image-text pairs with highly descriptive captions generated by GPT4-Vision, designed for improving multi-modal model training. |
Existing image-text datasets often rely on simplistic captions, hindering the ability of Large Multi-Modal Models (LMMs) to effectively align visual and textual information. ShareGPT4V addresses this by providing high-quality captions rich in details, knowledge, and relationships, enabling better modality alignment. |
The authors first collected 100K images from diverse sources and used carefully crafted prompts to generate detailed descriptions using GPT4-Vision. This data was then used to train a general caption model, Share-Captioner. Finally, they used Share-Captioner to generate captions for 1.2M images, creating the ShareGPT4V-PT dataset. They also developed ShareGPT4V-7B, a LMM trained using the dataset. |
Replacing existing SFT captions with those from ShareGPT4V significantly improves LLM performance across various benchmarks.
ShareGPT4V-7B, a 7B parameter model, outperforms many state-of-the-art LMMs with larger sizes and training datasets on 11 multi-modal benchmarks.
Ablation studies confirm the importance of high-quality captions and fine-tuning strategies for pre-training and fine-tuning LMMs. |
The ShareGPT4V-PT dataset currently uses images from existing public datasets; exploring new image sources could further enhance diversity.
While ShareGPT4V-7B achieves impressive results, future work can investigate scaling the model size and exploring alternative architectures. |
multi-modal learning, image captioning, large language models, dataset, vision-language |
2311.12775
Report |
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering |
Antoine Guédon, Vincent Lepetit |
We propose a method to allow precise and extremely fast mesh extraction from
3D Gaussian Splatting. Gaussian Splatting has recently become very popular as
it yields realistic rendering while being significantly faster to train than
NeRFs. It is however challenging to extract a mesh from the millions of tiny 3D
gaussians as these gaussians tend to be unorganized after optimization and no
method has been proposed so far. Our first key contribution is a regularization
term that encourages the gaussians to align well with the surface of the scene.
We then introduce a method that exploits this alignment to extract a mesh from
the Gaussians using Poisson reconstruction, which is fast, scalable, and
preserves details, in contrast to the Marching Cubes algorithm usually applied
to extract meshes from Neural SDFs. Finally, we introduce an optional
refinement strategy that binds gaussians to the surface of the mesh, and
jointly optimizes these Gaussians and the mesh through Gaussian splatting
rendering. This enables easy editing, sculpting, rigging, animating,
compositing and relighting of the Gaussians using traditional softwares by
manipulating the mesh instead of the gaussians themselves. Retrieving such an
editable mesh for realistic rendering is done within minutes with our method,
compared to hours with the state-of-the-art methods on neural SDFs, while
providing a better rendering quality. Our project page is the following:
https://anttwo.github.io/sugar/ |
This paper introduces SuGaR, a method for fast and accurate mesh extraction from 3D Gaussian Splatting representations. |
Mesh-based scene representations are valuable for editing, sculpting, animation, and relighting in Computer Graphics, but extracting meshes from the unstructured point clouds of Gaussian Splatting has been challenging. |
SuGaR first encourages alignment of Gaussian Splats with the scene surface through a novel regularization term during optimization. Then, it efficiently samples points on a level set of the Gaussian density function and uses Poisson reconstruction to generate a mesh. Optionally, it refines the mesh and binds new Gaussians to it for improved rendering. |
SuGaR extracts detailed meshes from Gaussian Splatting representations within minutes on a single GPU.
The method outperforms state-of-the-art mesh-based Novel View Synthesis techniques in terms of rendering quality.
Binding refined Gaussians to the mesh allows for high-quality rendering and facilitates intuitive scene manipulation using traditional mesh editing tools. |
The rendering quality of SuGaR, while exceeding other mesh-based methods, is not yet on par with the best NeRF models or vanilla Gaussian Splatting in all cases.
Future work could explore more sophisticated methods for distinguishing foreground and background points during mesh extraction. |
gaussian splatting, mesh extraction, novel view synthesis, 3d scene representation, computer graphics |
2311.12631
Report |
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning |
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, Shifeng Chen |
Recent advances in text-to-video generation have harnessed the power of
diffusion models to create visually compelling content conditioned on text
prompts. However, they usually encounter high computational costs and often
struggle to produce videos with coherent physical motions. To tackle these
issues, we propose GPT4Motion, a training-free framework that leverages the
planning capability of large language models such as GPT, the physical
simulation strength of Blender, and the excellent image generation ability of
text-to-image diffusion models to enhance the quality of video synthesis.
Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a
user textual prompt, which commands Blender's built-in physics engine to craft
fundamental scene components that encapsulate coherent physical motions across
frames. Then these components are inputted into Stable Diffusion to generate a
video aligned with the textual prompt. Experimental results on three basic
physical motion scenarios, including rigid object drop and collision, cloth
draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate
high-quality videos efficiently in maintaining motion coherency and entity
consistency. GPT4Motion offers new insights in text-to-video research,
enhancing its quality and broadening its horizon for further explorations. |
GPT4Motion, a training-free text-to-video generation framework that leverages GPT-4's planning ability to drive Blender simulations and guide Stable Diffusion for generating physically coherent videos. |
Current text-to-video models struggle to generate videos with coherent physical motions due to the lack of physical understanding. This work introduces a novel approach using LLMs and physics engine to address the challenge. |
GPT-4 generates Blender scripts from user prompts to simulate physical scenarios, producing edge and depth maps as conditions for Stable Diffusion. Cross-frame attention in SDXL enhances temporal consistency. |
GPT4Motion accurately controls physical properties like gravity, wind strength, and viscosity.
Outperforms baselines in generating realistic physical motions with better motion smoothness and less flickering.
User study confirms superior performance in physical accuracy, text-video alignment, and flicker reduction. |
Extending to more complex motions requiring refined LLM instructions is a challenge.
Occasional flickering in generated videos needs further investigation. |
text-to-video generation, physical simulation, large language models, blender, stable diffusion |
2311.12490
Report |
Hyb-NeRF: A Multiresolution Hybrid Encoding for Neural Radiance Fields |
Yifan Wang, Yi Gong, Yuan Zeng |
Recent advances in Neural radiance fields (NeRF) have enabled high-fidelity
scene reconstruction for novel view synthesis. However, NeRF requires hundreds
of network evaluations per pixel to approximate a volume rendering integral,
making it slow to train. Caching NeRFs into explicit data structures can
effectively enhance rendering speed but at the cost of higher memory usage. To
address these issues, we present Hyb-NeRF, a novel neural radiance field with a
multi-resolution hybrid encoding that achieves efficient neural modeling and
fast rendering, which also allows for high-quality novel view synthesis. The
key idea of Hyb-NeRF is to represent the scene using different encoding
strategies from coarse-to-fine resolution levels. Hyb-NeRF exploits
memory-efficiency learnable positional features at coarse resolutions and the
fast optimization speed and local details of hash-based feature grids at fine
resolutions. In addition, to further boost performance, we embed cone
tracing-based features in our learnable positional encoding that eliminates
encoding ambiguity and reduces aliasing artifacts. Extensive experiments on
both synthetic and real-world datasets show that Hyb-NeRF achieves faster
rendering speed with better rending quality and even a lower memory footprint
in comparison to previous state-of-the-art methods. |
Presents Hyb-NeRF, a novel neural radiance field representation using multi-resolution hybrid encoding for memory-efficient and high-quality scene representation and fast rendering. |
Addresses limitations of existing NeRF methods that are either slow to train or memory-intensive, aiming to achieve both fast and high-quality novel view synthesis. |
Combines learnable positional features at coarse resolution levels with hash-based feature grids at fine resolution levels. Integrates cone tracing-based features in the positional encoding to reduce aliasing and improve accuracy. Employs shallow MLPs for efficient processing. |
Achieves faster rendering speed and better rendering quality compared to previous state-of-the-art methods.
Demonstrates significantly lower memory footprint than previous voxel-based methods.
Successfully reconstructs high-quality radiance fields in 9 minutes with the smallest model achieving better rendering quality in 4 minutes. |
Limited exploration of higher resolution levels due to memory constraints.
Further investigation into the application of hybrid encoding in dynamic scene representation. |
neural radiance fields, novel view synthesis, multi-resolution encoding, hybrid encoding, learnable positional encoding |
2311.12386
Report |
Point, Segment and Count: A Generalized Framework for Object Counting |
Zhizhong Huang, Mingliang Dai, Yi Zhang, Junping Zhang, Hongming Shan |
Class-agnostic object counting aims to count all objects in an image with
respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot
counting. In this paper, we propose a generalized framework for both few-shot
and zero-shot object counting based on detection. Our framework combines the
superior advantages of two foundation models without compromising their
zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask
proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate
object counts. However, this strategy meets the obstacles of efficiency
overhead and the small crowded objects that cannot be localized and
distinguished. To address these issues, our framework, termed PseCo, follows
three steps: point, segment, and count. Specifically, we first propose a
class-agnostic object localization to provide accurate but least point prompts
for SAM, which consequently not only reduces computation costs but also avoids
missing small objects. Furthermore, we propose a generalized object
classification that leverages CLIP image/text embeddings as the classifier,
following a hierarchical knowledge distillation to obtain discriminative
classifications among hierarchical mask proposals. Extensive experimental
results on FSC-147, COCO, and LVIS demonstrate that PseCo achieves
state-of-the-art performance in both few-shot/zero-shot object
counting/detection. Code: https://github.com/Hzzone/PseCo |
This paper proposes a novel generalized framework named PSCNet for few-shot and zero-shot object counting and detection by leveraging the strengths of SAM and CLIP. |
Existing class-agnostic object counting methods often rely on density maps, lacking interpretability and struggling with small object detection. This paper addresses these limitations by combining the power of SAM for segmentation and CLIP for classification. |
PSCNet employs a three-step approach: 1) Class-agnostic object localization using a point decoder to identify potential object locations, 2) Segmentation with SAM using the identified points as prompts, 3) Classification of the segmented proposals using CLIP image/text embeddings with a hierarchical knowledge distillation strategy. |
PSCNet achieves state-of-the-art performance on few-shot and zero-shot object counting on the FSC-147 dataset, outperforming both density-based and detection-based methods.
The method demonstrates superior performance on object detection tasks compared to baselines, achieving significant improvements on FSC-147 and FSCD-LVIS datasets.
Evaluation on large-scale datasets like COCO and LVIS shows that PSCNet achieves substantial performance gains over existing open-vocabulary object detection methods like Detic. |
The method's reliance on the inference of SAM's mask decoder introduces computational overhead compared to traditional object detection methods.
PSCNet may face challenges in extremely crowded scenes or with inaccurate example images/text prompts, as highlighted in the failure cases. |
object counting, object detection, few-shot learning, zero-shot learning, sam, clip |
2311.12342
Report |
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis |
Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou |
Recent text-to-image diffusion models have reached an unprecedented level in
generating high-quality images. However, their exclusive reliance on textual
prompts often falls short in precise control of image compositions. In this
paper, we propose LoCo, a training-free approach for layout-to-image Synthesis
that excels in producing high-quality images aligned with both textual prompts
and layout instructions. Specifically, we introduce a Localized Attention
Constraint (LAC), leveraging semantic affinity between pixels in self-attention
maps to create precise representations of desired objects and effectively
ensure the accurate placement of objects in designated regions. We further
propose a Padding Token Constraint (PTC) to leverage the semantic information
embedded in previously neglected padding tokens, improving the consistency
between object appearance and layout instructions. LoCo seamlessly integrates
into existing text-to-image and layout-to-image models, enhancing their
performance in spatial control and addressing semantic failures observed in
prior methods. Extensive experiments showcase the superiority of our approach,
surpassing existing state-of-the-art training-free layout-to-image methods both
qualitatively and quantitatively across multiple benchmarks. |
This paper introduces LoCo, a training-free approach for layout-to-image synthesis that leverages localized attention constraints and padding token information to generate high-quality images adhering to both textual and layout conditions. |
Existing text-to-image synthesis methods struggle with precise control of image compositions, making it challenging to accurately place objects in desired locations. LoCo addresses this limitation by providing accurate spatial control without requiring model training. |
LoCo utilizes two novel constraints: (1) Localized Attention Constraint (LAC) enhances cross-attention maps using self-attention to achieve precise object representation and alignment with layout instructions. (2) Padding Tokens Constraint (PTC) leverages semantic information in padding tokens to enhance consistency between object appearance and layout. |
LoCo outperforms state-of-the-art training-free layout-to-image methods on standard benchmarks (HRS-Bench, DrawBench) in terms of spatial accuracy and image quality.
LoCo effectively handles both bounding box and semantic mask layout instructions, demonstrating its versatility.
The method can be seamlessly integrated into fully-supervised layout-to-image models (e.g., GLIGEN) as a plug-and-play booster, enhancing their performance. |
The performance of LoCo depends on the choice of hyperparameters, requiring careful tuning for optimal results.
The current implementation primarily focuses on single-image generation. Exploring extensions to video or sequential image synthesis is a potential area for future work. |
image synthesis, diffusion models, layout-to-image synthesis, spatial control, attention mechanisms |
2311.12193
Report |
Disentangling Structure and Appearance in ViT Feature Space |
Narek Tumanyan, Omer Bar-Tal, Shir Amir, Shai Bagon, Tali Dekel |
We present a method for semantically transferring the visual appearance of
one natural image to another. Specifically, our goal is to generate an image in
which objects in a source structure image are "painted" with the visual
appearance of their semantically related objects in a target appearance image.
To integrate semantic information into our framework, our key idea is to
leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically,
we derive novel disentangled representations of structure and appearance
extracted from deep ViT features. We then establish an objective function that
splices the desired structure and appearance representations, interweaving them
together in the space of ViT features. Based on our objective function, we
propose two frameworks of semantic appearance transfer -- "Splice", which works
by training a generator on a single and arbitrary pair of structure-appearance
images, and "SpliceNet", a feed-forward real-time appearance transfer model
trained on a dataset of images from a specific domain. Our frameworks do not
involve adversarial training, nor do they require any additional input
information such as semantic segmentation or correspondences. We demonstrate
high-resolution results on a variety of in-the-wild image pairs, under
significant variations in the number of objects, pose, and appearance. Code and
supplementary material are available in our project page: splice-vit.github.io. |
This paper introduces Splice and SpliceNet, novel methods for semantically transferring the visual appearance of one image to another by leveraging disentangled representations of structure and appearance from pre-trained DINO-ViT features. |
Semantic appearance transfer enables generating an image where objects in a source image are “painted” with the appearance of semantically similar objects in a target image, facilitating realistic and meaningful visual transformations. |
The methods use disentangled structure representations (self-similarity of keys in the deepest attention module) and appearance representations (global [CLS] token) extracted from DINO-ViT. Splice trains a generator on a single input image pair, while SpliceNet uses a feed-forward model trained on a domain-specific dataset. |
Splice achieves high-quality semantic appearance transfer on diverse in-the-wild image pairs, outperforming baselines in user studies.
SpliceNet enables real-time semantic appearance transfer within a specific domain, demonstrating superior performance compared to GAN-based methods.
Both Splice and SpliceNet demonstrate the effectiveness of leveraging pre-trained ViT features for encoding and manipulating structure and appearance information. |
The performance of Splice and SpliceNet depends on the quality of semantic representations learned by DINO-ViT.
SpliceNet is limited to a specific domain due to its domain-specific training dataset, while Splice requires training from scratch for each image pair. |
style transfer, vision transformers, appearance transfer, feature disentanglement, semantic image editing |
2311.12174
Report |
LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories |
Silvan Weder, Hermann Blum, Francis Engelmann, Marc Pollefeys |
Semantic annotations are indispensable to train or evaluate perception
models, yet very costly to acquire. This work introduces a fully automated
2D/3D labeling framework that, without any human intervention, can generate
labels for RGB-D scans at equal (or better) level of accuracy than comparable
manually annotated datasets such as ScanNet. Our approach is based on an
ensemble of state-of-the-art segmentation models and 3D lifting through neural
rendering. We demonstrate the effectiveness of our LabelMaker pipeline by
generating significantly better labels for the ScanNet datasets and
automatically labelling the previously unlabeled ARKitScenes dataset. Code and
models are available at https://labelmaker.org |
This work presents LabelMaker, a fully automated 2D/3D labeling framework that leverages an ensemble of state-of-the-art segmentation models and 3D lifting through neural rendering to generate accurate labels for RGB-D scans without human intervention. |
Semantic annotations are crucial for training and evaluating perception models, but acquiring them is expensive and time-consuming. LabelMaker addresses this challenge by enabling the generation of high-quality labels at scale without human effort, facilitating the development and evaluation of perception models and potentially unlocking the potential of large unlabeled datasets like ARKitScenes. |
LabelMaker employs an ensemble of 2D and 3D segmentation models (InternImage, OVSeg, CMX, Mask3D), projects their predictions into a common label space, and applies a consensus voting mechanism to obtain 2D labels for each frame. These 2D predictions are then lifted into 3D using a neural radiance field, which helps to improve consistency and detail, enabling the generation of both 2D and 3D semantic labels. |
LabelMaker generates labels on par with or better than human-annotated datasets like ScanNet, as demonstrated by evaluations on ScanNet and Replica datasets.
The method outperforms existing baselines, including human-annotated ScanNet labels and labels refined with SemanticNeRF, in both 2D and 3D semantic segmentation metrics.
LabelMaker can be used to automatically label large unlabeled datasets, such as ARKitScenes, paving the way for utilizing these datasets in training and evaluating 3D perception models. |
LabelMaker is currently limited to a fixed set of classes, which could be addressed by incorporating language embeddings for more flexibility and ambiguity resolution.
The 3D lifting component relies on SDFStudio, which has many hyperparameters, and further optimization could potentially improve results. |
semantic segmentation, 3d labeling, neural rendering, rgb-d, scannet |
2311.12079
Report |
FreeKD: Knowledge Distillation via Semantic Frequency Prompt |
Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, Shanghang Zhang |
Knowledge distillation (KD) has been applied to various tasks successfully,
and mainstream methods typically boost the student model via spatial imitation
losses. However, the consecutive downsamplings induced in the spatial domain of
teacher model is a type of corruption, hindering the student from analyzing
what specific information needs to be imitated, which results in accuracy
degradation. To better understand the underlying pattern of corrupted feature
maps, we shift our attention to the frequency domain. During frequency
distillation, we encounter a new challenge: the low-frequency bands convey
general but minimal context, while the high are more informative but also
introduce noise. Not each pixel within the frequency bands contributes equally
to the performance. To address the above problem: (1) We propose the Frequency
Prompt plugged into the teacher model, absorbing the semantic frequency context
during finetuning. (2) During the distillation period, a pixel-wise frequency
mask is generated via Frequency Prompt, to localize those pixel of interests
(PoIs) in various frequency bands. Additionally, we employ a position-aware
relational frequency loss for dense prediction tasks, delivering a high-order
spatial enhancement to the student model. We dub our Frequency Knowledge
Distillation method as FreeKD, which determines the optimal localization and
extent for the frequency distillation. Extensive experiments demonstrate that
FreeKD not only outperforms spatial-based distillation methods consistently on
dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on
COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys
more robustness to the student. Notably, we also validate the generalization of
our approach on large-scale vision models (e.g., DINO and SAM). |
This paper introduces FreeKD, a novel knowledge distillation method that operates in the frequency domain for dense prediction tasks. |
Existing spatial-based distillation methods suffer from downsampling corruption, hindering students from effectively learning valuable information. FreeKD addresses this by distilling knowledge from frequency bands. |
FreeKD utilizes Discrete Wavelet Transform for frequency band decomposition. It incorporates a semantic Frequency Prompt to identify crucial pixels of interest in frequency bands and a position-aware relational frequency loss for improved spatial understanding. |
FreeKD consistently outperforms state-of-the-art spatial distillation methods on object detection (COCO) and semantic segmentation (Cityscapes).
Students trained with FreeKD exhibit enhanced robustness and domain generalization capabilities, validated on COCO-C.
The method's efficacy extends to large-scale vision models like DINO and SAM, demonstrating its generality. |
The study primarily focuses on dense prediction tasks; exploring other vision tasks could be beneficial.
Future work could investigate different interaction methods between Frequency Prompt and frequency bands. |
knowledge distillation, frequency domain, dense prediction, frequency prompt, robustness |
2311.12075
Report |
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning |
Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, Ee-Chien Chang |
Studying backdoor attacks is valuable for model copyright protection and
enhancing defenses. While existing backdoor attacks have successfully infected
multimodal contrastive learning models such as CLIP, they can be easily
countered by specialized backdoor defenses for MCL models. This paper reveals
the threats in this practical scenario that backdoor attacks can remain
effective even after defenses and introduces the \emph{\toolns} attack, which
is resistant to backdoor detection and model fine-tuning defenses. To achieve
this, we draw motivations from the perspective of the Bayesian rule and propose
a dual-embedding guided framework for backdoor attacks. Specifically, we ensure
that visual trigger patterns approximate the textual target semantics in the
embedding space, making it challenging to detect the subtle parameter
variations induced by backdoor learning on such natural trigger patterns.
Additionally, we optimize the visual trigger patterns to align the poisoned
samples with target vision features in order to hinder the backdoor unlearning
through clean fine-tuning. Extensive experiments demonstrate that our attack
significantly outperforms state-of-the-art baselines (+45.3% ASR) in the
presence of SoTA backdoor defenses, rendering these mitigation and detection
strategies virtually ineffective. Furthermore, our approach effectively attacks
some more rigorous scenarios like downstream tasks. We believe that this paper
raises awareness regarding the potential threats associated with the practical
application of multimodal contrastive learning and encourages the development
of more robust defense mechanisms. |
This paper introduces \emph{\toolns}, a novel backdoor attack framework for Multimodal Contrastive Learning (MCL) models like CLIP, which demonstrates resistance against existing backdoor detection and mitigation techniques. |
The research highlights a significant threat in the practical application of MCL: even with defense mechanisms like backdoor detection and fine-tuning, backdoor attacks can remain effective, potentially compromising the reliability of pre-trained MCL models. |
Inspired by the Bayesian rule, the authors propose a dual-embedding guided framework. This framework optimizes visual trigger patterns to achieve two key goals: 1) minimizing parameter deviations from the clean model to evade detection and 2) aligning poisoned samples with target vision features to resist unlearning during clean fine-tuning. |
\toolns significantly outperforms state-of-the-art backdoor attacks by +45.3% ASR against fine-tuning defenses.
The attack successfully evades detection by DECREE, achieving a high \mathcal{PL}^1-norm score (0.082), indicating the difficulty in detecting the implanted backdoor.
\toolns maintains high effectiveness (87.21% ASR) even when defenders fine-tune the poisoned model with clean data from a different domain. |
The paper primarily focuses on image classification tasks and acknowledges the need to investigate backdoor attacks on more complex tasks built upon MCL.
The authors highlight the need for developing more robust and advanced backdoor detection and mitigation methods specifically designed for MCL models to counter the threats posed by attacks like \toolns. |
backdoor attack, multimodal contrastive learning, clip, backdoor detection, fine-tuning defense |
2311.12066
Report |
EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models |
Ruoxi Chen, Haibo Jin, Jinyin Chen, Lichao Sun |
Text-to-image diffusion models have emerged as an evolutionary for producing
creative content in image synthesis. Based on the impressive generation
abilities of these models, instruction-guided diffusion models can edit images
with simple instructions and input images. While they empower users to obtain
their desired edited images with ease, they have raised concerns about
unauthorized image manipulation. Prior research has delved into the
unauthorized use of personalized diffusion models; however, this problem of
instruction-guided diffusion models remains largely unexplored. In this paper,
we first propose a protection method EditShield against unauthorized
modifications from such models. Specifically, EditShield works by adding
imperceptible perturbations that can shift the latent representation used in
the diffusion process, forcing models to generate unrealistic images with
mismatched subjects. Our extensive experiments demonstrate EditShield's
effectiveness among synthetic and real-world datasets. Besides, EditShield also
maintains robustness against various editing types and synonymous instruction
phrases. |
This paper proposes EditShield, a method to protect images from unauthorized editing using instruction-guided diffusion models. |
Instruction-guided diffusion models, while powerful for image editing, pose risks of unauthorized manipulation and misuse, necessitating protective measures. |
EditShield crafts imperceptible perturbations that disrupt the latent representation of images, leading to unrealistic outputs after editing. |
EditShield effectively protects against unauthorized editing, as demonstrated by quantitative metrics and qualitative results.
The method exhibits robustness against various editing types and synonymous instruction phrases.
EditShield remains partially effective even against potential countermeasures like spatial smoothing and JPEG compression. |
The effectiveness of EditShield might be reduced by specific countermeasures designed to mitigate the added perturbations.
Future work includes exploring more sophisticated countermeasures and defenses for a stronger protection mechanism. |
image protection, diffusion models, image editing, unauthorized manipulation, adversarial perturbations |
2311.12063
Report |
DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields |
Yu Chi, Fangneng Zhan, Sibo Wu, Christian Theobalt, Adam Kortylewski |
Progress in 3D computer vision tasks demands a huge amount of data, yet
annotating multi-view images with 3D-consistent annotations, or point clouds
with part segmentation is both time-consuming and challenging. This paper
introduces DatasetNeRF, a novel approach capable of generating infinite,
high-quality 3D-consistent 2D annotations alongside 3D point cloud
segmentations, while utilizing minimal 2D human-labeled annotations.
Specifically, we leverage the strong semantic prior within a 3D generative
model to train a semantic decoder, requiring only a handful of fine-grained
labeled samples. Once trained, the decoder efficiently generalizes across the
latent space, enabling the generation of infinite data. The generated data is
applicable across various computer vision tasks, including video segmentation
and 3D point cloud segmentation. Our approach not only surpasses baseline
models in segmentation quality, achieving superior 3D consistency and
segmentation precision on individual images, but also demonstrates versatility
by being applicable to both articulated and non-articulated generative models.
Furthermore, we explore applications stemming from our approach, such as
3D-aware semantic editing and 3D inversion. |
DatasetNeRF is a novel framework that leverages pre-trained 3D GANs to generate infinite, high-quality, 3D-consistent 2D annotations and 3D point cloud segmentations, requiring minimal 2D human-labeled annotations. |
Annotating multi-view images or point clouds with 3D-consistent labels is time-consuming and challenging, hindering progress in 3D computer vision tasks that require large amounts of data. |
The method trains a semantic segmentation branch on a pre-trained 3D GAN, enhancing the feature tri-plane for semantic volumetric rendering. A depth prior from the 3D GAN backbone ensures 3D consistency and enables back-projection of 2D segmentations to 3D point cloud segmentations. |
DatasetNeRF surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision.
The framework demonstrates versatility by being applicable to both articulated and non-articulated generative models.
DatasetNeRF enables applications such as 3D-aware semantic editing and 3D inversion. |
The performance improvement plateaus with increasing training samples beyond a certain point.
The current method focuses on generating annotations for specific object categories and could be extended to broader and more complex scenes. |
3d computer vision, semantic segmentation, point cloud segmentation, generative adversarial networks, dataset generation |
2311.12024
Report |
PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction |
Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, Kai Zhang |
We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing
a 3D object from a few unposed images even with little visual overlap, while
simultaneously estimating the relative camera poses in ~1.3 seconds on a single
A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention
blocks to exchange information between 3D object tokens and 2D image tokens; we
predict a coarse point cloud for each view, and then use a differentiable
Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge
amount of multi-view posed data of ~1M objects, PF-LRM shows strong
cross-dataset generalization ability, and outperforms baseline methods by a
large margin in terms of pose prediction accuracy and 3D reconstruction quality
on various unseen evaluation datasets. We also demonstrate our model's
applicability in downstream text/image-to-3D task with fast feed-forward
inference. Our project website is at: https://totoro97.github.io/pf-lrm . |
This paper proposes PF-LRM, a method for reconstructing a 3D object from a few unposed images while simultaneously estimating relative camera poses. |
Many real-world scenarios involve sparse image capture with little overlap, making traditional Structure-from-Motion methods unreliable. PF-LRM addresses this by jointly learning camera poses and 3D shapes. |
The method utilizes a single-stream transformer model processing image and 3D object tokens. It predicts a coarse point cloud for each view, enabling camera pose estimation via a differentiable Perspective-n-Point solver. |
PF-LRM achieves state-of-the-art pose estimation accuracy on unseen datasets like OmniObject3D, GSO, and ABO.
It demonstrates strong cross-dataset generalization ability, outperforming baselines in novel view synthesis quality.
The method has potential applications in downstream tasks like text/image-to-3D generation. |
Limitations include ignoring background information for pose estimation and not modeling view-dependent effects.
Future work involves incorporating background cues, handling view-dependent appearance, and increasing reconstruction resolution. |
3d reconstruction, pose estimation, transformer, nerf, sparse views |
2311.11700
Report |
GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting |
Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, Xuelong Li |
In this paper, we introduce \textbf{GS-SLAM} that first utilizes 3D Gaussian
representation in the Simultaneous Localization and Mapping (SLAM) system. It
facilitates a better balance between efficiency and accuracy. Compared to
recent SLAM methods employing neural implicit representations, our method
utilizes a real-time differentiable splatting rendering pipeline that offers
significant speedup to map optimization and RGB-D rendering. Specifically, we
propose an adaptive expansion strategy that adds new or deletes noisy 3D
Gaussians in order to efficiently reconstruct new observed scene geometry and
improve the mapping of previously observed areas. This strategy is essential to
extend 3D Gaussian representation to reconstruct the whole scene rather than
synthesize a static object in existing methods. Moreover, in the pose tracking
process, an effective coarse-to-fine technique is designed to select reliable
3D Gaussian representations to optimize camera pose, resulting in runtime
reduction and robust estimation. Our method achieves competitive performance
compared with existing state-of-the-art real-time methods on the Replica,
TUM-RGBD datasets. Project page: https://gs-slam.github.io/. |
GS-SLAM, a novel dense visual SLAM method that leverages 3D Gaussian Splatting for efficient and accurate scene reconstruction and camera pose estimation. |
Existing SLAM methods struggle to balance efficiency and accuracy, particularly in generating detailed dense maps. GS-SLAM addresses this by utilizing the speed of splatting rendering and 3D Gaussian representations. |
GS-SLAM optimizes camera tracking and mapping with a differentiable RGB-D rendering approach using 3D Gaussians and splatting. It employs an adaptive expansion strategy to manage 3D Gaussian elements and a coarse-to-fine approach for pose estimation. |
Achieves state-of-the-art performance in dense neural RGB-D SLAM on Replica and TUM-RGBD datasets.
Exhibits superior rendering performance, achieving up to 100x faster speeds than previous methods.
Effectively balances efficiency and accuracy for real-time tracking, mapping, and rendering. |
Reliance on high-quality depth data may limit performance in certain conditions.
High memory requirements for large scenes, suggesting future work on optimization via techniques like quantization or clustering. |
slam, 3d gaussian splatting, dense mapping, camera pose estimation, real-time rendering |
2311.11697
Report |
Cut-and-Paste: Subject-Driven Video Editing with Attention Control |
Zhichao Zuo, Zhao Zhang, Yan Luo, Yang Zhao, Haijun Zhang, Yi Yang, Meng Wang |
This paper presents a novel framework termed Cut-and-Paste for real-word
semantic video editing under the guidance of text prompt and additional
reference image. While the text-driven video editing has demonstrated
remarkable ability to generate highly diverse videos following given text
prompts, the fine-grained semantic edits are hard to control by plain textual
prompt only in terms of object details and edited region, and cumbersome long
text descriptions are usually needed for the task. We therefore investigate
subject-driven video editing for more precise control of both edited regions
and background preservation, and fine-grained semantic generation. We achieve
this goal by introducing an reference image as supplementary input to the
text-driven video editing, which avoids racking your brain to come up with a
cumbersome text prompt describing the detailed appearance of the object. To
limit the editing area, we refer to a method of cross attention control in
image editing and successfully extend it to video editing by fusing the
attention map of adjacent frames, which strikes a balance between maintaining
video background and spatio-temporal consistency. Compared with current
methods, the whole process of our method is like ``cut" the source object to be
edited and then ``paste" the target object provided by reference image. We
demonstrate that our method performs favorably over prior arts for video
editing under the guidance of text prompt and extra reference image, as
measured by both quantitative and subjective evaluations. |
This paper proposes Cut-and-Paste, a novel subject-driven video editing framework that uses both text prompts and reference images for fine-grained control over semantic video editing. |
Existing text-driven video editing methods lack precise control over semantic edits and struggle to preserve the original video's background and temporal consistency. This paper aims to solve this by leveraging the semantic information of a reference image in addition to the text prompts. |
The proposed method combines a pre-trained text-to-image Latent Diffusion Model (LDM) with a multimodal encoder (BLIP-2) to fuse text prompts and reference image representations. It employs an attention control mechanism with adjacent frames to maintain spatio-temporal consistency. |
Cut-and-Paste demonstrates superior performance over state-of-the-art text-driven video editing methods in terms of fine-grained control, background preservation, and spatio-temporal consistency.
Quantitative evaluations using CLIP Score and LPIPS show that Cut-and-Paste achieves higher text-image similarity and lower deviation from the original video frames.
A user study confirms that users strongly prefer Cut-and-Paste for both text-video alignment and video fidelity. |
The current method faces limitations in editing multiple objects simultaneously and changing object size on a large scale.
Future work includes enhancing the model's capability to handle multiple objects and remove existing objects in video frames. Additionally, eliminating the fine-tuning process could make the approach more user-friendly. |
video editing, diffusion models, text-guided synthesis, attention control, subject-driven editing |
2311.11695
Report |
Clarity ChatGPT: An Interactive and Adaptive Processing System for Image Restoration and Enhancement |
Yanyan Wei, Zhao Zhang, Jiahuan Ren, Xiaogang Xu, Richang Hong, Yi Yang, Shuicheng Yan, Meng Wang |
The generalization capability of existing image restoration and enhancement
(IRE) methods is constrained by the limited pre-trained datasets, making it
difficult to handle agnostic inputs such as different degradation levels and
scenarios beyond their design scopes. Moreover, they are not equipped with
interactive mechanisms to consider user preferences or feedback, and their
end-to-end settings cannot provide users with more choices. Faced with the
above-mentioned IRE method's limited performance and insufficient
interactivity, we try to solve it from the engineering and system framework
levels. Specifically, we propose Clarity ChatGPT-a transformative system that
combines the conversational intelligence of ChatGPT with multiple IRE methods.
Clarity ChatGPT can automatically detect image degradation types and select
appropriate IRE methods to restore images, or iteratively generate satisfactory
results based on user feedback. Its innovative features include a CLIP-powered
detector for accurate degradation classification, no-reference image quality
evaluation for performance evaluation, region-specific processing for precise
enhancements, and advanced fusion techniques for optimal restoration results.
Clarity ChatGPT marks a significant advancement in integrating language and
vision, enhancing image-text interactions, and providing a robust,
high-performance IRE solution. Our case studies demonstrate that Clarity
ChatGPT effectively improves the generalization and interaction capabilities in
the IRE, and also fills the gap in the low-level domain of the existing
vision-language model. |
Clarity ChatGPT, a system bridging conversational AI (ChatGPT) with image restoration and enhancement (IRE) methods using Visual and Restoration & Enhancement Foundation Models (VFMs & REFMs). |
Existing IRE methods lack adaptability to diverse degradation types and user feedback. Clarity ChatGPT addresses these limitations by integrating LLMs with VFMs and REFMs for interactive, user-centric IRE solutions. |
Clarity ChatGPT employs a pipeline including: (1) CLIP-powered degradation detection, (2) no-reference IQA, (3) region-specific processing using SAM and GroundingDINO, and (4) multi-result fusion with a U-Net architecture. |
Fine-tuned CLIP achieves 94.57% Top-1 accuracy for degradation classification, significantly outperforming the original CLIP (38.27%).
Multiple results fusion for low-light enhancement with denoising shows superior performance (PSNR: 27.23, SSIM: 0.823) compared to individual methods.
Case studies demonstrate Clarity ChatGPT's capability in handling complex IRE tasks, including region-specific enhancements and challenging degradation types, exceeding ChatGPT-4V's performance. |
Limited model sharing and collaborative development of IRE algorithms.
Lack of a comprehensive user feedback mechanism for system optimization and personalization. |
image restoration, image enhancement, chatgpt, vision-language models, interactive image processing |
2311.11666
Report |
OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning |
Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao Yu, Ruqi Huang, Lu Fang |
Towards holistic understanding of 3D scenes, a general 3D segmentation method
is needed that can segment diverse objects without restrictions on object
quantity or categories, while also reflecting the inherent hierarchical
structure. To achieve this, we propose OmniSeg3D, an omniversal segmentation
method aims for segmenting anything in 3D all at once. The key insight is to
lift multi-view inconsistent 2D segmentations into a consistent 3D feature
field through a hierarchical contrastive learning framework, which is
accomplished by two steps. Firstly, we design a novel hierarchical
representation based on category-agnostic 2D segmentations to model the
multi-level relationship among pixels. Secondly, image features rendered from
the 3D feature field are clustered at different levels, which can be further
drawn closer or pushed apart according to the hierarchical relationship between
different levels. In tackling the challenges posed by inconsistent 2D
segmentations, this framework yields a global consistent 3D feature field,
which further enables hierarchical segmentation, multi-object selection, and
global discretization. Extensive experiments demonstrate the effectiveness of
our method on high-quality 3D segmentation and accurate hierarchical structure
understanding. A graphical user interface further facilitates flexible
interaction for omniversal 3D segmentation. |
This paper presents OmniSeg3D, an omniversal 3D segmentation method that segments diverse objects in 3D without restrictions on categories or quantity, while also capturing hierarchical structure. |
Holistic 3D scene understanding requires a general 3D segmentation method that overcomes limitations of existing methods, such as category restrictions and inability to reflect hierarchical structure. |
The method leverages multi-view 2D segmentations and lifts them into a consistent 3D feature field through a hierarchical contrastive learning framework. This is achieved by: (1) Designing a hierarchical 2D representation based on category-agnostic segmentations to model multi-level relationships. (2) Hierarchically clustering image features rendered from the 3D feature field, drawing them closer or pushing them apart based on their hierarchical relationships. |
OmniSeg3D achieves state-of-the-art performance on hierarchical 3D segmentation benchmarks, demonstrating its ability to understand scene structure across scales.
The method outperforms baseline methods in 3D instance segmentation tasks, showcasing its effectiveness in segmenting individual objects.
A user-friendly graphical user interface enables interactive 3D segmentation, facilitating applications like annotation and object manipulation. |
The lack of a clear definition for hierarchy levels in the current method may lead to inconsistent segmentation levels across different objects.
Objects that never appear in the same image might exhibit similar semantic features due to contrastive learning being applied on single images. |
3d segmentation, hierarchical representation learning, contrastive learning, multi-view consistency, interactive segmentation |
2311.11600
Report |
Deep Equilibrium Diffusion Restoration with Parallel Sampling |
Jiezhang Cao, Yue Shi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc Van Gool |
Diffusion model-based image restoration (IR) aims to use diffusion models to
recover high-quality (HQ) images from degraded images, achieving promising
performance. Due to the inherent property of diffusion models, most existing
methods need long serial sampling chains to restore HQ images step-by-step,
resulting in expensive sampling time and high computation costs. Moreover, such
long sampling chains hinder understanding the relationship between inputs and
restoration results since it is hard to compute the gradients in the whole
chains. In this work, we aim to rethink the diffusion model-based IR models
through a different perspective, i.e., a deep equilibrium (DEQ) fixed point
system, called DeqIR. Specifically, we derive an analytical solution by
modeling the entire sampling chain in these IR models as a joint multivariate
fixed point system. Based on the analytical solution, we can conduct parallel
sampling and restore HQ images without training. Furthermore, we compute fast
gradients via DEQ inversion and found that initialization optimization can
boost image quality and control the generation direction. Extensive experiments
on benchmarks demonstrate the effectiveness of our method on typical IR tasks
and real-world settings. |
This paper presents DeqIR, a zero-shot image restoration method based on deep equilibrium (DEQ) fixed-point systems for parallel sampling and initialization optimization in diffusion models. |
Existing diffusion model-based IR methods suffer from long serial sampling chains, leading to high computational costs and difficulties in understanding the relationship between inputs and outputs. |
The authors model the entire sampling chain as a joint multivariate fixed point system, deriving an analytical solution for parallel sampling. DEQ inversion enables efficient gradient computation for initialization optimization. |
DeqIR achieves parallel sampling, enabling faster inference and multi-GPU training compared to sequential sampling methods.
DEQ inversion allows for efficient computation of gradients, facilitating initialization optimization to improve restoration quality and control generation direction.
Extensive experiments demonstrate DeqIR's effectiveness on various IR tasks, outperforming existing zero-shot methods and achieving comparable results to supervised approaches, with promising real-world applicability. |
The performance of DeqIR depends on the accuracy of the degradation matrix, which might be unknown or inaccurate in some real-world scenarios.
Exploring the application of DEQ inversion to extend DeqIR for supervised learning is a potential future direction. |
image restoration, diffusion models, deep equilibrium models, parallel sampling, initialization optimization |
2311.11469
Report |
DiffGANPaint: Fast Inpainting Using Denoising Diffusion GANs |
Moein Heidari, Alireza Morsali, Tohid Abedini, Samin Heydarian |
Free-form image inpainting is the task of reconstructing parts of an image
specified by an arbitrary binary mask. In this task, it is typically desired to
generalize model capabilities to unseen mask types, rather than learning
certain mask distributions. Capitalizing on the advances in diffusion models,
in this paper, we propose a Denoising Diffusion Probabilistic Model (DDPM)
based model capable of filling missing pixels fast as it models the backward
diffusion process using the generator of a generative adversarial network (GAN)
network to reduce sampling cost in diffusion models. Experiments on
general-purpose image inpainting datasets verify that our approach performs
superior or on par with most contemporary works. |
Presents DiffGANPaint, a novel image inpainting method combining a Denoising Diffusion Probabilistic Model (DDPM) with a Generative Adversarial Network (GAN) for fast and high-quality reconstruction of missing image regions. |
Addresses the computational expense of traditional DDPM-based image inpainting methods while maintaining high visual quality. |
Utilizes a trained DDPM to denoise the input image, then employs a trained GAN generator to fill in the masked regions, leveraging the structural consistency of DDPM and the generation speed of GANs. |
DiffGANPaint generates high-quality inpainted images with superior or comparable performance to contemporary methods.
The method demonstrates strong generalization capabilities across diverse datasets, including CelebA-HQ faces and generic images.
DiffGANPaint achieves fast inpainting with a low computational budget compared to traditional DDPM approaches. |
The paper does not provide quantitative comparisons to other state-of-the-art inpainting methods.
Further exploration of different GAN architectures and their impact on inpainting quality is a potential avenue for future work. |
image inpainting, diffusion models, generative adversarial networks, ddpm, gan |
2311.11465
Report |
Understanding Segment Anything Model: SAM is Biased Towards Texture Rather than Shape |
Chaoning Zhang, Yu Qiao, Shehbaz Tariq, Sheng Zheng, Chenshuang Zhang, Chenghao Li, Hyundong Shin, Choong Seon Hong |
In contrast to the human vision that mainly depends on the shape for
recognizing the objects, deep image recognition models are widely known to be
biased toward texture. Recently, Meta research team has released the first
foundation model for image segmentation, termed segment anything model (SAM),
which has attracted significant attention. In this work, we understand SAM from
the perspective of texture \textit{v.s.} shape. Different from label-oriented
recognition tasks, the SAM is trained to predict a mask for covering the object
shape based on a promt. With this said, it seems self-evident that the SAM is
biased towards shape. In this work, however, we reveal an interesting finding:
the SAM is strongly biased towards texture-like dense features rather than
shape. This intriguing finding is supported by a novel setup where we
disentangle texture and shape cues and design texture-shape cue conflict for
mask prediction. |
This paper investigates whether the Segment Anything Model (SAM) prioritizes texture or shape cues when predicting object masks, revealing a surprising bias towards texture. |
Understanding the role of texture and shape in SAM's decision-making process is crucial for comprehending its capabilities and limitations as a foundation model for image segmentation. |
The authors disentangle texture and shape cues by creating images with only one type of cue and images with conflicting cues, then analyze SAM's mask predictions on these manipulated images. |
Texture alone can be sufficient for accurate mask prediction.
Shape alone leads to less accurate mask predictions.
In cases of conflicting cues, SAM predominantly relies on texture over shape. |
The analysis primarily focuses on silhouette-based images, potentially limiting the generalizability of findings.
Further research is needed to investigate the impact of different texture types and complexities on SAM's bias. |
segment anything model (sam), image segmentation, texture bias, shape bias, computer vision |
2311.11325
Report |
MoVideo: Motion-Aware Video Generation with Diffusion Models |
Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan |
While recent years have witnessed great progress on using diffusion models
for video generation, most of them are simple extensions of image generation
frameworks, which fail to explicitly consider one of the key differences
between videos and images, i.e., motion. In this paper, we propose a novel
motion-aware video generation (MoVideo) framework that takes motion into
consideration from two aspects: video depth and optical flow. The former
regulates motion by per-frame object distances and spatial layouts, while the
later describes motion by cross-frame correspondences that help in preserving
fine details and improving temporal consistency. More specifically, given a key
frame that exists or generated from text prompts, we first design a diffusion
model with spatio-temporal modules to generate the video depth and the
corresponding optical flows. Then, the video is generated in the latent space
by another spatio-temporal diffusion model under the guidance of depth, optical
flow-based warped latent video and the calculated occlusion mask. Lastly, we
use optical flows again to align and refine different frames for better video
decoding from the latent space to the pixel space. In experiments, MoVideo
achieves state-of-the-art results in both text-to-video and image-to-video
generation, showing promising prompt consistency, frame consistency and visual
quality. |
This paper proposes MoVideo, a novel motion-aware video generation framework that explicitly incorporates depth and optical flow to control video motion. |
Existing video generation diffusion models often lack explicit motion modeling and struggle to generate videos with natural and consistent motion. MoVideo addresses this by leveraging depth for spatial layout guidance and optical flow for temporal consistency. |
MoVideo consists of four stages: 1) Key frame generation from text prompts using Latent Diffusion, 2) Video depth and optical flow generation conditioned on the key frame using a 3D diffusion model, 3) Latent video generation guided by depth, optical flow-warped latent video, and occlusion mask using another 3D diffusion model, 4) Optical flow-augmented video decoding for enhanced temporal consistency. |
MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation on various datasets like WebVid-10M and DAVIS.
The generated videos exhibit strong prompt consistency, frame consistency, and high visual quality.
Ablation studies validate the contribution of each component, particularly the use of warped video, occlusion masks, and flow-augmented decoding. |
The current model is limited to generating videos from a single key frame.
Exploring higher-resolution video generation and more complex motion patterns is left for future work. |
video generation, diffusion models, motion modeling, optical flow, depth estimation |
2311.11284
Report |
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching |
Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, Yingcong Chen |
The recent advancements in text-to-3D generation mark a significant milestone
in generative models, unlocking new possibilities for creating imaginative 3D
assets across various real-world scenarios. While recent advancements in
text-to-3D generation have shown promise, they often fall short in rendering
detailed and high-quality 3D models. This problem is especially prevalent as
many methods base themselves on Score Distillation Sampling (SDS). This paper
identifies a notable deficiency in SDS, that it brings inconsistent and
low-quality updating direction for the 3D model, causing the over-smoothing
effect. To address this, we propose a novel approach called Interval Score
Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes
interval-based score matching to counteract over-smoothing. Furthermore, we
incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline.
Extensive experiments show that our model largely outperforms the
state-of-the-art in quality and training efficiency. |
This paper proposes LucidDreamer, a novel text-to-3D generation framework that leverages Interval Score Matching (ISM) to enhance the fidelity of generated 3D models. |
Existing text-to-3D generation methods often produce overly smooth models lacking intricate details. This stems from limitations in Score Distillation Sampling (SDS), which relies on inconsistent and low-quality pseudo-ground-truth data. |
The authors introduce ISM, which utilizes deterministic diffusing trajectories through DDIM inversion and conducts matching between interval steps in the diffusion process. This, coupled with employing 3D Gaussian Splatting as the 3D representation, facilitates high-quality 3D generation. |
LucidDreamer generates highly realistic and detailed 3D models, surpassing state-of-the-art methods in visual quality.
ISM effectively addresses the over-smoothing issue prevalent in SDS-based approaches.
The proposed framework exhibits efficiency in training and rendering, enabling high-resolution outputs with reduced computational burden. |
The influence of interval length on generation quality necessitates further investigation and potential refinements.
Exploring the full potential of ISM for advanced editing tasks, such as 2D/3D manipulation and control, holds promise for future work. |
text-to-3d generation, score distillation sampling, interval score matching, 3d gaussian splatting, generative models |
2311.11261
Report |
Adversarial Prompt Tuning for Vision-Language Models |
Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, Jitao Sang |
With the rapid advancement of multimodal learning, pre-trained
Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable
capacities in bridging the gap between visual and language modalities. However,
these models remain vulnerable to adversarial attacks, particularly in the
image modality, presenting considerable security risks. This paper introduces
Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial
robustness of image encoders in VLMs. AdvPT innovatively leverages learnable
text prompts and aligns them with adversarial image embeddings, to address the
vulnerabilities inherent in VLMs without the need for extensive parameter
training or modification of the model architecture. We demonstrate that AdvPT
improves resistance against white-box and black-box adversarial attacks and
exhibits a synergistic effect when combined with existing
image-processing-based defense techniques, further boosting defensive
capabilities. Comprehensive experimental analyses provide insights into
adversarial prompt tuning, a novel paradigm devoted to improving resistance to
adversarial images through textual input modifications, paving the way for
future robust multimodal learning research. These findings open up new
possibilities for enhancing the security of VLMs. Our code is available at
https://github.com/jiamingzhang94/Adversarial-Prompt-Tuning. |
This paper proposes Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs by aligning learnable text prompts with adversarial image embeddings. |
Existing VLMs are vulnerable to adversarial attacks, particularly in the image modality, presenting security risks. AdvPT addresses this by improving robustness without extensive parameter training or model architecture modification. |
AdvPT generates an adversarial image embedding bank. It then optimizes learnable text prompts by aligning them with these adversarial embeddings through backpropagation in the text encoder, leaving the image encoder untouched. |
AdvPT significantly improves robustness against both white-box and black-box attacks compared to vanilla CLIP.
It demonstrates synergy with existing image-based defense techniques, further boosting robustness.
The paper provides insights into AdvPT's working mechanism, generalization-robustness trade-off, and transferability across datasets. |
The evaluation of adversarial robustness is limited by the specific attacks used.
The focus is restricted to image recognition tasks. |
adversarial robustness, vision-language models, prompt tuning, adversarial attacks, multimodal learning |
2311.11243
Report |
AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort |
Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen |
Story visualization aims to generate a series of images that match the story
described in texts, and it requires the generated images to satisfy high
quality, alignment with the text description, and consistency in character
identities. Given the complexity of story visualization, existing methods
drastically simplify the problem by considering only a few specific characters
and scenarios, or requiring the users to provide per-image control conditions
such as sketches. However, these simplifications render these methods
incompetent for real applications. To this end, we propose an automated story
visualization system that can effectively generate diverse, high-quality, and
consistent sets of story images, with minimal human interactions. Specifically,
we utilize the comprehension and planning capabilities of large language models
for layout planning, and then leverage large-scale text-to-image models to
generate sophisticated story images based on the layout. We empirically find
that sparse control conditions, such as bounding boxes, are suitable for layout
planning, while dense control conditions, e.g., sketches and keypoints, are
suitable for generating high-quality image content. To obtain the best of both
worlds, we devise a dense condition generation module to transform simple
bounding box layouts into sketch or keypoint control conditions for final image
generation, which not only improves the image quality but also allows easy and
intuitive user interactions. In addition, we propose a simple yet effective
method to generate multi-view consistent character images, eliminating the
reliance on human labor to collect or draw character images. |
Proposes AutoStory, a fully automated story visualization system that generates diverse, high-quality, and consistent story images with minimal human interaction, using LLMs for layout planning and text-to-image models for image generation. |
Story visualization is important for various applications like art creation, education, and cultural heritage, but existing methods are limited in versatility and require significant user effort. |
Uses LLMs to generate layouts from story texts, a dense condition generation module to transform layouts into sketch or keypoint conditions, and a multi-subject customization model for image generation. A training-free method generates multi-view consistent character images, eliminating the need for user-provided character images. |
Generates high-quality, text-aligned, and identity-consistent story images in diverse styles.
Achieves superior quantitative results in text-to-image and image-to-image similarity compared to existing methods.
Outperforms competing approaches in user studies evaluating text alignment, identity preservation, and image quality. |
Multi-concept customization process can be slow.
Future work aims to accelerate the customization for real-time generation. |
story visualization, text-to-image generation, large language models, diffusion models, controllable image generation |
2311.11221
Report |
GaussianDiffusion: 3D Gaussian Splatting for Denoising Diffusion Probabilistic Models with Structured Noise |
Xinhai Li, Huaibin Wang, Kuo-Kun Tseng |
Text-to-3D, known for its efficient generation methods and expansive creative
potential, has garnered significant attention in the AIGC domain. However, the
amalgamation of Nerf and 2D diffusion models frequently yields oversaturated
images, posing severe limitations on downstream industrial applications due to
the constraints of pixelwise rendering method. Gaussian splatting has recently
superseded the traditional pointwise sampling technique prevalent in NeRF-based
methodologies, revolutionizing various aspects of 3D reconstruction. This paper
introduces a novel text to 3D content generation framework based on Gaussian
splatting, enabling fine control over image saturation through individual
Gaussian sphere transparencies, thereby producing more realistic images. The
challenge of achieving multi-view consistency in 3D generation significantly
impedes modeling complexity and accuracy. Taking inspiration from SJC, we
explore employing multi-view noise distributions to perturb images generated by
3D Gaussian splatting, aiming to rectify inconsistencies in multi-view
geometry. We ingeniously devise an efficient method to generate noise that
produces Gaussian noise from diverse viewpoints, all originating from a shared
noise source. Furthermore, vanilla 3D Gaussian-based generation tends to trap
models in local minima, causing artifacts like floaters, burrs, or
proliferative elements. To mitigate these issues, we propose the variational
Gaussian splatting technique to enhance the quality and stability of 3D
appearance. To our knowledge, our approach represents the first comprehensive
utilization of Gaussian splatting across the entire spectrum of 3D content
generation processes. |
This paper presents GaussianDiffusion, a novel text-to-3D generation framework based on Gaussian splatting for accelerated rendering and realistic 3D content creation from text prompts. |
Existing text-to-3D methods suffer from limitations like oversaturated images, slow rendering speed, multi-view inconsistency, and artifacts in generation. This work addresses these limitations with a novel Gaussian splatting based framework. |
The proposed GaussianDiffusion framework leverages Gaussian splatting for 3D representation and addresses multi-view consistency through a structured noise injection approach. It further introduces variational Gaussian splatting to enhance appearance quality and mitigate artifacts. |
GaussianDiffusion achieves significantly faster convergence compared to previous state-of-the-art methods like SJC and 3DFuse.
The introduction of structured noise effectively addresses multi-view geometric inconsistency, leading to better 3D structure generation.
Variational Gaussian splatting enhances the generated 3D appearance by reducing artifacts such as floaters, burrs, and proliferative elements. |
The use of variational Gaussian splatting, while improving realism, introduces some blurriness and haze in the generated output.
Future work will focus on refining the variational Gaussian splatting technique to mitigate blurriness and enhance overall appearance quality. |
text-to-3d, gaussian splatting, 3d content generation, multi-view consistency, variational gaussian splatting |
2311.11207
Report |
On the Noise Scheduling for Generating Plausible Designs with Diffusion Models |
Jiajie Fan, Laure Vuaille, Thomas Bäck, Hao Wang |
Deep Generative Models (DGMs) are widely used to create innovative designs
across multiple industries, ranging from fashion to the automotive sector. In
addition to generating images of high visual quality, the task of structural
design generation imposes more stringent constrains on the semantic expression,
e.g., no floating material or missing part, which we refer to as plausibility
in this work. We delve into the impact of noise schedules of diffusion models
on the plausibility of the outcome: there exists a range of noise levels at
which the model's performance decides the result plausibility. Also, we propose
two techniques to determine such a range for a given image set and devise a
novel parametric noise schedule for better plausibility. We apply this noise
schedule to the training and sampling of the well-known diffusion model EDM and
compare it to its default noise schedule. Compared to EDM, our schedule
significantly improves the rate of plausible designs from 83.4% to 93.5% and
Fr\'echet Inception Distance (FID) from 7.84 to 4.87. Further applications of
advanced image editing tools demonstrate the model's solid understanding of
structure. |
This paper proposes a Plausibility-oriented Diffusion Model (PoDM) that prioritizes a specific range of noise levels during training and sampling to improve the plausibility of generated structural designs. |
Existing diffusion models often prioritize visual quality over the plausibility of generated structures, leading to unrealistic designs. |
The authors identify a 'plausibility-relevant' range of noise levels in the diffusion process. They then modify the noise schedule of an existing diffusion model (EDM) to prioritize this range during both training and sampling. |
PoDM significantly increases the rate of plausible designs from 83.4% (EDM) to 93.5%, almost reaching the performance of DDPM (94%) but with a much faster sampling speed.
PoDM achieves a FID of 4.87, improving upon EDM's 7.84.
The authors demonstrate PoDM's ability to semantically manipulate structural designs using image editing techniques like interpolation, dragging, and inpainting. |
The study focuses solely on the BIKED dataset, potentially limiting the generalizability of the findings.
Future work could explore automated methods for evaluating the plausibility of generated images. |
diffusion models, generative design, structural design, noise schedule, image plausibility |
2311.10995
Report |
Behavior Optimized Image Generation |
Varun Khurana, Yaman K Singla, Jayakumar Subramanian, Rajiv Ratn Shah, Changyou Chen, Zhiqiang Xu, Balaji Krishnamurthy |
The last few years have witnessed great success on image generation, which
has crossed the acceptance thresholds of aesthetics, making it directly
applicable to personal and commercial applications. However, images, especially
in marketing and advertising applications, are often created as a means to an
end as opposed to just aesthetic concerns. The goal can be increasing sales,
getting more clicks, likes, or image sales (in the case of stock businesses).
Therefore, the generated images need to perform well on these key performance
indicators (KPIs), in addition to being aesthetically good. In this paper, we
make the first endeavor to answer the question of "How can one infuse the
knowledge of the end-goal within the image generation process itself to create
not just better-looking images but also "better-performing'' images?''. We
propose BoigLLM, an LLM that understands both image content and user behavior.
BoigLLM knows how an image should look to get a certain required KPI. We show
that BoigLLM outperforms 13x larger models such as GPT-3.5 and GPT-4 in this
task, demonstrating that while these state-of-the-art models can understand
images, they lack information on how these images perform in the real world. To
generate actual pixels of behavior-conditioned images, we train a
diffusion-based model (BoigSD) to align with a proposed BoigLLM-defined reward.
We show the performance of the overall pipeline on two datasets covering two
different behaviors: a stock dataset with the number of forward actions as the
KPI and a dataset containing tweets with the total likes as the KPI, denoted as
BoigBench. To advance research in the direction of utility-driven image
generation and understanding, we release BoigBench, a benchmark dataset
containing 168 million enterprise tweets with their media, brand account names,
time of post, and total likes. |
Introduces behavior-optimized image generation (BOIG), focusing on generating images that not only look good but also perform well on key performance indicators (KPIs) like likes and downloads. |
Images often serve a purpose beyond aesthetics, especially in marketing. Aligning image generation with user behavior can lead to more effective marketing campaigns. |
1. Creates BoigLLM, an LLM fine-tuned to understand image content and predict user behavior (likes, downloads). 2. Uses BoigLLM as a reward model to train BoigSD, a diffusion model that generates images optimized for desired KPIs. |
BoigLLM outperforms larger LLMs (GPT-3.5, GPT-4) in predicting image attributes based on desired behavior.
BoigSD generates images that score higher on BoigLLM's reward model, indicating better alignment with desired KPIs.
Supervised fine-tuning of stable diffusion on high-KPI images alone does not improve performance. |
The reward function relies on non-differentiable featurizers, limiting the use of end-to-end analytic policy gradients.
Current work focuses on likes and downloads; exploring other KPIs and user behaviors is crucial. |
image generation, user behavior, large language models, diffusion models, marketing |
2311.10982
Report |
Make Pixels Dance: High-Dynamic Video Generation |
Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, Hang Li |
Creating high-dynamic videos such as motion-rich actions and sophisticated
visual effects poses a significant challenge in the field of artificial
intelligence. Unfortunately, current state-of-the-art video generation methods,
primarily focusing on text-to-video generation, tend to produce video clips
with minimal motions despite maintaining high fidelity. We argue that relying
solely on text instructions is insufficient and suboptimal for video
generation. In this paper, we introduce PixelDance, a novel approach based on
diffusion models that incorporates image instructions for both the first and
last frames in conjunction with text instructions for video generation.
Comprehensive experimental results demonstrate that PixelDance trained with
public data exhibits significantly better proficiency in synthesizing videos
with complex scenes and intricate motions, setting a new standard for video
generation. |
This paper introduces Frame-FID, a novel video generation approach using diffusion models that incorporates image instructions for the first and last frames alongside text instructions. |
Current video generation models struggle to create high-dynamic videos with complex scenes and motions. This approach aims to address this limitation by providing more direct visual guidance. |
The method utilizes a latent diffusion model conditioned on text and image instructions. The image instructions, encoded using a VAE, are concatenated with the latent video representation. The model is trained to avoid directly replicating the last frame, allowing flexibility during inference. |
Frame-FID achieves state-of-the-art results on zero-shot video generation benchmarks MSR-VTT and UCF-101, outperforming existing methods in metrics like FVD and CLIP-similarity.
The model demonstrates superior performance in generating long videos with temporal consistency compared to autoregressive and hierarchical approaches.
Frame-FID exhibits strong generalization ability, generating high-quality videos in out-of-domain styles like comics and cartoons despite being trained primarily on realistic data. |
The model's performance could be further enhanced by training on larger, higher-quality, and more diverse video datasets.
Incorporating annotated texts describing key video elements and motions could improve alignment with user instructions. |
video generation, diffusion models, image instruction, long video generation, zero-shot video editing |
2311.10807
Report |
SENetV2: Aggregated dense layer for channelwise and global representations |
Mahendran Narayanan |
Convolutional Neural Networks (CNNs) have revolutionized image classification
by extracting spatial features and enabling state-of-the-art accuracy in
vision-based tasks. The squeeze and excitation network proposed module gathers
channelwise representations of the input. Multilayer perceptrons (MLP) learn
global representation from the data and in most image classification models
used to learn extracted features of the image. In this paper, we introduce a
novel aggregated multilayer perceptron, a multi-branch dense layer, within the
Squeeze excitation residual module designed to surpass the performance of
existing architectures. Our approach leverages a combination of squeeze
excitation network module with dense layers. This fusion enhances the network's
ability to capture channel-wise patterns and have global knowledge, leading to
a better feature representation. This proposed model has a negligible increase
in parameters when compared to SENet. We conduct extensive experiments on
benchmark datasets to validate the model and compare them with established
architectures. Experimental results demonstrate a remarkable increase in the
classification accuracy of the proposed model. |
This paper introduces SENetV2, an enhanced Squeeze and Excitation Network (SENet) module called Squeeze Aggregated Excitation (SaE) that improves feature representation by incorporating multi-branch fully connected layers within the SENet architecture. |
The authors aim to address the limitations of CNNs in capturing global representations and enhance the performance of SENet by introducing an aggregated multi-layer perceptron (MLP) within the Squeeze Excitation Residual Module. |
The authors propose a novel SaE module which incorporates multi-branch fully connected layers within the squeeze operation of the SENet module. This enables the model to learn richer global representations while maintaining a relatively lightweight structure. The proposed module is integrated into a ResNet architecture and evaluated on CIFAR-10, CIFAR-100, and a modified ImageNet dataset. |
SENetV2 outperforms vanilla ResNet and SENet on CIFAR-10 and CIFAR-100 datasets, demonstrating the effectiveness of the aggregated FC layers.
The proposed model achieves competitive results on the modified ImageNet dataset, further validating its capability in improving image classification accuracy.
The SaE module proves to be effective in enhancing feature representation by combining spatial, channel-wise, and global representations. |
The paper acknowledges the computational limitations, particularly with the modified ImageNet dataset, which could have limited the full potential of SENetV2.
Further exploration of different cardinality values and reduction sizes within the SaE module could lead to additional performance improvements. |
image classification, convolutional neural networks, squeeze and excitation networks, aggregated modules, global representations |
2311.10794
Report |
Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression |
Animesh Sinha, Bo Sun, Anmol Kalia, Arantxa Casanova, Elliot Blanchard, David Yan, Winnie Zhang, Tony Nelli, Jiahui Chen, Hardik Shah, Licheng Yu, Mitesh Kumar Singh, Ankit Ramchandani, Maziar Sanjabi, Sonal Gupta, Amy Bearman, Dhruv Mahajan |
We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models
(LDMs) in a distinct domain with high visual quality, prompt alignment and
scene diversity. We choose sticker image generation as the target domain, as
the images significantly differ from photorealistic samples typically generated
by large-scale LDMs. We start with a competent text-to-image model, like Emu,
and show that relying on prompt engineering with a photorealistic model to
generate stickers leads to poor prompt alignment and scene diversity. To
overcome these drawbacks, we first finetune Emu on millions of sticker-like
images collected using weak supervision to elicit diversity. Next, we curate
human-in-the-loop (HITL) Alignment and Style datasets from model generations,
and finetune to improve prompt alignment and style alignment respectively.
Sequential finetuning on these datasets poses a tradeoff between better style
alignment and prompt alignment gains. To address this tradeoff, we propose a
novel fine-tuning method called Style Tailoring, which jointly fits the content
and style distribution and achieves best tradeoff. Evaluation results show our
method improves visual quality by 14%, prompt alignment by 16.2% and scene
diversity by 15.3%, compared to prompt engineering the base Emu model for
stickers generation. |
Introduces Style Tailoring, a novel fine-tuning method for Latent Diffusion Models (LDMs) to generate images in a distinct domain (sticker images) with high visual quality, prompt alignment, and scene diversity. |
Addresses the limitations of existing LDM fine-tuning methods that struggle to simultaneously improve prompt alignment, visual diversity, visual appeal, and adherence to a specific style. |
Employs a multi-stage fine-tuning approach: (1) Domain alignment using weakly aligned sticker-like images. (2) Prompt alignment using a human-in-the-loop (HITL) dataset. (3) Style alignment using an expert-in-the-loop (EITL) dataset. Introduces Style Tailoring, which jointly optimizes for content and style by training on different data distributions at different denoising timesteps. |
Style Tailoring achieves the best trade-off between prompt alignment, style alignment, visual quality, and scene diversity compared to baseline methods and sequential fine-tuning.
Domain alignment fine-tuning significantly improves scene diversity and moderately enhances prompt alignment.
Human and expert-in-the-loop datasets are crucial for achieving high prompt and style alignment, respectively. |
Rare occurrences of photorealistic backgrounds in generated stickers, potentially due to unseen concepts during training.
Subjectivity in human evaluation of generative models, as preferences can shift over time. |
latent diffusion models, fine-tuning, style transfer, text-to-image generation, human-in-the-loop |
2311.10708
Report |
SelfEval: Leveraging the discriminative nature of generative models for evaluation |
Sai Saketh Rambhatla, Ishan Misra |
In this work, we show that text-to-image generative models can be 'inverted'
to assess their own text-image understanding capabilities in a completely
automated manner.
Our method, called SelfEval, uses the generative model to compute the
likelihood of real images given text prompts, making the generative model
directly applicable to discriminative tasks.
Using SelfEval, we repurpose standard datasets created for evaluating
multimodal text-image discriminative models to evaluate generative models in a
fine-grained manner: assessing their performance on attribute binding, color
recognition, counting, shape recognition, spatial understanding.
To the best of our knowledge SelfEval is the first automated metric to show a
high degree of agreement for measuring text-faithfulness with the gold-standard
human evaluations across multiple models and benchmarks.
Moreover, SelfEval enables us to evaluate generative models on challenging
tasks such as Winoground image-score where they demonstrate competitive
performance to discriminative models.
We also show severe drawbacks of standard automated metrics such as
CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and
how SelfEval sidesteps these issues.
We hope SelfEval enables easy and reliable automated evaluation for diffusion
models. |
The paper introduces "SelfEval," a method to automatically assess the text-image understanding of text-to-image generative models by inverting them to perform discriminative tasks. |
Automated evaluation of text-to-image models is crucial for efficient research and comparison but current methods rely on external models like CLIP, introducing biases and limitations. |
SelfEval estimates the likelihood of real images given text prompts using the diffusion model itself, converting it into a discriminative model for image-text matching tasks. |
SelfEval's ranking of text-faithfulness across different diffusion models aligns with human evaluation.
Latent diffusion models show superior text-faithfulness compared to pixel diffusion models, confirmed by both SelfEval and human evaluations.
SelfEval enables diffusion models to achieve competitive performance on challenging benchmarks like Winoground, surpassing previous methods and some discriminative models. |
SelfEval's computational cost is directly proportional to the number of timesteps in the diffusion process.
Future work could explore generalizing SelfEval to non-diffusion based generative models. |
text-to-image generation, diffusion models, automated evaluation, text faithfulness, image-text matching |
2311.10522
Report |
Enhancing Object Coherence in Layout-to-Image Synthesis |
Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin |
Layout-to-image synthesis is an emerging technique in conditional image
generation. It aims to generate complex scenes, where users require fine
control over the layout of the objects in a scene. However, it remains
challenging to control the object coherence, including semantic coherence
(e.g., the cat looks at the flowers or not) and physical coherence (e.g., the
hand and the racket should not be misaligned). In this paper, we propose a
novel diffusion model with effective global semantic fusion (GSF) and
self-similarity feature enhancement modules to guide the object coherence for
this task. For semantic coherence, we argue that the image caption contains
rich information for defining the semantic relationship within the objects in
the images. Instead of simply employing cross-attention between captions and
generated images, which addresses the highly relevant layout restriction and
semantic coherence separately and thus leads to unsatisfying results shown in
our experiments, we develop GSF to fuse the supervision from the layout
restriction and semantic coherence requirement and exploit it to guide the
image synthesis process. Moreover, to improve the physical coherence, we
develop a Self-similarity Coherence Attention (SCA) module to explicitly
integrate local contextual physical coherence into each pixel's generation
process. Specifically, we adopt a self-similarity map to encode the coherence
restrictions and employ it to extract coherent features from text embedding.
Through visualization of our self-similarity map, we explore the essence of
SCA, revealing that its effectiveness is not only in capturing reliable
physical coherence patterns but also in enhancing complex texture generation.
Extensive experiments demonstrate the superiority of our proposed method in
both image generation quality and controllability. |
This paper presents EOCNet, a novel diffusion model for layout-to-image synthesis (LIS) that addresses object coherence challenges by incorporating global semantic fusion (GSF) and self-similarity feature enhancement (SFE) modules. |
LIS often struggles with maintaining object coherence, both semantically (e.g., ensuring a cat looks at flowers) and physically (e.g., aligning a hand with a racket). EOCNet tackles these issues to achieve higher quality and controllability in generated images. |
EOCNet leverages a pre-trained text-to-image diffusion model. GSF integrates semantic coherence cues from captions and layout restrictions. SFE, comprising rectified cross-attention (RCA) and self-similarity coherence attention (SCA), refines object generation with contextual awareness. |
EOCNet outperforms SOTA methods on FID and DS, indicating superior image quality and diversity.
Visualization of SCA's self-similarity maps reveals its effectiveness in capturing physical coherence patterns and enhancing complex texture generation.
Caption integration enables fine-grained control over semantic coherence and image style. |
EOCNet encounters difficulties generating highly intricate textures, like realistic hands.
Semantic misalignments may occur when the caption's coherence requirements conflict with the layout. |
layout-to-image synthesis, diffusion models, object coherence, semantic fusion, self-similarity attention |
2311.10329
Report |
High-fidelity Person-centric Subject-to-Image Synthesis |
Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin |
Current subject-driven image generation methods encounter significant
challenges in person-centric image generation. The reason is that they learn
the semantic scene and person generation by fine-tuning a common pre-trained
diffusion, which involves an irreconcilable training imbalance. Precisely, to
generate realistic persons, they need to sufficiently tune the pre-trained
model, which inevitably causes the model to forget the rich semantic scene
prior and makes scene generation over-fit to the training data. Moreover, even
with sufficient fine-tuning, these methods can still not generate high-fidelity
persons since joint learning of the scene and person generation also lead to
quality compromise. In this paper, we propose Face-diffuser, an effective
collaborative generation pipeline to eliminate the above training imbalance and
quality compromise. Specifically, we first develop two specialized pre-trained
diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented
Diffusion Model (SDM), for scene and person generation, respectively. The
sampling process is divided into three sequential stages, i.e., semantic scene
construction, subject-scene fusion, and subject enhancement. The first and last
stages are performed by TDM and SDM respectively. The subject-scene fusion
stage, that is the collaboration achieved through a novel and highly effective
mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on
our key observation that there exists a robust link between classifier-free
guidance responses and the saliency of generated images. In each time step, SNF
leverages the unique strengths of each model and allows for the spatial
blending of predicted noises from both models automatically in a saliency-aware
manner. Extensive experiments confirm the impressive effectiveness and
robustness of the Face-diffuser. |
This paper introduces Face-diffuser, a novel pipeline for person-centric image generation that addresses limitations of existing methods by independently training two diffusion models for semantic scenes and person generation, and then seamlessly fusing their outputs using a Saliency-adaptive Noise Fusion mechanism. |
Existing subject-driven image generation methods struggle with person-centric generation due to training imbalance (overfitting to text prompts and forgetting scene priors) and compromised person quality from joint scene and person learning. |
The proposed method utilizes two independently trained diffusion models, one for scenes (TDM) and one for persons (SDM). During sampling, a three-stage process unfolds: 1) TDM constructs the scene, 2) a novel Saliency-adaptive Noise Fusion (SNF) mechanism combines outputs from TDM and SDM based on saliency maps derived from classifier-free guidance responses, and 3) SDM refines person details. |
Face-diffuser quantitatively outperforms state-of-the-art methods in both single- and multi-subject generation, demonstrating superior identity preservation and prompt consistency.
Qualitative results showcase Face-diffuser's ability to generate high-fidelity persons consistently embedded within diverse semantic scenes, surpassing the capabilities of existing methods.
Ablation studies confirm the importance of each stage in the pipeline and the effectiveness of the proposed SNF mechanism for seamless and high-quality image synthesis. |
The strong reliance on reference images for person generation raises privacy concerns due to the potential for unauthorized use of facial features.
The current method faces limitations in editing specific attributes of generated persons. Future work aims to address these limitations and enhance control over attribute editing. |
image generation, diffusion models, person-centric generation, saliency-adaptive fusion, classifier-free guidance |
2311.10123
Report |
MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture |
Lincong Feng, Muyu Wang, Maoyu Wang, Kuo Xu, Xiaoli Liu |
Generative models for 3D object synthesis have seen significant advancements
with the incorporation of prior knowledge distilled from 2D diffusion models.
Nevertheless, challenges persist in the form of multi-view geometric
inconsistencies and slow generation speeds within the existing 3D synthesis
frameworks. This can be attributed to two factors: firstly, the deficiency of
abundant geometric a priori knowledge in optimization, and secondly, the
entanglement issue between geometry and texture in conventional 3D generation
methods.In response, we introduce MetaDreammer, a two-stage optimization
approach that leverages rich 2D and 3D prior knowledge. In the first stage, our
emphasis is on optimizing the geometric representation to ensure multi-view
consistency and accuracy of 3D objects. In the second stage, we concentrate on
fine-tuning the geometry and optimizing the texture, thereby achieving a more
refined 3D object. Through leveraging 2D and 3D prior knowledge in two stages,
respectively, we effectively mitigate the interdependence between geometry and
texture. MetaDreamer establishes clear optimization objectives for each stage,
resulting in significant time savings in the 3D generation process. Ultimately,
MetaDreamer can generate high-quality 3D objects based on textual prompts
within 20 minutes, and to the best of our knowledge, it is the most efficient
text-to-3D generation method. Furthermore, we introduce image control into the
process, enhancing the controllability of 3D generation. Extensive empirical
evidence confirms that our method is not only highly efficient but also
achieves a quality level that is at the forefront of current state-of-the-art
3D generation techniques. |
MetaDreamer is a novel text-to-3D generation method that employs a two-stage, coarse-to-fine optimization process to efficiently generate high-quality 3D geometry and textures. |
Existing 3D generation methods suffer from slow generation speeds and struggle to balance geometric accuracy with high-quality textures. This is due to a lack of geometric prior knowledge and entanglement of geometry and texture optimization. |
MetaDreamer disentangles geometry and texture learning by using 3D priors (view-dependent diffusion model, depth, and reference image) in the first stage for coarse geometric optimization. In the second stage, it utilizes fine-tuned 2D priors (text-to-image diffusion model) for texture refinement and geometric detailing. |
MetaDreamer generates high-quality 3D objects with strong multi-view consistency and detailed textures within 20 minutes, outperforming state-of-the-art methods in both speed and quality.
Quantitative evaluations using CLIP similarity and T3Bench demonstrate MetaDreamer's superior performance in text-3D consistency and visual quality.
Ablation studies confirm the effectiveness of the two-stage disentanglement approach, highlighting the complementary roles of 3D and 2D priors. |
MetaDreamer faces limitations in multi-object generation scenarios due to the lack of multi-object priors in current geometric knowledge.
Future work will focus on incorporating richer multi-object geometric priors to enhance the model's capabilities. |
text-to-3d generation, 3d object synthesis, disentanglement learning, geometric priors, texture priors |
2311.10081
Report |
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback |
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran |
We present DRESS, a large vision language model (LVLM) that innovatively
exploits Natural Language feedback (NLF) from Large Language Models to enhance
its alignment and interactions by addressing two key limitations in the
state-of-the-art LVLMs. First, prior LVLMs generally rely only on the
instruction finetuning stage to enhance alignment with human preferences.
Without incorporating extra feedback, they are still prone to generate
unhelpful, hallucinated, or harmful responses. Second, while the visual
instruction tuning data is generally structured in a multi-turn dialogue
format, the connections and dependencies among consecutive conversational turns
are weak. This reduces the capacity for effective multi-turn interactions. To
tackle these, we propose a novel categorization of the NLF into two key types:
critique and refinement. The critique NLF identifies the strengths and
weaknesses of the responses and is used to align the LVLMs with human
preferences. The refinement NLF offers concrete suggestions for improvement and
is adopted to improve the interaction ability of the LVLMs-- which focuses on
LVLMs' ability to refine responses by incorporating feedback in multi-turn
interactions. To address the non-differentiable nature of NLF, we generalize
conditional reinforcement learning for training. Our experimental results
demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and
harmless (21.03%) responses, and more effectively learn from feedback during
multi-turn interactions compared to SOTA LVMLs. |
This paper proposes DRESS, a large vision language model (LVLM) that utilizes Natural Language Feedback (NLF) from Large Language Models to enhance its alignment with human preferences and improve multi-turn interaction capabilities. |
Existing LVLMs often generate unhelpful, hallucinated, or harmful responses due to limited alignment with human preferences and weak multi-turn interaction abilities. |
The approach categorizes NLF into 'critique' for evaluating response quality and 'refinement' for suggesting improvements. DRESS is trained using a generalized conditional reinforcement learning algorithm to incorporate this non-differentiable feedback. |
DRESS generates responses that are significantly more helpful, honest, and harmless compared to state-of-the-art LVLMs.
The model demonstrates superior multi-turn interaction ability, effectively learning from feedback to refine responses iteratively.
The paper introduces a new dataset, VLSafe, designed for evaluating and aligning LVLMs for harmlessness. |
The reliance on GPT-4 for feedback and evaluation introduces a dependency on its capabilities and limitations.
Future work could explore scaling up the RLAIF stage using web-scale data and developing more sophisticated refinement NLF modeling techniques. |
large vision language models, natural language feedback, alignment, multi-turn interaction, harmlessness |
2311.09753
Report |
DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics |
Aniket Roy, Maiterya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa |
Diffusion models have advanced generative AI significantly in terms of
editing and creating naturalistic images. However, efficiently improving
generated image quality is still of paramount interest. In this context, we
propose a generic "naturalness" preserving loss function, viz., kurtosis
concentration (KC) loss, which can be readily applied to any standard diffusion
model pipeline to elevate the image quality. Our motivation stems from the
projected kurtosis concentration property of natural images, which states that
natural images have nearly constant kurtosis values across different band-pass
versions of the image. To retain the "naturalness" of the generated images, we
enforce reducing the gap between the highest and lowest kurtosis values across
the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note
that our approach does not require any additional guidance like classifier or
classifier-free guidance to improve the image quality. We validate the proposed
approach for three diverse tasks, viz., (1) personalized few-shot finetuning
using text guidance, (2) unconditional image generation, and (3) image
super-resolution. Integrating the proposed KC loss has improved the perceptual
quality across all these tasks in terms of both FID, MUSIQ score, and user
evaluation. |
This paper proposes DiffNat, a novel kurtosis concentration (KC) loss function to improve the image quality of diffusion models by leveraging the statistical properties of natural images. |
Despite advancements in diffusion models, generated images can lack naturalness, especially in few-shot learning scenarios. This new loss function aims to address this limitation. |
The KC loss leverages the kurtosis concentration property of natural images, which states that the kurtosis values across different bandpass filtered versions of an image tend to be constant. The KC loss minimizes the difference between maximum and minimum kurtosis values across DWT filtered versions of the generated image, thereby enhancing naturalness. |
Adding the KC loss to DreamBooth and Custom Diffusion for few-shot finetuning results in improved image quality as measured by FID and MUSIQ scores.
Integrating the KC loss with DDPM for unconditional image generation leads to better perceptual quality across diverse datasets.
Incorporating the KC loss in image super-resolution diffusion models (Guided Diffusion and Latent Diffusion) significantly enhances the perceptual quality of super-resolved images. |
The paper primarily focuses on visual quality improvement and does not explicitly address potential limitations related to computational overhead or generalization ability.
Future work could explore the application of KC loss to other generative tasks and investigate its effectiveness in conjunction with different diffusion model architectures. |
diffusion models, image quality, natural image statistics, kurtosis concentration, generative ai |
2311.09571
Report |
3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation |
Dale Decatur, Itai Lang, Kfir Aberman, Rana Hanocka |
In this work we develop 3D Paintbrush, a technique for automatically
texturing local semantic regions on meshes via text descriptions. Our method is
designed to operate directly on meshes, producing texture maps which seamlessly
integrate into standard graphics pipelines. We opt to simultaneously produce a
localization map (to specify the edit region) and a texture map which conforms
to it. This synergistic approach improves the quality of both the localization
and the stylization. To enhance the details and resolution of the textured
area, we leverage multiple stages of a cascaded diffusion model to supervise
our local editing technique with generative priors learned from images at
different resolutions. Our technique, referred to as Cascaded Score
Distillation (CSD), simultaneously distills scores at multiple resolutions in a
cascaded fashion, enabling control over both the granularity and global
understanding of the supervision. We demonstrate the effectiveness of 3D
Paintbrush to locally texture a variety of shapes within different semantic
regions. Project page: https://threedle.github.io/3d-paintbrush |
3D Paintbrush is a method for automatically texturing local semantic regions on 3D meshes using text descriptions, producing texture maps compatible with standard graphics pipelines. |
Existing 3D editing methods struggle with precise local edits based on text prompts. 3D Paintbrush addresses this by generating both detailed texture maps and accurate localization maps for specified regions on meshes. |
The method uses neural networks to represent localization and texture maps. It leverages a novel Cascaded Score Distillation (CSD) technique that utilizes multiple stages of a cascaded diffusion model for high-resolution, text-driven supervision. |
3D Paintbrush generates highly detailed and localized textures on a variety of 3D shapes.
Simultaneous optimization of localization and texture maps improves the quality and detail of both.
CSD allows for control over the granularity and global understanding of the text-driven supervision, enabling high-resolution results. |
Currently, editing capabilities are limited to textures.
Future work includes expanding to other localized edits like deformations and materials, as well as co-texturing multiple shapes. |
3d texturing, local editing, text-to-3d, cascaded diffusion models, score distillation |
2311.09221
Report |
Single-Image 3D Human Digitization with Shape-Guided Diffusion |
Badour AlBahar, Shunsuke Saito, Hung-Yu Tseng, Changil Kim, Johannes Kopf, Jia-Bin Huang |
We present an approach to generate a 360-degree view of a person with a
consistent, high-resolution appearance from a single input image. NeRF and its
variants typically require videos or images from different viewpoints. Most
existing approaches taking monocular input either rely on ground-truth 3D scans
for supervision or lack 3D consistency. While recent 3D generative models show
promise of 3D consistent human digitization, these approaches do not generalize
well to diverse clothing appearances, and the results lack photorealism. Unlike
existing work, we utilize high-capacity 2D diffusion models pretrained for
general image synthesis tasks as an appearance prior of clothed humans. To
achieve better 3D consistency while retaining the input identity, we
progressively synthesize multiple views of the human in the input image by
inpainting missing regions with shape-guided diffusion conditioned on
silhouette and surface normal. We then fuse these synthesized multi-view images
via inverse rendering to obtain a fully textured high-resolution 3D mesh of the
given person. Experiments show that our approach outperforms prior methods and
achieves photorealistic 360-degree synthesis of a wide range of clothed humans
with complex textures from a single image. |
This paper presents a novel approach to generate a 360-degree view of a person with consistent, high-resolution appearance from a single image. |
Creating photorealistic 3D human models typically requires multi-view images or 3D scans, which are difficult to obtain. This work aims to address this challenge by enabling personalized 3D human digitization from easily accessible single images. |
The method leverages a pre-trained 2D diffusion model for general image synthesis as a human appearance prior. It reconstructs the 3D geometry, synthesizes multi-view images via shape-guided diffusion inpainting using normal and silhouette maps, and finally fuses these images into a textured 3D mesh. |
The approach outperforms previous methods in generating high-fidelity textured 3D humans from single images.
It effectively leverages the power of large-scale pre-trained 2D diffusion models for 3D human digitization.
Shape guidance using both normal and silhouette maps during inpainting significantly improves the preservation of shape and structural details. |
The approach currently relies on off-the-shelf methods for base geometry reconstruction and back-view synthesis, inheriting their limitations.
The generated textures lack view-dependency, which could be addressed in future work. |
digital humans, single-image 3d reconstruction, diffusion models, shape-guided synthesis, multi-view fusion |
2311.09215
Report |
ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy |
Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu |
Modern computer vision offers a great variety of models to practitioners, and
selecting a model from multiple options for specific applications can be
challenging. Conventionally, competing model architectures and training
protocols are compared by their classification accuracy on ImageNet. However,
this single metric does not fully capture performance nuances critical for
specialized tasks. In this work, we conduct an in-depth comparative analysis of
model behaviors beyond ImageNet accuracy, for both ConvNet and Vision
Transformer architectures, each across supervised and CLIP training paradigms.
Although our selected models have similar ImageNet accuracies and compute
requirements, we find that they differ in many other aspects: types of
mistakes, output calibration, transferability, and feature invariance, among
others. This diversity in model characteristics, not captured by traditional
metrics, highlights the need for more nuanced analysis when choosing among
different models. Our code is available at
https://github.com/kirill-vish/Beyond-INet. |
This paper presents a comparative analysis of ConvNet (ConvNeXt) and Vision Transformer (ViT) models trained with supervised and CLIP paradigms, going beyond traditional ImageNet accuracy evaluation to explore their behavioral nuances. |
Selecting appropriate vision models for specific tasks is challenging with numerous architectures and training methods. Relying solely on ImageNet accuracy is insufficient as it overlooks important model behaviors, particularly for specialized applications. |
The authors analyze four pretrained models (ConvNeXt and ViT, each with supervised and CLIP training) with similar ImageNet accuracies and computational costs. They evaluate various properties like model mistakes, shape/texture bias, calibration, robustness, transferability, performance on synthetic data, and transformation invariance. |
CLIP models exhibit better transferability and fewer classification errors relative to their ImageNet accuracy, while supervised models excel in robustness benchmarks and calibration.
Supervised ConvNeXt demonstrates strong performance across various benchmarks, including transferability, challenging the dominance of CLIP models in this aspect.
ConvNeXt outperforms ViT on synthetic data, while ViT shows a higher shape bias. This highlights architecture-specific strengths and weaknesses beyond ImageNet performance. |
The robustness evaluation is limited to ImageNet variants, potentially biasing the results.
The study primarily focuses on pretrained models, neglecting the impact of fine-tuning on specific downstream tasks. |
model selection, convnext, vision transformer, clip, benchmarking |
2311.09191
Report |
Domain Aligned CLIP for Few-shot Classification |
Muhammad Waleed Gondal, Jochen Gast, Inigo Alonso Ruiz, Richard Droste, Tommaso Macri, Suren Kumar, Luitpold Staudigl |
Large vision-language representation learning models like CLIP have
demonstrated impressive performance for zero-shot transfer to downstream tasks
while largely benefiting from inter-modal (image-text) alignment via
contrastive objectives. This downstream performance can further be enhanced by
full-scale fine-tuning which is often compute intensive, requires large
labelled data, and can reduce out-of-distribution (OOD) robustness.
Furthermore, sole reliance on inter-modal alignment might overlook the rich
information embedded within each individual modality. In this work, we
introduce a sample-efficient domain adaptation strategy for CLIP, termed Domain
Aligned CLIP (DAC), which improves both intra-modal (image-image) and
inter-modal alignment on target distributions without fine-tuning the main
model. For intra-modal alignment, we introduce a lightweight adapter that is
specifically trained with an intra-modal contrastive objective. To improve
inter-modal alignment, we introduce a simple framework to modulate the
precomputed class text embeddings. The proposed few-shot fine-tuning framework
is computationally efficient, robust to distribution shifts, and does not alter
CLIP's parameters. We study the effectiveness of DAC by benchmarking on 11
widely used image classification tasks with consistent improvements in 16-shot
classification upon strong baselines by about 2.3% and demonstrate competitive
performance on 4 OOD robustness benchmarks. |
This paper proposes Domain Aligned CLIP (DAC), a sample-efficient domain adaptation strategy for CLIP that improves few-shot classification by aligning both intra-modal (image-image) and inter-modal (image-text) representations on target distributions. |
Adapting large vision-language models like CLIP to downstream tasks often requires fine-tuning, which can be resource-intensive and prone to overfitting. DAC offers a computationally efficient alternative that leverages few-shot data for improved domain adaptation. |
DAC utilizes a two-stage adaptation strategy. First, a lightweight adapter layer is trained with a supervised contrastive objective to improve intra-modal alignment. Second, CLIP's text embeddings are fine-tuned to enhance inter-modal alignment, resulting in DAC-VT. |
DAC-VT consistently outperforms competitive few-shot CLIP adaptation baselines on 11 image classification benchmarks, demonstrating the effectiveness of aligning both intra- and inter-modal representations.
DAC-V, which only aligns visual features, shows better robustness to distribution shifts compared to methods focusing solely on inter-modal alignment.
Analysis reveals that DAC's intra- and inter-modal classifiers make uncorrelated errors, leading to improved performance through ensembling. |
The two-stage adaptation process increases the computational overhead during fine-tuning compared to some baselines.
Further improvement of ensembling intra- and inter-modal classifiers is possible as both still exhibit uncorrelated errors. |
few-shot learning, domain adaptation, vision-language models, contrastive learning, clip |
2311.08403
Report |
Instant3D: Instant Text-to-3D Generation |
Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, Xiangyu Xu |
Text-to-3D generation has attracted much attention from the computer vision
community. Existing methods mainly optimize a neural field from scratch for
each text prompt, relying on heavy and repetitive training cost which impedes
their practical deployment. In this paper, we propose a novel framework for
fast text-to-3D generation, dubbed Instant3D. Once trained, Instant3D is able
to create a 3D object for an unseen text prompt in less than one second with a
single run of a feedforward network. We achieve this remarkable speed by
devising a new network that directly constructs a 3D triplane from a text
prompt. The core innovation of our Instant3D lies in our exploration of
strategies to effectively inject text conditions into the network. In
particular, we propose to combine three key mechanisms: cross-attention, style
injection, and token-to-plane transformation, which collectively ensure precise
alignment of the output with the input text. Furthermore, we propose a simple
yet effective activation function, the scaled-sigmoid, to replace the original
sigmoid function, which speeds up the training convergence by more than ten
times. Finally, to address the Janus (multi-head) problem in 3D generation, we
propose an adaptive Perp-Neg algorithm that can dynamically adjust its concept
negation scales according to the severity of the Janus problem during training,
effectively reducing the multi-head effect. Extensive experiments on a wide
variety of benchmark datasets demonstrate that the proposed algorithm performs
favorably against the state-of-the-art methods both qualitatively and
quantitatively, while achieving significantly better efficiency. The code,
data, and models are available at https://github.com/ming1993li/Instant3DCodes. |
This paper presents Instant3D, a novel framework for fast text-to-3D generation capable of creating a 3D object from an unseen text prompt in under one second with a single run of a feedforward network. |
Existing text-to-3D generation methods rely on computationally expensive optimization for each new text prompt, hindering their practical deployment due to slow response times and lack of shared 3D priors across objects. |
Instant3D leverages a conditional feedforward network that directly constructs a 3D triplane representation from a text prompt. It employs three key mechanisms for effective text condition injection: cross-attention, style injection with Adaptive Instance Normalization (AdaIN), and token-to-plane transformation. It also introduces a scaled-sigmoid activation function for faster training convergence and an adaptive Perp-Neg algorithm to address the multi-head problem. |
Instant3D achieves high-quality 3D generation with accurate text-3D alignment, outperforming state-of-the-art methods qualitatively and quantitatively on various benchmark datasets.
The proposed method demonstrates superior efficiency, generating 3D objects in under a second compared to hours required by existing optimization-based approaches.
Ablation studies confirm the effectiveness of each proposed component, highlighting their contributions to fast and accurate text-to-3D generation. |
Current benchmark prompt sets, while diverse, are relatively small compared to text-to-image datasets, limiting the model's generalization ability.
The computational cost of training text-to-3D networks remains high, posing challenges for scaling up to larger datasets. |
text-to-3d generation, neural radiance fields, deep learning, computer vision, generative models |
2311.08400
Report |
Towards Open-Ended Visual Recognition with Large Language Model |
Qihang Yu, Xiaohui Shen, Liang-Chieh Chen |
Localizing and recognizing objects in the open-ended physical world poses a
long-standing challenge within the domain of machine perception. Recent methods
have endeavored to address the issue by employing a class-agnostic mask (or
box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP)
using pre-extracted text embeddings. However, it is worth noting that these
open-vocabulary recognition models still exhibit limitations in practical
applications. On one hand, they rely on the provision of class names during
testing, where the recognition performance heavily depends on this predefined
set of semantic classes by users. On the other hand, when training with
multiple datasets, human intervention is required to alleviate the label
definition conflict between them. In this paper, we introduce the OmniScient
Model (OSM), a novel Large Language Model (LLM) based mask classifier, as a
straightforward and effective solution to the aforementioned challenges.
Specifically, OSM predicts class labels in a generative manner, thus removing
the supply of class names during both training and testing. It also enables
cross-dataset training without any human interference, exhibiting robust
generalization capabilities due to the world knowledge acquired from the LLM.
By combining OSM with an off-the-shelf mask proposal model, we present
promising results on various benchmarks, and demonstrate its effectiveness in
handling novel concepts. Code/model are available at
https://github.com/bytedance/OmniScient-Model. |
This paper presents OmniScient Model (OSM), a novel generative framework for open-ended recognition tasks that leverages a Large Language Model (LLM) to predict class labels directly without predefined vocabularies. |
Existing open-vocabulary recognition models rely on predefined class names, hindering their applicability to real-world scenarios with novel concepts and complicating training with multiple datasets. |
OSM combines a frozen CLIP-ViT for feature extraction, a trainable bridging module (Mask Query Former) for mask-aware feature resampling, and a frozen LLM for generative class label prediction. The model is trained with an instruction tuning approach on multiple segmentation datasets. |
OSM achieves comparable accuracy to discriminative models when evaluated on mask classification with ground-truth masks, demonstrating the effectiveness of generative models for discriminative tasks.
OSM exhibits strong generalization ability, achieving state-of-the-art performance on open-vocabulary benchmarks and handling novel concepts beyond predefined vocabularies.
The proposed Mode Query mechanism allows OSM to balance between vocabulary-specific and vocabulary-agnostic predictions, making it adaptable to diverse real-world scenarios. |
The trade-off between accuracy and generalization ability in OSM requires further investigation to mitigate potential overfitting to training vocabularies.
Exploring stronger base models and larger datasets with better diversity could further improve OSM's performance and expressiveness in class label prediction. |
open-ended recognition, open-vocabulary, generative model, large language model, segmentation |
2311.08046
Report |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding |
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan |
Large language models have demonstrated impressive universal capabilities
across a wide range of open-ended tasks and have extended their utility to
encompass multimodal conversations. However, existing methods encounter
challenges in effectively handling both image and video understanding,
particularly with limited visual tokens. In this work, we introduce Chat-UniVi,
a Unified Vision-language model capable of comprehending and engaging in
conversations involving images and videos through a unified visual
representation. Specifically, we employ a set of dynamic visual tokens to
uniformly represent images and videos. This representation framework empowers
the model to efficiently utilize a limited number of visual tokens to
simultaneously capture the spatial details necessary for images and the
comprehensive temporal relationship required for videos. Moreover, we leverage
a multi-scale representation, enabling the model to perceive both high-level
semantic concepts and low-level visual details. Notably, Chat-UniVi is trained
on a mixed dataset containing both images and videos, allowing direct
application to tasks involving both mediums without requiring any
modifications. Extensive experimental results demonstrate that Chat-UniVi
consistently outperforms even existing methods exclusively designed for either
images or videos. Code is available at
https://github.com/PKU-YuanGroup/Chat-UniVi. |
This paper introduces Chat-UniVi, a unified vision-language model capable of understanding and engaging in conversations involving both images and videos through a shared representation framework. |
Existing methods for multimodal conversations often specialize in either image or video understanding, struggling to effectively capture both spatial details and temporal relationships with limited visual tokens. |
Chat-UniVi leverages dynamic visual tokens to uniformly represent images and videos. It employs a token merging method based on the DPC-KNN clustering algorithm to progressively merge visual tokens with similar semantic meanings, reducing the token number while preserving crucial information. Additionally, a multi-scale representation is used to capture both high-level semantic concepts and low-level visual details. |
Chat-UniVi consistently outperforms existing methods exclusively designed for either images or videos in both GPT-based and question-answering evaluations.
The model achieves impressive results in object hallucination benchmarks, indicating its strong capability to comprehend visual content and resist generating unrealistic descriptions.
Joint training on a mixed dataset of images and videos is shown to be crucial, allowing Chat-UniVi to excel in tasks involving both media types without requiring any modifications. |
The model currently relies on the capabilities of pre-trained large language models, inheriting their potential vulnerabilities such as hallucination and limitations in long sequence processing.
While natural language serves as a flexible interface for various tasks, it might not be optimal for tasks demanding structured outputs or generating dense predictions. |
multimodal learning, vision-language model, large language model, dynamic visual tokens, multi-scale representation |
2311.07885
Report |
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion |
Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, Hao Su |
Recent advancements in open-world 3D object generation have been remarkable,
with image-to-3D methods offering superior fine-grained control over their
text-to-3D counterparts. However, most existing models fall short in
simultaneously providing rapid generation speeds and high fidelity to input
images - two features essential for practical applications. In this paper, we
present One-2-3-45++, an innovative method that transforms a single image into
a detailed 3D textured mesh in approximately one minute. Our approach aims to
fully harness the extensive knowledge embedded in 2D diffusion models and
priors from valuable yet limited 3D data. This is achieved by initially
finetuning a 2D diffusion model for consistent multi-view image generation,
followed by elevating these images to 3D with the aid of multi-view conditioned
3D native diffusion models. Extensive experimental evaluations demonstrate that
our method can produce high-quality, diverse 3D assets that closely mirror the
original input image. Our project webpage:
https://sudo-ai-3d.github.io/One2345plus_page. |
Presents One-2-3-45++, a method that generates textured 3D meshes from a single image in approximately one minute, leveraging 2D diffusion models and 3D data priors for fast generation and high fidelity to the input. |
Addresses limitations of existing image-to-3D methods that are either slow and high-quality (optimization-based) or fast and low-quality (feed-forward). |
1. **Consistent Multi-view Generation:** Fine-tunes a 2D diffusion model to generate consistent multi-view images from a single input image. 2. **3D Diffusion with Multi-View Condition:** Employs a multi-view conditioned 3D diffusion model to generate a textured mesh from the multi-view images. 3. **Texture Refinement:** Uses a lightweight optimization technique to refine the texture of the generated mesh using the multi-view images. |
Achieves state-of-the-art results on the GSO dataset in terms of F-Score, CLIP similarity, and user preference.
Outperforms existing text-to-3D methods in terms of CLIP similarity and user preference.
Demonstrates significant speed advantages over optimization-based methods while maintaining high fidelity to the input image. |
Potential to improve geometry robustness and detail by incorporating additional guiding conditions from 2D diffusion models.
Reliance on accurate camera pose estimation for multi-view generation. |
3d generation, image-to-3d, text-to-3d, diffusion models, multi-view consistency |
2311.07414
Report |
FIRST: A Million-Entry Dataset for Text-Driven Fashion Synthesis and Design |
Zhen Huang, Yihao Li, Dong Pei, Jiapeng Zhou, Xuliang Ning, Jianlin Han, Xiaoguang Han, Xuejun Chen |
Text-driven fashion synthesis and design is an extremely valuable part of
artificial intelligence generative content(AIGC), which has the potential to
propel a tremendous revolution in the traditional fashion industry. To advance
the research on text-driven fashion synthesis and design, we introduce a new
dataset comprising a million high-resolution fashion images with rich
structured textual(FIRST) descriptions. In the FIRST, there is a wide range of
attire categories and each image-paired textual description is organized at
multiple hierarchical levels. Experiments on prevalent generative models
trained over FISRT show the necessity of FIRST. We invite the community to
further develop more intelligent fashion synthesis and design systems that make
fashion design more creative and imaginative based on our dataset. The dataset
will be released soon. |
Introduces FIRST, a million-entry dataset of high-resolution fashion images with rich, structured textual descriptions for advancing text-driven fashion synthesis and design. |
Existing fashion datasets lack either textual descriptions or have limited scale and unstructured text, hindering the development of intelligent fashion design systems. |
Collected over a million raw images from the internet and commercial partners, cleaned for quality, and hierarchically annotated using GPT-4V and human revision. |
FIRST is the largest fashion dataset with hierarchical annotations, covering diverse attire categories and photographic scenes.
Fine-tuning Stable Diffusion on FIRST significantly improves FID and CLIP-S scores, demonstrating enhanced generation quality and text control.
Human feedback confirms improved quality and text-image alignment of generated images after fine-tuning on FIRST. |
Current diffusion models struggle with long text prompts like those in FIRST, limiting their capacity to handle detailed descriptions.
Generating cohesive fashion collections from shared design philosophies remains a challenge, requiring models to understand abstract concepts and translate them into coherent visual styles. |
fashion synthesis, text-to-image generation, dataset, diffusion models, computer vision |
2311.06978
Report |
Augmented Bridge Matching |
Valentin De Bortoli, Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou, Weilie Nie |
Flow and bridge matching are a novel class of processes which encompass
diffusion models. One of the main aspect of their increased flexibility is that
these models can interpolate between arbitrary data distributions i.e. they
generalize beyond generative modeling and can be applied to learning stochastic
(and deterministic) processes of arbitrary transfer tasks between two given
distributions. In this paper, we highlight that while flow and bridge matching
processes preserve the information of the marginal distributions, they do
\emph{not} necessarily preserve the coupling information unless additional,
stronger optimality conditions are met. This can be problematic if one aims at
preserving the original empirical pairing. We show that a simple modification
of the matching process recovers this coupling by augmenting the velocity field
(or drift) with the information of the initial sample point. Doing so, we lose
the Markovian property of the process but preserve the coupling information
between distributions. We illustrate the efficiency of our augmentation in
learning mixture of image translation tasks. |
This paper investigates flow/bridge matching, showing that while it preserves marginal distributions, it doesn't always preserve coupling information, crucial for tasks like image translation where paired data relationships are key. |
Preserving coupling information is important in applications like image translation where the paired training data encodes the relationship between degraded and clean images. |
The authors leverage Doob h-transform theory to analyze flow/bridge matching fixed points and propose "Augmented Bridge Matching," which modifies the drift term to explicitly incorporate initial sample information, thus preserving the coupling. |
Bridge matching preserves the original coupling if and only if the training coupling is the optimal transport coupling (Schrödinger Bridge).
Augmenting the bridge matching drift term with initial sample information allows for the preservation of the training coupling.
Augmented Bridge Matching outperforms standard bridge matching in multi-domain image-to-image translation tasks, both qualitatively and quantitatively (using FID). |
The impact of intermediate augmentation levels (conditioning on X_{αt} with α ∈ (0,1)) on coupling preservation remains unclear.
High entropy in the training coupling can hinder the training process due to increased loss variance. |
diffusion models, bridge matching, coupling preservation, image translation, doob h-transform |
2311.06791
Report |
InfMLLM: A Unified Framework for Visual-Language Tasks |
Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi |
Large language models (LLMs) have proven their remarkable versatility in
handling a comprehensive range of language-centric applications. To expand
LLMs' capabilities to a broader spectrum of modal inputs, multimodal large
language models (MLLMs) have attracted growing interest. This work delves into
enabling LLMs to tackle more vision-language-related tasks, particularly image
captioning, visual question answering (VQA,) and visual grounding. To this end,
we implemented a three-stage training scheme: starting with lightweight
alignment pretraining, then moderate-weight multitask hybrid training, and
finally, LLM fine-tuning to improve instruction following capability.
Throughout the training process, the requirements on GPU memory gradually
increase. To effectively manage the number of visual embeddings passed to the
LLM while preserving their positional information, we introduce a
straightforward visual adapter module dubbed pool-adapter. Our experiments
demonstrate that preserving the positional information of visual embeddings
through the pool-adapter is particularly beneficial for tasks like visual
grounding. We name our proposed approach InfMLLM and have evaluated it
extensively on various benchmark datasets. Our results demonstrate that InfMLLM
achieves either state-of-the-art (SOTA) performance or performance comparable
to recent MLLMs. The code and model will be made open-source at:
\url{https://github.com/mightyzau/InfMLLM}. |
Presents InfMLLM, a MultiModal Large Language Model framework that uses a pool-adapter to adjust the number of image embeddings dynamically while preserving positional information for enhanced performance in vision-language tasks |
Extends the capabilities of LLMs to multimodal domains, enabling them to handle tasks like image captioning, visual question answering, and visual grounding more effectively |
Implements a three-stage training scheme: lightweight alignment pretraining of a visual adapter, moderate-weight multitask hybrid training, and LLM fine-tuning for improved instruction following. Introduces pool-adapter to align visual features with text embeddings while maintaining positional information |
InfMLLM achieves state-of-the-art results in visual grounding and visual question answering tasks
Demonstrates competitive performance in image captioning and text-oriented VQA tasks
Shows that increasing visual embeddings generally improves performance, and online adjustment of embedding quantity offers a balance between speed and accuracy |
Multitask finetuning presents optimization conflicts between individual tasks, requiring careful tuning of loss weights and data ratios
Exploring more effective solutions for multitask finetuning is crucial |
multimodal learning, large language models, vision-language tasks, image captioning, visual question answering, visual grounding |
2311.06783
Report |
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models |
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, Geng Xue, Wenxiu Sun, Qiong Yan, Weisi Lin |
Multi-modality foundation models, as represented by GPT-4V, have brought a
new paradigm for low-level visual perception and understanding tasks, that can
respond to a broad range of natural human instructions in a model. While
existing foundation models have shown exciting potentials on low-level visual
tasks, their related abilities are still preliminary and need to be improved.
In order to enhance these models, we conduct a large-scale subjective
experiment collecting a vast number of real human feedbacks on low-level
vision. Each feedback follows a pathway that starts with a detailed description
on the low-level visual appearance (*e.g. clarity, color, brightness* of an
image, and ends with an overall conclusion, with an average length of 45 words.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on
18,973 images with diverse low-level appearance. Moreover, to enable foundation
models to robustly respond to diverse types of questions, we design a
GPT-participated conversion to process these feedbacks into diverse-format 200K
instruction-response pairs. Experimental results indicate that the
**Q-Instruct** consistently elevates low-level perception and understanding
abilities across several foundational models. We anticipate that our datasets
can pave the way for a future that general intelligence can perceive,
understand low-level visual appearance and evaluate visual quality like a
human. Our dataset, model zoo, and demo is published at:
https://q-future.github.io/Q-Instruct. |
This paper introduces Q-Instruct, the first large-scale dataset for low-level visual instruction tuning of Multi-modality Large Language Models (MLLMs). |
Existing MLLMs excel at high-level visual tasks but struggle with low-level visual perception and understanding due to the lack of dedicated training data. |
The authors first collected Q-Pathway, a dataset of 58K human text feedbacks on the low-level aspects of 18,973 images. They then used GPT to automatically transform Q-Pathway into Q-Instruct, a dataset of 200K instruction-response pairs suitable for instruction tuning. |
Fine-tuning MLLMs on Q-Instruct significantly improves their performance on low-level visual question answering (up to 17% improvement on distortion-related questions).
Q-Instruct enhances the ability of MLLMs to provide detailed descriptions of low-level visual attributes and image quality.
Remarkably, text-driven instruction tuning with Q-Instruct effectively aligns MLLMs with numerical image quality assessment, exhibiting strong generalization even to unseen image types. |
While improving low-level visual abilities, fine-tuning with Q-Instruct might compromise performance on general-purpose or reasoning-intensive tasks.
Despite the improvement, Q-Instruct tuned models still fall short of human performance and may require further development to fully replace human judgment on low-level visual tasks. |
multi-modality large language models, low-level vision, instruction tuning, image quality assessment, visual question answering |
2311.06612
Report |
PerceptionGPT: Effectively Fusing Visual Perception into LLM |
Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, Tong Zhang |
The integration of visual inputs with large language models (LLMs) has led to
remarkable advancements in multi-modal capabilities, giving rise to visual
large language models (VLLMs). However, effectively harnessing VLLMs for
intricate visual perception tasks remains a challenge. In this paper, we
present a novel end-to-end framework named PerceptionGPT, which efficiently and
effectively equips the VLLMs with visual perception abilities by leveraging the
representation power of LLMs' token embedding. Our proposed method treats the
token embedding of the LLM as the carrier of spatial information, then leverage
lightweight visual task encoders and decoders to perform visual perception
tasks (e.g., detection, segmentation). Our approach significantly alleviates
the training difficulty suffered by previous approaches that formulate the
visual outputs as discrete tokens, and enables achieving superior performance
with fewer trainable parameters, less training data and shorted training time.
Moreover, as only one token embedding is required to decode the visual outputs,
the resulting sequence length during inference is significantly reduced.
Consequently, our approach enables accurate and flexible representations,
seamless integration of visual perception tasks, and efficient handling of a
multiple of visual outputs. We validate the effectiveness and efficiency of our
approach through extensive experiments. The results demonstrate significant
improvements over previous methods with much fewer trainable parameters and GPU
hours, which facilitates future research in enabling LLMs with visual
perception abilities. |
This paper introduces PerceptionGPT, a novel framework for efficiently training perception-enhanced vision language models (P-VLMs) by leveraging the representation power of LLM's token embedding. |
Existing methods for integrating visual perception into VLLMs face challenges such as training difficulty, quantization errors from discrete token representation, and increased context length. PerceptionGPT addresses these limitations. |
PerceptionGPT utilizes lightweight visual task encoders and decoders to represent visual perception signals (bounding boxes, segmentation masks) within the LLM's token embedding space, eliminating the need for discrete tokenization. |
PerceptionGPT achieves state-of-the-art performance on referring expression comprehension and segmentation tasks with only parameter-efficient tuning.
The method significantly reduces training difficulty, enabling good performance even with a small fraction of tunable parameters.
By representing perception signals with a single token embedding, PerceptionGPT accelerates decoding speed, especially for complex information like segmentation masks. |
The paper primarily focuses on object detection and segmentation, with potential for incorporating other perception tasks.
Further exploration of model scaling and its impact on performance is an area for future research. |
vision language model, visual perception, large language model, token embedding, multi-modal learning |
2311.06243
Report |
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization |
Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, Bernhard Schölkopf |
Large foundation models are becoming ubiquitous, but training them from
scratch is prohibitively expensive. Thus, efficiently adapting these powerful
models to downstream tasks is increasingly important. In this paper, we study a
principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream
task adaptation. Despite demonstrating good generalizability, OFT still uses a
fairly large number of trainable parameters due to the high dimensionality of
orthogonal matrices. To address this, we start by examining OFT from an
information transmission perspective, and then identify a few key desiderata
that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast
Fourier transform algorithm enables efficient information transmission, we
propose an efficient orthogonal parameterization using butterfly structures. We
apply this parameterization to OFT, creating a novel parameter-efficient
finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a
special case, BOFT introduces a generalized orthogonal finetuning framework.
Finally, we conduct an extensive empirical study of adapting large vision
transformers, large language models, and text-to-image diffusion models to
various downstream tasks in vision and language. |
This paper proposes Orthogonal Butterfly (BOFT), a parameter-efficient finetuning method for foundation models that leverages butterfly structures to create dense orthogonal matrices for weight updates. |
Efficiently adapting large foundation models to downstream tasks is crucial, and BOFT offers a principled approach to finetuning that improves upon existing methods like Orthogonal Finetuning (OFT) and LoRA. |
BOFT parameterizes a dense orthogonal matrix as a product of multiple sparse orthogonal matrices, inspired by the butterfly structures used in the fast Fourier transform algorithm. This allows for a significant reduction in trainable parameters while maintaining expressiveness and stability. |
BOFT consistently outperforms LoRA and OFT in terms of accuracy and parameter efficiency across various tasks, including natural language understanding, mathematical reasoning, image classification, high-quality segmentation, and controllable text-to-image generation.
The butterfly structure in BOFT introduces a beneficial inductive bias for generalization, as evidenced by its superior performance compared to OFT with the same effective block size.
BOFT enables smooth weight interpolation by gradually setting trained orthogonal butterfly components to identity matrices, highlighting its ability to preserve semantic information and explore a favorable weight space. |
BOFT introduces a slight training runtime overhead compared to OFT due to the multiplication of multiple orthogonal matrices.
The optimality of the butterfly structure for information transmission in this context remains an open question, and exploring other network topologies could potentially yield further improvements. |
parameter-efficient finetuning, foundation models, orthogonal matrices, butterfly structures, information transmission |
2311.05770
Report |
PolyMaX: General Dense Prediction with Mask Transformer |
Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen |
Dense prediction tasks, such as semantic segmentation, depth estimation, and
surface normal prediction, can be easily formulated as per-pixel classification
(discrete outputs) or regression (continuous outputs). This per-pixel
prediction paradigm has remained popular due to the prevalence of fully
convolutional networks. However, on the recent frontier of segmentation task,
the community has been witnessing a shift of paradigm from per-pixel prediction
to cluster-prediction with the emergence of transformer architectures,
particularly the mask transformers, which directly predicts a label for a mask
instead of a pixel. Despite this shift, methods based on the per-pixel
prediction paradigm still dominate the benchmarks on the other dense prediction
tasks that require continuous outputs, such as depth estimation and surface
normal prediction. Motivated by the success of DORN and AdaBins in depth
estimation, achieved by discretizing the continuous output space, we propose to
generalize the cluster-prediction based method to general dense prediction
tasks. This allows us to unify dense prediction tasks with the mask transformer
framework. Remarkably, the resulting model PolyMaX demonstrates
state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope
our simple yet effective design can inspire more research on exploiting mask
transformers for more dense prediction tasks. Code and model will be made
available. |
Proposes PolyMaX, a novel mask transformer framework that unifies various dense prediction tasks such as semantic segmentation, depth estimation, and surface normal prediction using a cluster-prediction paradigm. |
Addresses the limitations of existing dense prediction models that struggle to generalize across tasks with different output domains (discrete vs. continuous) by introducing a unified architecture based on mask transformers. |
Extends the cluster-prediction approach used in semantic segmentation to continuous domains by discretizing the output space into learnable clusters. Employs a mask transformer to learn cluster centers and their corresponding probability distribution maps, which are then linearly combined to generate the final predictions. |
Achieves state-of-the-art performance on NYUD-v2 dataset for semantic segmentation, depth estimation, and surface normal prediction, outperforming existing methods by a significant margin.
Demonstrates superior scalability compared to conventional per-pixel prediction methods when pretrained on larger datasets.
Provides high-quality pseudo-labels for semantic segmentation on Taskonomy dataset, facilitating future research in multi-task dense prediction. |
Despite achieving high performance, the model still exhibits limitations in handling transparent and reflective surfaces in depth and surface normal prediction.
Future work may explore better loss functions to address the issue of over-smoothness observed in depth and surface normal predictions. |
dense prediction, mask transformer, cluster-prediction, semantic segmentation, depth estimation, surface normal prediction |
2311.05613
Report |
Window Attention is Bugged: How not to Interpolate Position Embeddings |
Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer |
Window attention, position embeddings, and high resolution finetuning are
core concepts in the modern transformer era of computer vision. However, we
find that naively combining these near ubiquitous components can have a
detrimental effect on performance. The issue is simple: interpolating position
embeddings while using window attention is wrong. We study two state-of-the-art
methods that have these three components, namely Hiera and ViTDet, and find
that both do indeed suffer from this bug. To fix it, we introduce a simple
absolute window position embedding strategy, which solves the bug outright in
Hiera and allows us to increase both speed and performance of the model in
ViTDet. We finally combine the two to obtain HieraDet, which achieves 61.7 box
mAP on COCO, making it state-of-the-art for models that only use ImageNet-1k
pretraining. This all stems from what is essentially a 3 line bug fix, which we
name "absolute win". |
This paper identifies a bug that occurs when interpolating absolute position embeddings in models using window attention, particularly during high-resolution fine-tuning. |
The bug negatively impacts performance in tasks like image recognition and object detection, hindering the effectiveness of high-resolution fine-tuning in vision transformers. |
The authors analyze the interaction between window attention and position embeddings, demonstrating the misalignment caused by naive interpolation. They propose "absolute win", a method separating position embeddings into window and global embeddings, enabling correct interpolation. |
Absolute win significantly improves image recognition accuracy when fine-tuning at higher resolutions, outperforming baselines like Swin and MViTv2.
In object detection, absolute win boosts performance in both ViTDet and HieraDet, achieving state-of-the-art results with ImageNet-1k pretraining.
The method also increases inference speed by mitigating the need for computationally expensive relative position embeddings. |
The study primarily focuses on Hiera and ViTDet, leaving the exploration of absolute win's impact on other architectures for future work.
Further investigation into the optimal strategies for training fully supervised transformers and closing the performance gap with MAE pre-trained models is needed. |
vision transformers, position embeddings, window attention, high-resolution fine-tuning, object detection |
2311.05556
Report |
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module |
Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, Hang Zhao |
Latent Consistency Models (LCMs) have achieved impressive performance in
accelerating text-to-image generative tasks, producing high-quality images with
minimal inference steps. LCMs are distilled from pre-trained latent diffusion
models (LDMs), requiring only ~32 A100 GPU training hours. This report further
extends LCMs' potential in two aspects: First, by applying LoRA distillation to
Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded
LCM's scope to larger models with significantly less memory consumption,
achieving superior image generation quality. Second, we identify the LoRA
parameters obtained through LCM distillation as a universal Stable-Diffusion
acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into
various Stable-Diffusion fine-tuned models or LoRAs without training, thus
representing a universally applicable accelerator for diverse image generation
tasks. Compared with previous numerical PF-ODE solvers such as DDIM,
DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that
possesses strong generalization abilities. Project page:
https://github.com/luosiallen/latent-consistency-model. |
This work introduces LCM-LoRA, a universal training-free acceleration module for Stable-Diffusion (SD) that acts as an independent neural network-based solver module to predict the solution of Probability Flow ODE (PF-ODE), enabling fast inference with minimal steps on various fine-tuned SD models and LoRAs. |
Current open-source models and acceleration techniques have yet to achieve real-time generation on standard consumer GPUs, highlighting the need for a balance between speed and quality in LDM-generated imagery. |
The work extends Latent Consistency Models (LCMs) by: (1) applying LoRA distillation to Stable-Diffusion models (SD-V1.5, SSD-1B, and SDXL) to reduce memory consumption and achieve superior image generation quality, and (2) identifying LoRA parameters from LCM distillation as a universal SD acceleration module (LCM-LoRA), which can be directly plugged into fine-tuned SD models or LoRAs without training. |
LCM-LoRA significantly reduces memory requirements during training.
LCD paradigm effectively scales to larger models like SDXL and SSD-1B.
LCM-LoRA demonstrates robust generalization capabilities, achieving fast inference with minimal steps when combined with other fine-tuned SD models and LoRAs. |
Further investigation is needed to fully understand the impact of combining LCM-LoRA with LoRA parameters from various datasets.
Exploration of different linear combination strategies for acceleration and style vectors may further improve performance. |
stable diffusion, latent consistency models, image generation, model acceleration, lora |
2311.05463
Report |
ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors |
Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei |
Recently, the multimedia community has witnessed the rise of diffusion models
trained on large-scale multi-modal data for visual content creation,
particularly in the field of text-to-image generation. In this paper, we
propose a new task for ``stylizing'' text-to-image models, namely text-driven
stylized image generation, that further enhances editability in content
creation. Given input text prompt and style image, this task aims to produce
stylized images which are both semantically relevant to input text prompt and
meanwhile aligned with the style image in style. To achieve this, we present a
new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image
model with a trainable modulation network enabling more conditions of text
prompts and style images. Moreover, diffusion style and content regularizations
are simultaneously introduced to facilitate the learning of this modulation
network with these diffusion priors, pursuing high-quality stylized
text-to-image generation. Extensive experiments demonstrate the effectiveness
of our ControlStyle in producing more visually pleasing and artistic results,
surpassing a simple combination of text-to-image model and conventional style
transfer techniques. |
This paper introduces a new diffusion model, ControlStyle, for text-driven stylized image generation, which allows users to create images that match both a given text prompt and the style of a given image. |
The task enhances editability in visual content creation, moving beyond existing methods that require a content image or struggle with accurate style descriptions. |
ControlStyle builds on a pre-trained text-to-image diffusion model, adding a trainable modulation network that incorporates style image information and utilizes diffusion style and content regularizations to maintain image structure and style consistency. |
ControlStyle produces higher quality results, with better visual appeal and style alignment, compared to cascaded text-to-image and style transfer methods, as per quantitative metrics and user study.
The use of diffusion regularizations, leveraging image priors from the diffusion model's auto-encoder, proves more effective than perceptual loss, resulting in fewer artifacts.
ControlStyle demonstrates strong generalizability by effectively adapting to styles not present in its training dataset. |
The selection of features from the upsampling blocks for diffusion regularizations requires careful consideration to avoid performance degradation.
Further exploration of combining ControlStyle with other conditional control models like ControlNet could lead to even more powerful and interesting applications. |
diffusion models, text-to-image generation, style transfer, stylized image generation, content creation |
2311.04498
Report |
NExT-Chat: An LMM for Chat, Detection and Segmentation |
Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, Tat-Seng Chua |
The development of large language models (LLMs) has greatly advanced the
field of multimodal understanding, leading to the emergence of large multimodal
models (LMMs). In order to enhance the level of visual comprehension, recent
studies have equipped LMMs with region-level understanding capabilities by
representing object bounding box coordinates as a series of text sequences
(pix2seq). In this paper, we introduce a novel paradigm for object location
modeling called pix2emb method, where we ask the LMM to output the location
embeddings and then decode them with different decoders. This paradigm allows
us to use different location formats (such as bounding boxes and masks) in
multimodal conversations. Leveraging the proposed pix2emb method, we train an
LMM named NExT-Chat and demonstrate its capability of handling multiple tasks
like visual grounding, region captioning, and grounded reasoning. Comprehensive
experiments show the effectiveness of our NExT-Chat on various tasks, e.g.,
NExT-Chat (87.7) vs. Shikra (86.9) on POPE-Random, NExT-Chat (68.9) vs. LISA
(67.9) on referring expression segmentation task, and NExT-Chat (79.6) vs.
Kosmos-2 (62.3) on region caption task. The code and model are released at
https://github.com/NExT-ChatV/NExT-Chat. |
This paper introduces pix2emb, a novel paradigm for object location modeling in large multimodal models (LMMs) that utilizes embeddings to accommodate different location formats like bounding boxes and segmentation masks. |
Existing LMMs often rely on pix2seq, which is limited to discrete coordinate outputs and struggles with fine-grained formats like masks. Pix2emb addresses these limitations by enabling flexible output formats and leveraging established localization practices. |
Pix2emb introduces two tokens: `` to initiate localization and `` as a placeholder for location embeddings. This allows LMMs to predict various location formats and utilize existing practices like L1, IoU, and GIoU loss functions. The authors train NExT-Chat, an LMM based on pix2emb, using a three-stage process: pre-training, instruction tuning, and segmentation training. |
NExT-Chat achieves state-of-the-art results on the POPE benchmark for image hallucination diagnosis.
It outperforms existing methods in referring expression segmentation, showing superior cIoU scores on RefCOCO, RefCOCO+, and RefCOCOg datasets.
NExT-Chat exhibits strong performance in region captioning, surpassing baselines like Kosmos-2 in CIDEr score on RefCOCOg. |
NExT-Chat is primarily trained on single image inputs, limiting its ability to handle multiple images.
Lack of diverse training data hinders its performance in specialized domains like medical or satellite imagery. |
large multimodal models, object location modeling, pix2emb, visual grounding, region captioning |
2311.04391
Report |
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features |
Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany |
We present 3DiffTection, a state-of-the-art method for 3D object detection
from single images, leveraging features from a 3D-aware diffusion model.
Annotating large-scale image data for 3D detection is resource-intensive and
time-consuming. Recently, pretrained large image diffusion models have become
prominent as effective feature extractors for 2D perception tasks. However,
these features are initially trained on paired text and image data, which are
not optimized for 3D tasks, and often exhibit a domain gap when applied to the
target data. Our approach bridges these gaps through two specialized tuning
strategies: geometric and semantic. For geometric tuning, we fine-tune a
diffusion model to perform novel view synthesis conditioned on a single image,
by introducing a novel epipolar warp operator. This task meets two essential
criteria: the necessity for 3D awareness and reliance solely on posed image
data, which are readily available (e.g., from videos) and does not require
manual annotation. For semantic refinement, we further train the model on
target data with detection supervision. Both tuning phases employ ControlNet to
preserve the integrity of the original feature capabilities. In the final step,
we harness these enhanced capabilities to conduct a test-time prediction
ensemble across multiple virtual viewpoints. Through our methodology, we obtain
3D-aware features that are tailored for 3D detection and excel in identifying
cross-view point correspondences. Consequently, our model emerges as a powerful
3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a
precedent in single-view 3D detection by 9.43\% in AP3D on the
Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data
efficiency and generalization to cross-domain data. |
This paper presents 3DiffTection, a novel method for single-image 3D object detection by leveraging pre-trained 2D diffusion models and enhancing their 3D awareness. |
Annotating data for 3D object detection is expensive and time-consuming. 3DiffTection addresses this challenge by leveraging readily available pre-trained 2D diffusion models and enhancing their capabilities for 3D tasks. |
The method uses two ControlNets: a geometric ControlNet trained for novel view synthesis using epipolar warping to instill 3D awareness, and a semantic ControlNet jointly trained with a 3D detection head for task-specific adaptation. It further enhances detection by ensembling predictions across virtually generated views. |
3DiffTection significantly outperforms previous state-of-the-art methods on the Omni3D-ARKitScenes dataset for single-view 3D object detection.
The method shows strong data efficiency, achieving superior performance with significantly less training data than competing approaches.
3DiffTection exhibits strong generalization to cross-domain data, effectively transferring learned 3D awareness to new datasets. |
The method currently relies on accurate camera pose information, which can be challenging to obtain for in-the-wild video data.
The use of Stable Diffusion architecture leads to high memory and runtime demands, limiting its applicability in real-time settings and requiring further optimization. |
3d object detection, diffusion models, novel view synthesis, controlnet, data efficiency |
2311.04315
Report |
A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization |
Xingzhe He, Zhiwen Cao, Nicholas Kolkin, Lantao Yu, Kun Wan, Helge Rhodin, Ratheesh Kalarot |
Large text-to-image models have revolutionized the ability to generate
imagery using natural language. However, particularly unique or personal visual
concepts, such as pets and furniture, will not be captured by the original
model. This has led to interest in how to personalize a text-to-image model.
Despite significant progress, this task remains a formidable challenge,
particularly in preserving the subject's identity. Most researchers attempt to
address this issue by modifying model architectures. These methods are capable
of keeping the subject structure and color but fail to preserve identity
details. Towards this issue, our approach takes a data-centric perspective. We
introduce a novel regularization dataset generation strategy on both the text
and image level. This strategy enables the model to preserve fine details of
the desired subjects, such as text and logos. Our method is
architecture-agnostic and can be flexibly applied on various text-to-image
models. We show on established benchmarks that our data-centric approach forms
the new state of the art in terms of identity preservation and text alignment. |
This paper introduces a data-centric approach for enhancing identity preservation in diffusion-based text-to-image personalization, addressing overfitting issues observed in prior methods. |
Existing methods for personalizing text-to-image models struggle to maintain subject identity and often overfit to training data, leading to reduced quality and diversity in generated images. |
The proposed method generates a structured regularization dataset using formatted prompts that describe both foreground and background elements. This regularization dataset, combined with enhanced training prompts, improves the model's ability to learn personalized concepts without overfitting. |
The method demonstrates superior subject identity preservation, retaining intricate details like logos and textures.
It exhibits improved text alignment, generating images that are more faithful to the input text prompts.
The approach is effective with both inanimate objects and living entities, showing adaptability across diverse subject types. |
Generating the regularization dataset adds computational overhead to the training process.
The current approach assumes manual annotation of training images, which could be automated in future work. |
text-to-image generation, diffusion models, personalization, identity preservation, regularization dataset |
2311.04287
Report |
Holistic Evaluation of Text-To-Image Models |
Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang |
The stunning qualitative improvement of recent text-to-image models has led
to their widespread attention and adoption. However, we lack a comprehensive
quantitative understanding of their capabilities and risks. To fill this gap,
we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models
(HEIM). Whereas previous evaluations focus mostly on text-image alignment and
image quality, we identify 12 aspects, including text-image alignment, image
quality, aesthetics, originality, reasoning, knowledge, bias, toxicity,
fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios
encompassing these aspects and evaluate 26 state-of-the-art text-to-image
models on this benchmark. Our results reveal that no single model excels in all
aspects, with different models demonstrating different strengths. We release
the generated images and human evaluation results for full transparency at
https://crfm.stanford.edu/heim/v1.1.0 and the code at
https://github.com/stanford-crfm/helm, which is integrated with the HELM
codebase. |
The paper introduces HEIM, a novel benchmark for evaluating text-to-image models across 12 aspects, including aesthetics, originality, bias, and toxicity, addressing the limitations of existing benchmarks that primarily focus on image quality and text-image alignment. |
Existing benchmarks for text-to-image models lack comprehensiveness, often overlooking crucial aspects like originality, aesthetics, bias, and toxicity. HEIM aims to fill this gap by providing a holistic evaluation framework. |
HEIM evaluates 26 text-to-image models on 24 scenarios using both automated metrics (e.g., CLIPScore, FID) and human evaluation to provide a comprehensive assessment across the 12 identified aspects. |
No single model excels in all aspects, highlighting the need for models with balanced capabilities.
Weak correlations between human and automated metrics, particularly for photorealism and aesthetics, underscore the importance of human evaluation.
Most models show poor performance in reasoning and multilinguality, indicating areas needing further research. |
The 12 identified aspects may not be exhaustive and could be expanded in future work.
The reliance on crowdsourced human evaluation, while valuable, has limitations, particularly for subjective aspects like aesthetics and originality. |
benchmark, text-to-image generation, evaluation, bias, toxicity |
2311.04257
Report |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou |
Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However, previous
methods primarily focus on enhancing multi-modal capabilities. In this work, we
introduce a versatile multi-modal large language model, mPLUG-Owl2, which
effectively leverages modality collaboration to improve performance in both
text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design,
with the language decoder acting as a universal interface for managing
different modalities. Specifically, mPLUG-Owl2 incorporates shared functional
modules to facilitate modality collaboration and introduces a modality-adaptive
module that preserves modality-specific features. Extensive experiments reveal
that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal
tasks and achieving state-of-the-art performances with a single generic model.
Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality
collaboration phenomenon in both pure-text and multi-modal scenarios, setting a
pioneering path in the development of future multi-modal foundation models. |
Presents \modelname, a multi-modal large language model that leverages modality collaboration to enhance performance in both text and multi-modal tasks. |
Existing multi-modal LLMs struggle to balance the benefits of cross-modal interaction with the risk of modality interference, limiting their ability to excel in both text and multi-modal tasks simultaneously. |
Introduces a modularized network design with a modality-adaptive module to facilitate cross-modal interaction while preserving modality-specific features. Employs a two-stage training paradigm consisting of vision-language pre-training and joint vision-language instruction tuning. |
\modelname achieves state-of-the-art performance on 8 classic vision-language benchmarks and ranks first or second on 5 recent zero-shot multi-modal benchmarks.
\modelname demonstrates state-of-the-art results on multiple pure-text benchmarks, highlighting the benefits of modality collaboration for enhancing text-based capabilities.
Analysis confirms the positive impact of modality collaboration, especially in improving text-based understanding, knowledge, and reasoning abilities. |
Limited number of test samples in certain benchmarks (e.g., MME) may lead to performance fluctuations.
Despite efforts to mitigate bias, the model may still inherit some biases from the pre-trained LLM and web-sourced data. |
multi-modal large language models, modality collaboration, modality-adaptive module, joint vision-language instruction tuning, zero-shot learning |
2311.04251
Report |
MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters |
Chau Pham, Piotr Teterwak, Soren Nelson, Bryan A. Plummer |
Most deep neural networks are trained under fixed network architectures and
require retraining when the architecture changes. If expanding the network's
size is needed, it is necessary to retrain from scratch, which is expensive. To
avoid this, one can grow from a small network by adding random weights over
time to gradually achieve the target network size. However, this naive approach
falls short in practice as it brings too much noise to the growing process.
Prior work tackled this issue by leveraging the already learned weights and
training data for generating new weights through conducting a computationally
expensive analysis step. In this paper, we introduce MixtureGrowth, a new
approach to growing networks that circumvents the initialization overhead in
prior work. Before growing, each layer in our model is generated with a linear
combination of parameter templates. Newly grown layer weights are generated by
using a new linear combination of existing templates for a layer. On one hand,
these templates are already trained for the task, providing a strong
initialization. On the other, the new coefficients provide flexibility for the
added layer weights to learn something new. We show that our approach boosts
top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet
datasets, while achieving comparable performance with fewer FLOPs to a larger
network trained from scratch. Code is available at
https://github.com/chaudatascience/mixturegrowth. |
Introduces MixtureGrowth, a novel method for growing neural networks by reusing and recombining learned parameter templates from smaller, pre-trained networks. |
Reduces the computational cost of training large neural networks by leveraging knowledge from smaller models and avoiding expensive weight initialization procedures used in prior work. |
Trains two small networks with shared parameter templates, fuses them into a larger network, initializes new weights by learning new linear combinations of existing templates, and fine-tunes the entire network. |
Achieves 2-2.5% higher top-1 accuracy than state-of-the-art growing methods on CIFAR-100 and ImageNet.
Outperforms target network accuracy on CIFAR-100 with half the FLOPs.
Demonstrates robustness to growth point and benefits from fusing two small models over growing from a single one. |
Limited exploration of growth beyond doubling network size.
Further investigation into the relationship between template diversity and growth performance is needed. |
neural network growing, template mixing, parameter sharing, model fusion, computational efficiency |
2311.04246
Report |
ADFactory: An Effective Framework for Generalizing Optical Flow with Nerf |
Han Ling |
A significant challenge facing current optical flow methods is the difficulty
in generalizing them well to the real world. This is mainly due to the high
cost of hand-crafted datasets, and existing self-supervised methods are limited
by indirect loss and occlusions, resulting in fuzzy outcomes. To address this
challenge, we introduce a novel optical flow training framework: automatic data
factory (ADF). ADF only requires RGB images as input to effectively train the
optical flow network on the target data domain. Specifically, we use advanced
Nerf technology to reconstruct scenes from photo groups collected by a
monocular camera, and then calculate optical flow labels between camera pose
pairs based on the rendering results. To eliminate erroneous labels caused by
defects in the scene reconstructed by Nerf, we screened the generated labels
from multiple aspects, such as optical flow matching accuracy, radiation field
confidence, and depth consistency. The filtered labels can be directly used for
network supervision. Experimentally, the generalization ability of ADF on KITTI
surpasses existing self-supervised optical flow and monocular scene flow
algorithms. In addition, ADF achieves impressive results in real-world
zero-point generalization evaluations and surpasses most supervised methods. |
This paper proposes Automated Data Factory (ADF), a novel optical flow training framework utilizing scenes generated by Neural Radiance Fields (NeRF) to train deep optical flow networks. |
ADF addresses the challenge of generalizing optical flow methods to real-world scenarios by providing a cost-effective way to generate large-scale, high-quality optical flow training data without manual annotation or expensive equipment. |
ADF uses NeRF to reconstruct scenes from monocular camera images, generates optical flow labels between rendered camera poses, and employs data filtering techniques like structural similarity, radiation field confidence, and depth consistency to ensure label accuracy. |
ADF-trained models demonstrate superior zero-shot generalization on real-world optical flow estimation compared to existing self-supervised and supervised methods.
The proposed data filtering mechanism significantly improves the performance of trained optical flow models.
ADF proves effective in training both traditional optical flow networks like RAFT and more advanced normalized scene flow models like Scale-flow. |
ADF currently relies on static scenes due to limitations of NeRF, hindering its application in dynamic scenarios like KITTI raw data.
Further research is needed to bridge the gap between the optical flow generated by ADF (closer to light flow) and the object flow typically used in supervised learning. |
optical flow, neural radiance fields, self-supervised learning, zero-shot generalization, data generation |
2311.04219
Report |
OtterHD: A High-Resolution Multi-modality Model |
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu |
In this paper, we present OtterHD-8B, an innovative multimodal model evolved
from Fuyu-8B, specifically engineered to interpret high-resolution visual
inputs with granular precision. Unlike conventional models that are constrained
by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible
input dimensions, ensuring its versatility across various inference
requirements. Alongside this model, we introduce MagnifierBench, an evaluation
framework designed to scrutinize models' ability to discern minute details and
spatial relationships of small objects. Our comparative analysis reveals that
while current leading models falter on this benchmark, OtterHD-8B, particularly
when directly processing high-resolution inputs, outperforms its counterparts
by a substantial margin. The findings illuminate the structural variances in
visual information processing among different models and the influence that the
vision encoders' pre-training resolution disparities have on model
effectiveness within such benchmarks. Our study highlights the critical role of
flexibility and high-resolution input capabilities in large multimodal models
and also exemplifies the potential inherent in the Fuyu architecture's
simplicity for handling complex visual data. |
Presents OtterHD-8B, a novel multimodal model based on Fuyu-8B, designed to process high-resolution visual inputs with flexibility and introduces MagnifierBench, a benchmark to evaluate models' ability to discern minute details in large images. |
Addresses the limitations of conventional LMMs that rely on fixed-size vision encoders and lack fine-grained perception abilities, crucial for tasks requiring detailed visual understanding. |
Extends Fuyu-8B with instruction tuning to handle various resolutions up to 1024x1024 pixels and develops MagnifierBench using images from PVSG dataset with meticulously designed question-answer pairs focused on small objects. |
OtterHD-8B outperforms existing LMMs on MagnifierBench, demonstrating its superior fine-grained perception abilities.
Increasing input resolution leads to improved performance on MagnifierBench, highlighting the importance of resolution flexibility.
Dynamic resolution training further enhances OtterHD-8B's ability to generalize to unseen resolutions. |
Limited instruction tuning data compared to other state-of-the-art LMMs.
Further exploration of image augmentation methods like random cropping is needed. |
multimodal learning, large language models, computer vision, fine-grained perception, benchmarking |
2311.04212
Report |
Video Instance Matting |
Jiachen Li, Roberto Henschel, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Humphrey Shi |
Conventional video matting outputs one alpha matte for all instances
appearing in a video frame so that individual instances are not distinguished.
While video instance segmentation provides time-consistent instance masks,
results are unsatisfactory for matting applications, especially due to applied
binarization. To remedy this deficiency, we propose Video Instance
Matting~(VIM), that is, estimating alpha mattes of each instance at each frame
of a video sequence. To tackle this challenging problem, we present MSG-VIM, a
Mask Sequence Guided Video Instance Matting neural network, as a novel baseline
model for VIM. MSG-VIM leverages a mixture of mask augmentations to make
predictions robust to inaccurate and inconsistent mask guidance. It
incorporates temporal mask and temporal feature guidance to improve the
temporal consistency of alpha matte predictions. Furthermore, we build a new
benchmark for VIM, called VIM50, which comprises 50 video clips with multiple
human instances as foreground objects. To evaluate performances on the VIM
task, we introduce a suitable metric called Video Instance-aware Matting
Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50
benchmark and outperforms existing methods by a large margin. The project is
open-sourced at https://github.com/SHI-Labs/VIM. |
This paper proposes Video Instance Matting (VIM), a new task focused on estimating alpha mattes for each instance within a video sequence, addressing the limitations of conventional video matting and instance segmentation. |
VIM enables instance-aware video editing, surpassing the limitations of traditional methods by providing separate alpha mattes for individual instances, crucial for applications like instance-selective removal. |
The authors introduce MSG-VIM, a Mask Sequence Guided VIM network, utilizing mask sequences from video instance segmentation as guidance. MSG-VIM employs a mixture of mask augmentations for robustness, temporal mask and feature guidance for temporal consistency. |
MSG-VIM significantly outperforms existing video matting, instance segmentation, and image matting methods on the newly created VIM50 benchmark.
A proposed mixture of mask augmentations successfully enhances the robustness of MSG-VIM against inaccurate mask guidance.
The incorporation of temporal guidance, both for masks and features, demonstrably improves the temporal consistency of alpha matte predictions. |
The reliance on an external video instance segmentation model for mask guidance introduces a dependency on its accuracy.
The computational demands of processing longer video chunks are constrained by memory limitations. |
video matting, instance segmentation, alpha matte, video editing, deep learning |
2311.04145
Report |
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models |
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou |
Video synthesis has recently made remarkable strides benefiting from the
rapid development of diffusion models. However, it still encounters challenges
in terms of semantic accuracy, clarity and spatio-temporal continuity. They
primarily arise from the scarcity of well-aligned text-video data and the
complex inherent structure of videos, making it difficult for the model to
simultaneously ensure semantic and qualitative excellence. In this report, we
propose a cascaded I2VGen-XL approach that enhances model performance by
decoupling these two factors and ensures the alignment of the input data by
utilizing static images as a form of crucial guidance. I2VGen-XL consists of
two stages: i) the base stage guarantees coherent semantics and preserves
content from input images by using two hierarchical encoders, and ii) the
refinement stage enhances the video's details by incorporating an additional
brief text and improves the resolution to 1280$\times$720. To improve the
diversity, we collect around 35 million single-shot text-video pairs and 6
billion text-image pairs to optimize the model. By this means, I2VGen-XL can
simultaneously enhance the semantic accuracy, continuity of details and clarity
of generated videos. Through extensive experiments, we have investigated the
underlying principles of I2VGen-XL and compared it with current top methods,
which can demonstrate its effectiveness on diverse data. The source code and
models will be publicly available at \url{https://i2vgen-xl.github.io}. |
This paper introduces I2VGen-XL, a cascaded diffusion model that synthesizes high-quality videos from single images. |
Current video synthesis models struggle with semantic accuracy, clarity, and spatio-temporal continuity due to limited aligned text-video data and the complexity of videos. |
I2VGen-XL uses a two-stage approach: 1) a base stage with hierarchical encoders ensures semantic coherence and content preservation at low resolution, and 2) a refinement stage enhances details and resolution using an additional text prompt. |
I2VGen-XL generates videos with more realistic and diverse motions than state-of-the-art methods like Gen-2 and Pika Labs.
The refinement stage significantly improves spatial details, reduces noise, and enhances spatio-temporal continuity.
The model shows promising generalization ability across diverse categories like human faces, cartoons, and animals. |
Generating natural and diverse human body motions remains challenging.
The model is currently limited to generating short, single-shot videos. |
video synthesis, diffusion models, image-to-video generation, cascaded diffusion, high-resolution video generation |
2311.03943
Report |
CLIP Guided Image-perceptive Prompt Learning for Image Enhancement |
Weiwen Chen, Qiuhong Ke, Zinuo Li |
Image enhancement is a significant research area in the fields of computer
vision and image processing. In recent years, many learning-based methods for
image enhancement have been developed, where the Look-up-table (LUT) has proven
to be an effective tool. In this paper, we delve into the potential of
Contrastive Language-Image Pre-Training (CLIP) Guided Prompt Learning,
proposing a simple structure called CLIP-LUT for image enhancement. We found
that the prior knowledge of CLIP can effectively discern the quality of
degraded images, which can provide reliable guidance. To be specific, We
initially learn image-perceptive prompts to distinguish between original and
target images using CLIP model, in the meanwhile, we introduce a very simple
network by incorporating a simple baseline to predict the weights of three
different LUT as enhancement network. The obtained prompts are used to steer
the enhancement network like a loss function and improve the performance of
model. We demonstrate that by simply combining a straightforward method with
CLIP, we can obtain satisfactory results. |
This paper proposes CLIP-LUT, a novel image enhancement approach that leverages CLIP's image quality discerning ability through prompt learning to guide a simple LUT-based enhancement network. |
This approach bridges the gap between powerful visual-language models like CLIP and low-level image enhancement tasks, offering a new avenue for leveraging CLIP's knowledge in this domain. |
The method learns image-perceptive prompts using CLIP to distinguish between original and enhanced images. These prompts guide a lightweight UNet-like network predicting the weights for three different LUTs, which are then combined to enhance the input image. |
CLIP-LUT achieves competitive results on benchmark datasets like MIT-Adobe FiveK, HDR+, and FilmSet, outperforming several existing methods in terms of PSNR, SSIM, and color accuracy.
Ablation studies confirm the efficacy of learned image-perceptive prompts in guiding the enhancement process compared to using no prompts or random prompts.
The simplicity of the proposed approach highlights the potential of integrating CLIP into low-level computer vision tasks for effective performance. |
The paper acknowledges the preliminary stage of the research and suggests exploring more effective prompt learning techniques and loss functions.
Further investigations into lightweight network architectures for enhancement and extending the approach to other low-level vision tasks are proposed. |
image enhancement, clip, prompt learning, look-up table (lut), visual-language models |
2311.03873
Report |
Mini but Mighty: Finetuning ViTs with Mini Adapters |
Imad Eddine Marouf, Enzo Tartaglione, Stéphane Lathuilière |
Vision Transformers (ViTs) have become one of the dominant architectures in
computer vision, and pre-trained ViT models are commonly adapted to new tasks
via fine-tuning. Recent works proposed several parameter-efficient transfer
learning methods, such as adapters, to avoid the prohibitive training and
storage cost of finetuning. In this work, we observe that adapters perform
poorly when the dimension of adapters is small, and we propose MiMi, a training
framework that addresses this issue. We start with large adapters which can
reach high performance, and iteratively reduce their size. To enable automatic
estimation of the hidden dimension of every adapter, we also introduce a new
scoring function, specifically designed for adapters, that compares the neuron
importance across layers. Our method outperforms existing methods in finding
the best trade-off between accuracy and trained parameters across the three
dataset benchmarks DomainNet, VTAB, and Multi-task, for a total of 29 datasets. |
The paper proposes MiMi, an iterative training framework for Vision Transformers (ViTs) that reduces the size of adapter modules for parameter-efficient transfer learning. |
Fine-tuning pre-trained ViTs for new tasks is computationally and storage expensive. Adapters offer a parameter-efficient alternative, but their performance degrades with small sizes. MiMi addresses this by iteratively reducing adapter dimensions while maintaining high performance. |
MiMi starts with large adapters and iteratively prunes neurons based on a novel importance score that considers both down-sampling and up-sampling layers. This score enables dynamic adjustment of adapter sizes and even removal if deemed unnecessary. |
MiMi outperforms existing parameter-efficient transfer learning methods on 29 datasets across DomainNet, VTAB, and Multi-task benchmarks.
The proposed importance score for neuron selection effectively guides adapter size reduction, leading to better performance than vanilla training.
MiMi demonstrates its generalizability by achieving comparable performance to full fine-tuning with significantly fewer parameters across various ViT backbones. |
The paper mainly focuses on image classification tasks; further investigation is needed for other vision tasks.
The influence of hyperparameter ρ (amount of neuron removal) requires further exploration for optimal performance across diverse datasets. |
vision transformer, parameter-efficient finetuning, adapters, pruning, transfer learning |
2311.03830
Report |
Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models |
Shengzhe Zhou, Zejian Lee, Shengyuan Zhang, Lefan Hou, Changyuan Yang, Guang Yang, Zhiyuan Yang, Lingyun Sun |
Denoising Diffusion models have exhibited remarkable capabilities in image
generation. However, generating high-quality samples requires a large number of
iterations. Knowledge distillation for diffusion models is an effective method
to address this limitation with a shortened sampling process but causes
degraded generative quality. Based on our analysis with bias-variance
decomposition and experimental observations, we attribute the degradation to
the spatial fitting error occurring in the training of both the teacher and
student model. Accordingly, we propose $\textbf{S}$patial
$\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction
$\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention
guidance from the teacher model and a designed semantic gradient predictor to
reduce the student's fitting error. Empirically, our proposed model facilitates
high-quality sample generation in a few function evaluations. We achieve an FID
of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step,
outperforming existing diffusion methods. Our study provides a new perspective
on diffusion distillation by highlighting the intrinsic denoising ability of
models. Project link: \url{https://github.com/Sainzerjj/SFERD}. |
This paper proposes SFERD, a novel Spatial Fitting-Error Reduction Distillation model for Denoising Diffusion Models, to generate high-quality images in a few function evaluations. |
Generating high-quality samples from Denoising Diffusion Models typically requires many iterations, leading to slow sampling speed. Distillation methods address this issue but often compromise image quality. This paper aims to improve the quality of distilled diffusion models. |
The paper analyzes the distillation process and identifies the fitting errors in both the teacher and student models as the root cause of quality degradation. To address this, SFERD utilizes two novel components: (1) **attention guidance** for the teacher model to reduce error by highlighting semantically important regions, and (2) a **semantic gradient predictor** for the student model to enhance training by incorporating semantic information from a learned latent space. |
SFERD significantly outperforms existing distillation methods (PD, CD) on CIFAR-10 and ImageNet 64x64 datasets, achieving impressive FID scores with only a few sampling steps.
Notably, SFERD-CD achieves single-step FID scores of 5.31 and 9.39 on CIFAR-10 and ImageNet 64x64, respectively.
Applying SFERD to fine-tune pre-trained diffusion models directly also leads to improved performance. |
The attention guidance method, while effective, currently relies on unsupervised learning, which can lead to challenges in promptly detecting and correcting instances of incorrect self-attention direction.
The student model with the semantic gradient predictor, though offering improved performance, experiences a slight increase in inference time compared to the original student model due to the extra predictor. |
diffusion models, knowledge distillation, image generation, attention mechanisms, semantic gradient prediction |
2311.03426
Report |
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values |
Farnoosh Javadi, Walid Ahmed, Habib Hajimolahoseini, Foozhan Ataiefard, Mohammad Hassanpour, Saina Asani, Austin Wen, Omar Mohamed Awad, Kangling Liu, Yang Liu |
Massive transformer-based models face several challenges, including slow and
computationally intensive pre-training and over-parametrization. This paper
addresses these challenges by proposing a versatile method called GQKVA, which
generalizes query, key, and value grouping techniques. GQKVA is designed to
speed up transformer pre-training while reducing the model size. Our
experiments with various GQKVA variants highlight a clear trade-off between
performance and model size, allowing for customized choices based on resource
and time limitations. Our findings also indicate that the conventional
multi-head attention approach is not always the best choice, as there are
lighter and faster alternatives available. We tested our method on ViT, which
achieved an approximate 0.3% increase in accuracy while reducing the model size
by about 4% in the task of image classification. Additionally, our most
aggressive model reduction experiment resulted in a reduction of approximately
15% in model size, with only around a 1% drop in accuracy. |
This paper introduces GQKVA, a novel method for reducing the size and speeding up the pre-training of transformer models by generalizing query, key, and value grouping techniques. |
Massive transformer models suffer from slow and computationally expensive pre-training, along with over-parametrization. This method addresses these challenges by enabling faster training and smaller model sizes. |
The study explores various GQKVA variants by grouping queries, keys, and values within the self-attention mechanism of a ViT-small model. The variants are evaluated based on accuracy, model size, and training time per sample (TPS). |
GKVA, a variant of GQKVA, achieves the highest accuracy while reducing model size by 4-5% compared to standard multi-head attention (MHA).
Certain GQKVA variants outperform MQA in accuracy despite having the same or fewer parameters.
Results reveal a linear correlation between model size/TPS and performance, indicating a trade-off that allows for customization based on resource limits. |
The study is limited to evaluating GQKVA on the ViT-small model due to resource constraints.
Further research should explore applying GQKVA to larger transformer models to unlock greater potential speed-ups and memory savings. |
transformer, pre-training, model compression, attention mechanism, gqkva |
2311.03356
Report |
GLaMM: Pixel Grounding Large Multimodal Model |
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan |
Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate
ungrounded textual responses. Recently, region-level LMMs have been used to
generate visually grounded responses. However, they are limited to only
referring to a single object category at a time, require users to specify the
regions, or cannot offer dense pixel-wise object grounding. In this work, we
present Grounding LMM (GLaMM), the first model that can generate natural
language responses seamlessly intertwined with corresponding object
segmentation masks. GLaMM not only grounds objects appearing in the
conversations but is flexible enough to accept both textual and optional visual
prompts (region of interest) as input. This empowers users to interact with the
model at various levels of granularity, both in textual and visual domains. Due
to the lack of standard benchmarks for the novel setting of visually Grounded
Conversation Generation (GCG), we introduce a comprehensive evaluation protocol
with our curated grounded conversations. Our proposed GCG task requires densely
grounded concepts in natural scenes at a large-scale. To this end, we propose a
densely annotated Grounding-anything Dataset (GranD) using our proposed
automated annotation pipeline that encompasses 7.5M unique concepts grounded in
a total of 810M regions available with segmentation masks. Besides GCG, GLaMM
also performs effectively on several downstream tasks, e.g., referring
expression segmentation, image and region-level captioning and vision-language
conversations. |
Introduces GLaMM, the first large multimodal model capable of generating natural language responses with corresponding object segmentation masks, enabling visually grounded conversations. |
Addresses limitations of existing LMMs that lack region-specific understanding or can't provide detailed pixel-level grounding for truly interactive visual-language tasks. |
Develops GLaMM with a novel architecture that combines global and region encoders, an LLM, and a pixel decoder, trained end-to-end on a new densely annotated dataset (GranD). |
GLaMM outperforms existing LMMs on the newly proposed Grounded Conversation Generation (GCG) task.
Demonstrates strong performance on various downstream tasks such as referring expression segmentation, region-level captioning, and image captioning.
Introduces GranD, a large-scale dataset with 7.5M unique concepts grounded in 810M regions, created through an automated annotation pipeline for scalable data generation. |
Automated annotation pipeline in GranD may introduce noise in the labels, requiring further research on noise reduction techniques.
Future work includes expanding GLaMM to incorporate other modalities like video and 3D data. |
large multimodal models, grounded conversation generation, pixel-level grounding, dense image captioning, automated dataset annotation |
2311.03355
Report |
SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis |
Hanrong Ye, Jason Kuen, Qing Liu, Zhe Lin, Brian Price, Dan Xu |
We propose SegGen, a highly-effective training data generation method for
image segmentation, which pushes the performance limits of state-of-the-art
segmentation models to a significant extent. SegGen designs and integrates two
data generation strategies: MaskSyn and ImgSyn. (i) MaskSyn synthesizes new
mask-image pairs via our proposed text-to-mask generation model and
mask-to-image generation model, greatly improving the diversity in segmentation
masks for model supervision; (ii) ImgSyn synthesizes new images based on
existing masks using the mask-to-image generation model, strongly improving
image diversity for model inputs. On the highly competitive ADE20K and COCO
benchmarks, our data generation method markedly improves the performance of
state-of-the-art segmentation models in semantic segmentation, panoptic
segmentation, and instance segmentation. Notably, in terms of the ADE20K mIoU,
Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L
is also significantly increased from 56.1 to 57.4 (+1.3). These promising
results strongly suggest the effectiveness of our SegGen even when abundant
human-annotated training data is utilized. Moreover, training with our
synthetic data makes the segmentation models more robust towards unseen
domains. Project website: https://seggenerator.github.io |
This paper proposes SegGen, a novel method for generating high-quality, diverse training data for image segmentation using text-to-mask and mask-to-image generation models. |
Existing segmentation datasets are limited in size, hindering model performance and generalization ability. SegGen addresses this by synthesizing large-scale, high-quality training data. |
SegGen leverages two generative models: 1) Text2Mask generates new segmentation masks from text prompts. 2) Mask2Img synthesizes images conditioned on these masks or human-annotated ones. Two data generation approaches are proposed: MaskSyn focuses on new mask generation, while ImgSyn creates new images from existing masks. |
SegGen significantly boosts the performance of state-of-the-art segmentation models (e.g., Mask2Former) on ADE20K and COCO benchmarks for semantic, panoptic, and instance segmentation.
The method achieves state-of-the-art results on these benchmarks without relying on additional human-annotated data.
Models trained with SegGen's synthetic data exhibit improved generalization ability, performing better on images from unseen domains (e.g., PASCAL VOC) and AI-generated images. |
Generating instance segmentation data from text remains a challenge due to the difficulty of inferring instance information from color maps.
Further exploration is needed to optimize the scale and diversity of synthetic data for even greater performance improvements. |
image segmentation, data augmentation, synthetic data generation, text-to-image synthesis, mask-to-image synthesis |
2311.03354
Report |
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding |
Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan |
A remarkable ability of human beings resides in compositional reasoning,
i.e., the capacity to make "infinite use of finite means". However, current
large vision-language foundation models (VLMs) fall short of such compositional
abilities due to their "bag-of-words" behaviors and inability to construct
words that correctly represent visual entities and the relations among the
entities. To this end, we propose CoVLM, which can guide the LLM to explicitly
compose visual entities and relationships among the text and dynamically
communicate with the vision encoder and detection network to achieve
vision-language communicative decoding. Specifically, we first devise a set of
novel communication tokens for the LLM, for dynamic communication between the
visual detection system and the language system. A communication token is
generated by the LLM following a visual entity or a relation, to inform the
detection network to propose regions that are relevant to the sentence
generated so far. The proposed regions-of-interests (ROIs) are then fed back
into the LLM for better language generation contingent on the relevant regions.
The LLM is thus able to compose the visual entities and relationships through
the communication tokens. The vision-to-language and language-to-vision
communication are iteratively performed until the entire sentence is generated.
Our framework seamlessly bridges the gap between visual perception and LLMs and
outperforms previous VLMs by a large margin on compositional reasoning
benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on
ARO top-1 accuracy). We also achieve state-of-the-art performances on
traditional vision-language tasks such as referring expression comprehension
and visual question answering. |
This paper proposes CoVLM, a novel approach that integrates detection networks into LLMs to enable dynamic interaction and compositionality over visual entities and relations for improved image understanding. |
Current VLMs lack compositional reasoning abilities crucial for visual understanding tasks, often exhibiting "bag-of-words" behavior and failing to represent entities and relationships accurately. |
CoVLM introduces communication tokens into LLMs, enabling step-by-step communication with visual components and relations. The vision module uses a detection network to propose relevant regions based on language inputs, and the LLM uses these regions for better language generation. |
CoVLM outperforms previous VLMs on compositional reasoning benchmarks like ARO, Cola, and HICO-DET by significant margins.
It demonstrates superior performance in tasks requiring fine-grained object recognition and reasoning about relations between entities.
CoVLM achieves competitive results on traditional vision-language tasks like referring expression comprehension and visual question answering. |
The paper doesn't delve deeply into object-attribute or spatial event compositionality.
Further exploration of these aspects is crucial for future work. |
vision-language models, compositional reasoning, object detection, large language models, vision-language communication |
2311.03352
Report |
Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion |
Hao Zhou, Tiancheng Shen, Xu Yang, Hai Huang, Xiangtai Li, Lu Qi, Ming-Hsuan Yang |
In this paper, we highlight a problem of evaluation metrics adopted in the
open-vocabulary segmentation. That is, the evaluation process still heavily
relies on closed-set metrics on zero-shot or cross-dataset pipelines without
considering the similarity between predicted and ground truth categories. To
tackle this issue, we first survey eleven similarity measurements between two
categorical words using WordNet linguistics statistics, text embedding, and
language models by comprehensive quantitative analysis and user study. Built
upon those explored measurements, we designed novel evaluation metrics, namely
Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary
segmentation tasks. We benchmarked the proposed evaluation metrics on 12
open-vocabulary methods of three segmentation tasks. Even though the relative
subjectivity of similarity distance, we demonstrate that our metrics can still
well evaluate the open ability of the existing open-vocabulary segmentation
methods. We hope that our work can bring with the community new thinking about
how to evaluate the open ability of models. The evaluation code is released in
github. |
This paper proposes novel evaluation metrics (Open mIoU, Open AP, Open PQ) for open-vocabulary segmentation, addressing the limitations of existing closed-set metrics by incorporating semantic similarity between predicted and ground-truth labels. |
Existing open-vocabulary segmentation metrics rely on closed-set metrics and don't account for semantic similarity, failing to capture the true open-world performance of models. |
The authors explore and compare eleven similarity measurements, preferring WordNet's path similarity. They introduce open metrics that employ class-agnostic matching and semantic similarity-based scoring for semantic, instance, and panoptic segmentation. |
Open metrics consistently outperform vanilla metrics, highlighting their ability to account for semantic similarity.
Open metrics demonstrate sensitivity to segmentation and recognition quality, providing a more accurate evaluation.
User studies confirm that the proposed Open metrics and Path Similarity align well with human judgment. |
The choice of similarity measurement, while preferred, remains subjective and may not be universally suitable.
Future work could explore alternative similarity measures and their impact on open-vocabulary evaluation. |
open-vocabulary segmentation, evaluation metrics, semantic similarity, wordnet, class-agnostic matching |
2311.03335
Report |
Cross-Image Attention for Zero-Shot Appearance Transfer |
Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or |
Recent advancements in text-to-image generative models have demonstrated a
remarkable ability to capture a deep semantic understanding of images. In this
work, we leverage this semantic knowledge to transfer the visual appearance
between objects that share similar semantics but may differ significantly in
shape. To achieve this, we build upon the self-attention layers of these
generative models and introduce a cross-image attention mechanism that
implicitly establishes semantic correspondences across images. Specifically,
given a pair of images -- one depicting the target structure and the other
specifying the desired appearance -- our cross-image attention combines the
queries corresponding to the structure image with the keys and values of the
appearance image. This operation, when applied during the denoising process,
leverages the established semantic correspondences to generate an image
combining the desired structure and appearance. In addition, to improve the
output image quality, we harness three mechanisms that either manipulate the
noisy latent codes or the model's internal representations throughout the
denoising process. Importantly, our approach is zero-shot, requiring no
optimization or training. Experiments show that our method is effective across
a wide range of object categories and is robust to variations in shape, size,
and viewpoint between the two input images. |
This paper introduces a zero-shot method for semantic-based appearance transfer between objects in natural images, leveraging the implicit semantic correspondences captured by pretrained text-to-image diffusion models. |
The method addresses the limitations of existing appearance transfer techniques that require per-domain or per-image training, enabling flexible transfer across objects with variations in shape, size, and viewpoint. |
The core of the method is Cross-Image Attention, which replaces the self-attention layers in the diffusion model's decoder. It mixes queries from the structure image with keys and values from the appearance image to establish semantic correspondences. Further enhancements include attention map contrasting, appearance guidance using classifier-free guidance, and AdaIN for color distribution alignment. |
The method effectively transfers visual appearance between semantically similar objects, even with significant shape variations.
It outperforms existing techniques in qualitative comparisons, capturing finer details and preserving source structure.
Quantitative evaluations and user studies confirm its superiority in appearance fidelity and overall quality. |
Transferring appearance between objects lacking shared semantics remains challenging.
The quality of transfer relies on accurate and editable inversions of input images, which can be sensitive to inversion settings. |
appearance transfer, diffusion models, cross-image attention, zero-shot learning, semantic correspondences |
2311.03287
Report |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges |
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, Huaxiu Yao |
While GPT-4V(ision) impressively models both visual and textual information
simultaneously, it's hallucination behavior has not been systematically
assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias
and Interference Challenges in Visual Language Models (Bingo). This benchmark
is designed to evaluate and shed light on the two common types of
hallucinations in visual language models: bias and interference. Here, bias
refers to the model's tendency to hallucinate certain types of responses,
possibly due to imbalance in its training data. Interference pertains to
scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the
text prompt is phrased or how the input image is presented. We identify a
notable regional bias, whereby GPT-4V(ision) is better at interpreting Western
images or images with English writing compared to images from other countries
or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to
leading questions and is often confused when interpreting multiple images
together. Popular mitigation approaches, such as self-correction and
chain-of-thought reasoning, are not effective in resolving these challenges. We
also identified similar biases and interference vulnerabilities with LLaVA and
Bard. Our results characterize the hallucination challenges in GPT-4V(ision)
and state-of-the-art visual-language models, and highlight the need for new
solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo. |
This paper introduces Bingo, a new benchmark to analyze hallucinations in vision-language models (VLMs), focusing on GPT-4V(ision) |
Understanding the limitations and potential biases of VLMs like GPT-4V(ision) is crucial for improving their reliability and safety |
The authors curated a benchmark of 190 failure instances across various categories of bias (regional, OCR, factual) and interference (image-to-image, text-to-image), comparing GPT-4V(ision)'s performance with LLaVA and Bard. Mitigation strategies like self-correction and chain-of-thought reasoning were also explored |
GPT-4V(ision) exhibits significant regional bias, performing better on Western images and English text
It is highly susceptible to interference, struggling to differentiate similar images and often adhering to inaccurate user claims
Self-correction moderately reduces hallucinations, while chain-of-thought reasoning shows limited success |
The benchmark's scope is limited to specific metrics and tasks
Data curation relies on human judgment, potentially introducing bias |
vision-language models, hallucination, bias, interference, gpt-4v |
2311.03233
Report |
Navigating Scaling Laws: Compute Optimality in Adaptive Model Training |
Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann |
In recent years, the state-of-the-art in deep learning has been dominated by
very large models that have been pre-trained on vast amounts of data. The
paradigm is very simple: investing more computational resources (optimally)
leads to better performance, and even predictably so; neural scaling laws have
been derived that accurately forecast the performance of a network for a
desired level of compute. This leads to the notion of a `compute-optimal'
model, i.e. a model that allocates a given level of compute during training
optimally to maximize performance. In this work, we extend the concept of
optimality by allowing for an `adaptive' model, i.e. a model that can change
its shape during training. By doing so, we can design adaptive models that
optimally traverse between the underlying scaling laws and outpace their
`static' counterparts, leading to a significant reduction in the required
compute to reach a given target performance. We show that our approach
generalizes across modalities and different shape parameters. |
This paper proposes an adaptive model training methodology that adjusts model shape (e.g., patch size for ViTs, context length for LLMs) during training to traverse between scaling laws, significantly reducing compute requirements for a target performance. |
Alleviating the exponential compute increase needed for performance improvement in deep learning, especially for large pre-trained models. |
By leveraging scaling laws for different model shapes, the method identifies the shape yielding the fastest performance gain at a given compute budget, enabling dynamic shape scheduling. |
Adaptive patch size/context length scheduling for ViTs/LLMs reduces required compute by up to 50% for a given performance.
The method generalizes to other shape parameters like model width, batch size, and training objectives.
Dynamically scheduled models consistently outperform static (fixed shape) models in terms of compute efficiency. |
The study primarily focuses on FLOPs, assuming a strong correlation with accelerator time, which might not always hold.
Determining the optimal scheduler necessitates knowledge of scaling behavior for different shape parameters, potentially incurring high computational cost. |
deep learning, scaling laws, adaptive training, vision transformers, language models |
2311.03149
Report |
Asymmetric Masked Distillation for Pre-Training Small Foundation Models |
Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, Limin Wang |
Self-supervised foundation models have shown great potential in computer
vision thanks to the pre-training paradigm of masked autoencoding. Scale is a
primary factor influencing the performance of these foundation models. However,
these large foundation models often result in high computational cost. This
paper focuses on pre-training relatively small vision transformer models that
could be efficiently adapted to downstream tasks. Specifically, taking
inspiration from knowledge distillation in model compression, we propose a new
asymmetric masked distillation (AMD) framework for pre-training relatively
small models with autoencoding. The core of AMD is to devise an asymmetric
masking strategy, where the teacher model is enabled to see more context
information with a lower masking ratio, while the student model is still
equipped with a high masking ratio. We design customized multi-layer feature
alignment between the teacher encoder and student encoder to regularize the
pre-training of student MAE. To demonstrate the effectiveness and versatility
of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively
small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the
ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B
model on the Something-in-Something V2 dataset, a 3.7% improvement over the
original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to
downstream tasks and obtain consistent performance improvement over the
original masked autoencoding. The code and models are available at
https://github.com/MCG-NJU/AMD. |
This paper proposes Asymmetric Masked Distillation (AMD), a novel framework for pre-training smaller vision transformer models using an asymmetric masking strategy within a knowledge distillation framework. |
Large foundation models are computationally expensive. AMD aims to pre-train smaller models efficiently, enabling their adaptation to downstream tasks with reduced computational costs. |
AMD employs a teacher-student distillation framework. The teacher model (larger, pre-trained) uses a lower masking ratio, accessing more context. The student model (smaller) has a higher masking ratio, learning through pixel reconstruction and multi-layer feature alignment with the teacher. |
AMD achieves 73.3% accuracy on SSV2 using ViT-B, closing the gap with the larger teacher model (74.3%).
AMD demonstrates robust transfer learning performance, improving accuracy on action recognition tasks like SSV2, UCF101, and HMDB51.
AMD outperforms symmetric distillation methods like DMAE, highlighting the benefits of asymmetric masking and feature alignment in capturing richer context information. |
The optimal masking ratio for the teacher model needs careful consideration for balancing performance and computational cost.
Exploring AMD with more complex architectures and larger datasets could further enhance performance. |
knowledge distillation, vision transformers, self-supervised learning, masked autoencoding, action recognition |
2311.03079
Report |
CogVLM: Visual Expert for Pretrained Language Models |
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang |
We introduce CogVLM, a powerful open-source visual language foundation model.
Different from the popular shallow alignment method which maps image features
into the input space of language model, CogVLM bridges the gap between the
frozen pretrained language model and image encoder by a trainable visual expert
module in the attention and FFN layers. As a result, CogVLM enables deep fusion
of vision language features without sacrificing any performance on NLP tasks.
CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal
benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+,
RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on
VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X
55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM. |
This paper introduces CogVLM, an open-source visual language foundation model that deeply integrates visual and linguistic features while preserving the capabilities of a pretrained large language model. |
Existing methods for integrating vision into large language models either rely on shallow alignment, limiting performance, or risk catastrophic forgetting of language abilities when directly training on image-text data. CogVLM addresses these limitations. |
CogVLM employs a trainable visual expert module within the language model's architecture, enabling deep fusion of visual and linguistic information without modifying the original language model parameters. This approach facilitates comprehensive multimodal pretraining and fine-tuning on diverse datasets. |
CogVLM-17B achieves state-of-the-art results across 17 visual language benchmarks, including image captioning, visual question answering, LVLM benchmarks, and visual grounding.
The visual expert module is shown to be crucial, outperforming shallow alignment methods and even surpassing models with larger language models or specialized training.
Ablation studies validate the design choices of CogVLM, including the visual expert's architecture, initialization, and the use of causal attention masks for visual tokens. |
The model's performance in handling complex compositional reasoning or tasks requiring extensive external knowledge is not extensively evaluated.
Future work can explore advanced alignment techniques like RLHF and anti-hallucination strategies to further enhance CogVLM's capabilities and address potential biases. |
multimodal learning, vision language models, deep fusion, visual expert, large language models |
2311.02848
Report |
Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video |
Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, Yao Yao |
In this paper, we present Consistent4D, a novel approach for generating 4D
dynamic objects from uncalibrated monocular videos. Uniquely, we cast the
360-degree dynamic object reconstruction as a 4D generation problem,
eliminating the need for tedious multi-view data collection and camera
calibration. This is achieved by leveraging the object-level 3D-aware image
diffusion model as the primary supervision signal for training Dynamic Neural
Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to
facilitate stable convergence and temporal continuity under the supervision
signal which is discrete along the time axis. To achieve spatial and temporal
consistency, we further introduce an Interpolation-driven Consistency Loss. It
is optimized by minimizing the discrepancy between rendered frames from DyNeRF
and interpolated frames from a pre-trained video interpolation model. Extensive
experiments show that our Consistent4D can perform competitively to prior art
alternatives, opening up new possibilities for 4D dynamic object generation
from monocular videos, whilst also demonstrating advantage for conventional
text-to-3D generation tasks. Our project page is
https://consistent4d.github.io/. |
Presents Consistent4D, a novel framework for generating 360° 4D dynamic objects from uncalibrated, static monocular videos using a Cascade Dynamic Neural Radiance Field (DyNeRF) optimized by a 2D image diffusion model and a novel Interpolation-driven Consistency Loss (ICL). |
Addresses limitations of existing 4D reconstruction methods reliant on multi-view data or restricted capture setups by enabling dynamic object generation from readily available monocular videos. |
Leverages a cascade DyNeRF architecture trained with Score Distillation Sampling (SDS) from a pre-trained image diffusion model. Introduces ICL to enforce spatial and temporal consistency by minimizing discrepancies between rendered and interpolated frames from a video interpolation model. A lightweight video enhancer further refines the generated output. |
Outperforms baseline dynamic NeRF methods in quantitative metrics (LPIPS, CLIP similarity) for novel view synthesis of dynamic objects.
Demonstrates superior spatial and temporal consistency compared to methods without ICL, effectively mitigating issues like multi-face artifacts.
The proposed ICL also shows promise in alleviating multi-face problems in conventional text-to-3D generation tasks. |
Struggles to generate accurate representations when the object's motion is overly complex or abrupt.
ICL, while effective in most cases, does not completely eliminate multi-face artifacts in all text-to-3D generation scenarios. |
4d generation, dynamic nerf, monocular video, score distillation sampling, interpolation-driven consistency |
2311.02826
Report |
InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image |
Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, Jun Zhu |
With the success of Neural Radiance Field (NeRF) in 3D-aware portrait
editing, a variety of works have achieved promising results regarding both
quality and 3D consistency. However, these methods heavily rely on per-prompt
optimization when handling natural language as editing instructions. Due to the
lack of labeled human face 3D datasets and effective architectures, the area of
human-instructed 3D-aware editing for open-world portraits in an end-to-end
manner remains under-explored. To solve this problem, we propose an end-to-end
diffusion-based framework termed InstructPix2NeRF, which enables instructed
3D-aware portrait editing from a single open-world image with human
instructions. At its core lies a conditional latent 3D diffusion process that
lifts 2D editing to 3D space by learning the correlation between the paired
images' difference and the instructions via triplet data. With the help of our
proposed token position randomization strategy, we could even achieve
multi-semantic editing through one single pass with the portrait identity
well-preserved. Besides, we further propose an identity consistency module that
directly modulates the extracted identity signals into our diffusion process,
which increases the multi-view 3D identity consistency. Extensive experiments
verify the effectiveness of our method and show its superiority against strong
baselines quantitatively and qualitatively. Source code and pre-trained models
can be found on our project page:
\url{https://mybabyyh.github.io/InstructPix2NeRF}. |
Presents InstructPix2NeRF, an end-to-end diffusion-based framework for 3D-aware portrait editing from a single image guided by human instructions. |
Addresses the lack of end-to-end models for instructed 3D-aware portrait editing, aiming for more user-friendly and precise control over edits. |
Combines a conditional latent 3D diffusion process with NeRF-based generators. Utilizes a triplet dataset (original face, edited face, instruction) and introduces token position randomization and an identity consistency module. |
Achieves high fidelity to human instructions while maintaining identity consistency.
Enables multi-semantic editing from a single instruction, surpassing baseline methods in qualitative and quantitative evaluations.
Demonstrates superior performance in user studies for instruction correspondence and identity consistency. |
Slight variations in color output can occur between semantically similar instructions.
Fine details like eye shape and eyelashes can be further improved. |
3d-aware editing, human instructions, diffusion models, nerf, portrait editing |
2311.02709
Report |
Benchmarking a Benchmark: How Reliable is MS-COCO? |
Eric Zimmermann, Justin Szeto, Jerome Pasquero, Frederic Ratle |
Benchmark datasets are used to profile and compare algorithms across a
variety of tasks, ranging from image classification to segmentation, and also
play a large role in image pretraining algorithms. Emphasis is placed on
results with little regard to the actual content within the dataset. It is
important to question what kind of information is being learned from these
datasets and what are the nuances and biases within them. In the following
work, Sama-COCO, a re-annotation of MS-COCO, is used to discover potential
biases by leveraging a shape analysis pipeline. A model is trained and
evaluated on both datasets to examine the impact of different annotation
conditions. Results demonstrate that annotation styles are important and that
annotation pipelines should closely consider the task of interest. The dataset
is made publicly available at https://www.sama.com/sama-coco-dataset/ . |
This paper presents Sama-COCO, a re-annotated version of the MS-COCO dataset focused on tighter polygons and decomposed crowd instances, to investigate potential biases in annotation styles and their impact on model performance. |
Benchmark datasets like MS-COCO are crucial for evaluating computer vision algorithms. However, inconsistencies and biases within these datasets can impact the reliability of performance comparisons and the development of robust models. |
The authors re-annotated the MS-COCO dataset with stricter guidelines, emphasizing precise polygon boundaries and individual instance labeling. They trained a Faster R-CNN model on both datasets and compared performance using standard metrics. A shape analysis pipeline, based on contour analysis and distance transforms, was employed to quantify differences between corresponding annotations in both datasets. |
Annotation styles significantly impact model performance, with models performing better when trained and evaluated on datasets with consistent annotation guidelines.
MS-COCO exhibits a bias towards avoiding annotation around occluding objects, leading to variations in model outputs compared to Sama-COCO, which emphasizes pixel-level accuracy.
Even a theoretically perfect predictor trained on one dataset might exhibit degraded performance on another due to variations in annotation styles. |
The analysis focuses on single polygon shapes and relies on bounding box shape consistency assumptions for matching annotations.
The study primarily utilizes a Faster R-CNN model. Exploring the impact of annotation styles on other architectures could provide further insights. |
dataset bias, annotation quality, instance segmentation, ms-coco, sama-coco |
2311.02542
Report |
VR-NeRF: High-Fidelity Virtualized Walkable Spaces |
Linning Xu, Vasu Agrawal, William Laney, Tony Garcia, Aayush Bansal, Changil Kim, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Aljaž Božič, Dahua Lin, Michael Zollhöfer, Christian Richardt |
We present an end-to-end system for the high-fidelity capture, model
reconstruction, and real-time rendering of walkable spaces in virtual reality
using neural radiance fields. To this end, we designed and built a custom
multi-camera rig to densely capture walkable spaces in high fidelity and with
multi-view high dynamic range images in unprecedented quality and density. We
extend instant neural graphics primitives with a novel perceptual color space
for learning accurate HDR appearance, and an efficient mip-mapping mechanism
for level-of-detail rendering with anti-aliasing, while carefully optimizing
the trade-off between quality and speed. Our multi-GPU renderer enables
high-fidelity volume rendering of our neural radiance field model at the full
VR resolution of dual 2K$\times$2K at 36 Hz on our custom demo machine. We
demonstrate the quality of our results on our challenging high-fidelity
datasets, and compare our method and datasets to existing baselines. We release
our dataset on our project website. |
VR-NeRF: An end-to-end system for high-fidelity capture, reconstruction, and real-time rendering of walkable spaces in VR using neural radiance fields. |
Existing approaches for VR view synthesis are limited to either small headbox volumes or lower quality scene-scale rendering. |
A custom multi-camera rig ("Eyeful Tower") captures dense, high-resolution HDR images. A novel NeRF model with perceptual color space and efficient mip-mapping enables high-fidelity reconstruction and rendering. A multi-GPU renderer enables real-time VR exploration. |
Captures large-scale datasets with unprecedented quality and density (thousands of 50MP HDR images).
Proposed VR-NeRF model outperforms baselines in visual fidelity for large-scale HDR scenes.
Achieves real-time rendering (36 FPS) at VR resolution on a custom 20-GPU workstation. |
Limited ability to handle dynamic scene elements like moving objects or lighting.
View extrapolation capabilities are limited by the capture density. |
neural radiance fields, virtual reality, novel view synthesis, high dynamic range imaging, multi-view capture |
2311.02536
Report |
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models |
Jingru Yi, Burak Uzkent, Oana Ignat, Zili Li, Amanmeet Garg, Xiang Yu, Linda Liu |
Grounding-based vision and language models have been successfully applied to
low-level vision tasks, aiming to precisely locate objects referred in
captions. The effectiveness of grounding representation learning heavily relies
on the scale of the training dataset. Despite being a useful data enrichment
strategy, data augmentation has received minimal attention in existing vision
and language tasks as augmentation for image-caption pairs is non-trivial. In
this study, we propose a robust phrase grounding model trained with
text-conditioned and text-unconditioned data augmentations. Specifically, we
apply text-conditioned color jittering and horizontal flipping to ensure
semantic consistency between images and captions. To guarantee image-caption
correspondence in the training samples, we modify the captions according to
pre-defined keywords when applying horizontal flipping. Additionally, inspired
by recent masked signal reconstruction, we propose to use pixel-level masking
as a novel form of data augmentation. While we demonstrate our data
augmentation method with MDETR framework, the proposed approach is applicable
to common grounding-based vision and language tasks with other frameworks.
Finally, we show that image encoder pretrained on large-scale image and
language datasets (such as CLIP) can further improve the results. Through
extensive experiments on three commonly applied datasets: Flickr30k, referring
expressions and GQA, our method demonstrates advanced performance over the
state-of-the-arts with various metrics. Code can be found in
https://github.com/amzn/augment-the-pairs-wacv2024. |
The paper proposes a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations to improve grounding-based vision and language models, particularly for object localization within captions. |
Data augmentation, though crucial for enriching datasets and model generalization, has been under-explored in phrase grounding tasks due to the challenge of maintaining image-caption semantic consistency. |
The methodology involves text-conditioned color jittering and horizontal flipping applied selectively based on caption content to ensure semantic consistency. It also uses pixel-level masking as a novel data augmentation approach. The approach is demonstrated with the MDETR framework and incorporates image encoders pre-trained on large-scale image and language datasets like CLIP. |
The proposed method consistently outperforms MDETR, a state-of-the-art phrase grounding model, on various benchmarks like Flickr30k, Referring Expressions, and GQA.
Text-conditioned horizontal flipping, which modifies captions based on a keyword list, significantly improves performance on the Referring Expressions dataset, highlighting its effectiveness in learning complex image-caption relationships.
The model exhibits better semantic understanding and robustness in phrase grounding, demonstrated by its ability to suppress redundant detections and make accurate distinctions between objects based on subtle cues. |
The model still faces challenges in detecting small objects that blend with the background and recognizing text within images.
Future work will explore generalizing to a wider range of data augmentations to further enhance the diversity of the input space. |
phrase grounding, data augmentation, vision and language, object localization, mdetr |
2311.02343
Report |
Stable Diffusion Reference Only: Image Prompt and Blueprint Jointly Guided Multi-Condition Diffusion Model for Secondary Painting |
Hao Ai, Lu Sheng |
Stable Diffusion and ControlNet have achieved excellent results in the field
of image generation and synthesis. However, due to the granularity and method
of its control, the efficiency improvement is limited for professional artistic
creations such as comics and animation production whose main work is secondary
painting. In the current workflow, fixing characters and image styles often
need lengthy text prompts, and even requires further training through
TextualInversion, DreamBooth or other methods, which is very complicated and
expensive for painters. Therefore, we present a new method in this paper,
Stable Diffusion Reference Only, a images-to-image self-supervised model that
uses only two types of conditional images for precise control generation to
accelerate secondary painting. The first type of conditional image serves as an
image prompt, supplying the necessary conceptual and color information for
generation. The second type is blueprint image, which controls the visual
structure of the generated image. It is natively embedded into the original
UNet, eliminating the need for ControlNet. We released all the code for the
module and pipeline, and trained a controllable character line art coloring
model at https://github.com/aihao2000/stable-diffusion-reference-only, that
achieved state-of-the-art results in this field. This verifies the
effectiveness of the structure and greatly improves the production efficiency
of animations, comics, and fanworks. |
This paper introduces Stable Diffusion Reference Only, a novel image-to-image self-supervised model for secondary painting that utilizes two conditional images: an image prompt and a blueprint image. |
Existing text-to-image models are inefficient for professional artistic creations like comics and animation, as they rely heavily on text prompts and require extensive training for specific styles. |
The model leverages a modified UNet architecture with cross-attention mechanisms to incorporate both image prompt and blueprint information. It is trained in a self-supervised manner using a dataset of anime images and CLIP similarity scores. |
Stable Diffusion Reference Only demonstrates state-of-the-art results in line art coloring.
The model exhibits generalization capabilities, enabling style transfer between different anime characters.
It outperforms existing methods like ControlNet Reference Only and IP-Adapter in terms of accuracy and generalization. |
The model's performance on complex backgrounds and diverse artistic styles requires further investigation.
Future work could explore extending the blueprint image to other forms like sketches and poses. |
image generation, secondary painting, stable diffusion, image-to-image translation, anime art |
2311.01813
Report |
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation |
Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, Lu Hou |
Recently, open-domain text-to-video (T2V) generation models have made
remarkable progress. However, the promising results are mainly shown by the
qualitative cases of generated videos, while the quantitative evaluation of T2V
models still faces two critical problems. Firstly, existing studies lack
fine-grained evaluation of T2V models on different categories of text prompts.
Although some benchmarks have categorized the prompts, their categorization
either only focuses on a single aspect or fails to consider the temporal
information in video generation. Secondly, it is unclear whether the automatic
evaluation metrics are consistent with human standards. To address these
problems, we propose FETV, a benchmark for Fine-grained Evaluation of
Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based
on three orthogonal aspects: the major content, the attributes to control and
the prompt complexity. FETV is also temporal-aware, which introduces several
temporal categories tailored for video generation. Based on FETV, we conduct
comprehensive manual evaluations of four representative T2V models, revealing
their pros and cons on different categories of prompts from different aspects.
We also extend FETV as a testbed to evaluate the reliability of automatic T2V
metrics. The multi-aspect categorization of FETV enables fine-grained analysis
of the metrics' reliability in different scenarios. We find that existing
automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human
evaluation. To address this problem, we explore several solutions to improve
CLIPScore and FVD, and develop two automatic metrics that exhibit significant
higher correlation with humans than existing metrics. Benchmark page:
https://github.com/llyx97/FETV. |
The paper introduces FETV, a benchmark designed for fine-grained evaluation of open-domain text-to-video generation models. |
Existing quantitative evaluations of T2V models lack fine-grained analysis across categories and reliable automatic metrics. |
FETV categorizes text prompts based on major content (spatial/temporal), attribute control (spatial/temporal), and complexity. The authors manually evaluate four representative T2V models on FETV and analyze their performance across different categories. FETV is also used as a testbed to assess the reliability of automatic metrics. |
Existing T2V models struggle with generating high-quality videos, particularly for categories involving actions, kinetic motions, quantity control, motion direction, and event order.
Widely used automatic metrics (e.g., CLIPScore, FID, FVD) show poor correlation with human evaluation.
The authors develop two new automatic metrics, FVD-UMT (video quality) and UMTScore (video-text alignment), which exhibit higher correlation with human judgment than existing metrics. |
The number of evaluated T2V models is limited due to the scarcity of open-sourced models.
While improved, the proposed UMT-based metrics still have room for better alignment with human evaluation. |
text-to-video generation, benchmark, evaluation metrics, fine-grained evaluation, vision-language models |
2311.01804
Report |
inkn'hue: Enhancing Manga Colorization from Multiple Priors with Alignment Multi-Encoder VAE |
Tawin Jiramahapokee |
Manga, a form of Japanese comics and distinct visual storytelling, has
captivated readers worldwide. Traditionally presented in black and white,
manga's appeal lies in its ability to convey complex narratives and emotions
through intricate line art and shading. Yet, the desire to experience manga in
vibrant colors has sparked the pursuit of manga colorization, a task of
paramount significance for artists. However, existing methods, originally
designed for line art and sketches, face challenges when applied to manga.
These methods often fall short in achieving the desired results, leading to the
need for specialized manga-specific solutions. Existing approaches frequently
rely on a single training step or extensive manual artist intervention, which
can yield less satisfactory outcomes. To address these challenges, we propose a
specialized framework for manga colorization. Leveraging established models for
shading and vibrant coloring, our approach aligns both using a multi-encoder
VAE. This structured workflow ensures clear and colorful results, with the
option to incorporate reference images and manual hints. |
This paper introduces a novel framework for user-guided manga colorization that leverages a multi-encoder VAE to enhance the color consistency, shading, and line art quality of manga pages, building upon existing models for shading and colorization. |
Existing methods for manga colorization often result in color bleeding, text clarity issues, and a lack of vibrancy. This framework aims to address these limitations and provide a streamlined approach for producing high-quality colorized manga. |
The framework combines a shading model and a rough colorization model, aligning their outputs using a multi-encoder VAE. This VAE is trained to correct inconsistencies and enhance the overall visual quality, while CIELAB interpolation is employed as a post-processing step to fine-tune color saturation and truthfulness. |
The framework effectively restores line art details lost during the initial colorization process, resulting in sharper features and improved text clarity.
The multi-encoder VAE successfully corrects color outliers and inconsistencies from the rough colorization stage, producing more uniform and realistic results.
User studies show a strong preference for the framework's post-processed outputs over the rough-colorized priors, highlighting its effectiveness in enhancing visual appeal. |
The lack of publicly available datasets for manga colorization necessitated the compilation of a training dataset from various sources, potentially introducing bias.
The study primarily focused on comparing the framework's performance against its internal stages rather than external benchmarks due to the lack of user-guidance support in some existing models. |
manga colorization, multi-encoder vae, cielab interpolation, user-guided colorization, deep learning |
2311.01797
Report |
On the Generalization Properties of Diffusion Models |
Puheng Li, Zhong Li, Huishuai Zhang, Jiang Bian |
Diffusion models are a class of generative models that serve to establish a
stochastic transport map between an empirically observed, yet unknown, target
distribution and a known prior. Despite their remarkable success in real-world
applications, a theoretical understanding of their generalization capabilities
remains underdeveloped. This work embarks on a comprehensive theoretical
exploration of the generalization attributes of diffusion models. We establish
theoretical estimates of the generalization gap that evolves in tandem with the
training dynamics of score-based diffusion models, suggesting a polynomially
small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$
and the model capacity $m$, evading the curse of dimensionality (i.e., not
exponentially large in the data dimension) when early-stopped. Furthermore, we
extend our quantitative analysis to a data-dependent scenario, wherein target
distributions are portrayed as a succession of densities with progressively
increasing distances between modes. This precisely elucidates the adverse
effect of "modes shift" in ground truths on the model generalization. Moreover,
these estimates are not solely theoretical constructs but have also been
confirmed through numerical simulations. Our findings contribute to the
rigorous understanding of diffusion models' generalization properties and
provide insights that may guide practical applications. |
This paper provides a theoretical analysis of the generalization capability of diffusion models, showing that the generalization gap is polynomially small on sample size and model capacity with early-stopping. |
Understanding the generalization properties of diffusion models is crucial to address memorization, privacy, and copyright concerns arising from their impressive empirical performance and practical applications. |
The authors derive upper bounds of the generalization gap, measured by KL divergence, along the training dynamics, using techniques like Rademacher complexity and convex optimization analysis. |
The generalization error scales polynomially with sample size ($O(n^{-2/5})$) and model capacity ($O(m^{-4/5})$) when early-stopped, avoiding the curse of dimensionality.
For target distributions with distant multi-modes, the generalization capability is adversely affected by the distance between modes, as illustrated by the example of Gaussian mixtures.
Numerical simulations on synthetic and real-world datasets verify the theoretical findings of early-stopping generalization and the modes shift effect. |
The analysis focuses on a specific score network architecture (random feature model) and might not directly extend to other variants in the diffusion models family.
Future work could explore extending the theoretical framework to more complex score networks, like neural tangent kernels or mean-field models. |
diffusion models, score-based generative models, generalization, memorization, modes shift |
2311.01773
Report |
PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation |
Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang YU, Tao Chen |
Recent advances in implicit neural representations have achieved impressive
results by sampling and fusing individual points along sampling rays in the
sampling space. However, due to the explosively growing sampling space, finely
representing and synthesizing detailed textures remains a challenge for
unbounded large-scale outdoor scenes. To alleviate the dilemma of using
individual points to perceive the entire colossal space, we explore learning
the surface distribution of the scene to provide structural priors and reduce
the samplable space and propose a Point Diffusion implicit Function, PDF, for
large-scale scene neural representation. The core of our method is a
large-scale point cloud super-resolution diffusion module that enhances the
sparse point cloud reconstructed from several training images into a dense
point cloud as an explicit prior. Then in the rendering stage, only sampling
points with prior points within the sampling radius are retained. That is, the
sampling space is reduced from the unbounded space to the scene surface.
Meanwhile, to fill in the background of the scene that cannot be provided by
point clouds, the region sampling based on Mip-NeRF 360 is employed to model
the background representation. Expensive experiments have demonstrated the
effectiveness of our method for large-scale scene novel view synthesis, which
outperforms relevant state-of-the-art baselines. |
This paper proposes **PDF**, a **P**oint **D**iffusion implicit **F**unction, for large-scale scene neural representation to enable more efficient and detailed novel view synthesis. |
Existing implicit neural representation methods struggle with large-scale outdoor scenes due to the explosively growing sampling space needed to represent details. |
PDF uses a two-stage approach: (1) It leverages a point diffusion model to generate a dense point cloud from a sparse point cloud reconstructed from training images, providing a surface prior. (2) It employs Point-NeRF for foreground rendering and Mip-NeRF 360 for background rendering, fusing the features for novel view synthesis. |
PDF outperforms state-of-the-art methods on large-scale scene datasets (OMMO, BlendedMVS) in terms of PSNR, SSIM, and LPIPS.
The point diffusion module effectively captures scene surface distribution and generates dense point cloud representations.
The fusion of foreground and background rendering modules leads to photorealistic results with fine details. |
The current approach trains a separate diffusion model for each scene, limiting efficiency.
Exploring cross-scene point cloud up-sampling generalization and generalized point diffusion NeRF are interesting future directions. |
neural radiance fields, novel view synthesis, point cloud processing, diffusion models, large-scale scene representation |
2311.01714
Report |
EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation |
Zhengzhe Liu, Jingyu Hu, Ka-Hei Hui, Xiaojuan Qi, Daniel Cohen-Or, Chi-Wing Fu |
This paper presents a new text-guided technique for generating 3D shapes. The
technique leverages a hybrid 3D shape representation, namely EXIM, combining
the strengths of explicit and implicit representations. Specifically, the
explicit stage controls the topology of the generated 3D shapes and enables
local modifications, whereas the implicit stage refines the shape and paints it
with plausible colors. Also, the hybrid approach separates the shape and color
and generates color conditioned on shape to ensure shape-color consistency.
Unlike the existing state-of-the-art methods, we achieve high-fidelity shape
generation from natural-language descriptions without the need for
time-consuming per-shape optimization or reliance on human-annotated texts
during training or test-time optimization. Further, we demonstrate the
applicability of our approach to generate indoor scenes with consistent styles
using text-induced 3D shapes. Through extensive experiments, we demonstrate the
compelling quality of our results and the high coherency of our generated
shapes with the input texts, surpassing the performance of existing methods by
a significant margin. Codes and models are released at
https://github.com/liuzhengzhe/EXIM. |
This paper introduces EXIM, a novel hybrid explicit-implicit 3D shape representation, for high-fidelity text-guided 3D shape generation and modification. |
Existing methods for text-guided 3D shape generation suffer from limitations such as time-consuming optimization, unrealistic outputs, and difficulties in local shape modification. This work addresses these limitations by leveraging a hybrid shape representation. |
EXIM employs a two-stage pipeline. The first stage uses a 3D diffusion model in a compact wavelet domain to generate a coarse shape based on text. The second stage utilizes an implicit network to enhance details and generate color conditioned on the shape, guided by the text. |
EXIM generates high-fidelity 3D shapes from text descriptions, outperforming existing methods in terms of detail and realism.
The hybrid representation allows for local shape modification and independent editing of color and shape based on text.
The method can generate style-consistent indoor scenes by composing shapes generated from text prompts. |
The performance on categories trained with pseudo-annotations is limited by the quality of the image captioning model.
Automatic and accurate localization of regions of interest for modification based on text remains a challenge. |
3d shape generation, text-guided synthesis, hybrid representation, shape modification, indoor scene generation |
2311.01462
Report |
Idempotent Generative Network |
Assaf Shocher, Amil Dravid, Yossi Gandelsman, Inbar Mosseri, Michael Rubinstein, Alexei A. Efros |
We propose a new approach for generative modeling based on training a neural
network to be idempotent. An idempotent operator is one that can be applied
sequentially without changing the result beyond the initial application, namely
$f(f(z))=f(z)$. The proposed model $f$ is trained to map a source distribution
(e.g, Gaussian noise) to a target distribution (e.g. realistic images) using
the following objectives: (1) Instances from the target distribution should map
to themselves, namely $f(x)=x$. We define the target manifold as the set of all
instances that $f$ maps to themselves. (2) Instances that form the source
distribution should map onto the defined target manifold. This is achieved by
optimizing the idempotence term, $f(f(z))=f(z)$ which encourages the range of
$f(z)$ to be on the target manifold. Under ideal assumptions such a process
provably converges to the target distribution. This strategy results in a model
capable of generating an output in one step, maintaining a consistent latent
space, while also allowing sequential applications for refinement.
Additionally, we find that by processing inputs from both target and source
distributions, the model adeptly projects corrupted or modified data back to
the target manifold. This work is a first step towards a ``global projector''
that enables projecting any input into a target data distribution. |
This paper introduces Idempotent Generative Networks (IGN), a novel generative model trained to project inputs onto a target data manifold by optimizing for idempotence (i.e., f(f(z))=f(z)). |
IGN aims to create a "global projector" capable of mapping various inputs, including noise, corrupted instances, and alternative data distributions, onto a desired target distribution (e.g., realistic images). |
IGN uses a self-adversarial training process with three objectives: (1) Reconstruction: mapping target distribution samples to themselves, (2) Idempotence: ensuring the model output, when fed back as input, remains unchanged, and (3) Tightness: preventing the model from mapping everything to the target manifold. |
Theoretically, under ideal conditions, IGN’s generated distribution converges to the target distribution.
Experiments on MNIST and CelebA demonstrate IGN’s ability to generate images from noise, showing that sequential applications can refine generated samples.
IGN exhibits out-of-distribution projection capabilities, successfully handling tasks like denoising, colorization, and sketch-to-image translation without explicit training on these tasks. |
Similar to GANs, IGN can suffer from mode collapse, potentially limiting the diversity of generated samples.
The generated samples can appear blurry, a common issue in autoencoder-based models. |
generative models, idempotence, image generation, image-to-image translation, out-of-distribution projection |
2311.01410
Report |
The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing |
Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, Chongxuan Li |
We present a unified probabilistic formulation for diffusion-based image
editing, where a latent variable is edited in a task-specific manner and
generally deviates from the corresponding marginal distribution induced by the
original stochastic or ordinary differential equation (SDE or ODE). Instead, it
defines a corresponding SDE or ODE for editing. In the formulation, we prove
that the Kullback-Leibler divergence between the marginal distributions of the
two SDEs gradually decreases while that for the ODEs remains as the time
approaches zero, which shows the promise of SDE in image editing. Inspired by
it, we provide the SDE counterparts for widely used ODE baselines in various
tasks including inpainting and image-to-image translation, where SDE shows a
consistent and substantial improvement. Moreover, we propose SDE-Drag -- a
simple yet effective method built upon the SDE formulation for point-based
content dragging. We build a challenging benchmark (termed DragBench) with
open-set natural, art, and AI-generated images for evaluation. A user study on
DragBench indicates that SDE-Drag significantly outperforms our ODE baseline,
existing diffusion-based methods, and the renowned DragGAN. Our results
demonstrate the superiority and versatility of SDE in image editing and push
the boundary of diffusion-based editing methods. |
This paper introduces a unified probabilistic perspective for analyzing diffusion-based image editing and demonstrates the superiority of stochastic differential equation (SDE) formulations over the commonly used ordinary differential equation (ODE) methods. |
Existing diffusion-based image editing methods lack a probabilistic understanding, and ODE formulations are predominantly used due to ease of implementation. This work provides theoretical justification for SDE's advantages in editing. |
The paper presents a unified formulation encompassing existing methods, proves theoretically that SDEs reduce the divergence between edited and data distributions while ODEs do not, and proposes SDE-Drag, a novel SDE-based method for point-based content dragging. |
SDE counterparts consistently outperform ODE baselines in inpainting and image-to-image translation tasks.
SDE-Drag, evaluated on a new challenging benchmark (DragBench), significantly outperforms ODE-Drag, existing diffusion-based dragging methods, and DragGAN.
SDE-based methods achieve superior results without increasing computational cost. |
Theoretical analysis does not fully account for model approximation and discretization errors.
Open-set image dragging remains a challenge with certain failure cases. |
diffusion model, image editing, sde, ode, image dragging |
2311.01373
Report |
Recognize Any Regions |
Haosen Yang, Chuofan Ma, Bin Wen, Yi Jiang, Zehuan Yuan, Xiatian Zhu |
Understanding the semantics of individual regions or patches within
unconstrained images, such as in open-world object detection, represents a
critical yet challenging task in computer vision. Building on the success of
powerful image-level vision-language (ViL) foundation models like CLIP, recent
efforts have sought to harness their capabilities by either training a
contrastive model from scratch with an extensive collection of region-label
pairs or aligning the outputs of a detection model with image-level
representations of region proposals. Despite notable progress, these approaches
are plagued by computationally intensive training requirements, susceptibility
to data noise, and deficiency in contextual information. To address these
limitations, we explore the synergistic potential of off-the-shelf foundation
models, leveraging their respective strengths in localization and semantics. We
introduce a novel, generic, and efficient region recognition architecture,
named RegionSpot, designed to integrate position-aware localization knowledge
from a localization foundation model (e.g., SAM) with semantic information
extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge
while minimizing training overhead, we keep both foundation models frozen,
focusing optimization efforts solely on a lightweight attention-based knowledge
integration module. Through extensive experiments in the context of open-world
object recognition, our RegionSpot demonstrates significant performance
improvements over prior alternatives, while also providing substantial
computational savings. For instance, training our model with 3 million data in
a single day using 8 V100 GPUs. Our model outperforms GLIP by 6.5 % in mean
average precision (mAP), with an even larger margin by 14.8 % for more
challenging and rare categories. |
This paper introduces RegionSpot, a novel region recognition framework that leverages frozen vision and vision-language foundation models (e.g., SAM and CLIP) for efficient region recognition. |
Understanding the semantics of individual regions in images is crucial for tasks like open-world object detection. Existing methods suffer from high training costs, data noise susceptibility, and contextual information loss. |
RegionSpot integrates position-aware tokens from a localization model (SAM) with semantic features from a ViL model (CLIP) using cross-attention. This approach allows for efficient training by keeping both foundation models frozen. |
RegionSpot significantly outperforms previous methods in open-world object recognition, achieving a 6.5% higher mAP than GLIP.
The model demonstrates robustness to noisy region proposals, achieving state-of-the-art performance when using proposals from a ViL detector.
RegionSpot exhibits superior training efficiency, requiring substantially fewer GPU hours compared to GLIP and RegionCLIP. |
The current implementation primarily focuses on object recognition and relies on external object proposals.
Future work will explore end-to-end learning to incorporate both object localization and recognition. |
region recognition, open-world object detection, vision-language models, foundation models, zero-shot learning |
2311.01090
Report |
Infusion: Internal Diffusion for Video Inpainting |
Nicolas Cherel, Andrés Almansa, Yann Gousseau, Alasdair Newson |
Video inpainting is the task of filling a desired region in a video in a
visually convincing manner. It is a very challenging task due to the high
dimensionality of the signal and the temporal consistency required for
obtaining convincing results. Recently, diffusion models have shown impressive
results in modeling complex data distributions, including images and videos.
Diffusion models remain nonetheless very expensive to train and perform
inference with, which strongly restrict their application to video. We show
that in the case of video inpainting, thanks to the highly auto-similar nature
of videos, the training of a diffusion model can be restricted to the video to
inpaint and still produce very satisfying results. This leads us to adopt an
internal learning approch, which also allows for a greatly reduced network
size. We call our approach "Infusion": an internal learning algorithm for video
inpainting through diffusion. Due to our frugal network, we are able to propose
the first video inpainting approach based purely on diffusion. Other methods
require supporting elements such as optical flow estimation, which limits their
performance in the case of dynamic textures for example. We introduce a new
method for efficient training and inference of diffusion models in the context
of internal learning. We split the diffusion process into different learning
intervals which greatly simplifies the learning steps. We show qualititative
and quantitative results, demonstrating that our method reaches
state-of-the-art performance, in particular in the case of dynamic backgrounds
and textures. |
Introduces "Infusion," a purely diffusion-based video inpainting approach using internal learning, enabling high-quality video inpainting by training a lightweight network on a single video. |
Addresses the limitations of existing video inpainting methods, which often struggle with dynamic textures and rely on computationally expensive diffusion models or optical flow estimations. |
Employs a 3D UNet architecture with a novel "interval training" scheme, where the network is trained on subsets of diffusion timesteps, leading to efficient training and inference. |
Achieves state-of-the-art performance on video inpainting tasks, particularly excelling in handling dynamic textures.
Significantly outperforms competing methods in reconstructing complex dynamic textures, as evidenced by LPIPS and VFID metrics.
Demonstrates superior performance in object removal scenarios, as indicated by the VFID metric. |
Limited temporal receptive field due to the convolutional architecture, potentially impacting long-range temporal consistency.
Longer inference times compared to some deep learning-based methods. |
video inpainting, diffusion models, internal learning, dynamic textures, interval training |
2311.01015
Report |
Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs |
Peng Jin, Yang Wu, Yanbo Fan, Zhongqian Sun, Yang Wei, Li Yuan |
Most text-driven human motion generation methods employ sequential modeling
approaches, e.g., transformer, to extract sentence-level text representations
automatically and implicitly for human motion synthesis. However, these compact
text representations may overemphasize the action names at the expense of other
important properties and lack fine-grained details to guide the synthesis of
subtly distinct motion. In this paper, we propose hierarchical semantic graphs
for fine-grained control over motion generation. Specifically, we disentangle
motion descriptions into hierarchical semantic graphs including three levels of
motions, actions, and specifics. Such global-to-local structures facilitate a
comprehensive understanding of motion description and fine-grained control of
motion generation. Correspondingly, to leverage the coarse-to-fine topology of
hierarchical semantic graphs, we decompose the text-to-motion diffusion process
into three semantic levels, which correspond to capturing the overall motion,
local actions, and action specifics. Extensive experiments on two benchmark
human motion datasets, including HumanML3D and KIT, with superior performances,
justify the efficacy of our method. More encouragingly, by modifying the edge
weights of hierarchical semantic graphs, our method can continuously refine the
generated motion, which may have a far-reaching impact on the community. Code
and pre-training weights are available at
https://github.com/jpthu17/GraphMotion. |
This paper proposes GraphMotion, which leverages hierarchical semantic graphs for fine-grained control over text-driven human motion generation. |
Existing methods often overemphasize action names in text descriptions and lack fine-grained control over generated motions. This work aims to address this by using a more detailed and structured text representation. |
The authors use semantic role parsing to represent text descriptions as hierarchical semantic graphs with three levels: motions, actions, and specifics. They then design a coarse-to-fine motion diffusion model that progressively generates motion details, guided by the graph structure. |
GraphMotion outperforms state-of-the-art methods on HumanML3D and KIT benchmarks, demonstrating superior controllability and motion quality.
Modifying edge weights in the semantic graph enables fine-tuning of action attributes and durations.
The method successfully generates plausible motion even when verbs and action names are masked from the input text. |
The randomness inherent to diffusion models may occasionally lead to undesirable outputs.
The quality of generated motion is limited by the performance of the pre-trained motion variational autoencoder. |
human motion generation, text-to-motion, diffusion models, semantic graphs, fine-grained control |
2311.00990
Report |
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning |
Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Wenwu Zhu |
Customized text-to-video generation aims to generate text-guided videos with
customized user-given subjects, which has gained increasing attention recently.
However, existing works are primarily limited to generating videos for a single
subject, leaving the more challenging problem of customized multi-subject
text-to-video generation largely unexplored. In this paper, we fill this gap
and propose a novel VideoDreamer framework. VideoDreamer can generate
temporally consistent text-guided videos that faithfully preserve the visual
features of the given multiple subjects. Specifically, VideoDreamer leverages
the pretrained Stable Diffusion with latent-code motion dynamics and temporal
cross-frame attention as the base video generator. The video generator is
further customized for the given multiple subjects by the proposed Disen-Mix
Finetuning and Human-in-the-Loop Re-finetuning strategy, which can tackle the
attribute binding problem of multi-subject generation. We also introduce
MultiStudioBench, a benchmark for evaluating customized multi-subject
text-to-video generation models. Extensive experiments demonstrate the
remarkable ability of VideoDreamer to generate videos with new content such as
new events and backgrounds, tailored to the customized multiple subjects. Our
project page is available at https://videodreamer23.github.io/. |
This paper introduces VideoDreamer, a novel framework for customized multi-subject text-to-video generation, capable of generating videos featuring user-specified subjects while adhering to textual prompts. |
This research addresses the unexplored area of customized multi-subject text-to-video generation, overcoming limitations of previous works that focused primarily on single-subject generation. |
VideoDreamer leverages a pretrained Stable Diffusion model enhanced with latent-code motion dynamics and temporal cross-frame attention. It employs a Disen-Mix Finetuning strategy to customize the model for multiple subjects, mitigating the attribute binding problem. An optional Human-in-the-Loop Refinement (HLR) strategy can further enhance performance. |
VideoDreamer demonstrates superior subject fidelity compared to baseline models, effectively preserving visual features of multiple subjects.
The Disen-Mix finetuning strategy ensures high textual fidelity, preventing overfitting to subject-irrelevant information in mixed data.
VideoDreamer exhibits strong temporal consistency and minimal artifacts in generated videos, surpassing limitations of existing customization methods. |
The reliance on a single prompt to guide all frames limits the generation of videos with dynamic backgrounds or multiple events.
Temporal consistency can be further improved due to the video generator not being pretrained on text-video pairs. |
text-to-video generation, multi-subject customization, stable diffusion, disen-mix finetuning, human-in-the-loop refinement |
2311.00941
Report |
Gaussian Mixture Solvers for Diffusion Models |
Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, Chongxuan Li |
Recently, diffusion models have achieved great success in generative tasks.
Sampling from diffusion models is equivalent to solving the reverse diffusion
stochastic differential equations (SDEs) or the corresponding probability flow
ordinary differential equations (ODEs). In comparison, SDE-based solvers can
generate samples of higher quality and are suited for image translation tasks
like stroke-based synthesis. During inference, however, existing SDE-based
solvers are severely constrained by the efficiency-effectiveness dilemma. Our
investigation suggests that this is because the Gaussian assumption in the
reverse transition kernel is frequently violated (even in the case of simple
mixture data) given a limited number of discretization steps. To overcome this
limitation, we introduce a novel class of SDE-based solvers called
\emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver
estimates the first three-order moments and optimizes the parameters of a
Gaussian mixture transition kernel using generalized methods of moments in each
step during sampling. Empirically, our solver outperforms numerous SDE-based
solvers in terms of sample quality in image generation and stroke-based
synthesis in various diffusion models, which validates the motivation and
effectiveness of GMS. Our code is available at
https://github.com/Guohanzhong/GMS. |
The paper proposes a novel Gaussian Mixture Solver (GMS) for diffusion models, which employs a Gaussian mixture transition kernel in the reverse process to better approximate the true distribution and reduce discretization errors. |
Existing SDE-based solvers for diffusion models suffer from an efficiency-effectiveness dilemma, especially in tasks like image translation. This is because the Gaussian assumption for the reverse transition kernel often fails under limited discretization steps. |
GMS utilizes a noise prediction network with multiple heads to estimate high-order moments of the reverse transition kernel. It then fits a Gaussian mixture transition kernel in each sampling step using the generalized method of moments. |
GMS outperforms numerous SDE-based solvers in terms of sample quality on CIFAR10 and ImageNet 64x64, achieving a 4.44 FID improvement over the state-of-the-art SDE-based solver with 10 steps on CIFAR10.
In stroke-based image synthesis, GMS achieves higher realism than existing SDE-based and ODE-based solvers while maintaining comparable computation cost and faithfulness.
Theoretical and empirical evidence demonstrate that the true reverse transition kernel deviates from a Gaussian distribution, particularly with fewer sampling steps. |
GMS still requires more computation time compared to simpler SDE-based solvers, although it shows improvements with the same maximum computation cost.
Like other generative models, diffusion models with GMS can potentially generate problematic fake content. |
diffusion models, gaussian mixture, sde solvers, image generation, stroke-based synthesis |
2311.00750
Report |
Are These the Same Apple? Comparing Images Based on Object Intrinsics |
Klemen Kotar, Stephen Tian, Hong-Xing Yu, Daniel L. K. Yamins, Jiajun Wu |
The human visual system can effortlessly recognize an object under different
extrinsic factors such as lighting, object poses, and background, yet current
computer vision systems often struggle with these variations. An important step
to understanding and improving artificial vision systems is to measure image
similarity purely based on intrinsic object properties that define object
identity. This problem has been studied in the computer vision literature as
re-identification, though mostly restricted to specific object categories such
as people and cars. We propose to extend it to general object categories,
exploring an image similarity metric based on object intrinsics. To benchmark
such measurements, we collect the Common paired objects Under differenT
Extrinsics (CUTE) dataset of $18,000$ images of $180$ objects under different
extrinsic factors such as lighting, poses, and imaging conditions. While
existing methods such as LPIPS and CLIP scores do not measure object intrinsics
well, we find that combining deep features learned from contrastive
self-supervised learning with foreground filtering is a simple yet effective
approach to approximating the similarity. We conduct an extensive survey of
pre-trained features and foreground extraction methods to arrive at a strong
baseline that best measures intrinsic object-centric image similarity among
current methods. Finally, we demonstrate that our approach can aid in
downstream applications such as acting as an analog for human subjects and
improving generalizable re-identification. Please see our project website at
https://s-tian.github.io/projects/cute/ for visualizations of the data and
demos of our metric. |
This paper introduces a new dataset and a simple but effective method for measuring the visual similarity of general objects based solely on their intrinsic properties, aiming to mimic human perception's robustness to extrinsic factors like lighting, pose, and background. |
This contribution is important because current computer vision systems struggle to generalize across varying conditions, and measuring intrinsic object similarity is crucial for improving robustness and developing AI systems that understand the visual world like humans. |
The authors collect a new dataset called CUTE, containing 18,000 images of 180 objects under controlled and in-the-wild conditions with varying extrinsics. They propose a method called Foreground Feature Averaging (FFA), which combines foreground filtering with deep features learned from contrastive self-supervised learning (DINOv2), and benchmark its performance against existing metrics. |
FFA outperforms prior image similarity methods like LPIPS, SSIM, and CLIPScore, especially in challenging in-the-wild settings.
FFA demonstrates better alignment with human perception in a qualitative study where participants preferred object orderings generated by FFA compared to LPIPS and CLIPScore.
FFA, when combined with existing vehicle re-identification models, improves their generalization ability across different datasets. |
The proposed method is currently focused and evaluated on images containing single objects, limiting its applicability to more complex scenes.
The CUTE dataset, while diverse, has limitations in terms of object occlusion, pose variation relative to the camera, and geographical diversity in capture conditions. |
object similarity, intrinsic image similarity, self-supervised learning, computer vision, object recognition |
2311.00618
Report |
De-Diffusion Makes Text a Strong Cross-Modal Interface |
Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu |
We demonstrate text as a strong cross-modal interface. Rather than relying on
deep embeddings to connect image and language as the interface representation,
our approach represents an image as text, from which we enjoy the
interpretability and flexibility inherent to natural language. We employ an
autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
The encoder is trained to transform an input image into text, which is then fed
into the fixed text-to-image diffusion decoder to reconstruct the original
input -- a process we term De-Diffusion. Experiments validate both the
precision and comprehensiveness of De-Diffusion text representing images, such
that it can be readily ingested by off-the-shelf text-to-image tools and LLMs
for diverse multi-modal tasks. For example, a single De-Diffusion model can
generalize to provide transferable prompts for different text-to-image tools,
and also achieves a new state of the art on open-ended vision-language tasks by
simply prompting large language models with few-shot examples. |
This paper introduces De-Diffusion, a novel approach using text as a cross-modal interface for images by representing them as “scrambled captions” that are both precise and comprehensive. |
This approach leverages the flexibility and interpretability of natural language and bypasses the need for deep embedding adaptation in multi-modal tasks. |
The method uses an autoencoder architecture with a pre-trained text-to-image diffusion model as the decoder and trains the encoder to convert images into text descriptions. |
De-Diffusion text enables transferable prompts for different text-to-image tools, outperforming human captions in reconstruction quality.
It allows off-the-shelf LLMs to perform open-ended visual question answering with state-of-the-art results in few-shot settings.
De-Diffusion text facilitates multi-modal dialogue with chatbots and enables novel applications like text-based image blending. |
The quality of De-Diffusion text relies on the performance of the pre-trained text-to-image model used as the decoder.
Further exploration of techniques to improve coherence and reduce redundancy in the generated text descriptions is needed. |
cross-modal interface, text representation, de-diffusion, text-to-image generation, vision-language tasks |
2311.00571
Report |
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing |
Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li |
LLaVA-Interactive is a research prototype for multimodal human-AI
interaction. The system can have multi-turn dialogues with human users by
taking multimodal user inputs and generating multimodal responses. Importantly,
LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled
to align human intents in the interaction. The development of LLaVA-Interactive
is extremely cost-efficient as the system combines three multimodal skills of
pre-built AI models without additional model training: visual chat of LLaVA,
image segmentation from SEEM, as well as image generation and editing from
GLIGEN. A diverse set of application scenarios is presented to demonstrate the
promises of LLaVA-Interactive and to inspire future research in multimodal
interactive systems. |
LLaVA-Interactive is an open-source research prototype system for multimodal human-AI interaction, enabling multi-turn dialogues with multimodal inputs and responses, including visual prompts. |
It addresses the limitations of existing LMMs like GPT-4V, which primarily focus on language-based interaction and lack visual prompting, hindering the development of open-source multimodal AI agents. |
LLaVA-Interactive leverages pre-built AI models without additional training, combining the visual chat capabilities of LLaVA, image segmentation from SEEM, and image generation/editing from GLIGEN. |
Supports flexible visual prompts like strokes, drag-and-drop, and bounding boxes for tasks involving segmentation, generation, and editing.
Demonstrates enhanced user interaction and enables novel application scenarios, such as aiding photographic artists and co-creating visual scenes.
Highlights the potential of composing pre-trained models for building general-purpose assistants without extensive training. |
Capabilities limited by the performance of individual pre-trained models.
Lack of emergent skills arising from latent task composition, as it relies on the existing abilities of individual models. |
multimodal ai, visual prompting, human-ai interaction, image segmentation, image generation |
2311.00457
Report |
Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture |
Yixin Chen, Junfeng Ni, Nan Jiang, Yaowei Zhang, Yixin Zhu, Siyuan Huang |
Reconstructing detailed 3D scenes from single-view images remains a
challenging task due to limitations in existing approaches, which primarily
focus on geometric shape recovery, overlooking object appearances and fine
shape details. To address these challenges, we propose a novel framework for
simultaneous high-fidelity recovery of object shapes and textures from
single-view images. Our approach utilizes the proposed Single-view neural
implicit Shape and Radiance field (SSR) representations to leverage both
explicit 3D shape supervision and volume rendering of color, depth, and surface
normal images. To overcome shape-appearance ambiguity under partial
observations, we introduce a two-stage learning curriculum incorporating both
3D and 2D supervisions. A distinctive feature of our framework is its ability
to generate fine-grained textured meshes while seamlessly integrating rendering
capabilities into the single-view 3D reconstruction model. This integration
enables not only improved textured 3D object reconstruction by 27.7% and 11.6%
on the 3D-FRONT and Pix3D datasets, respectively, but also supports the
rendering of images from novel viewpoints. Beyond individual objects, our
approach facilitates composing object-level representations into flexible scene
representations, thereby enabling applications such as holistic scene
understanding and 3D scene editing. We conduct extensive experiments to
demonstrate the effectiveness of our method. |
A novel framework for reconstructing high-fidelity 3D shapes and textures from single-view images using neural implicit shape and radiance field representations. |
Single-view 3D reconstruction is crucial for machines to understand and interact with the 3D world, with applications in VR/AR and robotics. Existing methods often neglect object textures and struggle to capture fine shape details. |
The framework leverages both 3D shape supervision (SDF) and volume rendering of color, depth, and normal images. It utilizes a two-stage learning curriculum to overcome shape-appearance ambiguity under partial observations. Pixel-aligned and instance-aligned features are used for SDF and color prediction, respectively. |
Achieves state-of-the-art performance on 3D object reconstruction benchmarks, with significant improvement in capturing fine-grained shape details and textures.
Enables rendering of color, depth, and normal images from novel viewpoints, showcasing its capability in novel view synthesis and single-view depth/normal estimation.
Demonstrates potential for holistic scene understanding and 3D scene editing applications by composing object-level representations into flexible scene representations. |
Struggles with reconstructing objects with thin surfaces and severe occlusion.
Generalization to unseen object categories remains a challenge. Future work includes integrating unsigned distance fields and incorporating large-scale 2D/3D priors for improved generalizability. |
single-view reconstruction, 3d object reconstruction, neural implicit representation, volume rendering, scene editing |
2311.00213
Report |
Consistent Video-to-Video Transfer Using Synthetic Dataset |
Jiaxin Cheng, Tianjun Xiao, Tong He |
We introduce a novel and efficient approach for text-based video-to-video
editing that eliminates the need for resource-intensive per-video-per-model
finetuning. At the core of our approach is a synthetic paired video dataset
tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's
image transfer via editing instruction, we adapt this paradigm to the video
domain. Extending the Prompt-to-Prompt to videos, we efficiently generate
paired samples, each with an input video and its edited counterpart. Alongside
this, we introduce the Long Video Sampling Correction during sampling, ensuring
consistent long videos across batches. Our method surpasses current methods
like Tune-A-Video, heralding substantial progress in text-based video-to-video
editing and suggesting exciting avenues for further exploration and deployment. |
This paper introduces Instruct Video-to-Video, a novel diffusion-based model for text-based video-to-video editing that eliminates the need for per-video-per-model finetuning. |
Existing text-based video editing approaches suffer from limitations such as requiring resource-intensive per-video-per-model finetuning and demanding users to describe both the original and target video. |
The paper proposes a synthetic paired video dataset tailored for video-to-video transfer tasks, generated using a large language model and a video diffusion model adapted from the Prompt-to-Prompt method. It also introduces Long Video Sampling Correction (LVSC) to ensure consistency across extended video sequences. |
The approach eliminates the need for per-video-per-model finetuning, enabling a universal one-model-all-video transfer.
It simplifies user interaction by requiring only an intuitive editing prompt.
The proposed method outperforms existing techniques like Tune-A-Video in text-based video editing, as demonstrated through user studies and automated metrics. |
The model may struggle with videos containing objects that are difficult to detect due to size, positioning, or occlusion.
Future work includes exploring the generation of longer videos and improving the model's ability to handle complex editing scenarios. |
video editing, diffusion models, synthetic data, prompt-to-prompt, long video generation |
2311.00047
Report |
Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans? |
Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai |
Vision-Language Models (VLMs) are trained on vast amounts of data captured by
humans emulating our understanding of the world. However, known as visual
illusions, human's perception of reality isn't always faithful to the physical
world. This raises a key question: do VLMs have the similar kind of illusions
as humans do, or do they faithfully learn to represent reality? To investigate
this question, we build a dataset containing five types of visual illusions and
formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our
findings have shown that although the overall alignment is low, larger models
are closer to human perception and more susceptible to visual illusions. Our
dataset and initial findings will promote a better understanding of visual
illusions in humans and machines and provide a stepping stone for future
computational models that can better align humans and machines in perceiving
and communicating about the shared visual world. The code and data are
available at https://github.com/vl-illusion/dataset. |
This paper introduces Grounding Visual Illusion in Language (GVIL), the first dataset for evaluating how well machines recognize and respond to visual illusions in language. |
Understanding how well machines perceive visual illusions, a phenomenon inherent in human vision, is crucial for improving human-machine alignment in tasks involving vision and language. |
The authors created GVIL, encompassing five illusion categories and four benchmark tasks: Same-Different Question Answering, Referential Question Answering, Attribute Question Answering, and Referential Localization. Four state-of-the-art vision-language models with varying sizes were evaluated. |
Larger models demonstrate a stronger tendency towards humanlike illusion recognition and are more likely to align with human responses under illusion contexts.
While models show promising alignment in object localization under illusions, they struggle with visual question-answering tasks.
The degree of alignment between machine and human responses varies across different categories of visual illusions. |
The current dataset size is modest, limiting the generalizability of the findings.
Further research is needed to understand the discrepancy in model performance across different tasks. |
visual illusion, vision-language models, human-machine alignment, dataset, benchmark |
2310.20700
Report |
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction |
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu |
Recently video generation has achieved substantial progress with realistic
results. Nevertheless, existing AI-generated videos are usually very short
clips ("shot-level") depicting a single scene. To deliver a coherent long video
("story-level"), it is desirable to have creative transition and prediction
effects across different clips. This paper presents a short-to-long video
diffusion model, SEINE, that focuses on generative transition and prediction.
The goal is to generate high-quality long videos with smooth and creative
transitions between scenes and varying lengths of shot-level videos.
Specifically, we propose a random-mask video diffusion model to automatically
generate transitions based on textual descriptions. By providing the images of
different scenes as inputs, combined with text-based control, our model
generates transition videos that ensure coherence and visual quality.
Furthermore, the model can be readily extended to various tasks such as
image-to-video animation and autoregressive video prediction. To conduct a
comprehensive evaluation of this new generative task, we propose three
assessing criteria for smooth and creative transition: temporal consistency,
semantic similarity, and video-text semantic alignment. Extensive experiments
validate the effectiveness of our approach over existing methods for generative
transition and prediction, enabling the creation of story-level long videos.
Project page: https://vchitect.github.io/SEINE-project/ . |
This paper introduces SEINE, a short-to-long (S2L) video diffusion model that focuses on generating high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos using a random-mask approach conditioned on text descriptions. |
Existing AI-generated videos are typically short clips depicting a single scene, while longer, story-level videos require creative transitions and prediction effects across different clips for coherent storytelling. |
SEINE employs a random-mask diffusion model that leverages text descriptions and visible conditional frames to generate unseen transition and prediction frames, enabling smooth and creative scene connections and video extension. |
SEINE outperforms comparison methods in generating transitions based on metrics like temporal coherence, semantic similarity, and video-text alignment.
SEINE demonstrates diverse and controllable transition generation, enabling variations in transition style and camera movement control.
SEINE shows promising results in long video generation through auto-regressive prediction and image-to-video animation, expanding its applicability. |
Limitations include the need for similarity between source and target scenes for smooth transitions and potential text-video unalignment.
Future work focuses on improving text-image alignment, addressing limitations related to scene similarity, and mitigating watermark generation. |
video generation, diffusion models, generative transition, video prediction, text-to-video |
2310.20649
Report |
Dynamic Batch Norm Statistics Update for Natural Robustness |
Shahbaz Rezaei, Mohammad Sadegh Norouzzadeh |
DNNs trained on natural clean samples have been shown to perform poorly on
corrupted samples, such as noisy or blurry images. Various data augmentation
methods have been recently proposed to improve DNN's robustness against common
corruptions. Despite their success, they require computationally expensive
training and cannot be applied to off-the-shelf trained models. Recently, it
has been shown that updating BatchNorm (BN) statistics of an off-the-shelf
model on a single corruption improves its accuracy on that corruption
significantly. However, adopting the idea at inference time when the type of
corruption is unknown and changing decreases the effectiveness of this method.
In this paper, we harness the Fourier domain to detect the corruption type, a
challenging task in the image domain. We propose a unified framework consisting
of a corruption-detection model and BN statistics update that improves the
corruption accuracy of any off-the-shelf trained model. We benchmark our
framework on different models and datasets. Our results demonstrate about 8%
and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively.
Furthermore, our framework can further improve the accuracy of state-of-the-art
robust models, such as AugMix and DeepAug. |
This paper presents a framework to improve the robustness of pre-trained vision models against corrupted images, by dynamically updating Batch Normalization (BN) statistics based on the detected corruption type. |
DNNs are known to be vulnerable to image corruptions. Existing data augmentation methods to address this are computationally expensive and cannot be applied to already trained models. This work offers a computationally light-weight alternative to improve robustness of off-the-shelf models. |
The framework utilizes a corruption type detection model, trained on the Fourier spectrum of images, to identify the corruption present in an input image. Based on the detected corruption, the BN statistics of the pre-trained model are updated with pre-computed values specific to that corruption type, fetched from a lookup table. |
Achieves around 8% and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively, compared to the base model.
Outperforms the inference-time adaptation of previous BN update methods when the corruption type dynamically changes.
Can be applied to existing state-of-the-art robust models, like AugMix and DeepAug, and further enhance their performance. |
Requires data samples from all corruption types during training to construct the corruption detection model and the BN statistics lookup table.
Performance improvement is limited by the effectiveness of the BN statistics update method for the specific corruption type. |
robustness, image corruption, batch normalization, fourier domain, domain adaptation |
2310.20550
Report |
CapsFusion: Rethinking Image-Text Data at Scale |
Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu |
Large multimodal models demonstrate remarkable generalist ability to perform
diverse multimodal tasks in a zero-shot manner. Large-scale web-based
image-text pairs contribute fundamentally to this success, but suffer from
excessive noise. Recent studies use alternative captions synthesized by
captioning models and have achieved notable benchmark performance. However, our
experiments reveal significant Scalability Deficiency and World Knowledge Loss
issues in models trained with synthetic captions, which have been largely
obscured by their initial benchmark success. Upon closer examination, we
identify the root cause as the overly-simplified language structure and lack of
knowledge details in existing synthetic captions. To provide higher-quality and
more scalable multimodal pretraining data, we propose CapsFusion, an advanced
framework that leverages large language models to consolidate and refine
information from both web-based image-text pairs and synthetic captions.
Extensive experiments show that CapsFusion captions exhibit remarkable
all-round superiority over existing captions in terms of model performance
(e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample
efficiency (requiring 11-16 times less computation than baselines), world
knowledge depth, and scalability. These effectiveness, efficiency and
scalability advantages position CapsFusion as a promising candidate for future
scaling of LMM training. |
This paper introduces Refcap, a novel framework that leverages large language models (LLMs) to refine large-scale image-text data for improved training of large multimodal models (LMMs). |
Existing methods for generating image-text training data, such as using web-based pairs or synthetic captions, suffer from either excessive noise or a lack of real-world knowledge and scalability. |
Refcap uses a captioning model to generate synthetic captions and then employs ChatGPT to fuse them with web-based captions, extracting real-world knowledge while maintaining structure. To ensure scalability, a fine-tuned LLaMA model is used for large-scale caption fusion. |
Refcap captions significantly outperform raw, synthetic, and mixed captions in LMM training, achieving substantial improvements in CIDEr scores on multiple benchmarks.
Refcap demonstrates superior sample efficiency, requiring 11-16 times less computation to reach similar performance levels as baseline captions.
LMMs trained on Refcap captions exhibit richer world knowledge compared to those trained on synthetic captions, as evidenced by their ability to identify celebrities, artworks, and locations. |
The caption fusion process relies on heuristics and could benefit from further exploration of automatic quality control mechanisms.
Future work can explore the generalization of Refcap to other modalities beyond image-text pairs. |
large multimodal models, image captioning, data augmentation, large language models, world knowledge |
2310.19909
Report |
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks |
Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein |
Neural network based computer vision systems are typically built on a
backbone, a pretrained or randomly initialized feature extractor. Several years
ago, the default option was an ImageNet-trained convolutional neural network.
However, the recent past has seen the emergence of countless backbones
pretrained using various algorithms and datasets. While this abundance of
choice has led to performance increases for a range of systems, it is difficult
for practitioners to make informed decisions about which backbone to choose.
Battle of the Backbones (BoB) makes this choice easier by benchmarking a
diverse suite of pretrained models, including vision-language models, those
trained via self-supervised learning, and the Stable Diffusion backbone, across
a diverse set of computer vision tasks ranging from classification to object
detection to OOD generalization and more. Furthermore, BoB sheds light on
promising directions for the research community to advance computer vision by
illuminating strengths and weakness of existing approaches through a
comprehensive analysis conducted on more than 1500 training runs. While vision
transformers (ViTs) and self-supervised learning (SSL) are increasingly
popular, we find that convolutional neural networks pretrained in a supervised
fashion on large training sets still perform best on most tasks among the
models we consider. Moreover, in apples-to-apples comparisons on the same
architectures and similarly sized pretraining datasets, we find that SSL
backbones are highly competitive, indicating that future works should perform
SSL pretraining with advanced architectures and larger pretraining datasets. We
release the raw results of our experiments along with code that allows
researchers to put their own backbones through the gauntlet here:
https://github.com/hsouri/Battle-of-the-Backbones |
This paper presents "Battle of the Backbones" (BoB), a benchmark comparing diverse pretrained computer vision backbones across a wide range of tasks including classification, object detection, out-of-distribution generalization, and image retrieval. |
The abundance of pretrained backbone models makes it difficult for practitioners to choose the best option. BoB aims to guide practitioners and researchers by providing a comprehensive evaluation of backbones and identifying strengths and weaknesses of existing approaches. |
The authors benchmark publicly available pretrained models with different architectures (CNNs, ViTs, Swin Transformers, Stable Diffusion encoder), pretraining algorithms (supervised, self-supervised, vision-language), and pretraining datasets (ImageNet, LAION, CLIP dataset, depth datasets). They evaluate these backbones on a diverse set of tasks using various learning protocols (fine-tuning, linear probing, frozen backbone) and report performance using standard metrics for each task. |
Supervised ConvNeXt-Base, SwinV2-Base (trained on ImageNet-21k), and CLIP ViT-Base consistently rank among the top performers across various tasks and settings.
Supervised pretraining generally yields superior results, largely due to being trained on larger datasets. However, self-supervised or vision-language pretrained models perform better when comparing backbones trained on similar-sized datasets.
Performance across tasks is highly correlated, suggesting the possibility of developing universal backbones suitable for various computer vision tasks. |
The insights are limited by the specific tasks, backbones, and settings considered in the benchmark.
Larger backbone models (beyond ConvNeXt-Base) were not included, potentially affecting the ranking, especially for transformers which benefit more from scale. |
backbone, benchmark, computer vision, self-supervised learning, vision-language models |
2310.19776
Report |
Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery |
Sarah Rastegar, Hazel Doughty, Cees G. M. Snoek |
In the quest for unveiling novel categories at test time, we confront the
inherent limitations of traditional supervised recognition models that are
restricted by a predefined category set. While strides have been made in the
realms of self-supervised and open-world learning towards test-time category
discovery, a crucial yet often overlooked question persists: what exactly
delineates a category? In this paper, we conceptualize a category through the
lens of optimization, viewing it as an optimal solution to a well-defined
problem. Harnessing this unique conceptualization, we propose a novel,
efficient and self-supervised method capable of discovering previously unknown
categories at test time. A salient feature of our approach is the assignment of
minimum length category codes to individual data instances, which encapsulates
the implicit category hierarchy prevalent in real-world datasets. This
mechanism affords us enhanced control over category granularity, thereby
equipping our model to handle fine-grained categories adeptly. Experimental
evaluations, bolstered by state-of-the-art benchmark comparisons, testify to
the efficacy of our solution in managing unknown categories at test time.
Furthermore, we fortify our proposition with a theoretical foundation,
providing proof of its optimality. Our code is available at
https://github.com/SarahRastegar/InfoSieve. |
This paper proposes InfoSieve, a novel self-supervised method for discovering unknown categories at test time by conceptualizing 'category' as an optimization problem solution. |
Traditional supervised models struggle with open-world recognition due to the lack of a clear 'category' definition. This work addresses label inconsistencies, incorporates category hierarchies, and tackles open-world recognition by learning category codes instead of relying on predefined labels. |
The proposed method uses algorithmic and Shannon information theory to define an optimization problem for finding optimal category codes. It leverages contrastive learning to maximize mutual information between input data and binary category codes, while minimizing code length to reduce search space. A masking mechanism allows for flexible handling of category granularity. |
InfoSieve outperforms state-of-the-art methods in generalized category discovery on fine-grained datasets.
The method demonstrates robustness to different category granularities and long-tailed data distributions.
Qualitative analysis reveals the model's ability to learn an implicit category hierarchy from the data. |
The approach assumes an implicit hierarchical tree underlying categorization, which may not always hold true.
The current implementation requires unlabeled data from unknown categories during training. |
generalized category discovery, novel class discovery, self-supervised learning, information theory, contrastive learning |
2310.19731
Report |
ViR: Towards Efficient Vision Retention Backbones |
Ali Hatamizadeh, Michael Ranzinger, Shiyi Lan, Jose M. Alvarez, Sanja Fidler, Jan Kautz |
Vision Transformers (ViTs) have attracted a lot of popularity in recent
years, due to their exceptional capabilities in modeling long-range spatial
dependencies and scalability for large scale training. Although the training
parallelism of self-attention mechanism plays an important role in retaining
great performance, its quadratic complexity baffles the application of ViTs in
many scenarios which demand fast inference. This effect is even more pronounced
in applications in which autoregressive modeling of input features is required.
In Natural Language Processing (NLP), a new stream of efforts has proposed
parallelizable models with recurrent formulation that allows for efficient
inference in generative applications. Inspired by this trend, we propose a new
class of computer vision models, dubbed Vision Retention Networks (ViR), with
dual parallel and recurrent formulations, which strike an optimal balance
between fast inference and parallel training with competitive performance. In
particular, ViR scales favorably for image throughput and memory consumption in
tasks that require higher-resolution images due to its flexible formulation in
processing large sequence lengths. The ViR is the first attempt to realize dual
parallel and recurrent equivalency in a general vision backbone for recognition
tasks. We have validated the effectiveness of ViR through extensive experiments
with different dataset sizes and various image resolutions and achieved
competitive performance. Code: https://github.com/NVlabs/ViR |
The paper introduces Vision Retention Networks (ViR), a novel computer vision model architecture that leverages both parallel and recurrent formulations, enabling efficient inference for tasks requiring high-resolution images. |
ViTs excel in capturing long-range dependencies but suffer from quadratic complexity, making them slow for high-resolution image processing. ViR addresses this limitation by introducing a recurrent formulation that enables fast inference without compromising accuracy. |
ViR utilizes a retention mechanism with dual parallel and recurrent representations. The recurrent mode processes tokens sequentially, reducing complexity for long sequences. A hybrid chunkwise mode combines parallel and recurrent processing for optimal performance. |
ViR achieves competitive performance on ImageNet-1K classification benchmarks, outperforming other ViT-based models in terms of accuracy and throughput.
ViR with 2D retention demonstrates superior performance for downstream tasks like object detection and semantic segmentation on MS COCO and ADE20K datasets.
ViR exhibits favorable scaling characteristics for high-resolution images, achieving higher throughput and utilizing memory more efficiently than ViTs, especially for larger batch sizes. |
Exploration of relative position embeddings in two dimensions for potential performance improvement.
Extension of ViR to other vision tasks beyond recognition, leveraging its efficiency for high-resolution image processing. |
vision transformer, recurrent neural network, efficient inference, high-resolution images, computer vision |
2310.19540
Report |
IterInv: Iterative Inversion for Pixel-Level T2I Models |
Chuanming Tang, Kai Wang, Joost van de Weijer |
Large-scale text-to-image diffusion models have been a ground-breaking
development in generating convincing images following an input text prompt. The
goal of image editing research is to give users control over the generated
images by modifying the text prompt. Current image editing techniques
predominantly hinge on DDIM inversion as a prevalent practice rooted in Latent
Diffusion Models (LDM). However, the large pretrained T2I models working on the
latent space suffer from losing details due to the first compression stage with
an autoencoder mechanism. Instead, other mainstream T2I pipeline working on the
pixel level, such as Imagen and DeepFloyd-IF, circumvents the above problem.
They are commonly composed of multiple stages, typically starting with a
text-to-image stage and followed by several super-resolution stages. In this
pipeline, the DDIM inversion fails to find the initial noise and generate the
original image given that the super-resolution diffusion models are not
compatible with the DDIM technique. According to our experimental findings,
iteratively concatenating the noisy image as the condition is the root of this
problem. Based on this observation, we develop an iterative inversion (IterInv)
technique for this category of T2I models and verify IterInv with the
open-source DeepFloyd-IF model.Specifically, IterInv employ NTI as the
inversion and reconstruction of low-resolution image generation. In stages 2
and 3, we update the latent variance at each timestep to find the deterministic
inversion trace and promote the reconstruction process. By combining our method
with a popular image editing method, we prove the application prospects of
IterInv. The code will be released upon acceptance. The code is available at
\url{https://github.com/Tchuanm/IterInv.git}. |
This paper introduces IterInv, a novel iterative inversion technique for pixel-level Text-to-Image (T2I) diffusion models like DeepFloyd-IF, addressing the limitations of DDIM inversion in such models. |
Existing text-guided image editing methods rely on latent diffusion models (LDMs) that often lead to detail loss. Pixel-level T2I models offer a solution but lack effective inversion techniques for accurate real image reconstruction, hindering editing capabilities. |
IterInv leverages Null-Text Inversion (NTI) with classifier-free guidance and iteratively optimizes latent variance at each timestep to find a deterministic inversion trace, enabling accurate image reconstruction in the pixel space. |
IterInv demonstrates superior reconstruction quality compared to DDIM inversion across various stages of DeepFloyd-IF, achieving results comparable to SDXL's autoencoder.
The method exhibits robustness to classifier-guidance scale variations, ensuring consistent performance.
Combining IterInv with DiffEdit enables effective text-guided image editing on DeepFloyd-IF, opening possibilities for advanced editing techniques in pixel-level diffusion models. |
The current study focuses solely on the DeepFloyd model, limiting the generalizability of IterInv.
The compatibility of IterInv with other image editing methods beyond DiffEdit remains unexplored. |
image inversion, image reconstruction, image editing, text-to-image, pixel diffusion |
2310.19512
Report |
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation |
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan |
Video generation has increasingly gained interest in both academia and
industry. Although commercial tools can generate plausible videos, there is a
limited number of open-source models available for researchers and engineers.
In this work, we introduce two diffusion models for high-quality video
generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V
models synthesize a video based on a given text input, while I2V models
incorporate an additional image input. Our proposed T2V model can generate
realistic and cinematic-quality videos with a resolution of $1024 \times 576$,
outperforming other open-source T2V models in terms of quality. The I2V model
is designed to produce videos that strictly adhere to the content of the
provided reference image, preserving its content, structure, and style. This
model is the first open-source I2V foundation model capable of transforming a
given image into a video clip while maintaining content preservation
constraints. We believe that these open-source video generation models will
contribute significantly to the technological advancements within the
community. |
This paper introduces two open-source diffusion models for video generation: a text-to-video (T2V) model and an image-to-video (I2V) model. |
Existing open-source video generation models have limitations in quality, resolution, and content preservation, while commercial models are not accessible for research. |
The T2V model extends Stable Diffusion with temporal attention and joint image-video training. The I2V model incorporates a CLIP-based image embedding branch into the T2V architecture. |
The T2V model generates high-quality videos (1024x576 resolution) with cinematic quality, outperforming other open-source models.
The I2V model is the first open-source model capable of strictly preserving content and structure of the input image while animating it.
Both models demonstrate superior performance compared to existing open-source alternatives and achieve comparable results to some commercial models. |
The current models are limited to 2-second video generation.
Further improvements in motion quality, resolution, and success rate are needed. |
video generation, diffusion models, text-to-video, image-to-video, open-source |
2310.19464
Report |
Generative Neural Fields by Mixtures of Neural Implicit Functions |
Tackgeun You, Mijeong Kim, Jungtaek Kim, Bohyung Han |
We propose a novel approach to learning the generative neural fields
represented by linear combinations of implicit basis networks. Our algorithm
learns basis networks in the form of implicit neural representations and their
coefficients in a latent space by either conducting meta-learning or adopting
auto-decoding paradigms. The proposed method easily enlarges the capacity of
generative neural fields by increasing the number of basis networks while
maintaining the size of a network for inference to be small through their
weighted model averaging. Consequently, sampling instances using the model is
efficient in terms of latency and memory footprint. Moreover, we customize
denoising diffusion probabilistic model for a target task to sample latent
mixture coefficients, which allows our final model to generate unseen data
effectively. Experiments show that our approach achieves competitive generation
performance on diverse benchmarks for images, voxel data, and NeRF scenes
without sophisticated designs for specific modalities and domains. |
This paper proposes mNIF, a novel method for learning generative neural fields using linear combinations of implicit basis networks (INRs). |
mNIF offers a more efficient and scalable approach to represent complex data distributions in various domains (images, voxels, NeRF scenes) compared to existing generative neural field methods. |
The method learns a set of basis INRs and a latent space of mixture coefficients. Two training stages are employed: (1) context adaptation via meta-learning or auto-decoding to optimize basis networks and mixture coefficients for reconstruction, and (2) task-specific generalization using a denoising diffusion probabilistic model for sampling unseen data. |
mNIF achieves competitive or state-of-the-art generation quality on image, voxel, and NeRF scene benchmarks.
The method exhibits significantly better inference efficiency (smaller model size and faster speed) than existing methods.
Analysis reveals the learned latent space captures smooth data manifold and benefits from increasing mixture components and latent dimensionality. |
Limited scalability beyond fine-grained datasets is observed, potentially due to the limitations of the SIREN architecture used.
Future work will focus on incorporating local information and exploring alternative architectures to enhance performance on diverse datasets. |
generative neural fields, implicit neural representations, mixture of experts, denoising diffusion probabilistic models, meta-learning |
2310.19415
Report |
Text-to-3D with Classifier Score Distillation |
Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, Xiaojuan Qi |
Text-to-3D generation has made remarkable progress recently, particularly
with methods based on Score Distillation Sampling (SDS) that leverages
pre-trained 2D diffusion models. While the usage of classifier-free guidance is
well acknowledged to be crucial for successful optimization, it is considered
an auxiliary trick rather than the most essential component. In this paper, we
re-evaluate the role of classifier-free guidance in score distillation and
discover a surprising finding: the guidance alone is enough for effective
text-to-3D generation tasks. We name this method Classifier Score Distillation
(CSD), which can be interpreted as using an implicit classification model for
generation. This new perspective reveals new insights for understanding
existing techniques. We validate the effectiveness of CSD across a variety of
text-to-3D tasks including shape generation, texture synthesis, and shape
editing, achieving results superior to those of state-of-the-art methods. Our
project page is https://xinyu-andy.github.io/Classifier-Score-Distillation |
This paper introduces Classifier Score Distillation (CSD), a novel method for text-to-3D generation that utilizes the classifier component of pre-trained 2D diffusion models, challenging the prevailing assumption that generative priors are essential for this task. |
This research reshapes the understanding of text-to-3D generation by demonstrating the critical role of implicit classifiers within diffusion models, potentially leading to more efficient and effective generation techniques. |
The authors analyze the role of classifier-free guidance in Score Distillation Sampling (SDS), revealing that the classifier score, rather than the generative prior, is the driving force behind successful optimization. They propose CSD, which leverages only the classifier score for 3D scene refinement. |
CSD achieves state-of-the-art results in text-to-3D generation, surpassing existing SDS-based methods in visual quality and text alignment.
The study reveals that negative prompts act as dual-objective classifier scores, and introduces an annealed negative classifier score optimization strategy for improved quality and fidelity.
CSD proves effective for text-guided 3D editing, allowing modifications to existing scenes while preserving desired attributes. |
While empirical results highlight CSD's superiority, a formal distribution-based objective function for this optimization process is yet to be defined.
Applying CSD to 2D image optimization results in artifacts, suggesting potential limitations or the need for further research to bridge the gap between 2D and 3D applications. |
text-to-3d generation, score distillation, classifier-free guidance, diffusion models, 3d scene editing |
2310.19248
Report |
IMPRESS: Evaluating the Resilience of Imperceptible Perturbations Against Unauthorized Data Usage in Diffusion-Based Generative AI |
Bochuan Cao, Changjiang Li, Ting Wang, Jinyuan Jia, Bo Li, Jinghui Chen |
Diffusion-based image generation models, such as Stable Diffusion or DALL-E
2, are able to learn from given images and generate high-quality samples
following the guidance from prompts. For instance, they can be used to create
artistic images that mimic the style of an artist based on his/her original
artworks or to maliciously edit the original images for fake content. However,
such ability also brings serious ethical issues without proper authorization
from the owner of the original images. In response, several attempts have been
made to protect the original images from such unauthorized data usage by adding
imperceptible perturbations, which are designed to mislead the diffusion model
and make it unable to properly generate new samples. In this work, we introduce
a perturbation purification platform, named IMPRESS, to evaluate the
effectiveness of imperceptible perturbations as a protective measure. IMPRESS
is based on the key observation that imperceptible perturbations could lead to
a perceptible inconsistency between the original image and the
diffusion-reconstructed image, which can be used to devise a new optimization
strategy for purifying the image, which may weaken the protection of the
original image from unauthorized data usage (e.g., style mimicking, malicious
editing). The proposed IMPRESS platform offers a comprehensive evaluation of
several contemporary protection methods, and can be used as an evaluation
platform for future protection methods. |
This paper introduces IMPRESS, a platform for evaluating the effectiveness of imperceptible perturbations in protecting images from unauthorized use in diffusion models by purifying perturbed images with consistency-based losses. |
This evaluation is crucial to understand the robustness of existing protection methods like GLAZE and PhotoGuard against adaptive attacks and guide the development of future protection mechanisms. |
IMPRESS leverages the inconsistency between a perturbed image and its diffusion-reconstructed version. It formulates an optimization problem with two losses: a similarity loss ensuring the purified image is close to the perturbed one and a consistency loss ensuring the purified image can be reconstructed by the diffusion model. |
IMPRESS successfully weakens the protection of GLAZE on style mimicking, increasing the accuracy of generated images mimicking protected styles to near clean-image levels (87% vs. 90.8% for CLIP classifier).
IMPRESS also diminishes the effectiveness of PhotoGuard on malicious editing, leading to edited images closer to edited clean images according to PSNR and VIF-p metrics.
Adaptive protection methods incorporating consistency-based losses are explored but show limited improvement, suggesting the complexity of balancing multiple objectives and potential inherent conflicts. |
The reliance on specific similarity metrics (e.g., LPIPS) and the effectiveness of simple post-processing techniques on malicious editing highlight potential vulnerabilities.
Designing robust adaptive protection methods for IMPRESS remains challenging due to complex loss optimization and potential conflicts between protection and purification goals. |
image protection, diffusion models, adversarial attacks, image editing, style mimicking |
2310.18949
Report |
Customize StyleGAN with One Hand Sketch |
Shaocong Zhang |
Generating images from human sketches typically requires dedicated networks
trained from scratch. In contrast, the emergence of the pre-trained
Vision-Language models (e.g., CLIP) has propelled generative applications based
on controlling the output imagery of existing StyleGAN models with text inputs
or reference images. Parallelly, our work proposes a framework to control
StyleGAN imagery with a single user sketch. In particular, we learn a
conditional distribution in the latent space of a pre-trained StyleGAN model
via energy-based learning and propose two novel energy functions leveraging
CLIP for cross-domain semantic supervision. Once trained, our model can
generate multi-modal images semantically aligned with the input sketch.
Quantitative evaluations on synthesized datasets have shown that our approach
improves significantly from previous methods in the one-shot regime. The
superiority of our method is further underscored when experimenting with a wide
range of human sketches of diverse styles and poses. Surprisingly, our models
outperform the previous baseline regarding both the range of sketch inputs and
image qualities despite operating with a stricter setting: with no extra
training data and single sketch input. |
This paper proposes a novel framework to control the imagery generated by a pre-trained StyleGAN model using a single user sketch, eliminating the need for dedicated networks or training datasets. |
This approach aligns with the recent trend of utilizing pre-trained generative models and enables a more intuitive and flexible way for users to control image generation through sketches. |
The framework leverages energy-based learning to learn a conditional distribution in the latent space of the StyleGAN model. It introduces two novel energy functions based on CLIP to provide cross-domain semantic supervision, guiding the generated images to align with the input sketch. |
Quantitative evaluations on synthesized datasets demonstrate significant improvement over previous methods in one-shot image generation.
Experiments with real human sketches show the method's robustness to diverse sketch styles and poses, outperforming the baseline in terms of image quality and adaptability.
The proposed framework integrates seamlessly with other StyleGAN-based manipulations like latent space editing and natural image inversion, broadening its application in image editing. |
The method may struggle with sketches representing rare modes not well-represented in the source StyleGAN model's training data.
Future work could explore explicit control over the degree of output realism and extend the framework to other generative models beyond StyleGAN. |
image generation, sketch-to-image synthesis, stylegan, clip, energy-based models |
2310.18936
Report |
Adversarial Examples Are Not Real Features |
Ang Li, Yifei Wang, Yiwen Guo, Yisen Wang |
The existence of adversarial examples has been a mystery for years and
attracted much interest. A well-known theory by \citet{ilyas2019adversarial}
explains adversarial vulnerability from a data perspective by showing that one
can extract non-robust features from adversarial examples and these features
alone are useful for classification. However, the explanation remains quite
counter-intuitive since non-robust features are mostly noise features to
humans. In this paper, we re-examine the theory from a larger context by
incorporating multiple learning paradigms. Notably, we find that contrary to
their good usefulness under supervised learning, non-robust features attain
poor usefulness when transferred to other self-supervised learning paradigms,
such as contrastive learning, masked image modeling, and diffusion models. It
reveals that non-robust features are not really as useful as robust or natural
features that enjoy good transferability between these paradigms. Meanwhile,
for robustness, we also show that naturally trained encoders from robust
features are largely non-robust under AutoAttack. Our cross-paradigm
examination suggests that the non-robust features are not really useful but
more like paradigm-wise shortcuts, and robust features alone might be
insufficient to attain reliable model robustness. Code is available at
\url{https://github.com/PKU-ML/AdvNotRealFeatures}. |
This paper challenges the prevailing view of adversarial examples as explained by the existence of non-robust features. It argues that these features are not truly useful but act as paradigm-specific shortcuts. |
Understanding the true nature of adversarial examples and their relation to data features is crucial for developing robust machine learning models. |
The authors propose a cross-paradigm evaluation framework, testing the usefulness and robustness of robust and non-robust features across various learning paradigms (Supervised, Contrastive, Masked Image Modeling, Diffusion). |
Non-robust features, while useful in supervised learning, show poor transferability and are largely useless in other self-supervised paradigms.
Robust features, claimed to be sufficient for robustness, fail to provide robustness when learned with different paradigms, especially under more reliable attacks.
Adversarial examples themselves show poor transferability across paradigms, suggesting a strong dependence on the learning objective. |
The study primarily focuses on image classification tasks, leaving its generalizability to other domains unexplored.
Further investigation is needed to understand the influence of data augmentation on the robustness of models trained on robust datasets. |
adversarial examples, robustness, non-robust features, cross-paradigm learning, transferability |
2310.18274
Report |
LipSim: A Provably Robust Perceptual Similarity Metric |
Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg |
Recent years have seen growing interest in developing and applying perceptual
similarity metrics. Research has shown the superiority of perceptual metrics
over pixel-wise metrics in aligning with human perception and serving as a
proxy for the human visual system. On the other hand, as perceptual metrics
rely on neural networks, there is a growing concern regarding their resilience,
given the established vulnerability of neural networks to adversarial attacks.
It is indeed logical to infer that perceptual metrics may inherit both the
strengths and shortcomings of neural networks. In this work, we demonstrate the
vulnerability of state-of-the-art perceptual similarity metrics based on an
ensemble of ViT-based feature extractors to adversarial attacks. We then
propose a framework to train a robust perceptual similarity metric called
LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging
1-Lipschitz neural networks as the backbone, LipSim provides guarded areas
around each data point and certificates for all perturbations within an
$\ell_2$ ball. Finally, a comprehensive set of experiments shows the
performance of LipSim in terms of natural and certified scores and on the image
retrieval application. The code is available at
https://github.com/SaraGhazanfari/LipSim. |
The paper proposes LipSim, the first certifiably robust perceptual similarity metric, by leveraging 1-Lipschitz neural networks and a student-teacher training approach with DreamSim as the teacher model. |
Existing perceptual similarity metrics, while effective, are vulnerable to adversarial attacks, potentially compromising applications like image retrieval and copy detection. LipSim aims to address this vulnerability with provable robustness guarantees. |
LipSim utilizes a 1-Lipschitz feature extractor trained via knowledge distillation from DreamSim on ImageNet. It then fine-tunes the feature extractor with a hinge loss on the NIGHT dataset and projects the embeddings onto a unit hypersphere, enabling certified robustness. |
LipSim demonstrates higher empirical robustness compared to state-of-the-art perceptual metrics under various adversarial attacks.
The paper proves theoretical guarantees for LipSim's robustness, providing certified accuracy within a specified perturbation budget.
LipSim achieves good performance on image retrieval, showcasing its practical applicability for finding semantically similar images even with adversarial queries. |
The current implementation of LipSim is limited to 2AFC datasets and could be expanded for broader applicability.
Future work could explore LipSim's performance on a wider range of applications, such as copy detection and feature inversion. |
perceptual similarity, certified robustness, lipschitz networks, adversarial attacks, image retrieval |
2310.17880
Report |
Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations |
Tristan Aumentado-Armstrong, Ashkan Mirzaei, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski |
Neural Radiance Fields (NeRFs) have proven to be powerful 3D representations,
capable of high quality novel view synthesis of complex scenes. While NeRFs
have been applied to graphics, vision, and robotics, problems with slow
rendering speed and characteristic visual artifacts prevent adoption in many
use cases. In this work, we investigate combining an autoencoder (AE) with a
NeRF, in which latent features (instead of colours) are rendered and then
convolutionally decoded. The resulting latent-space NeRF can produce novel
views with higher quality than standard colour-space NeRFs, as the AE can
correct certain visual artifacts, while rendering over three times faster. Our
work is orthogonal to other techniques for improving NeRF efficiency. Further,
we can control the tradeoff between efficiency and image quality by shrinking
the AE architecture, achieving over 13 times faster rendering with only a small
drop in performance. We hope that our approach can form the basis of an
efficient, yet high-fidelity, 3D scene representation for downstream tasks,
especially when retaining differentiability is useful, as in many robotics
scenarios requiring continual learning. |
This paper introduces Reconstructive Latent-Space NeRF (ReLS-NeRF), a novel 3D scene representation that combines an autoencoder (AE) with a NeRF for faster rendering and higher visual fidelity. |
NeRFs, while powerful, suffer from slow rendering speeds and visual artifacts, hindering their application in robotics and other fields. This work addresses these limitations to broaden NeRF's applicability. |
ReLS-NeRF renders low-resolution latent features instead of colors, using an AE to decode them into high-resolution images. The model is trained in three stages: AE training, joint NeRF fitting, and decoder fine-tuning. |
ReLS-NeRF achieves faster rendering (over 3 times) than standard NeRFs.
It improves visual quality on several metrics, including PSNR, LPIPS, and video quality metrics like DOVER.
The trade-off between speed and quality can be controlled by adjusting the AE architecture. |
The AE introduces temporal artifacts (view inconsistencies) not captured by standard metrics.
Future work includes exploring task-specific AEs and geometry-aware decoders. |
neural radiance fields, nerf, autoencoder, 3d scene representation, novel view synthesis |
2310.17527
Report |
Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction |
Feng Wang, Zilong Chen, Guokang Wang, Yafei Song, Huaping Liu |
In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel
method for efficiently reconstructing dynamic 3D scenes from multi-view or
monocular videos. Based on the observation that dynamic scenes often contain
substantial static areas that result in redundancy in storage and computations,
MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding
and a 4D hash encoding. The weights for the two components are represented by a
learnable mask which is guided by an uncertainty-based objective to reflect the
spatial and temporal importance of each 3D position. With this design, our
method can reduce the hash collision rate by avoiding redundant queries and
modifications on static areas, making it feasible to represent a large number
of space-time voxels by hash tables with small size.Besides, without the
requirements to fit the large numbers of temporally redundant features
independently, our method is easier to optimize and converge rapidly with only
twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH
obtains consistently better results than previous methods with only 20 minutes
of training time and 130 MB of memory storage. Code is available at
https://github.com/masked-spacetime-hashing/msth |
This paper proposes Masked Space-Time Hash encoding (MSTH), a novel, efficient method for reconstructing dynamic 3D scenes from multi-view or monocular videos. |
Reconstructing dynamic scenes is crucial for various applications, but existing methods struggle with efficiency, memory usage, and rendering quality. |
MSTH uses a weighted combination of 3D and 4D hash encodings, guided by a learnable mask reflecting spatial and temporal importance, to reduce hash collisions and improve efficiency. |
MSTH achieves consistently better reconstruction metrics (PSNR, DSSIM, LPIPS) than state-of-the-art methods on multiple datasets.
The method requires only 20 minutes of training time, significantly faster than previous approaches.
MSTH maintains a compact memory footprint of 130MB, thanks to its efficient encoding scheme. |
MSTH may struggle with scenes lacking detailed dynamic information, leading to artifacts.
Future work includes addressing complex scenes, motion dynamics, and integrating multiple information sources for enhanced reconstruction. |
dynamic 3d scene reconstruction, neural radiance fields, hash encoding, uncertainty estimation, multi-view video |
2310.17347
Report |
CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling |
Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M. Weber |
While conditional diffusion models are known to have good coverage of the
data distribution, they still face limitations in output diversity,
particularly when sampled with a high classifier-free guidance scale for
optimal image quality or when trained on small datasets. We attribute this
problem to the role of the conditioning signal in inference and offer an
improved sampling strategy for diffusion models that can increase generation
diversity, especially at high guidance scales, with minimal loss of sample
quality. Our sampling strategy anneals the conditioning signal by adding
scheduled, monotonically decreasing Gaussian noise to the conditioning vector
during inference to balance diversity and condition alignment. Our
Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained
model and sampling algorithm, and we show that it boosts the diversity of
diffusion models in various conditional generation tasks. Further, using an
existing pretrained diffusion model, CADS achieves a new state-of-the-art FID
of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256
and 512$\times$512 respectively. |
This paper introduces Condition-Annealed Diffusion Sampler (CADS), a novel sampling strategy for diffusion models to enhance generation diversity without compromising quality. |
Conditional diffusion models, while powerful, often lack diversity in their outputs, especially at high classifier-free guidance scales or when trained on smaller datasets. This limits their ability to fully capture the breadth of the data distribution. |
CADS anneals the conditioning signal by adding scheduled, monotonically decreasing Gaussian noise during inference. This disrupts the strong dependence on the conditioning signal initially and gradually restores it, promoting exploration of the data distribution while maintaining alignment with the input condition. |
CADS significantly boosts diversity across various tasks (class-conditional ImageNet generation, pose-to-image, text-to-image, and identity-conditioned face generation) as measured by FID, Recall, and similarity scores.
CADS achieves state-of-the-art FID scores on class-conditional ImageNet generation at 256x256 and 512x512 resolutions by leveraging higher guidance scales without sacrificing diversity.
The method is compatible with various diffusion samplers (DDPM, DDIM, PNDM, DPM++) and consistently improves their performance. |
Applying CADS to complex conditioning contexts like dense segmentation maps requires further investigation.
While CADS mitigates the diversity-quality trade-off, finding the optimal annealing schedule might require task-specific tuning. |
diffusion models, generative modeling, diversity, classifier-free guidance, sampling strategies |
2310.17050
Report |
Exploring Question Decomposition for Zero-Shot VQA |
Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu |
Visual question answering (VQA) has traditionally been treated as a
single-step task where each question receives the same amount of effort, unlike
natural human question-answering strategies. We explore a question
decomposition strategy for VQA to overcome this limitation. We probe the
ability of recently developed large vision-language models to use human-written
decompositions and produce their own decompositions of visual questions,
finding they are capable of learning both tasks from demonstrations alone.
However, we show that naive application of model-written decompositions can
hurt performance. We introduce a model-driven selective decomposition approach
for second-guessing predictions and correcting errors, and validate its
effectiveness on eight VQA tasks across three domains, showing consistent
improvements in accuracy, including improvements of >20% on medical VQA
datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA
reformulation of the challenging Winoground task. Project Site:
https://zaidkhan.me/decomposition-0shot-vqa/ |
This paper explores question decomposition as a strategy for zero-shot visual question answering (VQA) with large vision-language models (VLMs), enabling them to approach reasoning-heavy VQA as a two-step process. |
Traditional VQA treats all questions as single-step tasks, unlike natural human question-answering strategies where more complex questions receive more effort. This work aims to address this limitation by introducing a question decomposition strategy for VQA. |
The authors first probe the ability of large VLMs to use human-written and then model-generated question decompositions in a zero-shot setting. They then introduce a model-driven 'selective decomposition' approach to address limitations of naive decomposition, evaluating its effectiveness on eight VQA tasks across three domains. |
Large VLMs can effectively use human-written decompositions to improve VQA accuracy without explicit training and do not merely exploit surface-level statistics.
Generative, instruction-tuned language models can produce effective decompositions zero-shot without task-specific training.
Selective decomposition, which applies decomposition only when the model is uncertain about its initial answer, consistently improves VQA accuracy across datasets and domains, with significant gains on medical VQA datasets and the challenging Winoground task. |
The study primarily considers two-step decomposition approaches. Exploring multi-step approaches for more complex reasoning remains a future direction.
While the paper focuses on in-context learning, investigating the benefits of explicitly training models to produce and consume decompositions is left for future work. |
visual question answering, question decomposition, zero-shot learning, vision-language models, selective prediction |
2310.16951
Report |
The Teenager's Problem: Efficient Garment Decluttering With Grasp Optimization |
Aviv Adler, Ayah Ahmad, Shengyin Wang, Wisdom C. Agboh, Edith Llontop, Tianshuang Qiu, Jeffrey Ichnowski, Mehmet Dogar, Thomas Kollar, Richard Cheng, Ken Goldberg |
This paper addresses the ''Teenager's Problem'': efficiently removing
scattered garments from a planar surface. As grasping and transporting
individual garments is highly inefficient, we propose analytical policies to
select grasp locations for multiple garments using an overhead camera. Two
classes of methods are considered: depth-based, which use overhead depth data
to find efficient grasps, and segment-based, which use segmentation on the RGB
overhead image (without requiring any depth data); grasp efficiency is measured
by Objects per Transport, which denotes the average number of objects removed
per trip to the laundry basket. Experiments suggest that both depth- and
segment-based methods easily reduce Objects per Transport (OpT) by $20\%$;
furthermore, these approaches complement each other, with combined hybrid
methods yielding improvements of $34\%$. Finally, a method employing
consolidation (with segmentation) is considered, which manipulates the garments
on the work surface to increase OpT; this yields an improvement of $67\%$ over
the baseline, though at a cost of additional physical actions. |
This paper introduces the "Teenager's Problem" - efficient decluttering of garments from a surface, proposing depth-based, segment-based, and hybrid methods to optimize grasp locations for removing multiple garments simultaneously. |
Efficient garment manipulation is important in various domains like hotels, retail, and manufacturing, where current individual garment grasping methods are inefficient. |
The paper evaluates different grasp planning methods, including depth-based (highest point, max volume), segment-based (using segmentation to grasp multiple garments), hybrid (combining depth and segmentation), and a baseline random grasping method. These methods are tested with a real robot to compare their efficiency in clearing a workspace of scattered garments. |
Both depth-based and segment-based methods individually increase grasping efficiency (Objects per Transport) by 20%.
Hybrid methods, combining depth and segmentation, yield even larger improvements, reaching up to 34%.
Incorporating consolidation actions (rearranging garments within the workspace) with segmentation achieves a 67% improvement but requires additional physical actions. |
The methods rely on accurate separation of garments from the background using color, which may not generalize well to different setups.
The grasps use a fixed height and vertical orientation, potentially limiting efficiency. Future work could explore optimizing grasp height and angle. |
robotics, garment manipulation, decluttering, grasp planning, image segmentation |
2310.16858
Report |
4D-Editor: Interactive Object-level Editing in Dynamic Neural Radiance Fields via Semantic Distillation |
Dadong Jiang, Zhihui Ke, Xiaobo Zhou, Xidong Shi |
This paper targets interactive object-level editing (e.g., deletion,
recoloring, transformation, composition) in dynamic scenes. Recently, some
methods aiming for flexible editing static scenes represented by neural
radiance field (NeRF) have shown impressive synthesis quality, while similar
capabilities in time-variant dynamic scenes remain limited. To solve this
problem, we propose 4D-Editor, an interactive semantic-driven editing
framework, allowing editing multiple objects in a dynamic NeRF with user
strokes on a single frame. We propose an extension to the original dynamic NeRF
by incorporating a hybrid semantic feature distillation to maintain
spatial-temporal consistency after editing. In addition, we design Recursive
Selection Refinement that significantly boosts object segmentation accuracy
within a dynamic NeRF to aid the editing process. Moreover, we develop
Multi-view Reprojection Inpainting to fill holes caused by incomplete scene
capture after editing. Extensive experiments and editing examples on real-world
demonstrate that 4D-Editor achieves photo-realistic editing on dynamic NeRFs.
Project page: https://patrickddj.github.io/4D-Editor |
4D-Editor, an interactive object-level editing framework for dynamic neural radiance fields (NeRFs), allows users to edit multiple objects with strokes on a single reference frame, propagating modifications throughout the entire dynamic NeRF. |
Existing NeRF editing methods are limited to static scenes or lack object-level control in dynamic scenes. 4D-Editor addresses this gap by enabling interactive and precise object editing in dynamic NeRFs, crucial for applications like VR/AR and animation. |
4D-Editor utilizes hybrid semantic feature distillation from a pre-trained DINO model to guide object segmentation. It introduces Recursive Selection Refinement for accurate object selection and Multi-view Reprojection Inpainting to fill holes caused by object removal. |
4D-Editor achieves precise object-level editing in dynamic NeRFs with user-friendly strokes, demonstrated on challenging datasets.
Recursive Selection Refinement significantly improves object segmentation accuracy compared to traditional methods.
Multi-view Reprojection Inpainting effectively fills holes after object removal, preserving spatial-temporal consistency. |
Removing shadows of moving objects remains challenging.
Scene inpainting might exhibit spatial-temporal inconsistencies in some cases, requiring further investigation. |
neural radiance fields, dynamic scene editing, interactive editing, semantic distillation, 4d object segmentation |
2310.16825
Report |
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images |
Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov |
We assemble a dataset of Creative-Commons-licensed (CC) images, which we use
to train a set of open diffusion models that are qualitatively competitive with
Stable Diffusion 2 (SD2). This task presents two challenges: (1)
high-resolution CC images lack the captions necessary to train text-to-image
generative models; (2) CC images are relatively scarce. In turn, to address
these challenges, we use an intuitive transfer learning technique to produce a
set of high-quality synthetic captions paired with curated CC images. We then
develop a data- and compute-efficient training recipe that requires as little
as 3% of the LAION-2B data needed to train existing SD2 models, but obtains
comparable quality. These results indicate that we have a sufficient number of
CC images (~70 million) for training high-quality models. Our training recipe
also implements a variety of optimizations that achieve ~3X training speed-ups,
enabling rapid model iteration. We leverage this recipe to train several
high-quality text-to-image models, which we dub the CommonCanvas family. Our
largest model achieves comparable performance to SD2 on a human evaluation,
despite being trained on our CC dataset that is significantly smaller than
LAION and using synthetic captions for training. We release our models, data,
and code at
https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md |
This paper introduces CommonCanvas, a suite of text-to-image latent diffusion models trained solely on Creative Commons images and synthetically generated captions. |
This work addresses copyright and reproducibility concerns associated with training diffusion models on web-scraped data (like LAION). |
The authors curate a dataset of CC images and use a pre-trained BLIP-2 model to generate captions for these images. They also develop efficient training techniques that allow them to train high-quality models with significantly less data. |
Training diffusion models on less than 3% of the data used to train Stable Diffusion 2 (SD2) yields comparable performance on standard metrics.
Synthetic captions can be as effective as human-generated captions for training diffusion models.
CommonCanvas models, despite being trained on a smaller dataset with synthetic captions, achieve comparable performance to SD2 on human evaluations. |
The CC image dataset used is smaller and potentially less diverse than web-scraped datasets.
The reliance on a pre-trained BLIP-2 model for captions introduces potential biases. |
diffusion models, copyright, synthetic data, image captioning, data efficiency |
2310.16818
Report |
DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior |
Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, Yebin Liu |
We present DreamCraft3D, a hierarchical 3D content generation method that
produces high-fidelity and coherent 3D objects. We tackle the problem by
leveraging a 2D reference image to guide the stages of geometry sculpting and
texture boosting. A central focus of this work is to address the consistency
issue that existing works encounter. To sculpt geometries that render
coherently, we perform score distillation sampling via a view-dependent
diffusion model. This 3D prior, alongside several training strategies,
prioritizes the geometry consistency but compromises the texture fidelity. We
further propose Bootstrapped Score Distillation to specifically boost the
texture. We train a personalized diffusion model, Dreambooth, on the augmented
renderings of the scene, imbuing it with 3D knowledge of the scene being
optimized. The score distillation from this 3D-aware diffusion prior provides
view-consistent guidance for the scene. Notably, through an alternating
optimization of the diffusion prior and 3D scene representation, we achieve
mutually reinforcing improvements: the optimized 3D scene aids in training the
scene-specific diffusion model, which offers increasingly view-consistent
guidance for 3D optimization. The optimization is thus bootstrapped and leads
to substantial texture boosting. With tailored 3D priors throughout the
hierarchical generation, DreamCraft3D generates coherent 3D objects with
photorealistic renderings, advancing the state-of-the-art in 3D content
generation. Code available at https://github.com/deepseek-ai/DreamCraft3D. |
DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects by leveraging a 2D reference image to guide geometry sculpting and texture boosting. |
Addresses the limitations of existing 3D content generation methods that struggle to create complex objects with consistent geometry and textures. |
A hierarchical pipeline with two main stages: (1) Geometry sculpting: Uses a view-conditioned diffusion model and progressive view training to create detailed, consistent geometry from a 2D reference image. (2) Texture boosting: Employs a bootstrapped score distillation (BSD) approach that iteratively refines the 3D texture by jointly optimizing the 3D representation and a personalized DreamBooth diffusion model. |
Generates creative 3D assets with intricate geometric structures and realistic textures rendered coherently in 360 degrees.
Outperforms existing text-to-3D and image-to-3D methods in terms of texture quality, geometric consistency, and overall visual fidelity.
Demonstrates superior performance in user studies, with a strong preference for DreamCraft3D-generated models. |
Occasionally incorporates frontal-view details into textures due to depth ambiguity.
Does not explicitly separate material and lighting information from the 2D reference image. |
3d content generation, diffusion models, dreambooth, texture synthesis, view consistency |
2310.16656
Report |
A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation |
Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, Yaniv Leviathan |
Text-to-image diffusion models achieved a remarkable leap in capabilities
over the last few years, enabling high-quality and diverse synthesis of images
from a textual prompt. However, even the most advanced models often struggle to
precisely follow all of the directions in their prompts. The vast majority of
these models are trained on datasets consisting of (image, caption) pairs where
the images often come from the web, and the captions are their HTML alternate
text. A notable example is the LAION dataset, used by Stable Diffusion and
other models. In this work we observe that these captions are often of low
quality, and argue that this significantly affects the model's capability to
understand nuanced semantics in the textual prompts. We show that by relabeling
the corpus with a specialized automatic captioning model and training a
text-to-image model on the recaptioned dataset, the model benefits
substantially across the board. First, in overall image quality: e.g. FID 14.84
vs. the baseline of 17.87, and 64.3% improvement in faithful image generation
according to human evaluation. Second, in semantic alignment, e.g. semantic
object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and
positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the
corpus and provide evidence that this technique, which we call RECAP, both
reduces the train-inference discrepancy and provides the model with more
information per example, increasing sample efficiency and allowing the model to
better understand the relations between captions and images. |
This paper introduces RECAP, a method that improves text-to-image models by training them on synthetically generated captions. It involves fine-tuning an automatic captioning system and using it to generate more detailed and contextually relevant captions for the training images. |
Existing text-to-image models often struggle to accurately follow nuanced prompts because they're trained on datasets with low-quality captions (e.g., HTML Alttext). This method aims to address this limitation and improve the models' fidelity and semantic understanding. |
The method has 3 steps: 1) Fine-tune an image-to-text model (PaLI) on human-annotated captions to generate detailed descriptions. 2) Use the fine-tuned model to re-caption the image training dataset. 3) Fine-tune a text-to-image model (Stable Diffusion) on the dataset with the new captions. |
Significantly improved image quality metrics, with FID improving from 17.87 to 14.84.
Improved semantic alignment between generated images and prompts, demonstrated by increased object accuracy, counting alignment, and positional alignment scores.
Human evaluation showed 64.3% relative improvement in generating images successfully following the prompts. |
The study primarily focuses on fine-tuning a pre-trained model; exploring the impact of training from scratch with RECAP captions is left for future work.
The impact of RECAP on larger models and datasets is yet to be explored. |
text-to-image generation, image captioning, synthetic data, semantic alignment, diffusion models |
2310.16400
Report |
Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models |
Tianyi Lu, Xing Zhang, Jiaxi Gu, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu |
Latent Diffusion Models (LDMs) are renowned for their powerful capabilities
in image and video synthesis. Yet, video editing methods suffer from
insufficient pre-training data or video-by-video re-training cost. In
addressing this gap, we propose FLDM (Fused Latent Diffusion Model), a
training-free framework to achieve text-guided video editing by applying
off-the-shelf image editing methods in video LDMs. Specifically, FLDM fuses
latents from an image LDM and an video LDM during the denoising process. In
this way, temporal consistency can be kept with video LDM while high-fidelity
from the image LDM can also be exploited. Meanwhile, FLDM possesses high
flexibility since both image LDM and video LDM can be replaced so advanced
image editing methods such as InstructPix2Pix and ControlNet can be exploited.
To the best of our knowledge, FLDM is the first method to adapt off-the-shelf
image editing methods into video LDMs for video editing. Extensive quantitative
and qualitative experiments demonstrate that FLDM can improve the textual
alignment and temporal consistency of edited videos. |
This paper proposes FLDM (Fused Latent Diffusion Model), a training-free framework for text-guided video editing using off-the-shelf image editing methods within video LDMs. |
Existing video editing methods are limited by insufficient pre-training data or require costly video-by-video retraining. This work addresses these limitations by leveraging the strengths of both image and video LDMs. |
FLDM fuses latent representations from an image LDM and a video LDM during the denoising process. This allows for control over the balance between temporal consistency (from the video LDM) and editing fidelity (from the image LDM). |
FLDM improves the textual alignment and temporal consistency of edited videos compared to using image or video LDMs alone.
The method is flexible and can be used with different off-the-shelf image editing techniques, such as InstructPix2Pix and ControlNet.
FLDM demonstrates the complementary nature of image and video LDMs in achieving high-quality video editing. |
The paper uses a re-implemented video diffusion model due to the lack of publicly available high-quality pre-trained models.
Further exploration is needed to optimize the fusion strategy and apply it to other video editing tasks. |
video editing, latent diffusion models, text-guided editing, temporal consistency, multi-source fusion |
2310.16383
Report |
Open-NeRF: Towards Open Vocabulary NeRF Decomposition |
Hao Zhang, Fang Li, Narendra Ahuja |
In this paper, we address the challenge of decomposing Neural Radiance Fields
(NeRF) into objects from an open vocabulary, a critical task for object
manipulation in 3D reconstruction and view synthesis. Current techniques for
NeRF decomposition involve a trade-off between the flexibility of processing
open-vocabulary queries and the accuracy of 3D segmentation. We present,
Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF), that leverage
large-scale, off-the-shelf, segmentation models like the Segment Anything Model
(SAM) and introduce an integrate-and-distill paradigm with hierarchical
embeddings to achieve both the flexibility of open-vocabulary querying and 3D
segmentation accuracy. Open-NeRF first utilizes large-scale foundation models
to generate hierarchical 2D mask proposals from varying viewpoints. These
proposals are then aligned via tracking approaches and integrated within the 3D
space and subsequently distilled into the 3D field. This process ensures
consistent recognition and granularity of objects from different viewpoints,
even in challenging scenarios involving occlusion and indistinct features. Our
experimental results show that the proposed Open-NeRF outperforms
state-of-the-art methods such as LERF \cite{lerf} and FFD \cite{ffd} in
open-vocabulary scenarios. Open-NeRF offers a promising solution to NeRF
decomposition, guided by open-vocabulary queries, enabling novel applications
in robotics and vision-language interaction in open-world 3D scenes. |
Open-NeRF decomposes Neural Radiance Fields (NeRF) into objects from an open vocabulary using an integrate-and-distill paradigm with hierarchical embeddings. |
NeRF decomposition is crucial for object manipulation in 3D reconstruction and view synthesis, but existing methods struggle with the trade-off between handling open-vocabulary queries and accurate 3D segmentation. |
Open-NeRF utilizes large-scale foundation models (SAM, openclip) to generate and align 2D mask proposals from multiple viewpoints, integrating them in 3D space and distilling them into the 3D field. It also employs hierarchical embeddings for handling queries at different scales (object, part, background). |
Open-NeRF outperforms state-of-the-art methods (LERF, FFD) in open-vocabulary scenarios.
It accurately segments both common and novel objects regardless of viewpoint.
It enables flexible object manipulation based on various attributes like product name, brand, color, and text. |
The performance of Open-NeRF is limited by the capabilities of the foundational models (SAM, openclip).
Future work could explore more robust methods for handling background regions. |
nerf, 3d scene understanding, open vocabulary, segmentation, vision-language models |
2310.16167
Report |
iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis |
Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, Igor Gilitschenski |
We present a method for generating consistent novel views from a single
source image. Our approach focuses on maximizing the reuse of visible pixels
from the source image. To achieve this, we use a monocular depth estimator that
transfers visible pixels from the source view to the target view. Starting from
a pre-trained 2D inpainting diffusion model, we train our method on the
large-scale Objaverse dataset to learn 3D object priors. While training we use
a novel masking mechanism based on epipolar lines to further improve the
quality of our approach. This allows our framework to perform zero-shot novel
view synthesis on a variety of objects. We evaluate the zero-shot abilities of
our framework on three challenging datasets: Google Scanned Objects, Ray Traced
Multiview, and Common Objects in 3D. See our webpage for more details:
https://yashkant.github.io/invs/ |
The paper introduces iNVS, a novel method for synthesizing new views of an object from a single source image by leveraging a pretrained 2D inpainting diffusion model and maximizing the reuse of visible pixels through depth-based warping. |
Generating high-fidelity novel views from a single image is crucial for various applications but remains challenging due to the need to infer 3D geometry from limited information. Existing methods often struggle with consistency, quality, or generalization. |
iNVS uses a monocular depth estimator to warp visible pixels from the source to the target view. It then employs an inpainting diffusion model, finetuned on the Objaverse dataset, to recover missing regions, guided by an epipolar mask that identifies newly visible areas. |
iNVS outperforms baseline methods on PSNR and achieves comparable LPIPS scores, indicating good noise reduction and perceptual similarity.
The method excels at preserving text and fine details from the source image due to its pixel reuse strategy.
While iNVS demonstrates strong performance, it can exhibit limitations in accurately reconstructing object shapes due to reliance on monocular depth estimation, leading to lower SSIM scores. |
The method's reliance on monocular depth estimation can lead to structural inconsistencies, particularly in regions with significant viewpoint changes.
Future work could explore auto-regressive schemes for novel view generation to address limitations in generating consistent textures in unseen regions. |
novel view synthesis, diffusion models, inpainting, epipolar geometry, single image |
2310.16044
Report |
Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark |
Zhengfei Kuang, Yunzhi Zhang, Hong-Xing Yu, Samir Agarwala, Shangzhe Wu, Jiajun Wu |
We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering
Benchmark. Recent advances in inverse rendering have enabled a wide range of
real-world applications in 3D content generation, moving rapidly from research
and commercial use cases to consumer devices. While the results continue to
improve, there is no real-world benchmark that can quantitatively assess and
compare the performance of various inverse rendering methods. Existing
real-world datasets typically only consist of the shape and multi-view images
of objects, which are not sufficient for evaluating the quality of material
recovery and object relighting. Methods capable of recovering material and
lighting often resort to synthetic data for quantitative evaluation, which on
the other hand does not guarantee generalization to complex real-world
environments. We introduce a new dataset of real-world objects captured under a
variety of natural scenes with ground-truth 3D scans, multi-view images, and
environment lighting. Using this dataset, we establish the first comprehensive
real-world evaluation benchmark for object inverse rendering tasks from
in-the-wild scenes, and compare the performance of various existing methods. |
This paper introduces \emph{\name}, a novel real-world 3D object inverse rendering benchmark designed to address the lack of standardized evaluation for inverse rendering methods in complex, real-world settings. |
Accurately evaluating inverse rendering methods is crucial as their applications in 3D content creation and robotics expand. However, current benchmarks often rely on synthetic data, limiting generalization to real-world scenarios. |
The authors created a dataset of 14 objects captured in 7 diverse real-world scenes, including ground-truth 3D scans, multi-view images, and environment lighting. They established three evaluation benchmarks: geometry estimation, novel scene relighting, and novel view synthesis. |
IDR excels in geometry reconstruction and novel view synthesis, outperforming NeRF and its variants.
NVDiffRecMC demonstrates superior performance in novel scene relighting compared to other inverse rendering methods.
Methods using ground-truth shape and material information significantly outperform those relying solely on learned priors, highlighting areas for future research. |
The dataset is currently limited to non-translucent objects and faces difficulties capturing thin, deformable objects.
Future work includes expanding the dataset with more diverse objects, incorporating multi-object scenes, and capturing complete environment maps. |
inverse rendering, benchmarking, 3d reconstruction, relighting, novel view synthesis |
2310.16002
Report |
Integrating View Conditions for Image Synthesis |
Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Tian Ye, Kaicheng Zhou |
In the field of image processing, applying intricate semantic modifications
within existing images remains an enduring challenge. This paper introduces a
pioneering framework that integrates viewpoint information to enhance the
control of image editing tasks, especially for interior design scenes. By
surveying existing object editing methodologies, we distill three essential
criteria -- consistency, controllability, and harmony -- that should be met for
an image editing method. In contrast to previous approaches, our framework
takes the lead in satisfying all three requirements for addressing the
challenge of image synthesis. Through comprehensive experiments, encompassing
both quantitative assessments and qualitative comparisons with contemporary
state-of-the-art methods, we present compelling evidence of our framework's
superior performance across multiple dimensions. This work establishes a
promising avenue for advancing image synthesis techniques and empowering
precise object modifications while preserving the visual coherence of the
entire composition. |
This paper introduces a novel image editing framework that leverages viewpoint information to enhance control over object manipulation in images, particularly for interior design scenes. |
Existing image editing methods struggle to simultaneously achieve consistency in object appearance, controllability over object pose and position, and harmonious integration with the scene. This framework addresses these limitations. |
The framework combines several components: 1) an LLM planner to extract object and pose information from user prompts, 2) pose estimation and synthesis modules for generating target objects with desired viewpoints, and 3) a personalized diffusion model with ControlNets for harmoniously integrating the synthesized object into the scene. |
The framework outperforms state-of-the-art reference-based image synthesis methods in terms of consistency, harmony, and controllability, as demonstrated through qualitative comparisons and human evaluations.
Ablation studies confirm the necessity of each component, highlighting the importance of view conditions for accurate object synthesis.
The framework demonstrates robustness to slight errors in view condition specifications. |
Future work aims to develop an end-to-end solution by integrating view control directly within the latent space of the diffusion model, improving efficiency.
The current implementation relies on explicit pose estimation and synthesis, which can be further streamlined. |
image editing, view control, pose synthesis, diffusion models, interior design |
2310.15747
Report |
Large Language Models are Temporal and Causal Reasoners for Video Question Answering |
Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim |
Large Language Models (LLMs) have shown remarkable performances on a wide
range of natural language understanding and generation tasks. We observe that
the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$
for temporal and causal reasoning in Video Question Answering (VideoQA).
However, such priors often cause suboptimal results on VideoQA by leading the
model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$,
while ignoring visual content. This is also known as `ungrounded guesses' or
`hallucinations'. To address this problem while leveraging LLMs' prior on
VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to
predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping
the source pair and the target label to understand their complex relationships,
$\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs,
respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to
LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five
challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general
framework that is applicable to various LLMs (OPT and GPT-J) and consistently
improves their performances. We empirically demonstrate that Flipped-VQA not
only enhances the exploitation of linguistic shortcuts but also mitigates the
linguistic bias, which causes incorrect answers over-relying on the question.
Code is available at https://github.com/mlvlab/Flipped-VQA. |
This paper investigates the temporal and causal reasoning abilities of Large Language Models (LLMs) on Video Question Answering (VideoQA) and proposes Flipped-VQA, a novel framework that leverages LLMs' knowledge for this task. |
Challenging VideoQA benchmarks require understanding of temporal and causal relationships, and LLMs, pretrained on massive text data, inherently possess such reasoning abilities. However, they can be prone to linguistic bias, relying heavily on questions while ignoring visual content. |
Flipped-VQA consists of three objectives: 1) VQ -> A (main task: predicting answer from video and question), 2) VA -> Q (predicting question from video and answer), and 3) QA -> V (predicting video from question and answer). This encourages understanding the complex relationships within the VQA triplet. |
Larger LLMs exhibit significantly better performance on temporal and causal VideoQA questions, highlighting the importance of their pretrained knowledge.
Flipped-VQA significantly improves the performance of various LLMs (LLaMA, OPT, GPT-J) on five challenging VideoQA datasets, surpassing previous state-of-the-art models.
Extensive analyses demonstrate that Flipped-VQA effectively mitigates linguistic bias by encouraging the model to utilize visual content more effectively, while still leveraging linguistic shortcuts when beneficial. |
The framework's applicability to encoder-decoder LLMs with objectives beyond next-token prediction requires further exploration.
Despite using a small number of trainable parameters, the reliance on large backbone LLMs results in significant memory usage. |
video question answering, large language models, temporal and causal reasoning, linguistic bias mitigation, multi-modal understanding |
2310.15308
Report |
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding |
Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari |
The landscape of publicly available vision foundation models (VFMs), such as
CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed
with distinct capabilities stemming from their pre-training objectives. For
instance, CLIP excels in semantic understanding, while SAM specializes in
spatial understanding for segmentation. In this work, we introduce a simple
recipe to efficiently merge VFMs into a unified model that absorbs their
expertise. Our method integrates techniques of multi-task learning, continual
learning, and distillation. Further, it demands significantly less
computational cost compared to traditional multi-task training from scratch,
and it only needs a small fraction of the pre-training datasets that were
initially used to train individual models. By applying our method to SAM and
CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM
and CLIP into a single vision transformer. Compared with deploying SAM and CLIP
independently, our merged model, SAM-CLIP, reduces storage and compute costs
for inference, making it well-suited for edge device applications. We show that
SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also
introduces synergistic functionalities, notably in zero-shot semantic
segmentation, where SAM-CLIP establishes new state-of-the-art results on 5
benchmarks. It outperforms previous models that are specifically designed for
this task by a large margin, including +6.8% and +5.9% mean IoU improvement on
Pascal-VOC and COCO-Stuff datasets, respectively. |
This paper introduces a method for efficiently merging pre-trained Vision Foundation Models (VFMs) into a unified model, combining their expertise without requiring extensive training from scratch. |
Maintaining separate VFMs for different tasks is inefficient, while traditional multi-task learning is computationally expensive. This work offers a middle ground by efficiently merging VFMs with minimal training. |
The method treats merging as a continual learning problem, using multi-task distillation and a small replay dataset to transfer knowledge from an auxiliary VFM to a base VFM while mitigating forgetting. |
The merged model, combining SAM and CLIP (called SAM-CLIP), retains the zero-shot capabilities of both original models (instance segmentation and image classification).
SAM-CLIP exhibits stronger representation learning abilities compared to individual SAM and CLIP models.
SAM-CLIP demonstrates emergent capability in zero-shot semantic segmentation, achieving state-of-the-art results on 5 benchmarks. |
The merged model might inherit limitations (e.g., biases in data distribution) from the original VFMs.
The merged model requires an additional head for the auxiliary model, increasing the overall size. |
vision foundation models, model merging, knowledge distillation, continual learning, zero-shot learning |
2310.15169
Report |
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling |
Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu |
With the availability of large-scale video datasets and the advances of
diffusion models, text-driven video generation has achieved substantial
progress. However, existing video generation models are typically trained on a
limited number of frames, resulting in the inability to generate high-fidelity
long videos during inference. Furthermore, these models only support
single-text conditions, whereas real-life scenarios often require multi-text
conditions as the video content changes over time. To tackle these challenges,
this study explores the potential of extending the text-driven capability to
generate longer videos conditioned on multiple texts. 1) We first analyze the
impact of initial noise in video diffusion models. Then building upon the
observation of noise, we propose FreeNoise, a tuning-free and time-efficient
paradigm to enhance the generative capabilities of pretrained video diffusion
models while preserving content consistency. Specifically, instead of
initializing noises for all frames, we reschedule a sequence of noises for
long-range correlation and perform temporal attention over them by window-based
function. 2) Additionally, we design a novel motion injection method to support
the generation of videos conditioned on multiple text prompts. Extensive
experiments validate the superiority of our paradigm in extending the
generative capabilities of video diffusion models. It is noteworthy that
compared with the previous best-performing method which brought about 255%
extra time cost, our method incurs only negligible time cost of approximately
17%. Generated video samples are available at our website:
http://haonanqiu.com/projects/FreeNoise.html. |
This paper proposes FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pre-trained video diffusion models for longer and multi-prompt video generation. |
Existing video generation models are limited in their ability to generate high-fidelity long videos and often only support single-text conditions, hindering their applicability to real-life scenarios. |
The proposed FreeNoise method leverages noise rescheduling with local shuffling and window-based attention fusion to enable longer video generation while maintaining content consistency. It also introduces a motion injection strategy for multi-prompt video generation by modulating the influence of text prompts during the denoising process. |
FreeNoise outperforms previous methods in generating longer videos with better content consistency and visual quality, as evidenced by quantitative metrics (FVD, KVD, CLIP-SIM) and user studies.
The proposed motion injection method effectively enables multi-prompt video generation with smooth transitions and coherent motion continuity.
Compared to previous best methods, FreeNoise incurs significantly less computational overhead during inference (17% vs. 255%). |
The weakening effect of repeated locally shuffled noises might limit the introduction of new content as video length increases.
The performance of FreeNoise is constrained by the base model's ability to handle videos with significant subject movement. |
video generation, diffusion models, long video generation, multi-prompt video generation, content consistency |
2310.15160
Report |
FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models |
Lihe Yang, Xiaogang Xu, Bingyi Kang, Yinghuan Shi, Hengshuang Zhao |
Semantic segmentation has witnessed tremendous progress due to the proposal
of various advanced network architectures. However, they are extremely hungry
for delicate annotations to train, and the acquisition is laborious and
unaffordable. Therefore, we present FreeMask in this work, which resorts to
synthetic images from generative models to ease the burden of both data
collection and annotation procedures. Concretely, we first synthesize abundant
training images conditioned on the semantic masks provided by realistic
datasets. This yields extra well-aligned image-mask training pairs for semantic
segmentation models. We surprisingly observe that, solely trained with
synthetic images, we already achieve comparable performance with real ones
(e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we
investigate the role of synthetic images by joint training with real images, or
pre-training for real images. Meantime, we design a robust filtering principle
to suppress incorrectly synthesized regions. In addition, we propose to
inequally treat different semantic masks to prioritize those harder ones and
sample more corresponding synthetic images for them. As a result, either
jointly trained or pre-trained with our filtered and re-sampled synthesized
images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on
ADE20K. Code is available at https://github.com/LiheYoung/FreeMask. |
This paper presents FreeMask, a novel method to enhance fully-supervised semantic segmentation by leveraging synthetic images generated from semantic masks. |
Collecting and annotating real images for semantic segmentation is laborious and expensive. This work explores using synthetic data from generative models to address this challenge. |
The authors use FreestyleNet, a mask-to-image synthesis model, to generate synthetic images from real semantic masks. They propose two strategies: 1) Filtering noisy synthetic regions based on class-level loss analysis, and 2) Re-sampling synthetic images based on mask-level hardness to prioritize challenging layouts. |
Training solely on synthetic images achieves comparable performance to training on real images (e.g., 48.3 vs 48.5 mIoU on ADE20K).
Jointly training on real and synthetic images significantly improves performance over using real images alone (e.g., 48.7 to 52.0 mIoU on ADE20K).
Pre-training on synthetic images and fine-tuning on real images also leads to substantial improvements. |
Generating synthetic images can be time-consuming.
The proposed method's effectiveness in more complex real-world scenarios requires further investigation. |
semantic segmentation, synthetic data, generative models, image synthesis, data augmentation |
2310.15111
Report |
Matryoshka Diffusion Models |
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly |
Diffusion models are the de facto approach for generating high-quality images
and videos, but learning high-dimensional models remains a formidable task due
to computational and optimization challenges. Existing methods often resort to
training cascaded models in pixel space or using a downsampled latent space of
a separately trained auto-encoder. In this paper, we introduce Matryoshka
Diffusion Models(MDM), an end-to-end framework for high-resolution image and
video synthesis. We propose a diffusion process that denoises inputs at
multiple resolutions jointly and uses a NestedUNet architecture where features
and parameters for small-scale inputs are nested within those of large scales.
In addition, MDM enables a progressive training schedule from lower to higher
resolutions, which leads to significant improvements in optimization for
high-resolution generation. We demonstrate the effectiveness of our approach on
various benchmarks, including class-conditioned image generation,
high-resolution text-to-image, and text-to-video applications. Remarkably, we
can train a single pixel-space model at resolutions of up to 1024x1024 pixels,
demonstrating strong zero-shot generalization using the CC12M dataset, which
contains only 12 million images. |
This paper introduces \Model (\model), an end-to-end diffusion model framework for high-resolution image and video synthesis that addresses the computational and optimization challenges of traditional high-dimensional models. |
Scaling diffusion models to high resolutions for complex generation tasks like text-to-image synthesis is challenging. Existing methods rely on cascaded or latent approaches, which complicate training, inference, and can limit generation quality. |
\model uses a multi-resolution diffusion process in an extended space, jointly denoising inputs at multiple resolutions using a NestedUNet architecture. It employs a progressive training schedule, starting from lower resolutions and gradually adding higher resolutions. |
Joint multi-resolution denoising and a nested architecture lead to faster convergence and better quality compared to single-resolution diffusion.
Progressive training significantly speeds up the training process for high-resolution models, outperforming cascaded diffusion baselines.
\model achieves high performance in text-to-image generation up to 1024x1024 resolution on a relatively small dataset (CC12M), demonstrating strong zero-shot generalization. |
The paper primarily explores a limited set of architectures, leaving room for further improvements in weight sharing and parameter distribution across resolutions.
While compared to Latent Diffusion Models (LDM), a more thorough investigation of combining \model with autoencoder-based approaches is left as future work. |
diffusion models, high-resolution synthesis, text-to-image generation, text-to-video generation, multi-resolution modeling |
2310.15110
Report |
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model |
Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, Hao Su |
We report Zero123++, an image-conditioned diffusion model for generating
3D-consistent multi-view images from a single input view. To take full
advantage of pretrained 2D generative priors, we develop various conditioning
and training schemes to minimize the effort of finetuning from off-the-shelf
image diffusion models such as Stable Diffusion. Zero123++ excels in producing
high-quality, consistent multi-view images from a single image, overcoming
common issues like texture degradation and geometric misalignment. Furthermore,
we showcase the feasibility of training a ControlNet on Zero123++ for enhanced
control over the generation process. The code is available at
https://github.com/SUDO-AI-3D/zero123plus. |
Introduces Zero123++, an image-conditioned diffusion model that generates consistent multi-view images from a single input view, by finetuning Stable Diffusion with novel conditioning and training schemes. |
Addresses limitations of previous methods like Zero-1-to-3 in achieving 3D consistency in generated multi-view images, aiming to bridge the gap with true 3D scene representation. |
Utilizes a multi-view tiling strategy, leverages Stable Diffusion's local and global conditioning mechanisms, adopts a linear noise schedule, and implements a phased training approach for optimal prior utilization. |
Generates high-quality, consistent multi-view images from single inputs, outperforming previous methods in visual fidelity and consistency.
Demonstrates strong generalization ability, effectively handling real photos, AI-generated images, and 2D illustrations.
Presents a depth-controlled version using ControlNet, enabling geometry-guided generation with superior consistency (LPIPS of 0.086). |
Current model trained on a medium-scale dataset (Objaverse), potentially limiting its representational capacity.
Exploration of two-stage refiner models and further dataset scaling are planned to enhance detail and generalization. |
multi-view generation, diffusion models, 3d consistency, image conditioning, controlnet |
2310.15008
Report |
Wonder3D: Single Image to 3D using Cross-Domain Diffusion |
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, Wenping Wang |
In this work, we introduce Wonder3D, a novel method for efficiently
generating high-fidelity textured meshes from single-view images.Recent methods
based on Score Distillation Sampling (SDS) have shown the potential to recover
3D geometry from 2D diffusion priors, but they typically suffer from
time-consuming per-shape optimization and inconsistent geometry. In contrast,
certain works directly produce 3D information via fast network inferences, but
their results are often of low quality and lack geometric details. To
holistically improve the quality, consistency, and efficiency of image-to-3D
tasks, we propose a cross-domain diffusion model that generates multi-view
normal maps and the corresponding color images. To ensure consistency, we
employ a multi-view cross-domain attention mechanism that facilitates
information exchange across views and modalities. Lastly, we introduce a
geometry-aware normal fusion algorithm that extracts high-quality surfaces from
the multi-view 2D representations. Our extensive evaluations demonstrate that
our method achieves high-quality reconstruction results, robust generalization,
and reasonably good efficiency compared to prior works. |
Wonder3D, a novel method for efficiently generating high-fidelity textured meshes from single-view images. |
Existing methods for single-view 3D reconstruction either suffer from time-consuming optimization, inconsistent geometry, or limited generalizability. This paper aims to address these limitations. |
Wonder3D leverages a cross-domain diffusion model to generate consistent multi-view normal maps and color images. It then employs a geometry-aware normal fusion algorithm to extract high-quality surfaces from the 2D representations. |
Wonder3D achieves high-quality reconstruction results with fine-grained details.
The method demonstrates robust generalization across diverse image styles.
Wonder3D offers good efficiency, reconstructing textured meshes in just 2 minutes. |
The limited number of views (six) poses challenges for reconstructing objects with thin structures or severe occlusions.
Scaling up to more views requires addressing increased computational demands during training. |
3d reconstruction, single-view reconstruction, diffusion models, normal fusion, cross-domain learning |
2310.14942
Report |
Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand |
Junfeng Guo, Yiming Li, Lixu Wang, Shu-Tao Xia, Heng Huang, Cong Liu, Bo Li |
The prosperity of deep neural networks (DNNs) is largely benefited from
open-source datasets, based on which users can evaluate and improve their
methods. In this paper, we revisit backdoor-based dataset ownership
verification (DOV), which is currently the only feasible approach to protect
the copyright of open-source datasets. We reveal that these methods are
fundamentally harmful given that they could introduce malicious
misclassification behaviors to watermarked DNNs by the adversaries. In this
paper, we design DOV from another perspective by making watermarked models
(trained on the protected dataset) correctly classify some `hard' samples that
will be misclassified by the benign model. Our method is inspired by the
generalization property of DNNs, where we find a \emph{hardly-generalized
domain} for the original dataset (as its \emph{domain watermark}). It can be
easily learned with the protected dataset containing modified samples.
Specifically, we formulate the domain generation as a bi-level optimization and
propose to optimize a set of visually-indistinguishable clean-label modified
data with similar effects to domain-watermarked samples from the
hardly-generalized domain to ensure watermark stealthiness. We also design a
hypothesis-test-guided ownership verification via our domain watermark and
provide the theoretical analyses of our method. Extensive experiments on three
benchmark datasets are conducted, which verify the effectiveness of our method
and its resistance to potential adaptive methods. The code for reproducing main
experiments is available at
\url{https://github.com/JunfengGo/Domain-Watermark}. |
This paper revisits dataset ownership verification (DOV), reveals the harm of backdoor-based methods, and proposes a harmless DOV approach using a 'domain watermark.' |
Protecting the copyright of open-source datasets is crucial, but existing backdoor-based DOV methods introduce security risks. |
The authors find a 'hardly-generalized domain' for the original dataset, train a model on modified samples from this domain, and use prediction differences for harmless verification. |
The domain watermark achieves high benign accuracy and verification success rates.
It is resistant to adaptive methods like fine-tuning and model pruning.
The method successfully distinguishes between models trained on the protected dataset and those trained independently. |
The verification success rate is restricted by the benign accuracy.
Future work will explore lower watermarking rates and resistance to more adaptive methods. |
dataset ownership verification, domain watermark, harmless verification, copyright protection, deep neural networks |
2310.14532
Report |
Practical Deep Dispersed Watermarking with Synchronization and Fusion |
Hengchang Guo, Qilong Zhang, Junwei Luo, Feng Guo, Wenbin Zhang, Xiaodong Su, Minglei Li |
Deep learning based blind watermarking works have gradually emerged and
achieved impressive performance. However, previous deep watermarking studies
mainly focus on fixed low-resolution images while paying less attention to
arbitrary resolution images, especially widespread high-resolution images
nowadays. Moreover, most works usually demonstrate robustness against typical
non-geometric attacks (\textit{e.g.}, JPEG compression) but ignore common
geometric attacks (\textit{e.g.}, Rotate) and more challenging combined
attacks. To overcome the above limitations, we propose a practical deep
\textbf{D}ispersed \textbf{W}atermarking with \textbf{S}ynchronization and
\textbf{F}usion, called \textbf{\proposed}. Specifically, given an
arbitrary-resolution cover image, we adopt a dispersed embedding scheme which
sparsely and randomly selects several fixed small-size cover blocks to embed a
consistent watermark message by a well-trained encoder. In the extraction
stage, we first design a watermark synchronization module to locate and rectify
the encoded blocks in the noised watermarked image. We then utilize a decoder
to obtain messages embedded in these blocks, and propose a message fusion
strategy based on similarity to make full use of the consistency among
messages, thus determining a reliable message. Extensive experiments conducted
on different datasets convincingly demonstrate the effectiveness of our
proposed {\proposed}. Compared with state-of-the-art approaches, our blind
watermarking can achieve better performance: averagely improve the bit accuracy
by 5.28\% and 5.93\% against single and combined attacks, respectively, and
show less file size increment and better visual quality. Our code is available
at https://github.com/bytedance/DWSF. |
This paper proposes DWSF, a practical deep blind watermarking framework for arbitrary-resolution images, addressing the limitations of existing methods in handling high-resolution images and complex attacks. |
Existing deep watermarking methods struggle with high-resolution images common in real-world scenarios and lack robustness against complex, combined attacks. |
DWSF uses a dispersed embedding scheme to embed a consistent watermark message into randomly selected small image blocks. It then employs a watermark synchronization module to locate and rectify encoded blocks, even under geometric distortions. Finally, a message fusion strategy leverages message consistency for a reliable final watermark. |
DWSF achieves significantly higher visual quality (PSNR) and lower file size increment compared to state-of-the-art methods.
DWSF demonstrates superior robustness against a wide range of single and combined attacks, consistently achieving over 98% bit accuracy.
DWSF shows practical value with high bit check accuracy, indicating its ability to correctly decode the entire watermark message in realistic scenarios. |
The current implementation of DWSF primarily focuses on image watermarking, with potential extensions to other media like videos left for future work.
Exploring more sophisticated message fusion techniques and further improving the efficiency of the watermark synchronization module are promising directions for future research. |
robust blind watermarking, deep learning, dispersed embedding, watermark synchronization, message fusion |
2310.14487
Report |
VQ-NeRF: Vector Quantization Enhances Implicit Neural Representations |
Yiying Yang, Wen Liu, Fukun Yin, Xin Chen, Gang Yu, Jiayuan Fan, Tao Chen |
Recent advancements in implicit neural representations have contributed to
high-fidelity surface reconstruction and photorealistic novel view synthesis.
However, the computational complexity inherent in these methodologies presents
a substantial impediment, constraining the attainable frame rates and
resolutions in practical applications. In response to this predicament, we
propose VQ-NeRF, an effective and efficient pipeline for enhancing implicit
neural representations via vector quantization. The essence of our method
involves reducing the sampling space of NeRF to a lower resolution and
subsequently reinstating it to the original size utilizing a pre-trained VAE
decoder, thereby effectively mitigating the sampling time bottleneck
encountered during rendering. Although the codebook furnishes representative
features, reconstructing fine texture details of the scene remains challenging
due to high compression rates. To overcome this constraint, we design an
innovative multi-scale NeRF sampling scheme that concurrently optimizes the
NeRF model at both compressed and original scales to enhance the network's
ability to preserve fine details. Furthermore, we incorporate a semantic loss
function to improve the geometric fidelity and semantic coherence of our 3D
reconstructions. Extensive experiments demonstrate the effectiveness of our
model in achieving the optimal trade-off between rendering quality and
efficiency. Evaluation on the DTU, BlendMVS, and H3DS datasets confirms the
superior performance of our approach. |
Presents VQ-NeRF, a novel framework that leverages Vector Quantization (VQ) to enhance Neural Radiance Fields (NeRF) for efficient and high-quality 3D surface representation. |
Addresses the computational bottleneck in traditional NeRF methods, which limits their practical applications in terms of achievable frame rates and resolutions. |
Reduces the sampling space of NeRF using a pre-trained VQ-VAE decoder and introduces a multi-scale semantic consistency module to recover texture details and ensure realism in rendered images. |
Achieves optimal trade-off between rendering quality and efficiency, outperforming baselines like NeRF, VolSDF, and Coco-INR.
Significantly reduces rendering time (more than 10 times faster) compared to state-of-the-art methods while maintaining high visual fidelity.
Demonstrates superior performance in quantitative metrics (PSNR, SSIM, LPIPS) and qualitative comparisons on DTU, BlendedMVS, and H3DS datasets. |
VQ-NeRF requires significant training time due to scene-specific optimization.
Future work will explore general representations for different scenes to improve generalization and reduce training time. |
neural radiance fields, vector quantization, 3d surface reconstruction, novel view synthesis, vq-vae |
2310.14189
Report |
Improved Techniques for Training Consistency Models |
Yang Song, Prafulla Dhariwal |
Consistency models are a nascent family of generative models that can sample
high quality data in one step without the need for adversarial training.
Current consistency models achieve optimal sample quality by distilling from
pre-trained diffusion models and employing learned metrics such as LPIPS.
However, distillation limits the quality of consistency models to that of the
pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation.
To tackle these challenges, we present improved techniques for consistency
training, where consistency models learn directly from data without
distillation. We delve into the theory behind consistency training and identify
a previously overlooked flaw, which we address by eliminating Exponential
Moving Average from the teacher consistency model. To replace learned metrics
like LPIPS, we adopt Pseudo-Huber losses from robust statistics. Additionally,
we introduce a lognormal noise schedule for the consistency training objective,
and propose to double total discretization steps every set number of training
iterations. Combined with better hyperparameter tuning, these modifications
enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10
and ImageNet $64\times 64$ respectively in a single sampling step. These scores
mark a 3.5$\times$ and 4$\times$ improvement compared to prior consistency
training approaches. Through two-step sampling, we further reduce FID scores to
2.24 and 2.77 on these two datasets, surpassing those obtained via distillation
in both one-step and two-step settings, while narrowing the gap between
consistency models and other state-of-the-art generative models. |
This paper introduces improved consistency training (iCT) techniques for consistency models, a new class of generative models that produce high-quality samples in one step without adversarial training, achieving state-of-the-art results without relying on pre-trained diffusion models or learned metrics like LPIPS. |
Consistency training (CT) allows consistency models to learn directly from data, making them a distinct family of generative models. Previous CT methods were outperformed by distillation-based methods and relied on learned metrics, limiting their potential and introducing bias. |
The paper analyzes and improves CT by: 1) optimizing weighting functions, noise embeddings, and dropout, 2) removing Exponential Moving Average from the teacher network, 3) adopting Pseudo-Huber losses instead of LPIPS, 4) introducing an improved curriculum for total discretization steps, and 5) proposing a new noise schedule based on lognormal distributions. |
iCT achieves FID scores of 2.51 and 3.25 on CIFAR-10 and ImageNet 64x64 in one step, surpassing distillation-based methods and representing 3.5x and 4x improvements over prior CT methods.
Two-step iCT achieves FIDs of 2.24 and 2.77 on CIFAR-10 and ImageNet 64x64, exceeding distillation-based methods in both one-step and two-step settings.
iCT demonstrates comparable or superior performance to top-tier diffusion models and GANs, showcasing its potential as a new independent family of generative models. |
The study primarily focuses on CIFAR-10 and ImageNet 64x64, further investigation is needed to validate effectiveness on higher resolution datasets.
While iCT significantly reduces the computational overhead of distillation, it still requires careful hyperparameter tuning, especially for the Pseudo-Huber loss. |
generative models, consistency models, consistency training, image generation, deep learning |
2310.14108
Report |
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement |
Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta |
Contrastive language image pretraining (CLIP) is a standard method for
training vision-language models. While CLIP is scalable, promptable, and robust
to distribution shifts on image classification tasks, it lacks object
localization capabilities. This paper studies the following question: Can we
augment CLIP training with task-specific vision models from model zoos to
improve its visual representations? Towards this end, we leverage open-source
task-specific vision models to generate pseudo-labels for an uncurated and
noisy image-text dataset. Subsequently, we train CLIP models on these
pseudo-labels in addition to the contrastive training on image and text pairs.
This simple setup shows substantial improvements of up to 16.3% across
different vision tasks, including segmentation, detection, depth estimation,
and surface normal estimation. Importantly, these enhancements are achieved
without compromising CLIP's existing capabilities, including its proficiency in
promptable zero-shot classification. |
This paper proposes CLIP trained with Experts (CLIPTe), a method to improve CLIP's visual representations by leveraging task-specific vision models from model zoos to generate pseudo-labels for training on an uncurated and noisy image-text dataset. |
While CLIP excels in image classification, it lacks object localization capabilities. CLIPTe aims to bridge this gap by enhancing CLIP's visual representations without compromising its existing strengths. |
CLIPTe generates pseudo-labels for an uncurated image-text dataset using open-source task-specific vision models (experts) for segmentation, depth estimation, and surface normal estimation. It then trains CLIP models on these pseudo-labels along with the standard contrastive training on image-text pairs. |
CLIPTe significantly improves CLIP's performance on various vision tasks, including segmentation, detection, depth estimation, and surface normal estimation, with up to 16.3% improvement in probing accuracy.
The method exhibits positive transfer of representations to downstream tasks, indicating its ability to generalize learned knowledge.
Importantly, CLIPTe preserves CLIP's inherent strengths, including zero-shot classification capabilities, ensuring its versatility across different vision domains. |
The paper mainly focuses on finetuning pre-trained CLIP models on CC3M, leaving exploration with larger datasets and diverse experts for future work.
Further investigation into the impact of pseudo-label quality and noise on CLIPTe's performance is crucial. |
clip, vision-language models, pseudo-supervision, multi-task learning, object localization |
2310.13772
Report |
TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models |
Tianshi Cao, Karsten Kreis, Sanja Fidler, Nicholas Sharp, Kangxue Yin |
We present TexFusion (Texture Diffusion), a new method to synthesize textures
for given 3D geometries, using large-scale text-guided image diffusion models.
In contrast to recent works that leverage 2D text-to-image diffusion models to
distill 3D objects using a slow and fragile optimization process, TexFusion
introduces a new 3D-consistent generation technique specifically designed for
texture synthesis that employs regular diffusion model sampling on different 2D
rendered views. Specifically, we leverage latent diffusion models, apply the
diffusion model's denoiser on a set of 2D renders of the 3D object, and
aggregate the different denoising predictions on a shared latent texture map.
Final output RGB textures are produced by optimizing an intermediate neural
color field on the decodings of 2D renders of the latent texture. We thoroughly
validate TexFusion and show that we can efficiently generate diverse, high
quality and globally coherent textures. We achieve state-of-the-art text-guided
texture synthesis performance using only image diffusion models, while avoiding
the pitfalls of previous distillation-based methods. The text-conditioning
offers detailed control and we also do not rely on any ground truth 3D textures
for training. This makes our method versatile and applicable to a broad range
of geometry and texture types. We hope that TexFusion will advance AI-based
texturing of 3D assets for applications in virtual reality, game design,
simulation, and more. |
TexFusion is a novel method for synthesizing high-quality, globally coherent 3D textures on given meshes, guided by text prompts. |
Existing methods for text-driven 3D texture synthesis either lack global coherence or rely on slow and unstable optimization processes. |
TexFusion leverages latent diffusion models and introduces the Sequential Interlaced Multiview Sampler (SIMS), which interlaces denoising iterations with texture map aggregation across multiple camera views. |
TexFusion generates textures with natural color tones and fewer artifacts compared to the state-of-the-art TEXTure method.
User studies show preference for TexFusion results in terms of natural color, detail, cleanliness, and alignment with prompts.
The method is significantly faster than previous optimization-based techniques, achieving comparable speed to TEXTure. |
The sharpness of the generated textures is not yet ideal and could be further improved.
Texture generation is not real-time, limiting its applicability in interactive settings. |
3d texture synthesis, text-guided generation, diffusion models, multi-view consistency, latent space |
2310.13730
Report |
Localizing and Editing Knowledge in Text-to-Image Generative Models |
Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha |
Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have
achieved unprecedented quality of photorealism with state-of-the-art FID scores
on MS-COCO and other generation benchmarks. Given a caption, image generation
requires fine-grained knowledge about attributes such as object structure,
style, and viewpoint amongst others. Where does this information reside in
text-to-image generative models? In our paper, we tackle this question and
understand how knowledge corresponding to distinct visual attributes is stored
in large-scale text-to-image diffusion models. We adapt Causal Mediation
Analysis for text-to-image models and trace knowledge about distinct visual
attributes to various (causal) components in the (i) UNet and (ii) text-encoder
of the diffusion model. In particular, we show that unlike generative
large-language models, knowledge about different attributes is not localized in
isolated components, but is instead distributed amongst a set of components in
the conditional UNet. These sets of components are often distinct for different
visual attributes. Remarkably, we find that the CLIP text-encoder in public
text-to-image models such as Stable-Diffusion contains only one causal state
across different visual attributes, and this is the first self-attention layer
corresponding to the last subject token of the attribute in the caption. This
is in stark contrast to the causal states in other language models which are
often the mid-MLP layers. Based on this observation of only one causal state in
the text-encoder, we introduce a fast, data-free model editing method
Diff-QuickFix which can effectively edit concepts in text-to-image models.
DiffQuickFix can edit (ablate) concepts in under a second with a closed-form
update, providing a significant 1000x speedup and comparable editing
performance to existing fine-tuning based editing methods. |
This paper investigates how knowledge about different visual attributes is stored in large-scale text-to-image diffusion models (e.g., Stable Diffusion). |
Understanding where this knowledge resides is crucial for interpreting these models and enabling controlled edits. |
The authors adapt Causal Mediation Analysis to trace knowledge about visual attributes to specific components in the UNet and the text-encoder. |
Knowledge is distributed across the UNet with different distributions for distinct attributes, unlike in large language models where it is localized.
The CLIP text-encoder exhibits a single causal state for all visual attributes: the first self-attention layer corresponding to the last subject token.
This localized causal state in the text-encoder allows for efficient model editing. |
The study primarily focuses on Stable Diffusion, limiting the generalizability of the findings to other model architectures.
Further investigation into individual components within each layer (e.g., neurons) is left for future work. |
text-to-image synthesis, diffusion models, interpretability, causal mediation analysis, model editing |
2310.13545
Report |
ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection |
Zhongzhan Huang, Pan Zhou, Shuicheng Yan, Liang Lin |
In diffusion models, UNet is the most popular network backbone, since its
long skip connects (LSCs) to connect distant network blocks can aggregate
long-distant information and alleviate vanishing gradient. Unfortunately, UNet
often suffers from unstable training in diffusion models which can be
alleviated by scaling its LSC coefficients smaller. However, theoretical
understandings of the instability of UNet in diffusion models and also the
performance improvement of LSC scaling remain absent yet. To solve this issue,
we theoretically show that the coefficients of LSCs in UNet have big effects on
the stableness of the forward and backward propagation and robustness of UNet.
Specifically, the hidden feature and gradient of UNet at any layer can
oscillate and their oscillation ranges are actually large which explains the
instability of UNet training. Moreover, UNet is also provably sensitive to
perturbed input, and predicts an output distant from the desired output,
yielding oscillatory loss and thus oscillatory gradient. Besides, we also
observe the theoretical benefits of the LSC coefficient scaling of UNet in the
stableness of hidden features and gradient and also robustness. Finally,
inspired by our theory, we propose an effective coefficient scaling framework
ScaleLong that scales the coefficients of LSC in UNet and better improves the
training stability of UNet. Experimental results on four famous datasets show
that our methods are superior to stabilize training and yield about 1.5x
training acceleration on different diffusion models with UNet or UViT
backbones. Code: https://github.com/sail-sg/ScaleLong |
This paper theoretically analyzes the training instability of UNet in diffusion models and proposes a framework ScaleLong with two scaling methods (CS and LS) for long skip connections to improve stability. |
UNet, a popular backbone in diffusion models, often suffers from unstable training. Understanding this instability and finding ways to stabilize training are crucial for improving diffusion model performance and efficiency. |
The authors theoretically analyze the stability of forward and backward propagation in UNet, as well as its robustness to noisy input. They derive bounds for hidden feature oscillation, gradient magnitude, and robustness error, showing the influence of long skip connection coefficients. Inspired by the theoretical analysis, they propose ScaleLong, which includes two methods: CS (constant scaling) and LS (learnable scaling). |
Scaling the coefficients of long skip connections can effectively stabilize UNet training in diffusion models.
CS with exponentially decaying coefficients is more effective than universal scaling methods like 1/√2-scaling.
LS, which learns scaling coefficients adaptively, further improves training stability and convergence speed. |
CS requires manual selection of the scaling coefficient within an estimated range.
LS introduces a small but non-negligible number of additional parameters and computational cost. |
diffusion models, unet, training stability, long skip connections, coefficient scaling |
2310.13165
Report |
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation |
Sihan Xu, Ziqiao Ma, Yidong Huang, Honglak Lee, Joyce Chai |
Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks
but lack an intuitive interface for consistent image-to-image (I2I)
translation. Various methods have been explored to address this issue,
including mask-based methods, attention-based methods, and image-conditioning.
However, it remains a critical challenge to enable unpaired I2I translation
with pre-trained DMs while maintaining satisfying consistency. This paper
introduces Cyclenet, a novel but simple method that incorporates cycle
consistency into DMs to regularize image manipulation. We validate Cyclenet on
unpaired I2I tasks of different granularities. Besides the scene and object
level translation, we additionally contribute a multi-domain I2I translation
dataset to study the physical state changes of objects. Our empirical studies
show that Cyclenet is superior in translation consistency and quality, and can
generate high-quality images for out-of-domain distributions with a simple
change of the textual prompt. Cyclenet is a practical framework, which is
robust even with very limited training data (around 2k) and requires minimal
computational resources (1 GPU) to train. Project homepage:
https://cyclenetweb.github.io/ |
The paper introduces CycleNet, a novel method that incorporates cycle consistency into diffusion models (DMs) for image-to-image (I2I) translation to improve consistency in image manipulation. |
Consistency in image manipulation is crucial for various DM applications, especially in unpaired I2I scenarios where correspondence between source and target domain images is not guaranteed. |
CycleNet leverages cycle consistency regularization over the image translation cycle by introducing reconstruction loss, cycle consistency loss, and invariance loss. It utilizes a ControlNet with pre-trained Stable Diffusion as the backbone and incorporates text prompts and image conditioning to guide the translation. |
CycleNet demonstrates superior translation consistency and quality compared to previous approaches on scene, object, and state-level I2I tasks.
It is computationally efficient, requiring only limited training data and a single GPU.
CycleNet exhibits robust zero-shot I2I translation capability, generating faithful and high-quality images for out-of-domain distributions with a simple change of the textual prompt. |
Cycle consistency constraints can be too restrictive, leading to trade-offs between consistency and translation quality.
Maintaining global consistency while making faithful local edits remains challenging for LDM-based approaches. |
image-to-image translation, diffusion models, cycle consistency, image manipulation, zero-shot learning |
2310.13119
Report |
DreamSpace: Dreaming Your Room Space with Text-Driven Panoramic Texture Propagation |
Bangbang Yang, Wenqi Dong, Lin Ma, Wenbo Hu, Xiao Liu, Zhaopeng Cui, Yuewen Ma |
Diffusion-based methods have achieved prominent success in generating 2D
media. However, accomplishing similar proficiencies for scene-level mesh
texturing in 3D spatial applications, e.g., XR/VR, remains constrained,
primarily due to the intricate nature of 3D geometry and the necessity for
immersive free-viewpoint rendering. In this paper, we propose a novel indoor
scene texturing framework, which delivers text-driven texture generation with
enchanting details and authentic spatial coherence. The key insight is to first
imagine a stylized 360{\deg} panoramic texture from the central viewpoint of
the scene, and then propagate it to the rest areas with inpainting and
imitating techniques. To ensure meaningful and aligned textures to the scene,
we develop a novel coarse-to-fine panoramic texture generation approach with
dual texture alignment, which both considers the geometry and texture cues of
the captured scenes. To survive from cluttered geometries during texture
propagation, we design a separated strategy, which conducts texture inpainting
in confidential regions and then learns an implicit imitating network to
synthesize textures in occluded and tiny structural areas. Extensive
experiments and the immersive VR application on real-world indoor scenes
demonstrate the high quality of the generated textures and the engaging
experience on VR headsets. Project webpage:
https://ybbbbt.com/publication/dreamspace |
DreamSpace: a novel text-driven framework for generating semantically meaningful and spatially coherent scene textures for real-world indoor scenes represented as meshes, suitable for immersive VR applications. |
Existing methods for scene stylization either lack semantic meaning, are computationally expensive for VR, or struggle with real-world scene complexities. DreamSpace addresses these limitations by enabling text-driven, high-quality texture generation for real-world indoor scenes with immersive VR experiences. |
DreamSpace uses a top-down approach: 1) Generates a stylized panoramic texture from the central viewpoint using a coarse-to-fine panoramic diffusion process with dual texture alignment. 2) Propagates the texture to the rest of the scene using confidential texture inpainting for visible areas and an implicit texture imitating network for occluded/tiny areas. |
Generates high-resolution, semantically meaningful textures for real-world indoor scenes based on text prompts.
Outperforms existing methods in terms of visual quality, image-text matching, and user preference.
Enables immersive VR experiences by generating textured meshes compatible with standard rendering pipelines and HMD devices. |
Baked lighting in the generated textures limits custom lighting and dynamic shadows in rendering.
Reliance on real-world textures and high-quality scene reconstruction as input may limit applicability. |
text-driven texture generation, scene stylization, panoramic diffusion, immersive vr, 3d scene understanding |
2310.13102
Report |
Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models |
Gabriele Corso, Yilun Xu, Valentin de Bortoli, Regina Barzilay, Tommi Jaakkola |
In light of the widespread success of generative models, a significant amount
of research has gone into speeding up their sampling time. However, generative
models are often sampled multiple times to obtain a diverse set incurring a
cost that is orthogonal to sampling time. We tackle the question of how to
improve diversity and sample efficiency by moving beyond the common assumption
of independent samples. We propose particle guidance, an extension of
diffusion-based generative sampling where a joint-particle time-evolving
potential enforces diversity. We analyze theoretically the joint distribution
that particle guidance generates, how to learn a potential that achieves
optimal diversity, and the connections with methods in other disciplines.
Empirically, we test the framework both in the setting of conditional image
generation, where we are able to increase diversity without affecting quality,
and molecular conformer generation, where we reduce the state-of-the-art median
error by 13% on average. |
This paper introduces "particle guidance", a novel framework that enhances the sample efficiency of diffusion models by guiding them to generate diverse sets of samples instead of independent samples. |
While deep generative models have achieved remarkable success, their reliance on generating numerous independent samples for diversity is computationally expensive. This work tackles the challenge of improving both diversity and sample efficiency in generative models. |
Particle guidance modifies the reverse diffusion process by introducing a time-evolving potential that encourages diversity among a set of particles being sampled simultaneously. Two instantiations are presented: fixed potential PG, which uses hand-crafted potentials for efficient diverse sampling without additional training, and learned potential PG, which learns potentials to achieve specific joint distributions and preserve marginal distributions. |
Theoretical analysis of particle guidance leads to an expression for the joint marginal distribution of the sampled process under any arbitrary guidance potential.
A training objective is derived to learn a time-evolving potential that enables sampling from a desired joint distribution, ensuring optimality under given diversity constraints.
Empirical evaluations on text-to-image generation and molecular conformer generation demonstrate particle guidance's effectiveness in improving diversity and sample efficiency. In text-to-image generation, it increases diversity without sacrificing quality, and in molecular conformer generation, it achieves a 13% reduction in median error compared to state-of-the-art methods. |
Computational overhead of particle guidance can increase with the number of particles and the complexity of the potential function.
Carefully choosing the potential or guidance weight is crucial to prevent degradation of sample quality due to excessive deviation from the marginal likelihood. |
diffusion models, generative models, sample efficiency, diversity, particle guidance |
2310.12973
Report |
Frozen Transformers in Language Models Are Effective Visual Encoder Layers |
Ziqi Pang, Ziyang Xie, Yunze Man, Yu-Xiong Wang |
This paper reveals that large language models (LLMs), despite being trained
solely on textual data, are surprisingly strong encoders for purely visual
tasks in the absence of language. Even more intriguingly, this can be achieved
by a simple yet previously overlooked strategy -- employing a frozen
transformer block from pre-trained LLMs as a constituent encoder layer to
directly process visual tokens. Our work pushes the boundaries of leveraging
LLMs for computer vision tasks, significantly departing from conventional
practices that typically necessitate a multi-modal vision-language setup with
associated language prompts, inputs, or outputs. We demonstrate that our
approach consistently enhances performance across a diverse range of tasks,
encompassing pure 2D and 3D visual recognition tasks (e.g., image and point
cloud classification), temporal modeling tasks (e.g., action recognition),
non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g.,
2D/3D visual question answering and image-text retrieval). Such improvements
are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and
OPT) and different LLM transformer blocks. We additionally propose the
information filtering hypothesis to explain the effectiveness of pre-trained
LLMs in visual encoding -- the pre-trained LLM transformer blocks discern
informative visual tokens and further amplify their effect. This hypothesis is
empirically supported by the observation that the feature activation, after
training with LLM transformer blocks, exhibits a stronger focus on relevant
regions. We hope that our work inspires new perspectives on utilizing LLMs and
deepening our understanding of their underlying mechanisms. Code is available
at https://github.com/ziqipang/LM4VisualEncoding. |
This paper discovers that a frozen transformer block from a pre-trained large language model (LLM) can surprisingly serve as an effective visual encoder layer, enhancing performance across various computer vision tasks even without language input. |
This finding challenges the conventional view of LLMs as solely language-specific models, suggesting their potential for more general representation learning across modalities. |
The authors insert a frozen LLM transformer block, pre-trained on text data, into existing visual encoders, keeping the LLM block frozen during training. They evaluate this approach on diverse tasks like image classification, point cloud recognition, action recognition, and visual question answering. |
Incorporating a frozen LLM transformer block consistently improves performance across a wide range of visual tasks, including 2D and 3D recognition, temporal modeling, and multi-modal tasks.
This improvement is observed across different types of LLMs (e.g., LLaMA, OPT) and various LLM transformer blocks.
The authors propose the 'information filtering' hypothesis, suggesting that pre-trained LLM transformers can identify and amplify informative visual tokens, contributing to their effectiveness in visual encoding. |
The paper primarily focuses on exploring the potential of frozen LLM transformers for visual encoding rather than achieving state-of-the-art results on all tasks.
The information filtering hypothesis, while insightful, requires further investigation to fully understand the mechanisms by which LLMs benefit visual encoding, such as quantifying layer-wise utility and analyzing training dynamics. |
large language models, computer vision, visual encoding, representation learning, information filtering |
2310.12474
Report |
Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping |
Zijie Pan, Jiachen Lu, Xiatian Zhu, Li Zhang |
High-resolution 3D object generation remains a challenging task primarily due
to the limited availability of comprehensive annotated training data. Recent
advancements have aimed to overcome this constraint by harnessing image
generative models, pretrained on extensive curated web datasets, using
knowledge transfer techniques like Score Distillation Sampling (SDS).
Efficiently addressing the requirements of high-resolution rendering often
necessitates the adoption of latent representation-based models, such as the
Latent Diffusion Model (LDM). In this framework, a significant challenge
arises: To compute gradients for individual image pixels, it is necessary to
backpropagate gradients from the designated latent space through the frozen
components of the image model, such as the VAE encoder used within LDM.
However, this gradient propagation pathway has never been optimized, remaining
uncontrolled during training. We find that the unregulated gradients adversely
affect the 3D model's capacity in acquiring texture-related information from
the image generative model, leading to poor quality appearance synthesis. To
address this overarching challenge, we propose an innovative operation termed
Pixel-wise Gradient Clipping (PGC) designed for seamless integration into
existing 3D generative models, thereby enhancing their synthesis quality.
Specifically, we control the magnitude of stochastic gradients by clipping the
pixel-wise gradients efficiently, while preserving crucial texture-related
gradient directions. Despite this simplicity and minimal extra cost, extensive
experiments demonstrate the efficacy of our PGC in enhancing the performance of
existing 3D generative models for high-resolution object rendering. |
This paper introduces Pixel-wise Gradient Clipping (PGC), a simple yet effective technique to enhance texture quality in high-resolution 3D object generation using Latent Diffusion Models (LDMs). |
Existing 3D generation methods using LDMs suffer from uncontrolled pixel-wise gradients during backpropagation through the VAE encoder, leading to poor texture synthesis, especially at high resolutions. |
PGC regulates the magnitudes of pixel-wise gradients by clipping them against predefined thresholds while preserving crucial texture information encoded in the gradient direction. |
PGC consistently enhances texture details compared to baselines, particularly when using SDXL for guidance.
Integrating PGC enables the successful utilization of SDXL, which otherwise fails in high-resolution 3D generation.
PGC is shown to be beneficial across various LDM-based 3D generation pipelines, including text-to-3D and image-to-3D tasks. |
The risk of potential biases inherited from the pre-trained text-to-image models.
The effectiveness of PGC with larger multi-view diffusion models needs further investigation. |
3d object generation, texture synthesis, latent diffusion models, score distillation sampling, gradient clipping |
2310.12395
Report |
Closed-Form Diffusion Models |
Christopher Scarvelis, Haitz Sáez de Ocáriz Borde, Justin Solomon |
Score-based generative models (SGMs) sample from a target distribution by
iteratively transforming noise using the score function of the perturbed
target. For any finite training set, this score function can be evaluated in
closed form, but the resulting SGM memorizes its training data and does not
generate novel samples. In practice, one approximates the score by training a
neural network via score-matching. The error in this approximation promotes
generalization, but neural SGMs are costly to train and sample, and the
effective regularization this error provides is not well-understood
theoretically. In this work, we instead explicitly smooth the closed-form score
to obtain an SGM that generates novel samples without training. We analyze our
model and propose an efficient nearest-neighbor-based estimator of its score
function. Using this estimator, our method achieves sampling times competitive
with neural SGMs while running on consumer-grade CPUs. |
Introduced Smoothed Closed-Form Diffusion Models (smoothed CFDMs), training-free diffusion models that generate novel samples from finite training sets by smoothing the score function of the perturbed data distribution. |
Addresses the limitations of neural SGMs, such as high training costs and unclear generalization mechanisms, by providing a training-free, efficient, and theoretically grounded approach to generative modeling. |
Explicitly smooths the closed-form score function, derived from a finite training set, to promote generalization. Utilizes a nearest-neighbor-based estimator of the smoothed score and a reduced number of sampling steps for efficiency. |
Smoothing the closed-form score function promotes generalization by enabling the generation of novel samples that are convex combinations of training points.
The support of the model's samples converges towards barycenters of tuples of training points as the number of sampling steps increases.
Achieves competitive sample quality and significantly faster sampling times compared to neural SGMs, even on consumer-grade CPUs, by employing a nearest-neighbor-based score estimator and reduced sampling steps. |
Generating high-quality images requires sampling in the latent space of a pretrained autoencoder, limiting direct application to pixel-level image generation.
The theoretical analysis assumes specific noise distributions (Gumbel) for characterizing the distribution of one-step samples, leaving room for exploring alternative noise distributions. |
generative models, diffusion models, score-based models, training-free methods, nearest neighbor search |
2310.12190
Report |
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors |
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, Ying Shan |
Animating a still image offers an engaging visual experience. Traditional
image animation techniques mainly focus on animating natural scenes with
stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g.
human hair or body motions), and thus limits their applicability to more
general visual content. To overcome this limitation, we explore the synthesis
of dynamic content for open-domain images, converting them into animated
videos. The key idea is to utilize the motion prior of text-to-video diffusion
models by incorporating the image into the generative process as guidance.
Given an image, we first project it into a text-aligned rich context
representation space using a query transformer, which facilitates the video
model to digest the image content in a compatible fashion. However, some visual
details still struggle to be preserved in the resultant videos. To supplement
with more precise image information, we further feed the full image to the
diffusion model by concatenating it with the initial noises. Experimental
results show that our proposed method can produce visually convincing and more
logical & natural motions, as well as higher conformity to the input image.
Comparative evaluation demonstrates the notable superiority of our approach
over existing competitors. |
This paper presents DynamiCrafter, a novel method for animating open-domain images by leveraging the generative capabilities of pretrained text-to-video diffusion models. |
Existing image animation techniques often struggle to animate open-domain images due to their reliance on specific object categories or stochastic motion patterns. This new method overcomes this limitation by utilizing the rich dynamic priors present in text-to-video diffusion models. |
DynamiCrafter employs a dual-stream image injection paradigm. It projects the input image into a text-aligned context representation space using a query transformer to facilitate semantic understanding. Additionally, it directly feeds the image to the diffusion model alongside the initial noise to preserve visual details. |
DynamiCrafter generates temporally coherent and visually convincing animations that closely adhere to the input image content.
It outperforms existing open-source methods in quantitative evaluations using FVD, KVD, and a newly introduced Perceptual Input Conformity (PIC) metric.
Qualitative comparisons and user studies confirm its superiority over previous approaches and demonstrate comparable performance to state-of-the-art commercial products like PikaLabs and Gen-2. |
DynamiCrafter's performance is limited by the capabilities of the underlying text-to-video diffusion model, particularly in terms of resolution, duration, and potential flickering artifacts.
Achieving fine-grained control over specific motions remains challenging, although the paper explores text-based motion control as a promising direction. |
image animation, video diffusion models, text-to-video generation, generative models, open-domain animation |
2310.12149
Report |
Object-aware Inversion and Reassembly for Image Editing |
Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bohan Zhuang, Chunhua Shen |
By comparing the original and target prompts, we can obtain numerous editing
pairs, each comprising an object and its corresponding editing target. To allow
editability while maintaining fidelity to the input image, existing editing
methods typically involve a fixed number of inversion steps that project the
whole input image to its noisier latent representation, followed by a denoising
process guided by the target prompt. However, we find that the optimal number
of inversion steps for achieving ideal editing results varies significantly
among different editing pairs, owing to varying editing difficulties.
Therefore, the current literature, which relies on a fixed number of inversion
steps, produces sub-optimal generation quality, especially when handling
multiple editing pairs in a natural image. To this end, we propose a new image
editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable
object-level fine-grained editing. Specifically, we design a new search metric,
which determines the optimal inversion steps for each editing pair, by jointly
considering the editability of the target and the fidelity of the non-editing
region. We use our search metric to find the optimal inversion step for each
editing pair when editing an image. We then edit these editing pairs separately
to avoid concept mismatch. Subsequently, we propose an additional reassembly
step to seamlessly integrate the respective editing results and the non-editing
region to obtain the final edited image. To systematically evaluate the
effectiveness of our method, we collect two datasets called OIRBench for
benchmarking single- and multi-object editing, respectively. Experiments
demonstrate that our method achieves superior performance in editing object
shapes, colors, materials, categories, etc., especially in multi-object editing
scenarios. |
This paper proposes Object-aware Inversion and Reassembly (OIR), a novel text-driven image editing method using diffusion models that addresses the limitation of existing methods which use a fixed inversion step for all editing pairs in an image. |
Existing diffusion-based image editing methods employ a fixed inversion step for all editing pairs, neglecting the varying editing difficulties of different objects, leading to sub-optimal generation quality and concept mismatch. |
OIR utilizes a search metric to determine the optimal inversion step for each editing pair based on editability and fidelity. It then disassembles the image, edits each pair separately using the optimal step, and reassembles them with the non-editing region, ensuring global consistency. |
OIR achieves state-of-the-art performance in multi-object editing scenarios, surpassing existing methods on CLIP score and demonstrating significant qualitative improvements.
The proposed search metric effectively identifies the optimal inversion step for various editing pairs, confirmed through visualizations.
Ablation studies highlight the importance of disassembly and reassembly steps in OIR for achieving high-quality editing and avoiding concept mismatch. |
OIR incurs additional inference time for the optimal inversion step search.
The effectiveness of OIR needs further validation on other editing tasks like video editing. |
image editing, diffusion models, text-driven editing, object-aware editing, inversion |
2310.12103
Report |
Quality Diversity through Human Feedback |
Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, Joel Lehman |
Reinforcement Learning from Human Feedback (RLHF) has shown potential in
qualitative tasks where clear objectives are lacking. However, its
effectiveness is not fully realized when it is conceptualized merely as a tool
to optimize average human preferences, especially in generative tasks that
demand diverse model responses. Meanwhile, Quality Diversity (QD) algorithms
excel at identifying diverse and high-quality solutions but often rely on
manually crafted diversity metrics. This paper introduces Quality Diversity
through Human Feedback (QDHF), a novel approach integrating human feedback into
the QD framework. QDHF infers diversity metrics from human judgments of
similarity among solutions, thereby enhancing the applicability and
effectiveness of QD algorithms. Our empirical studies show that QDHF
significantly outperforms state-of-the-art methods in automatic diversity
discovery and matches the efficacy of using manually crafted metrics for QD on
standard benchmarks in robotics and reinforcement learning. Notably, in a
latent space illumination task, QDHF substantially enhances the diversity in
images generated by a diffusion model and was more favorably received in user
studies. We conclude by analyzing QDHF's scalability and the quality of its
derived diversity metrics, emphasizing its potential to improve exploration and
diversity in complex, open-ended optimization tasks. Source code is available
on GitHub: https://github.com/ld-ing/qdhf. |
This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach that integrates human feedback into the Quality Diversity (QD) framework to learn diversity metrics for enhanced optimization in complex tasks. |
Many generative tasks require diverse model responses, and existing QD algorithms often rely on manually crafted diversity metrics which limit their applicability. QDHF addresses this by learning diversity metrics directly from human feedback, improving exploration and diversity in complex optimization. |
QDHF uses latent space projection to characterize diversity and contrastive learning to align the learned diversity representation with human judgment on the similarity of solutions. A progressive training strategy is proposed to refine the diversity metrics throughout the optimization process. |
QDHF significantly outperforms unsupervised diversity discovery methods and matches the performance of QD with ground truth metrics in robotics and reinforcement learning benchmarks.
QDHF, applied to a latent space illumination task for image generation, produces more diverse and high-quality images compared to baseline methods, as evidenced by quantitative metrics and user studies.
Analysis shows a strong correlation between QDHF's performance, the sample size of human feedback, and the accuracy of the learned diversity metrics in reflecting human judgment. |
The preference model used in QDHF might not generalize well to unseen domains, requiring more diverse and strategically collected human feedback.
Future work will focus on applying QDHF to more complex and open-ended tasks in robotics, reinforcement learning, and generative modeling. |
quality diversity, human feedback, contrastive learning, diversity metrics, generative ai |
2310.11868
Report |
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now |
Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu |
The recent advances in diffusion models (DMs) have revolutionized the
generation of realistic and complex images. However, these models also
introduce potential safety hazards, such as producing harmful content and
infringing data copyrights. Despite the development of safety-driven unlearning
techniques to counteract these challenges, doubts about their efficacy persist.
To tackle this issue, we introduce an evaluation framework that leverages
adversarial prompts to discern the trustworthiness of these safety-driven DMs
after they have undergone the process of unlearning harmful concepts.
Specifically, we investigated the adversarial robustness of DMs, assessed by
adversarial prompts, when eliminating unwanted concepts, styles, and objects.
We develop an effective and efficient adversarial prompt generation approach
for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic
classification abilities of DMs to simplify the creation of adversarial
prompts, thereby eliminating the need for auxiliary classification or diffusion
models.Through extensive benchmarking, we evaluate the robustness of five
widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable
concepts, styles, or objects) across a variety of tasks. Our results
demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the
state-of-the-art adversarial prompt generation method and reveal the lack of
robustness of current safety-driven unlearning techniques when applied to DMs.
Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack.
WARNING: This paper contains model outputs that may be offensive in nature. |
This paper introduces a novel adversarial attack method, Diffusion-MU-Attack (DMA), to evaluate the robustness of safety-driven diffusion models (DMs) after unlearning harmful concepts, styles, and objects. |
Evaluating the robustness of safety-driven unlearned DMs is crucial to ensure their trustworthiness and prevent the generation of harmful content, despite efforts to remove such influence. |
DMA leverages the intrinsic classification abilities of DMs to efficiently generate adversarial prompts without the need for auxiliary diffusion or classification models. The method optimizes the adversarial prompt using a simplified diffusion classifier-guided approach. |
DMA effectively bypasses various safety-driven unlearned DMs, leading to the generation of undesirable content across concept, style, and object unlearning tasks.
DMA outperforms the concurrent attack method P4D in terms of effectiveness and efficiency, especially in style and object unlearning.
Current safety-driven unlearning techniques exhibit varying degrees of vulnerability to adversarial prompts, highlighting the need for more robust unlearning methods. |
The evaluation primarily focuses on a limited selection of unlearned DMs.
Future work could explore the development of more robust unlearning techniques that can withstand adversarial attacks like DMA.
Investigating the attack transferability across different diffusion model architectures and training datasets is another potential direction. |
diffusion models, adversarial attacks, machine unlearning, image generation, ai safety |
2310.11513
Report |
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment |
Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt |
Recent breakthroughs in diffusion models, multimodal pretraining, and
efficient finetuning have led to an explosion of text-to-image generative
models. Given human evaluation is expensive and difficult to scale, automated
methods are critical for evaluating the increasingly large number of new
models. However, most current automated evaluation metrics like FID or
CLIPScore only offer a holistic measure of image quality or image-text
alignment, and are unsuited for fine-grained or instance-level analysis. In
this paper, we introduce GenEval, an object-focused framework to evaluate
compositional image properties such as object co-occurrence, position, count,
and color. We show that current object detection models can be leveraged to
evaluate text-to-image models on a variety of generation tasks with strong
human agreement, and that other discriminative vision models can be linked to
this pipeline to further verify properties like object color. We then evaluate
several open-source text-to-image models and analyze their relative generative
capabilities on our benchmark. We find that recent models demonstrate
significant improvement on these tasks, though they are still lacking in
complex capabilities such as spatial relations and attribute binding. Finally,
we demonstrate how GenEval might be used to help discover existing failure
modes, in order to inform development of the next generation of text-to-image
models. Our code to run the GenEval framework is publicly available at
https://github.com/djghosh13/geneval. |
Introduces Geneval, an automated object-focused framework for evaluating compositional capabilities of text-to-image models. |
Automated methods are needed for evaluating the increasing number of text-to-image models, but existing metrics lack fine-grained analysis. |
Leverages object detection models to verify object presence, count, and position, and uses additional vision models (e.g., color classifiers) for attribute verification. |
Geneval achieves 83% agreement with human judgment on image correctness.
Recent models like IF-XL show improvement but still struggle with spatial relations and attribute binding.
Geneval's fine-grained output helps identify specific failure modes in text-to-image generation. |
Performance limited by the capabilities of current object detection models.
Evaluation scope depends on the availability of relevant discriminative vision models. |
text-to-image generation, evaluation metrics, object detection, compositional reasoning, attribute binding |
2310.11454
Report |
VeRA: Vector-based Random Matrix Adaptation |
Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano |
Low-rank adapation (LoRA) is a popular method that reduces the number of
trainable parameters when finetuning large language models, but still faces
acute storage challenges when scaling to even larger models or deploying
numerous per-user or per-task adapted models. In this work, we present
Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the
number of trainable parameters compared to LoRA, yet maintains the same
performance. It achieves this by using a single pair of low-rank matrices
shared across all layers and learning small scaling vectors instead. We
demonstrate its effectiveness on the GLUE and E2E benchmarks, image
classification tasks, and show its application in instruction-tuning of 7B and
13B language models. |
This paper introduces VeRA (Vector-based Random Matrix Adaptation), a new parameter-efficient finetuning method that uses significantly fewer parameters than LoRA while maintaining comparable performance. |
Efficient adaptation methods are crucial for scaling large language models to personalized applications and edge devices due to limited memory constraints. |
VeRA adapts a single pair of frozen, randomly initialized matrices shared across layers using trainable scaling vectors, unlike LoRA which trains separate low-rank matrices per layer. |
VeRA achieves comparable performance to LoRA on GLUE and E2E benchmarks with an order of magnitude fewer parameters.
It successfully performs instruction-tuning of 7B and 13B language models with a 100x reduction in trainable parameters compared to LoRA.
VeRA demonstrates comparable or better performance to LoRA on image classification tasks with Vision Transformers, using over 10x fewer parameters. |
The current study primarily focuses on Transformer architectures, leaving its applicability to other architectures and domains for future research.
Further performance improvements could be explored by incorporating dynamic parameter allocation, initialization techniques, or regularization. |
parameter-efficient finetuning, large language models, lora, random projections, instruction tuning |
2310.11448
Report |
4K4D: Real-Time 4D View Synthesis at 4K Resolution |
Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, Xiaowei Zhou |
This paper targets high-fidelity and real-time view synthesis of dynamic 3D
scenes at 4K resolution. Recently, some methods on dynamic view synthesis have
shown impressive rendering quality. However, their speed is still limited when
rendering high-resolution images. To overcome this problem, we propose 4K4D, a
4D point cloud representation that supports hardware rasterization and enables
unprecedented rendering speed. Our representation is built on a 4D feature grid
so that the points are naturally regularized and can be robustly optimized. In
addition, we design a novel hybrid appearance model that significantly boosts
the rendering quality while preserving efficiency. Moreover, we develop a
differentiable depth peeling algorithm to effectively learn the proposed model
from RGB videos. Experiments show that our representation can be rendered at
over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the
ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x
faster than previous methods and achieves the state-of-the-art rendering
quality. Our project page is available at https://zju3dv.github.io/4k4d/. |
This paper presents 4K4D, a novel 4D point cloud representation designed for real-time, high-fidelity view synthesis of dynamic 3D scenes at 4K resolution. |
Existing dynamic view synthesis methods, though achieving impressive rendering quality, are limited in rendering speed, particularly at high resolutions, hindering their application in VR/AR and other fields. |
4K4D leverages a 4D feature grid for point regularization and robust optimization. It introduces a hybrid appearance model combining a pre-computable image blending model for efficiency and a continuous spherical harmonics model for view-dependent effects. A differentiable depth peeling algorithm renders the representation, enabling hardware rasterization for speed enhancement. |
4K4D achieves state-of-the-art rendering quality, outperforming competitors on benchmark datasets like DNA-Rendering and ENeRF-Outdoor.
The method achieves unprecedented rendering speed, reaching over 200 FPS at 1080p and 80 FPS at 4K on an RTX 4090 GPU.
4K4D effectively compresses scene information, achieving a low storage cost of approximately 2 MB per frame, including source videos. |
4K4D currently lacks the ability to establish point correspondences across frames, potentially limiting its applicability in tasks requiring temporal coherence.
The storage cost scales linearly with the number of frames, presenting challenges for modeling extensive volumetric video sequences. |
dynamic view synthesis, neural rendering, point cloud representation, real-time rendering, 4k resolution |
2310.11440
Report |
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models |
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan |
The vision and language generative models have been overgrown in recent
years. For video generation, various open-sourced models and public-available
services have been developed to generate high-quality videos. However, these
methods often use a few metrics, e.g., FVD or IS, to evaluate the performance.
We argue that it is hard to judge the large conditional generative models from
the simple metrics since these models are often trained on very large datasets
with multi-aspect abilities. Thus, we propose a novel framework and pipeline
for exhaustively evaluating the performance of the generated videos. Our
approach involves generating a diverse and comprehensive list of 700 prompts
for text-to-video generation, which is based on an analysis of real-world user
data and generated with the assistance of a large language model. Then, we
evaluate the state-of-the-art video generative models on our carefully designed
benchmark, in terms of visual qualities, content qualities, motion qualities,
and text-video alignment with 17 well-selected objective metrics. To obtain the
final leaderboard of the models, we further fit a series of coefficients to
align the objective metrics to the users' opinions. Based on the proposed human
alignment method, our final score shows a higher correlation than simply
averaging the metrics, showing the effectiveness of the proposed evaluation
method. |
This paper proposes EvalCrafter, a comprehensive framework and benchmark for evaluating text-to-video (T2V) generation models. |
Existing metrics for evaluating T2V models are limited and often fail to capture important aspects such as motion quality, temporal consistency, and text-video alignment. |
The authors first construct a benchmark of 700 diverse prompts with detailed annotations. They then evaluate various T2V models on this benchmark using 17 objective metrics across four aspects: visual quality, text-video alignment, motion quality, and temporal consistency. Finally, they align objective metrics with user opinions obtained through user studies. |
Significant variations in model rankings across different evaluation aspects highlight the need for a multi-aspect evaluation approach.
Users prioritize visual appeal and temporal consistency over strict text-video alignment.
Current T2V models struggle with camera motion control, complex scenes, instruction following, and entity details. |
The current benchmark only contains 700 prompts, which might not be enough to represent the complexity of real-world scenarios.
Evaluating motion quality in a general sense remains challenging. |
text-to-video generation, benchmarking, evaluation metrics, human alignment, large generative models |
2310.10769
Report |
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation |
Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, Xiangyu Zhang |
With the impressive progress in diffusion-based text-to-image generation,
extending such powerful generative ability to text-to-video raises enormous
attention. Existing methods either require large-scale text-video pairs and a
large number of training resources or learn motions that are precisely aligned
with template videos. It is non-trivial to balance a trade-off between the
degree of generation freedom and the resource costs for video generation. In
our study, we present a few-shot-based tuning framework, LAMP, which enables
text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos
on a single GPU. Specifically, we design a first-frame-conditioned pipeline
that uses an off-the-shelf text-to-image model for content generation so that
our tuned video diffusion model mainly focuses on motion learning. The
well-developed text-to-image techniques can provide visually pleasing and
diverse content as generation conditions, which highly improves video quality
and generation freedom. To capture the features of temporal dimension, we
expand the pretrained 2D convolution layers of the T2I model to our novel
temporal-spatial motion learning layers and modify the attention blocks to the
temporal level. Additionally, we develop an effective inference trick,
shared-noise sampling, which can improve the stability of videos with
computational costs. Our method can also be flexibly applied to other tasks,
e.g. real-world image animation and video editing. Extensive experiments
demonstrate that LAMP can effectively learn the motion pattern on limited data
and generate high-quality videos. The code and models are available at
https://rq-wu.github.io/projects/LAMP. |
Presents LAMP, a few-shot-based tuning framework enabling text-to-image diffusion models to learn motion patterns from a small set of videos (8-16) on a single GPU. |
Addresses the limitations of existing text-to-video generation methods that either require large datasets and resources or lack generation freedom by relying on template videos. |
Introduces a first-frame-conditioned pipeline decoupling content and motion, temporal-spatial motion learning layers to capture temporal information, and a shared-noise sampling strategy for improved consistency. |
Achieves state-of-the-art performance on textural alignment, frame consistency, and generation diversity compared to existing methods.
Demonstrates good generalization ability, generating high-quality videos with learned motion patterns applied to various scenes and styles.
Successfully applies the framework to real image animation and video editing tasks. |
Learning complex motions can lead to an increased occurrence of failure cases.
Motion of foreground objects can sometimes affect background stability. |
text-to-video generation, diffusion models, few-shot learning, motion pattern learning, video editing |
2310.10651
Report |
HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending |
Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Weiming Zhang, Gang Hua, Nenghai Yu |
Hair editing has made tremendous progress in recent years. Early hair editing
methods use well-drawn sketches or masks to specify the editing conditions.
Even though they can enable very fine-grained local control, such interaction
modes are inefficient for the editing conditions that can be easily specified
by language descriptions or reference images. Thanks to the recent breakthrough
of cross-modal models (e.g., CLIP), HairCLIP is the first work that enables
hair editing based on text descriptions or reference images. However, such
text-driven and reference-driven interaction modes make HairCLIP unable to
support fine-grained controls specified by sketch or mask. In this paper, we
propose HairCLIPv2, aiming to support all the aforementioned interactions with
one unified framework. Simultaneously, it improves upon HairCLIP with better
irrelevant attributes (e.g., identity, background) preservation and unseen text
descriptions support. The key idea is to convert all the hair editing tasks
into hair transfer tasks, with editing conditions converted into different
proxies accordingly. The editing effects are added upon the input image by
blending the corresponding proxy features within the hairstyle or hair color
feature spaces. Besides the unprecedented user interaction mode support,
quantitative and qualitative experiments demonstrate the superiority of
HairCLIPv2 in terms of editing effects, irrelevant attribute preservation and
visual naturalness. Our code is available at
\url{https://github.com/wty-ustc/HairCLIPv2}. |
HairCLIPv2 is proposed as a unified system for hair editing that supports diverse user interactions, including text descriptions, reference images, sketches, masks, and their combinations. |
Previous hair editing methods were limited in the types of user input they supported, lacking the ability to handle both simple language/image-based instructions and fine-grained local controls within a single framework. |
The key idea is to convert all hair editing tasks into hair transfer by generating different proxy images based on user inputs. This is achieved by blending proxy features within the hairstyle or hair color feature space of StyleGAN. |
HairCLIPv2 demonstrates superior performance compared to existing text-driven methods, especially for out-of-domain text descriptions.
It achieves comparable hair transfer results to state-of-the-art methods while offering broader interaction support.
The proposed method excels in sketch-based local hair editing, outperforming existing approaches in terms of editing quality and preservation of non-edited regions. |
The current system focuses solely on image editing and does not extend to facial hair or video editing.
Generating proxies through optimization limits real-time editing capabilities, presenting an area for future research. |
hair editing, hair transfer, stylegan, clip, multimodal editing |
2310.10649
Report |
A Computational Framework for Solving Wasserstein Lagrangian Flows |
Kirill Neklyudov, Rob Brekelmans, Alexander Tong, Lazar Atanackovic, Qiang Liu, Alireza Makhzani |
The dynamical formulation of the optimal transport can be extended through
various choices of the underlying geometry ($\textit{kinetic energy}$), and the
regularization of density paths ($\textit{potential energy}$). These
combinations yield different variational problems ($\textit{Lagrangians}$),
encompassing many variations of the optimal transport problem such as the
Schr\"odinger bridge, unbalanced optimal transport, and optimal transport with
physical constraints, among others. In general, the optimal density path is
unknown, and solving these variational problems can be computationally
challenging. Leveraging the dual formulation of the Lagrangians, we propose a
novel deep learning based framework approaching all of these problems from a
unified perspective. Our method does not require simulating or backpropagating
through the trajectories of the learned dynamics, and does not need access to
optimal couplings. We showcase the versatility of the proposed framework by
outperforming previous approaches for the single-cell trajectory inference,
where incorporating prior knowledge into the dynamics is crucial for correct
predictions. |
The paper introduces Wasserstein Lagrangian Flows, a deep learning framework for inferring dynamics and solving marginal interpolation problems using Lagrangian action functionals on manifolds of probability measures. |
This framework unifies various optimal transport problems, including Schrödinger Bridge, unbalanced optimal transport, and optimal transport with physical constraints, allowing for flexible incorporation of prior information in trajectory inference. |
The methodology leverages the dual formulation of Lagrangians, parameterizes cotangent vectors and distributional paths with neural networks, and optimizes a dual objective that is linear in the density. |
The framework outperforms previous approaches for single-cell trajectory inference.
Incorporating mass teleportation into the dynamical formulation improves performance.
Including a physical potential significantly enhances performance, especially when combined with the Wasserstein Fisher-Rao metric. |
The paper focuses on Lagrangians with linearizable dual objectives.
Future work includes exploring various Lagrangian costs and extending the framework to other domains. |
optimal transport, lagrangian mechanics, deep learning, trajectory inference, single-cell rna sequencing |
2310.10644
Report |
TOSS:High-quality Text-guided Novel View Synthesis from a Single Image |
Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum |
In this paper, we present TOSS, which introduces text to the task of novel
view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has
demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a
pure image-to-image translation problem. This approach suffers from the
challengingly under-constrained nature of single-view NVS: the process lacks
means of explicit user control and often results in implausible NVS
generations. To address this limitation, TOSS uses text as high-level semantic
information to constrain the NVS solution space. TOSS fine-tunes text-to-image
Stable Diffusion pre-trained on large-scale text-image pairs and introduces
modules specifically tailored to image and camera pose conditioning, as well as
dedicated training for pose correctness and preservation of fine details.
Comprehensive experiments are conducted with results showing that our proposed
TOSS outperforms Zero-1-to-3 with more plausible, controllable and
multiview-consistent NVS results. We further support these results with
comprehensive ablations that underscore the effectiveness and potential of the
introduced semantic guidance and architecture design. |
TOSS, a zero-shot open-set novel view synthesis (NVS) model that leverages textual descriptions to generate more plausible and controllable novel views from a single RGB image. |
Existing single-view NVS methods often produce implausible results due to the highly unconstrained nature of the problem. TOSS addresses this by incorporating textual guidance to constrain the solution space and provide explicit user control. |
TOSS adapts a text-to-image Stable Diffusion model by introducing: 1) a dense cross-attention module to condition on input image features, 2) a mechanism for incorporating camera pose information, 3) a training strategy with expert denoisers for pose accuracy and detail preservation. |
TOSS quantitatively outperforms baseline methods on NVS, showing higher PSNR, SSIM, and lower LPIPS and KID values on GSO and RTMV datasets.
TOSS demonstrates superior qualitative results with more plausible, controllable, and multiview-consistent novel view generations.
TOSS improves 3D reconstruction quality with finer details and better mesh quality compared to baselines. |
Current captioning models may not provide sufficiently detailed descriptions for optimal NVS.
Training on synthetic datasets can lead to distribution shift; utilizing real images and videos could alleviate this. |
novel view synthesis, text-guided generation, diffusion models, single image 3d reconstruction, semantic guidance |
2310.10642
Report |
Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting |
Zeyu Yang, Hongye Yang, Zijie Pan, Li Zhang |
Reconstructing dynamic 3D scenes from 2D images and generating diverse views
over time is challenging due to scene complexity and temporal dynamics. Despite
advancements in neural implicit models, limitations persist: (i) Inadequate
Scene Structure: Existing methods struggle to reveal the spatial and temporal
structure of dynamic scenes from directly learning the complex 6D plenoptic
function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element
deformation becomes impractical for complex dynamics. To address these issues,
we consider the spacetime as an entirety and propose to approximate the
underlying spatio-temporal 4D volume of a dynamic scene by optimizing a
collection of 4D primitives, with explicit geometry and appearance modeling.
Learning to optimize the 4D primitives enables us to synthesize novel views at
any desired time with our tailored rendering routine. Our model is conceptually
simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that
can rotate arbitrarily in space and time, as well as view-dependent and
time-evolved appearance represented by the coefficient of 4D spherindrical
harmonics. This approach offers simplicity, flexibility for variable-length
video and end-to-end training, and efficient real-time rendering, making it
suitable for capturing complex dynamic scene motions. Experiments across
various benchmarks, including monocular and multi-view scenarios, demonstrate
our 4DGS model's superior visual quality and efficiency. |
This paper presents 4D Gaussian Splatting (4DGS), a novel approach to represent and render dynamic scenes using 4D Gaussian primitives that coherently integrate space and time dimensions. |
Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging, and existing methods struggle with inadequate scene structure representation and scaling deformation modeling. |
The method leverages 4D Gaussian distributions parameterized by anisotropic ellipses capable of rotation in space-time to model scene elements. Additionally, it introduces 4D Spherindrical Harmonics to represent the time evolution of view-dependent color. |
4DGS outperforms state-of-the-art methods in terms of visual quality and efficiency on benchmarks like the Plenoptic Video and D-NeRF datasets.
The method effectively captures underlying 3D motion without explicit supervision or regularization.
4DGS demonstrates capability in handling complex real-world dynamic scenes, including those with volumetric effects, non-Lambertian surfaces, and varying lighting. |
The method might struggle to capture distant background areas when initialized with point clouds from a limited time range.
Reliance on initial point clouds might introduce limitations in scenarios where such information is unavailable. |
novel view synthesis, dynamic scene reconstruction, 4d gaussian splatting, 4d spherindrical harmonics, real-time rendering |
2310.10640
Report |
LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts |
Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka |
Diffusion-based generative models have significantly advanced text-to-image
generation but encounter challenges when processing lengthy and intricate text
prompts describing complex scenes with multiple objects. While excelling in
generating images from short, single-object descriptions, these models often
struggle to faithfully capture all the nuanced details within longer and more
elaborate textual inputs. In response, we present a novel approach leveraging
Large Language Models (LLMs) to extract critical components from text prompts,
including bounding box coordinates for foreground objects, detailed textual
descriptions for individual objects, and a succinct background context. These
components form the foundation of our layout-to-image generation model, which
operates in two phases. The initial Global Scene Generation utilizes object
layouts and background context to create an initial scene but often falls short
in faithfully representing object characteristics as specified in the prompts.
To address this limitation, we introduce an Iterative Refinement Scheme that
iteratively evaluates and refines box-level content to align them with their
textual descriptions, recomposing objects as needed to ensure consistency. Our
evaluation on complex prompts featuring multiple objects demonstrates a
substantial improvement in recall compared to baseline diffusion models. This
is further validated by a user study, underscoring the efficacy of our approach
in generating coherent and detailed scenes from intricate textual inputs. |
This paper proposes a novel framework, called LLM Blueprint, to address the challenge of generating images from lengthy and detailed text prompts, which often pose difficulties for existing text-to-image models. |
Current text-to-image models, while proficient in handling short prompts, often struggle to faithfully capture all details in longer descriptions, leading to omissions and misrepresentations. |
The framework uses LLMs to extract a 'Scene Blueprint' from the prompt, including object layouts, descriptions, and background context. It then employs a two-phase image generation process: 1) Global Scene Generation: Creating an initial image based on the layout and background. 2) Iterative Refinement Scheme: Refining the content of each bounding box to align with its textual description, using a multi-modal guidance procedure. |
The proposed method achieves a significantly higher Prompt Adherence Recall (PAR) score (85%) compared to baselines like Stable Diffusion (49%), GLIGEN (57%), and LayoutGPT (69%).
Qualitative analysis demonstrates the superiority of the approach in capturing all objects and their details, including accurate spatial positioning, compared to existing methods.
A user study confirms that the proposed method consistently produces more coherent images that better adhere to lengthy textual descriptions than baseline approaches. |
The current method uses a fixed box layout during refinement, exploring dynamic box adjustments could be beneficial.
The approach handles overlapping boxes by size-based sorting; investigating more advanced techniques is a potential future direction. |
text-to-image synthesis, large language models, diffusion models, layout-to-image generation, iterative refinement |
2310.10563
Report |
RefConv: Re-parameterized Refocusing Convolution for Powerful ConvNets |
Zhicheng Cai, Xiaohan Ding, Qiu Shen, Xun Cao |
We propose Re-parameterized Refocusing Convolution (RefConv) as a replacement
for regular convolutional layers, which is a plug-and-play module to improve
the performance without any inference costs. Specifically, given a pre-trained
model, RefConv applies a trainable Refocusing Transformation to the basis
kernels inherited from the pre-trained model to establish connections among the
parameters. For example, a depth-wise RefConv can relate the parameters of a
specific channel of convolution kernel to the parameters of the other kernel,
i.e., make them refocus on the other parts of the model they have never
attended to, rather than focus on the input features only. From another
perspective, RefConv augments the priors of existing model structures by
utilizing the representations encoded in the pre-trained parameters as the
priors and refocusing on them to learn novel representations, thus further
enhancing the representational capacity of the pre-trained model. Experimental
results validated that RefConv can improve multiple CNN-based models by a clear
margin on image classification (up to 1.47% higher top-1 accuracy on ImageNet),
object detection and semantic segmentation without introducing any extra
inference costs or altering the original model structure. Further studies
demonstrated that RefConv can reduce the redundancy of channels and smooth the
loss landscape, which explains its effectiveness. |
This paper proposes Re-parameterized Refocusing Convolution (RefConv), a plug-and-play module that enhances the performance of convolutional layers in convolutional neural networks (CNNs) without increasing inference costs. |
This approach aims to improve CNN performance by augmenting the priors of existing model structures, allowing kernels to learn more diverse representations. |
RefConv replaces standard convolutional layers with a two-step process: 1) it uses a pre-trained convolutional kernel as basis weights, and 2) it applies a trainable Refocusing Transformation to these basis weights, creating transformed weights that are used for inference. |
RefConv consistently improves the performance of various CNN architectures on ImageNet, object detection, and semantic segmentation tasks.
The method is shown to reduce channel redundancy, leading to more diverse and richer representations.
RefConv results in a smoother loss landscape, implying better generalization abilities. |
The current design of Refocusing Transformation is relatively simple, relying solely on convolution.
Future work could explore more advanced operations and non-linearity in the Refocusing Transformation. |
convolutional neural networks, re-parameterization, refocusing convolution, channel redundancy, loss landscape |
2310.10533
Report |
Label-efficient Segmentation via Affinity Propagation |
Wentong Li, Yuqian Yuan, Song Wang, Wenyu Liu, Dongqi Tang, Jian Liu, Jianke Zhu, Lei Zhang |
Weakly-supervised segmentation with label-efficient sparse annotations has
attracted increasing research attention to reduce the cost of laborious
pixel-wise labeling process, while the pairwise affinity modeling techniques
play an essential role in this task. Most of the existing approaches focus on
using the local appearance kernel to model the neighboring pairwise potentials.
However, such a local operation fails to capture the long-range dependencies
and ignores the topology of objects. In this work, we formulate the affinity
modeling as an affinity propagation process, and propose a local and a global
pairwise affinity terms to generate accurate soft pseudo labels. An efficient
algorithm is also developed to reduce significantly the computational cost. The
proposed approach can be conveniently plugged into existing segmentation
networks. Experiments on three typical label-efficient segmentation tasks, i.e.
box-supervised instance segmentation, point/scribble-supervised semantic
segmentation and CLIP-guided semantic segmentation, demonstrate the superior
performance of the proposed approach. |
This paper proposes Affinity Propagation (APro), a novel component for weakly-supervised segmentation that formulates the task as an affinity propagation process. |
Label-efficient segmentation with sparse annotations is crucial for reducing annotation costs. Existing methods using local appearance kernels for affinity modeling have limitations in capturing long-range dependencies and object topology. |
APro models pairwise affinity both globally and locally. It utilizes a topology-aware tree-based graph for global affinity propagation and a Gaussian kernel-based approach for local affinity propagation. An efficient algorithm reduces the computational complexity from O(N^2) to O(NlogN). |
APro outperforms counterparts in box-supervised instance segmentation on Pascal VOC and COCO datasets, achieving significant AP gains.
It achieves state-of-the-art results in point/scribble-supervised semantic segmentation on Pascal VOC2012, surpassing previous best methods in mIoU.
In CLIP-guided annotation-free semantic segmentation, APro consistently improves performance on Pascal VOC2012, Pascal Context, and COCO-Stuff datasets across various CLIP models. |
The method relies on image intensity and color similarities, potentially facing challenges in scenarios like motion blur and occlusions.
Future work will explore integrating APro with large-scale foundation models like SAM for enhanced feature representation and performance. |
weakly-supervised segmentation, affinity propagation, label-efficient learning, pairwise affinity modeling, instance segmentation, semantic segmentation |
2310.10513
Report |
Unifying Image Processing as Visual Prompting Question Answering |
Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong |
Image processing is a fundamental task in computer vision, which aims at
enhancing image quality and extracting essential features for subsequent vision
applications. Traditionally, task-specific models are developed for individual
tasks and designing such models requires distinct expertise. Building upon the
success of large language models (LLMs) in natural language processing (NLP),
there is a similar trend in computer vision, which focuses on developing
large-scale models through pretraining and in-context learning. This paradigm
shift reduces the reliance on task-specific models, yielding a powerful unified
model to deal with various tasks. However, these advances have predominantly
concentrated on high-level vision tasks, with less attention paid to low-level
vision tasks. To address this issue, we propose a universal model for general
image processing that covers image restoration, image enhancement, image
feature extraction tasks, etc. Our proposed framework, named PromptGIP, unifies
these diverse image processing tasks within a universal framework. Inspired by
NLP question answering (QA) techniques, we employ a visual prompting question
answering paradigm. Specifically, we treat the input-output image pair as a
structured question-answer sentence, thereby reprogramming the image processing
task as a prompting QA problem. PromptGIP can undertake diverse cross-domain
tasks using provided visual prompts, eliminating the need for task-specific
finetuning. Our methodology offers a universal and adaptive solution to general
image processing. While PromptGIP has demonstrated a certain degree of
out-of-domain task generalization capability, further research is expected to
fully explore its more powerful emergent generalization. |
This paper introduces PromptGIP, a universal model for general image processing. PromptGIP can handle image restoration, enhancement, and feature extraction tasks within a unified framework. |
Existing image processing methods often require task-specific models and struggle to generalize across different output domains. This work aims to unify these diverse tasks under one model. |
PromptGIP leverages a visual prompting question answering paradigm. It treats input-output image pairs as structured question-answer sentences, effectively reprogramming image processing tasks as prompting QA problems. |
PromptGIP successfully handles up to 15 diverse image processing tasks with promising visual results.
It outperforms baseline methods, including the original ViT and Painter, on various tasks, demonstrating the effectiveness of the proposed QA paradigm and masked training strategy.
PromptGIP exhibits a certain degree of generalization on out-of-distribution tasks, showcasing its potential for broader application. |
The current model does not excel at generating unexpected or emergent outcomes, indicating a need for further exploration in enabling true out-of-distribution generalization.
The current ViT backbone limits performance on certain tasks, suggesting that incorporating stronger backbones could be beneficial. |
image processing, visual prompting, question answering, in-context learning, vision transformer |
2310.10343
Report |
ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion |
Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, Hongdong Li |
Given a single image of a 3D object, this paper proposes a novel method
(named ConsistNet) that is able to generate multiple images of the same object,
as if seen they are captured from different viewpoints, while the 3D
(multi-view) consistencies among those multiple generated images are
effectively exploited. Central to our method is a multi-view consistency block
which enables information exchange across multiple single-view diffusion
processes based on the underlying multi-view geometry principles. ConsistNet is
an extension to the standard latent diffusion model, and consists of two
sub-modules: (a) a view aggregation module that unprojects multi-view features
into global 3D volumes and infer consistency, and (b) a ray aggregation module
that samples and aggregate 3D consistent features back to each view to enforce
consistency. Our approach departs from previous methods in multi-view image
generation, in that it can be easily dropped-in pre-trained LDMs without
requiring explicit pixel correspondences or depth prediction. Experiments show
that our method effectively learns 3D consistency over a frozen Zero123
backbone and can generate 16 surrounding views of the object within 40 seconds
on a single A100 GPU. Our code will be made available on
https://github.com/JiayuYANG/ConsistNet |
This paper proposes ConsistNet, a plug-in module for image diffusion models to generate multi-view consistent images by enforcing 3D consistency. |
3D-consistent multi-view image generation is crucial for applications like 3D asset creation in VR/AR and video games, overcoming limitations of existing methods that struggle to maintain strict multi-view geometry consistency. |
The method uses multiple parallel Latent Diffusion Models (LDMs), one per viewpoint, connected by ConsistNet. This module enforces consistency through view aggregation (unprojecting features to 3D and using attention) and ray aggregation (sampling 3D features and projecting back to 2D). |
ConsistNet effectively learns 3D consistency when applied to a frozen Zero123 model.
It generates 16 surrounding views of an object in 40 seconds on a single A100 GPU.
The model demonstrates good generalization ability when evaluated on unseen data (Google Scanned Objects dataset). |
Quantitative evaluation metrics may not fully capture the inherent ambiguity of generating unseen views from a single image.
Future work includes improving computational efficiency and developing a 3D reconstruction module for mesh generation alongside image denoising. |
multi-view image generation, 3d consistency, diffusion models, latent diffusion models, generative models |
2310.10123
Report |
AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion |
Yitong Jiang, Zhaoyang Zhang, Tianfan Xue, Jinwei Gu |
In this paper, we aim to solve complex real-world image restoration
situations, in which, one image may have a variety of unknown degradations. To
this end, we propose an all-in-one image restoration framework with latent
diffusion (AutoDIR), which can automatically detect and address multiple
unknown degradations. Our framework first utilizes a Blind Image Quality
Assessment Module (BIQA) to automatically detect and identify the unknown
dominant image degradation type of the image. Then, an All-in-One Image
Refinement (AIR) Module handles multiple kinds of degradation image restoration
with the guidance of BIQA. Finally, a Structure Correction Module (SCM) is
proposed to recover the image details distorted by AIR. Our comprehensive
evaluation demonstrates that AutoDIR outperforms state-of-the-art approaches by
achieving superior restoration results while supporting a wider range of tasks.
Notably, AutoDIR is also the first method to automatically handle real-scenario
images with multiple unknown degradations. |
Proposes AutoDIR, an all-in-one image restoration system using latent diffusion for automatic detection and restoration of images with unknown degradations. |
Addresses limitations of single-task image restoration methods and aims to learn a unified model capable of handling real-world images with multiple unknown degradations. |
Combines a Blind Image Quality Assessment (BIQA) module for degradation detection, an All-in-One Image Restoration (AIR) module based on latent diffusion for restoration, and a Structural Correction Module (SCM) for refining image details. |
AutoDIR outperforms state-of-the-art methods in seven image restoration tasks, including denoising, deblurring, super-resolution, low-light enhancement, dehazing, deraining, and deraindrop removal.
Effectively handles images with unknown degradations, as demonstrated on Under-Display Camera and Underwater datasets.
Shows promise as a foundation model for image restoration, exhibiting effective few-shot learning capabilities for new tasks like desnowing. |
Computational cost remains high compared to non-generative networks.
Currently focuses on global image restoration, with limited capabilities for local region-based editing. |
image restoration, latent diffusion model, blind image quality assessment, foundation model, multi-task learning |
2310.09965
Report |
ProteusNeRF: Fast Lightweight NeRF Editing using 3D-Aware Image Context |
Binglun Wang, Niladri Shekhar Dutt, Niloy J. Mitra |
Neural Radiance Fields (NeRFs) have recently emerged as a popular option for
photo-realistic object capture due to their ability to faithfully capture
high-fidelity volumetric content even from handheld video input. Although much
research has been devoted to efficient optimization leading to real-time
training and rendering, options for interactive editing NeRFs remain limited.
We present a very simple but effective neural network architecture that is fast
and efficient while maintaining a low memory footprint. This architecture can
be incrementally guided through user-friendly image-based edits. Our
representation allows straightforward object selection via semantic feature
distillation at the training stage. More importantly, we propose a local
3D-aware image context to facilitate view-consistent image editing that can
then be distilled into fine-tuned NeRFs, via geometric and appearance
adjustments. We evaluate our setup on a variety of examples to demonstrate
appearance and geometric edits and report 10-30x speedup over concurrent work
focusing on text-guided NeRF editing. Video results can be seen on our project
webpage at https://proteusnerf.github.io. |
Presents
ame, a fast and lightweight framework for editing NeRF assets using traditional or generative image manipulation tools via a novel 3D-aware image context. |
NeRF editing remains limited despite advancements in real-time training and rendering, creating a need for intuitive, expressive, and efficient editing tools. |
Employs a residual tri-plane feature field for NeRF representation, enables object selection through semantic feature distillation, and utilizes a 3D-aware image context for synchronized multi-view editing. Edits are distilled back into the NeRF via geometric and appearance adjustments. |
Achieves 10-30x speedup over concurrent text-guided NeRF editing methods.
Allows for both small appearance edits (e.g., recoloring) and larger edits involving geometry changes.
Supports layered editing with a minimal memory footprint (4-36KB/edit) for appearance modifications. |
Limited support for large geometric changes.
Inability to handle view-dependent specular effects. |
nerf, neural radiance fields, 3d editing, generative editing, image-based editing |
2310.09912
Report |
Unsupervised Discovery of Interpretable Directions in h-space of Pre-trained Diffusion Models |
Zijian Zhang, Luping Liu, Zhijie Lin, Yichen Zhu, Zhou Zhao |
We propose the first unsupervised and learning-based method to identify
interpretable directions in h-space of pre-trained diffusion models. Our method
is derived from an existing technique that operates on the GAN latent space.
Specifically, we employ a shift control module that works on h-space of
pre-trained diffusion models to manipulate a sample into a shifted version of
itself, followed by a reconstructor to reproduce both the type and the strength
of the manipulation. By jointly optimizing them, the model will spontaneously
discover disentangled and interpretable directions. To prevent the discovery of
meaningless and destructive directions, we employ a discriminator to maintain
the fidelity of shifted sample. Due to the iterative generative process of
diffusion models, our training requires a substantial amount of GPU VRAM to
store numerous intermediate tensors for back-propagating gradient. To address
this issue, we propose a general VRAM-efficient training algorithm based on
gradient checkpointing technique to back-propagate any gradient through the
whole generative process, with acceptable occupancy of VRAM and sacrifice of
training efficiency. Compared with existing related works on diffusion models,
our method inherently identifies global and scalable directions, without
necessitating any other complicated procedures. Extensive experiments on
various datasets demonstrate the effectiveness of our method. |
The paper proposes the first unsupervised, learning-based method to identify interpretable directions in the h-space of pre-trained diffusion models for semantic image manipulation. |
Existing methods for discovering meaningful directions in the latent space of diffusion models rely on external supervision (e.g., human annotation, CLIP). This work aims to achieve similar results in a fully unsupervised manner. |
The method employs a shift control module and a reconstructor. The shift control module manipulates a sample into a shifted version by modifying its representation in the h-space. The reconstructor aims to reproduce the applied shift. A discriminator is introduced to maintain the fidelity of shifted samples during training. Additionally, a VRAM-efficient training algorithm based on gradient checkpointing is proposed to handle the memory-intensive training process. |
The method successfully discovers disentangled and interpretable directions in the h-space of pre-trained diffusion models, enabling semantic image manipulation.
The proposed VRAM-efficient training algorithm significantly reduces memory consumption during training while maintaining comparable efficiency to the standard approach.
Quantitative evaluations using reconstructor classification accuracy (RCA) and mean opinion score (MOS) demonstrate the effectiveness of the proposed method. |
The training and inference speed is limited by the multi-step iterative generation process inherent to diffusion models.
The reliance on adversarial training to maintain sample fidelity may introduce complexity, and simpler alternative methods could be explored. |
diffusion models, unsupervised learning, semantic image manipulation, latent space, interpretable directions |
2310.09711
Report |
LOVECon: Text-driven Training-Free Long Video Editing with ControlNet |
Zhenyi Liao, Zhijie Deng |
Leveraging pre-trained conditional diffusion models for video editing without
further tuning has gained increasing attention due to its promise in film
production, advertising, etc. Yet, seminal works in this line fall short in
generation length, temporal coherence, or fidelity to the source video. This
paper aims to bridge the gap, establishing a simple and effective baseline for
training-free diffusion model-based long video editing. As suggested by prior
arts, we build the pipeline upon ControlNet, which excels at various image
editing tasks based on text prompts. To break down the length constraints
caused by limited computational memory, we split the long video into
consecutive windows and develop a novel cross-window attention mechanism to
ensure the consistency of global style and maximize the smoothness among
windows. To achieve more accurate control, we extract the information from the
source video via DDIM inversion and integrate the outcomes into the latent
states of the generations. We also incorporate a video frame interpolation
model to mitigate the frame-level flickering issue. Extensive empirical studies
verify the superior efficacy of our method over competing baselines across
scenarios, including the replacement of the attributes of foreground objects,
style transfer, and background replacement. In particular, our method manages
to edit videos with up to 128 frames according to user requirements. Code is
available at https://github.com/zhijie-group/LOVECon. |
This paper introduces LOVECon, a simple yet effective training-free diffusion model-based method for long video editing. |
Existing training-free methods for video editing struggle with long videos, exhibiting inconsistencies in global style and local details, especially in maintaining temporal coherence and fidelity to the source video. LOVECon aims to address these limitations. |
LOVECon leverages pre-trained Stable Diffusion and ControlNet, incorporating DDIM inversion for source frame information. It introduces three key components: (1) Cross-window attention for inter-window consistency, (2) Latent fusion of source and edited frames for structural fidelity, and (3) Frame interpolation to mitigate flickering. |
LOVECon outperforms baselines in maintaining fidelity to the source video and temporal consistency, as evidenced by quantitative metrics and user studies.
LOVECon demonstrates precise editing capabilities while preserving fine details, unlike some baselines that suffer from semantic leakage or blurring.
LOVECon can effectively edit videos up to 128 frames, showcasing its capability for long video editing. |
LOVECon, relying on ControlNet, is limited in handling significant shape changes in editing.
Suboptimal editing results are observed when the source video contains substantial content changes, such as large movements. |
video editing, diffusion models, controlnet, temporal consistency, long video generation |
2310.09469
Report |
Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner |
Mengfei Xia, Yujun Shen, Changsong Lei, Yu Zhou, Ran Yi, Deli Zhao, Wenping Wang, Yong-jin Liu |
A diffusion model, which is formulated to produce an image using thousands of
denoising steps, usually suffers from a slow inference speed. Existing
acceleration algorithms simplify the sampling by skipping most steps yet
exhibit considerable performance degradation. By viewing the generation of
diffusion models as a discretized integrating process, we argue that the
quality drop is partly caused by applying an inaccurate integral direction to a
timestep interval. To rectify this issue, we propose a timestep aligner that
helps find a more accurate integral direction for a particular interval at the
minimum cost. Specifically, at each denoising step, we replace the original
parameterization by conditioning the network on a new timestep, which is
obtained by aligning the sampling distribution to the real distribution.
Extensive experiments show that our plug-in design can be trained efficiently
and boost the inference performance of various state-of-the-art acceleration
methods, especially when there are few denoising steps. For example, when using
10 denoising steps on the popular LSUN Bedroom dataset, we improve the FID of
DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate
set of timesteps. Code will be made publicly available. |
This paper proposes Time step Aligner (TSA), a plug-in method to enhance the accuracy of accelerated diffusion model sampling by re-aligning time steps. |
Accelerated diffusion model sampling, while reducing computation, often suffers from quality degradation due to discrepancies between real and sampling distributions. This method aims to bridge this gap for improved sampling fidelity. |
TSA searches for a more suitable time step (τ) for each denoising step, replacing the original time step (t) as input to the pre-trained noise prediction model. It minimizes the distance between predicted noise at the re-aligned time step and the actual noise, effectively aligning the distributions. |
TSA significantly boosts the performance of various acceleration methods, particularly for low function evaluation counts (NFE).
The improvement is consistent across diverse datasets (CIFAR10, CelebA, LSUN Bedroom, FFHQ, ImageNet, MS-COCO) and tasks (unconditional & conditional generation).
Experiments validate the theoretical claims, showing monotonic FID reduction with progressive time step re-alignment and a decrease in the distribution gap. |
The parallel training strategy, while significantly faster, shows slightly lower performance compared to the sequential approach.
Exploration of methods to enhance the parallel training strategy's performance is a potential future direction. |
diffusion models, image generation, sampling acceleration, time step alignment, truncation error |
2310.09458
Report |
PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation |
Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, Wayne Wu |
Recent advances in zero-shot text-to-3D human generation, which employ the
human model prior (eg, SMPL) or Score Distillation Sampling (SDS) with
pre-trained text-to-image diffusion models, have been groundbreaking. However,
SDS may provide inaccurate gradient directions under the weak diffusion
guidance, as it tends to produce over-smoothed results and generate body
textures that are inconsistent with the detailed mesh geometry. Therefore,
directly leverage existing strategies for high-fidelity text-to-3D human
texturing is challenging. In this work, we propose a model called PaintHuman to
addresses the challenges from two aspects. We first propose a novel score
function, Denoised Score Distillation (DSD), which directly modifies the SDS by
introducing negative gradient components to iteratively correct the gradient
direction and generate high-quality textures. In addition, we use the depth map
as a geometric guidance to ensure the texture is semantically aligned to human
mesh surfaces. To guarantee the quality of rendered results, we employ
geometry-aware networks to predict surface materials and render realistic human
textures. Extensive experiments, benchmarked against state-of-the-art methods,
validate the efficacy of our approach. |
Presents PaintHuman, a zero-shot text-to-human texture generation model that leverages a novel Denoised Score Distillation (DSD) method for high-quality, detailed textures aligned to input text. |
Addresses limitations of existing text-to-3D human texturing methods that produce over-smoothed results, inconsistent textures, and semantic misalignment with input texts. |
Introduces DSD, which refines gradient direction using negative image-text pairs during the diffusion process, utilizes depth signals for accurate geometry-aware texturing, and employs a differentiable SV-BRDF network for realistic material prediction and rendering. |
Generates high-quality human avatars with detailed textures, surpassing existing methods in visual fidelity and semantic alignment.
Demonstrates the efficacy of DSD in mitigating over-smoothing issues and achieving higher CLIP scores compared to baseline models.
Shows significant improvements in user study evaluations, highlighting the superior quality and text faithfulness of the generated textures. |
Current semantic zoom implementation relies on manual adjustment for face region; future work may explore automatic detection.
Exploring the potential of DSD in broader 3D texturing tasks beyond human avatars could be a promising research direction. |
text-to-3d, texture generation, diffusion models, score distillation sampling, denoising score distillation |
2310.09382
Report |
LL-VQ-VAE: Learnable Lattice Vector-Quantization For Efficient Representations |
Ahmed Khalil, Robert Piechocki, Raul Santos-Rodriguez |
In this paper we introduce learnable lattice vector quantization and
demonstrate its effectiveness for learning discrete representations. Our
method, termed LL-VQ-VAE, replaces the vector quantization layer in VQ-VAE with
lattice-based discretization. The learnable lattice imposes a structure over
all discrete embeddings, acting as a deterrent against codebook collapse,
leading to high codebook utilization. Compared to VQ-VAE, our method obtains
lower reconstruction errors under the same training conditions, trains in a
fraction of the time, and with a constant number of parameters (equal to the
embedding dimension $D$), making it a very scalable approach. We demonstrate
these results on the FFHQ-1024 dataset and include FashionMNIST and Celeb-A. |
This paper introduces Learnable Lattice VQ-VAE (LL-VQ-VAE), replacing vector quantization in VQ-VAE with a learnable lattice layer for efficient latent discretization. |
The proposed method addresses limitations of VQ-VAE, such as codebook collapse and a trade-off between quantization quality and speed, by imposing a structured lattice representation on discrete embeddings. |
The LL-VQ-VAE utilizes a learnable lattice basis matrix to define the embedding space. The Babai Rounding Estimate is used for quantization, and a size loss term encourages lattice sparsity, controlling the number of effective embeddings. |
LL-VQ-VAE achieves lower reconstruction errors than VQ-VAE and its EMA variant on datasets like FFHQ-1024, Celeb-A, and FashionMNIST.
It significantly reduces the number of trainable parameters in the quantization layer, making it more scalable.
The method demonstrates faster training times compared to VQ-VAE while maintaining high quantization quality and resisting codebook collapse. |
The paper acknowledges the difficulty of imposing a hard upper limit on the number of embeddings due to the infinite nature of the lattice.
Future work aims to explore the relationship between quantization strategies and the preservation of image properties, as well as the resilience of the learned representations to distortions. |
learnable lattice, vector quantization, vq-vae, discrete representation learning, codebook collapse |
2310.09199
Report |
PaLI-3 Vision Language Models: Smaller, Faster, Stronger |
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut |
This paper presents PaLI-3, a smaller, faster, and stronger vision language
model (VLM) that compares favorably to similar models that are 10x larger. As
part of arriving at this strong performance, we compare Vision Transformer
(ViT) models pretrained using classification objectives to contrastively
(SigLIP) pretrained ones. We find that, while slightly underperforming on
standard image classification benchmarks, SigLIP-based PaLI shows superior
performance across various multimodal benchmarks, especially on localization
and visually-situated text understanding. We scale the SigLIP image encoder up
to 2 billion parameters, and achieves a new state-of-the-art on multilingual
cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles
research on fundamental pieces of complex VLMs, and could fuel a new generation
of scaled-up models. |
This paper presents \NEWNAME, a 5B parameter vision language model (VLM) that achieves state-of-the-art performance on various benchmarks, particularly excelling in visually-situated text understanding and object localization, despite being 10x smaller than previous state-of-the-art models. |
Smaller VLMs are important for practical applications due to easier training, deployment, environmental friendliness, and faster research cycles. |
The model utilizes a contrastively pretrained ViT-G image encoder (SigLIP) instead of classification pretraining, and is trained in three stages: unimodal pretraining, multimodal training with a refined dataset mixture, and resolution increase. |
Achieves new state-of-the-art results on visually-situated text understanding benchmarks, such as TextCaps and TextVQA.
Outperforms previous models on several video QA benchmarks, despite not being pretrained on video data.
Introduces a 2B parameter multilingual SigLIP model that achieves state-of-the-art on multilingual cross-modal retrieval. |
Similar limitations to existing VLMs in terms of potential biases and fairness issues.
Further improvements in reasoning capabilities are needed, particularly for tasks like AI2D and ChartQA. |
vision language model, contrastive pretraining, visually-situated text understanding, object localization, multimodal learning |
2310.08949
Report |
EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs |
Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu |
We present EasyGen, an efficient model designed to enhance multimodal
understanding and generation by harnessing the capabilities of diffusion models
and large language models (LLMs), Unlike existing multimodal models that
predominately depend on encoders like CLIP or ImageBind and need ample amounts
of training data to bridge modalities,EasyGen leverages BiDiffuser,a
bidirectional conditional diffusion model, to foster more efficient modality
interactions. Easygen achieves text generation by training a projection layer
linking BiDiffuser and an LLM, and facilities image generation by training an
adapter to align the LLM's text space with the BiDiffuser's image space,
Comprehensive quantitative and qualitative experiments show that EasyGen excels
in data-efficient training, high-quality image generation, and extendibility,
effectively addressing the challenges in multimodal generation. The source code
is available at https://github.com/zxy556677/EasyGen. |
This paper introduces EasyGen, an end-to-end multimodal model that leverages a bidirectional conditional diffusion model (BiDiffuser) and LLMs for efficient multimodal understanding and generation. |
Existing multimodal models often struggle with inefficient modality interactions and limited generation capabilities beyond text. EasyGen aims to address these limitations. |
EasyGen utilizes BiDiffuser, a fine-tuned version of UniDiffuser, for bidirectional image-text generation. It employs a projection layer to align BiDiffuser with LLMs for text generation tasks like image captioning and VQA. For image generation, an adapter integrates semantic information from LLM into BiDiffuser. |
EasyGen achieves competitive performance on image captioning and VQA tasks with high data efficiency.
The model demonstrates superior image generation quality compared to other end-to-end MLLMs.
EasyGen is easily extendable, accommodating advanced visual encoders or enhancing existing multimodal LLMs like LLaVA. |
The diffusion-based approach may lead to longer processing times for image-to-text and text-to-image generation.
Future work could focus on exploring efficient sampling methods to enhance EasyGen's overall efficiency. |
multimodal generation, diffusion models, large language models, image captioning, visual question answering |
2310.08921
Report |
Feature Proliferation -- the "Cancer" in StyleGAN and its Treatments |
Shuang Song, Yuanbang Liang, Jing Wu, Yu-Kun Lai, Yipeng Qin |
Despite the success of StyleGAN in image synthesis, the images it synthesizes
are not always perfect and the well-known truncation trick has become a
standard post-processing technique for StyleGAN to synthesize high-quality
images. Although effective, it has long been noted that the truncation trick
tends to reduce the diversity of synthesized images and unnecessarily
sacrifices many distinct image features. To address this issue, in this paper,
we first delve into the StyleGAN image synthesis mechanism and discover an
important phenomenon, namely Feature Proliferation, which demonstrates how
specific features reproduce with forward propagation. Then, we show how the
occurrence of Feature Proliferation results in StyleGAN image artifacts. As an
analogy, we refer to it as the" cancer" in StyleGAN from its proliferating and
malignant nature. Finally, we propose a novel feature rescaling method that
identifies and modulates risky features to mitigate feature proliferation.
Thanks to our discovery of Feature Proliferation, the proposed feature
rescaling method is less destructive and retains more useful image features
than the truncation trick, as it is more fine-grained and works in a
lower-level feature space rather than a high-level latent space. Experimental
results justify the validity of our claims and the effectiveness of the
proposed feature rescaling method. Our code is available at https://github.
com/songc42/Feature-proliferation. |
This paper introduces "Feature Proliferation", a phenomenon in StyleGAN where certain features with abnormal values reproduce during forward propagation, leading to image artifacts. |
This phenomenon explains a cause of artifacts in StyleGAN-generated images and allows for targeted correction without sacrificing image diversity as much as the truncation trick. |
The authors analyze the StyleGAN architecture, particularly the weight modulation and demodulation steps, to identify how Feature Proliferation occurs. They propose a feature rescaling method that identifies and adjusts risky features early in the network. |
Feature Proliferation is directly linked to image artifacts in StyleGAN.
Proposed feature rescaling mitigates artifacts while better preserving image features compared to the truncation trick.
The method is compatible with StyleGAN latent space operations like interpolation and image editing. |
Current feature identification and rescaling may still remove some useful features.
Future work will focus on more precise feature processing and investigating Feature Proliferation in other network architectures. |
stylegan, generative adversarial networks, image synthesis, feature proliferation, artifact removal |
2310.08587
Report |
Pseudo-Generalized Dynamic View Synthesis from a Video |
Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Angel Bautista, Joshua M. Susskind, Alexander G. Schwing |
Rendering scenes observed in a monocular video from novel viewpoints is a
challenging problem. For static scenes the community has studied both
scene-specific optimization techniques, which optimize on every test scene, and
generalized techniques, which only run a deep net forward pass on a test scene.
In contrast, for dynamic scenes, scene-specific optimization techniques exist,
but, to our best knowledge, there is currently no generalized method for
dynamic novel view synthesis from a given monocular video. To answer whether
generalized dynamic novel view synthesis from monocular videos is possible
today, we establish an analysis framework based on existing techniques and work
toward the generalized approach. We find a pseudo-generalized process without
scene-specific appearance optimization is possible, but geometrically and
temporally consistent depth estimates are needed. Despite no scene-specific
appearance optimization, the pseudo-generalized approach improves upon some
scene-specific methods. |
This paper investigates the feasibility of generalizing dynamic novel view synthesis from monocular videos, aiming to reduce reliance on computationally expensive scene-specific optimization. |
Developing generalized methods for dynamic view synthesis from monocular videos is crucial for applications in AR/VR and robotics but remains challenging due to the task's ill-posed nature. Existing methods heavily rely on scene-specific optimization, hindering scalability. |
The authors establish an analysis framework inspired by scene-specific methods, separating the rendering of static and dynamic content. They adapt a pre-trained generalizable NeRF transformer (GNT) for static content and investigate the use of depth and temporal priors (optical flow and tracking) for dynamic content rendering. |
Complete generalization with current depth estimation or tracking methods is not yet achievable.
A pseudo-generalized approach, using consistent depth estimates but avoiding scene-specific appearance optimization, outperforms several scene-specific methods on perceptual quality metrics.
Consistent depth is identified as a sufficient condition for generalized dynamic novel view synthesis from monocular videos. |
Simple temporal aggregation using tracking methods does not yet yield satisfactory results, indicating a need for more sophisticated designs.
Future work includes exploring context-aware inpainting to address artifacts such as missing foreground parts and blurry backgrounds. |
novel view synthesis, dynamic scenes, monocular video, generalizable methods, consistent depth |
2310.08579
Report |
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion |
Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov |
Despite significant advances in large-scale text-to-image models, achieving
hyper-realistic human image generation remains a desirable yet unsolved task.
Existing models like Stable Diffusion and DALL-E 2 tend to generate human
images with incoherent parts or unnatural poses. To tackle these challenges,
our key insight is that human image is inherently structural over multiple
granularities, from the coarse-level body skeleton to fine-grained spatial
geometry. Therefore, capturing such correlations between the explicit
appearance and latent structure in one model is essential to generate coherent
and natural human images. To this end, we propose a unified framework,
HyperHuman, that generates in-the-wild human images of high realism and diverse
layouts. Specifically, 1) we first build a large-scale human-centric dataset,
named HumanVerse, which consists of 340M images with comprehensive annotations
like human pose, depth, and surface normal. 2) Next, we propose a Latent
Structural Diffusion Model that simultaneously denoises the depth and surface
normal along with the synthesized RGB image. Our model enforces the joint
learning of image appearance, spatial relationship, and geometry in a unified
network, where each branch in the model complements to each other with both
structural awareness and textural richness. 3) Finally, to further boost the
visual quality, we propose a Structure-Guided Refiner to compose the predicted
conditions for more detailed generation of higher resolution. Extensive
experiments demonstrate that our framework yields the state-of-the-art
performance, generating hyper-realistic human images under diverse scenarios.
Project Page: https://snap-research.github.io/HyperHuman/ |
This paper introduces HyperHuman, a novel framework for generating highly realistic and diverse human images with controllable layouts, addressing the limitations of existing text-to-image models in accurately depicting human anatomy and poses. |
Generating hyper-realistic human images is crucial for various applications like image animation and virtual try-on, but existing models often produce incoherent or unnatural results. HyperHuman aims to overcome these limitations by explicitly modeling the inherent multi-level structure of human images. |
The approach consists of two main stages: 1) Latent Structural Diffusion Model: This model jointly denoises RGB, depth, and surface-normal maps, capturing the correlations between appearance and structure. 2) Structure-Guided Refiner: It leverages the predicted structural maps to generate high-resolution images with improved detail and fidelity. The authors also create a large-scale human-centric dataset, HumanVerse, containing 340M images with comprehensive annotations for training and evaluation. |
HyperHuman significantly outperforms previous state-of-the-art models in terms of image quality, pose accuracy, and text-image alignment on the MS-COCO 2014 validation human subset.
The model demonstrates strong robustness to the impact of random seeds and unseen poses, as shown by the generated results.
Qualitative analysis and user studies confirm that HyperHuman generates more realistic, aesthetically pleasing, and text-aligned human images compared to baseline methods. |
The generation of subtle details like fingers and eyes is limited by the performance of existing pose, depth, and normal estimators.
Future work includes exploring deep priors for text-to-pose generation, eliminating the current reliance on body skeleton input. |
text-to-image generation, human image synthesis, diffusion models, controllable image generation, structure-aware generation |
2310.08577
Report |
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models |
Vishaal Udandarao, Max F. Burg, Samuel Albanie, Matthias Bethge |
Recent advances in the development of vision-language models (VLMs) are
yielding remarkable success in recognizing visual semantic content, including
impressive instances of compositional image understanding. Here, we introduce
the novel task of Visual Data-Type Identification, a basic perceptual skill
with implications for data curation (e.g., noisy data-removal from large
datasets, domain-specific retrieval) and autonomous vision (e.g.,
distinguishing changing weather conditions from camera lens staining). We
develop two datasets consisting of animal images altered across a diverse set
of 27 visual data-types, spanning four broad categories. An extensive zero-shot
evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced
performance landscape. While VLMs are reasonably good at identifying certain
stylistic \textit{data-types}, such as cartoons and sketches, they struggle
with simpler data-types arising from basic manipulations like image rotations
or additive noise. Our findings reveal that (i) model scaling alone yields
marginal gains for contrastively-trained models like CLIP, and (ii) there is a
pronounced drop in performance for the largest auto-regressively trained VLMs
like OpenFlamingo. This finding points to a blind spot in current frontier
VLMs: they excel in recognizing semantic content but fail to acquire an
understanding of visual data-types through scaling. By analyzing the
pre-training distributions of these models and incorporating data-type
information into the captions during fine-tuning, we achieve a significant
enhancement in performance. By exploring this previously uncharted task, we aim
to set the stage for further advancing VLMs to equip them with visual data-type
understanding. Code and datasets are released at
https://github.com/bethgelab/DataTypeIdentification. |
This paper introduces the novel task of *Visual Data-Type Identification*, where a model identifies how an image was generated (e.g., blurred, rotated) in addition to recognizing semantic content. |
This task is important for various applications such as data curation, filtering, and autonomous vision, where understanding image generation processes can be as crucial as recognizing semantic content. |
The authors create two datasets, *SyntheticTypeIdent* and *NaturalTypeIdent*, consisting of animal images altered with 27 different data-types. They benchmark 39 state-of-the-art VLMs, ranging from 100M to 80B parameters, on their ability to identify these data-types. |
VLMs struggle with identifying many data-types, particularly those involving simple transformations like noise addition or rotation.
Scaling VLM size yields only marginal performance improvements, suggesting current models are not inherently learning to recognize data-types through increased scale alone.
Performance can be significantly improved by fine-tuning VLMs with data explicitly containing data-type information. |
The study is limited to animal images and a specific set of data-types.
Future work can explore alternative training objectives or architectures that explicitly encourage data-type representation learning in VLMs. |
vision-language models, data-type identification, dataset bias, model scaling, fine-tuning |
2310.08541
Report |
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation |
Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang |
We introduce ``Idea to Image,'' a system that enables multimodal iterative
self-refinement with GPT-4V(ision) for automatic image design and generation.
Humans can quickly identify the characteristics of different text-to-image
(T2I) models via iterative explorations. This enables them to efficiently
convert their high-level generation ideas into effective T2I prompts that can
produce good images. We investigate if systems based on large multimodal models
(LMMs) can develop analogous multimodal self-refinement abilities that enable
exploring unknown models or environments via self-refining tries. Idea2Img
cyclically generates revised T2I prompts to synthesize draft images, and
provides directional feedback for prompt revision, both conditioned on its
memory of the probed T2I model's characteristics. The iterative self-refinement
brings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img
can process input ideas with interleaved image-text sequences, follow ideas
with design instructions, and generate images of better semantic and visual
qualities. The user preference study validates the efficacy of multimodal
iterative self-refinement on automatic image design and generation. |
This paper introduces "Idea to Image" (Idea2Img), a system that utilizes multimodal iterative self-refinement with large multimodal models (LMMs) like GPT-4V for automatic image design and generation. |
The goal is to mimic the human ability to iteratively explore and understand the characteristics of text-to-image (T2I) models, thereby generating more effective prompts and higher-quality images. |
Idea2Img employs a cyclical process where an LMM generates text prompts, selects the best draft image from the T2I model, provides feedback on discrepancies, and refines the prompt iteratively. This process is guided by a memory module storing the history of prompts, images, and feedback. |
Idea2Img can handle complex user ideas containing interleaved image-text sequences and design instructions.
Idea2Img consistently outperforms baseline methods and human-written prompts in user preference studies across various T2I models, including SDXL and DeepFloyd IF.
Stronger T2I models benefit more from Idea2Img's iterative refinement, indicating a synergistic effect between LMM guidance and T2I capabilities. |
Current work focuses on image generation, future work can explore applying the framework to other multimodal tasks like GUI navigation and embodied agents.
While the current system explores using a single generation model, extending it to manage and optimize the collaboration of multiple tools is a promising direction. |
image generation, text-to-image, multimodal learning, iterative refinement, large language models |
2310.08534
Report |
Animating Street View |
Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz |
We present a system that automatically brings street view imagery to life by
populating it with naturally behaving, animated pedestrians and vehicles. Our
approach is to remove existing people and vehicles from the input image, insert
moving objects with proper scale, angle, motion, and appearance, plan paths and
traffic behavior, as well as render the scene with plausible occlusion and
shadowing effects. The system achieves these by reconstructing the still image
street scene, simulating crowd behavior, and rendering with consistent
lighting, visibility, occlusions, and shadows. We demonstrate results on a
diverse range of street scenes including regular still images and panoramas. |
This paper presents a system that automatically animates still street view images by populating them with naturally behaving pedestrians and vehicles, respecting the scene's geometry and illumination. |
This approach enhances the vividness of street view imagery without additional capture and privacy concerns, offering a more immersive experience. |
The system uses a three-stage pipeline: (1) Reconstruction of scene geometry, semantics, and lighting. (2) Simulation of pedestrian and vehicle behaviors. (3) Rendering of 3D assets into the scene with realistic shadows and occlusions. |
Realistic animation of street scenes with moving pedestrians and vehicles.
Accurate shadow rendering and occlusion handling based on 3D scene understanding.
Effective traffic simulation at crosswalks with dynamic pedestrian and vehicle interactions. |
System struggles with curved lanes, hills, and complex shadow scenarios.
Limited diversity in the appearance of generated pedestrians and vehicles. |
image animation, scene reconstruction, crowd simulation, rendering, street view |
2310.08530
Report |
UniPose: Detecting Any Keypoints |
Jie Yang, Ailing Zeng, Ruimao Zhang, Lei Zhang |
This work proposes a unified framework called UniPose to detect keypoints of
any articulated (e.g., human and animal), rigid, and soft objects via visual or
textual prompts for fine-grained vision understanding and manipulation.
Keypoint is a structure-aware, pixel-level, and compact representation of any
object, especially articulated objects. Existing fine-grained promptable tasks
mainly focus on object instance detection and segmentation but often fail to
identify fine-grained granularity and structured information of image and
instance, such as eyes, leg, paw, etc. Meanwhile, prompt-based keypoint
detection is still under-explored. To bridge the gap, we make the first attempt
to develop an end-to-end prompt-based keypoint detection framework called
UniPose to detect keypoints of any objects. As keypoint detection tasks are
unified in this framework, we can leverage 13 keypoint detection datasets with
338 keypoints across 1,237 categories over 400K instances to train a generic
keypoint detection model. UniPose can effectively align text-to-keypoint and
image-to-keypoint due to the mutual enhancement of textual and visual prompts
based on the cross-modality contrastive learning optimization objectives. Our
experimental results show that UniPose has strong fine-grained localization and
generalization abilities across image styles, categories, and poses. Based on
UniPose as a generalist keypoint detector, we hope it could serve fine-grained
visual perception, understanding, and generation. |
This paper introduces UniPose, a unified framework for detecting keypoints of any object (articulated, rigid, or soft) using visual or textual prompts. |
Existing methods are limited to specific object categories or struggle with unseen objects and keypoints. UniPose aims to overcome these limitations and enable fine-grained vision understanding and manipulation. |
UniPose utilizes a coarse-to-fine strategy: 1) it encodes visual and textual prompts, 2) decodes instance information (bounding boxes), and 3) decodes keypoint locations. It's trained on a unified dataset (UniKPT) combining 13 keypoint datasets across various object categories. |
UniPose achieves state-of-the-art results on unseen object and keypoint detection, surpassing previous methods by a large margin.
It outperforms expert keypoint detection models across 12 diverse datasets, demonstrating strong generalization ability.
UniPose exhibits impressive text-to-image similarity, exceeding CLIP's performance in distinguishing object categories and image styles. |
The performance on objects with novel topologies not included in the training data needs improvement.
The model can struggle with heavily occluded keypoints or objects with indistinct visual features. |
keypoint detection, prompt-based learning, multi-modality learning, category-agnostic pose estimation, open-vocabulary vision |
2310.08529
Report |
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models |
Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, Xinggang Wang |
In recent times, the generation of 3D assets from text prompts has shown
impressive results. Both 2D and 3D diffusion models can help generate decent 3D
objects based on prompts. 3D diffusion models have good 3D consistency, but
their quality and generalization are limited as trainable 3D data is expensive
and hard to obtain. 2D diffusion models enjoy strong abilities of
generalization and fine generation, but 3D consistency is hard to guarantee.
This paper attempts to bridge the power from the two types of diffusion models
via the recent explicit and efficient 3D Gaussian splatting representation. A
fast 3D object generation framework, named as GaussianDreamer, is proposed,
where the 3D diffusion model provides priors for initialization and the 2D
diffusion model enriches the geometry and appearance. Operations of noisy point
growing and color perturbation are introduced to enhance the initialized
Gaussians. Our GaussianDreamer can generate a high-quality 3D instance or 3D
avatar within 15 minutes on one GPU, much faster than previous methods, while
the generated instances can be directly rendered in real time. Demos and code
are available at https://taoranyi.com/gaussiandreamer/. |
GaussianDreamer, a fast text-to-3D generation method that bridges 3D and 2D diffusion models via Gaussian Splatting, achieving both 3D consistency and rich detail. |
Existing methods either lack 3D consistency (2D diffusion models) or struggle with complex prompts and fine details due to limited 3D data (3D diffusion models). |
1. Initialize 3D Gaussians from coarse 3D models generated by text-to-3D or text-to-motion diffusion models. 2. Enhance initialization with noisy point growing and color perturbation. 3. Optimize Gaussians using the Score Distillation Sampling loss with a 2D diffusion model. 4. Render in real time using Gaussian Splatting. |
Generates high-quality 3D objects and avatars with 3D consistency and fine details.
Significantly faster than previous methods (15 minutes on a single GPU).
Achieves real-time rendering without mesh conversion. |
Generated objects may have unsharp edges and unnecessary Gaussians.
Limited effectiveness in generating large-scale scenes. |
text-to-3d, 3d generation, diffusion models, gaussian splatting, real-time rendering |
2310.08528
Report |
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering |
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, Xinggang Wang |
Representing and rendering dynamic scenes has been an important but
challenging task. Especially, to accurately model complex motions, high
efficiency is usually hard to guarantee. To achieve real-time dynamic scene
rendering while also enjoying high training and storage efficiency, we propose
4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes
rather than applying 3D-GS for each individual frame. In 4D-GS, a novel
explicit representation containing both 3D Gaussians and 4D neural voxels is
proposed. A decomposed neural voxel encoding algorithm inspired by HexPlane is
proposed to efficiently build Gaussian features from 4D neural voxels and then
a lightweight MLP is applied to predict Gaussian deformations at novel
timestamps. Our 4D-GS method achieves real-time rendering under high
resolutions, 82 FPS at an 800$\times$800 resolution on an RTX 3090 GPU while
maintaining comparable or better quality than previous state-of-the-art
methods. More demos and code are available at
https://guanjunwu.github.io/4dgs/. |
Proposes 4D Gaussian Splatting (4D-GS), a novel approach for real-time dynamic scene rendering using an efficient Gaussian deformation field to model Gaussian motions and shape changes over time. |
Real-time rendering of dynamic scenes is crucial for applications like VR and AR, but accurately modeling complex motions with high efficiency is challenging. Existing methods struggle with either rendering speed or storage efficiency, especially for long input sequences. |
Represents scenes as 3D Gaussians and models their motion and deformation over time using a Gaussian deformation field network. This network consists of a spatial-temporal structure encoder (multi-resolution HexPlane inspired by K-Planes) to encode features of adjacent Gaussians and a lightweight multi-head decoder to predict Gaussian deformations at novel timestamps. Rendering is achieved through efficient differentiable splatting. |
Achieves real-time rendering on dynamic scenes, up to 82 FPS at 800x800 resolution on synthetic datasets and 30 FPS at 1352x1014 resolution on real datasets.
Maintains comparable or superior rendering quality compared to state-of-the-art methods while ensuring low storage consumption and fast convergence.
Demonstrates potential for 4D object tracking and editing due to its explicit representation of dynamic scenes. |
Modeling large motions and dramatic scene changes, especially in monocular settings, remains a challenge.
Handling urban-scale reconstructions with a massive number of 3D Gaussians requires a more compact algorithm. |
dynamic scene rendering, 4d gaussian splatting, real-time rendering, deformation fields, neural rendering |
2310.08465
Report |
MotionDirector: Motion Customization of Text-to-Video Diffusion Models |
Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou |
Large-scale pre-trained diffusion models have exhibited remarkable
capabilities in diverse video generations. Given a set of video clips of the
same motion concept, the task of Motion Customization is to adapt existing
text-to-video diffusion models to generate videos with this motion. For
example, generating a video with a car moving in a prescribed manner under
specific camera movements to make a movie, or a video illustrating how a bear
would lift weights to inspire creators. Adaptation methods have been developed
for customizing appearance like subject or style, yet unexplored for motion. It
is straightforward to extend mainstream adaption methods for motion
customization, including full model tuning, parameter-efficient tuning of
additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept
learned by these methods is often coupled with the limited appearances in the
training videos, making it difficult to generalize the customized motion to
other appearances. To overcome this challenge, we propose MotionDirector, with
a dual-path LoRAs architecture to decouple the learning of appearance and
motion. Further, we design a novel appearance-debiased temporal loss to
mitigate the influence of appearance on the temporal training objective.
Experimental results show the proposed method can generate videos of diverse
appearances for the customized motions. Our method also supports various
downstream applications, such as the mixing of different videos with their
appearance and motion respectively, and animating a single image with
customized motions. Our code and model weights will be released. |
This paper introduces Motion Customization, a method for adapting text-to-video diffusion models to generate videos with user-specified motion concepts, learned from one or multiple reference videos, while preserving appearance diversity. |
While existing text-to-video models allow for appearance customization, controlling the motion generated in videos remains an open challenge. This ability is crucial for users who desire specific motion styles in their generated videos. |
The proposed MotionDirector uses a dual-path Low-Rank Adaptation (LoRA) architecture. The spatial path learns appearance from single frames, while the temporal path learns motion from multiple frames, decoupling the two. An appearance-debiased temporal loss further improves motion learning by mitigating the influence of appearance. |
MotionDirector successfully generates diverse videos with customized motions, outperforming baselines and existing controllable generation methods on two benchmarks.
Human evaluations demonstrate a strong preference for MotionDirector in terms of motion fidelity and appearance diversity.
The method is efficient, requiring minimal additional parameters and training time compared to full model fine-tuning. |
Learning complex motions involving multiple subjects remains challenging.
Future work could explore decoupling motions of multiple subjects for more intricate scenarios. |
text-to-video generation, motion customization, diffusion models, low-rank adaptation (lora), controllable video generation |
2310.08442
Report |
Debias the Training of Diffusion Models |
Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, Feng Zhao |
Diffusion models have demonstrated compelling generation quality by
optimizing the variational lower bound through a simple denoising score
matching loss. In this paper, we provide theoretical evidence that the
prevailing practice of using a constant loss weight strategy in diffusion
models leads to biased estimation during the training phase. Simply optimizing
the denoising network to predict Gaussian noise with constant weighting may
hinder precise estimations of original images. To address the issue, we propose
an elegant and effective weighting strategy grounded in the theoretically
unbiased principle. Moreover, we conduct a comprehensive and systematic
exploration to dissect the inherent bias problem deriving from constant
weighting loss from the perspectives of its existence, impact and reasons.
These analyses are expected to advance our understanding and demystify the
inner workings of diffusion models. Through empirical evaluation, we
demonstrate that our proposed debiased estimation method significantly enhances
sample quality without the reliance on complex techniques, and exhibits
improved efficiency compared to the baseline method both in training and
sampling processes. |
This paper provides theoretical and empirical evidence that the common practice of using constant loss weights in diffusion models leads to biased estimation during training, hindering image quality. It proposes a debiased loss weight strategy to address this issue. |
Understanding and addressing the bias in diffusion model training is crucial as it directly impacts the quality of generated samples and the model's performance. |
The paper theoretically analyzes the impact of constant weighting on the estimation of the original image from noisy samples. It then proposes a debiased weighting strategy that assigns higher weights to later timesteps, improving the estimation of the original image. The effectiveness of this approach is validated through experiments on multiple datasets. |
The proposed debiased estimation method significantly improves sample quality compared to constant weighting.
The method exhibits improved efficiency in both training and sampling, achieving superior performance with fewer training iterations and sampling steps.
The analysis provides insights into the existence, impact, and underlying causes of biased estimation in diffusion models. |
The paper focuses on the standard Gaussian noise prediction objective and does not extensively explore other training targets.
Future work could investigate the impact of noise schedules on the bias problem. |
diffusion models, generative models, debiasing, image generation, loss function |
2310.08094
Report |
SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing |
Zijie Wu, Chaohui Yu, Zhen Zhu, Fan Wang, Xiang Bai |
Recent progress in text-to-image (T2I) models enables high-quality image
generation with flexible textual control. To utilize the abundant visual priors
in the off-the-shelf T2I models, a series of methods try to invert an image to
proper embedding that aligns with the semantic space of the T2I model. However,
these image-to-text (I2T) inversion methods typically need multiple source
images containing the same concept or struggle with the imbalance between
editing flexibility and visual fidelity. In this work, we point out that the
critical problem lies in the foreground-background entanglement when learning
an intended concept, and propose a simple and effective baseline for
single-image I2T inversion, named SingleInsert. SingleInsert adopts a two-stage
scheme. In the first stage, we regulate the learned embedding to concentrate on
the foreground area without being associated with the irrelevant background. In
the second stage, we finetune the T2I model for better visual resemblance and
devise a semantic loss to prevent the language drift problem. With the proposed
techniques, SingleInsert excels in single concept generation with high visual
fidelity while allowing flexible editing. Additionally, SingleInsert can
perform single-image novel view synthesis and multiple concepts composition
without requiring joint training. To facilitate evaluation, we design an
editing prompt list and introduce a metric named Editing Success Rate (ESR) for
quantitative assessment of editing flexibility. Our project page is:
https://jarrentwu1031.github.io/SingleInsert-web/ |
This paper introduces SingleInsert, a single-image image-to-text inversion method for inserting novel concepts into pre-trained text-to-image models, enabling flexible editing. |
Existing methods struggle to balance editing flexibility and visual fidelity when learning concepts from single images due to foreground-background entanglement. |
SingleInsert employs a two-stage scheme: 1) **Inversion stage:** An image encoder learns to map a source image to a textual embedding, optimized using foreground and background losses to disentangle the concept from its background. 2) **Finetuning stage:** The text-to-image model is fine-tuned alongside the encoder, using the same losses and a semantic loss to prevent language drift and preserve class-specific priors. |
SingleInsert achieves high visual fidelity while enabling flexible editing of the learned concept, outperforming existing single-image and multi-image inversion methods.
The proposed method effectively disentangles the learned concept from the background, allowing for novel view synthesis from a single image.
SingleInsert enables composition of multiple independently learned concepts without joint training. |
SingleInsert may struggle with rare concepts due to limited prior knowledge in the base text-to-image model.
Synthesized novel viewpoints can be less accurate when the input image presents an extreme perspective of the concept. |
image-to-text inversion, text-to-image generation, concept learning, novel view synthesis, concept composition |
2310.08092
Report |
Consistent123: Improve Consistency for One Image to 3D Object Synthesis |
Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, C. L. Philip Chen, Lei Zhang |
Large image diffusion models enable novel view synthesis with high quality
and excellent zero-shot capability. However, such models based on
image-to-image translation have no guarantee of view consistency, limiting the
performance for downstream tasks like 3D reconstruction and image-to-3D
generation. To empower consistency, we propose Consistent123 to synthesize
novel views simultaneously by incorporating additional cross-view attention
layers and the shared self-attention mechanism. The proposed attention
mechanism improves the interaction across all synthesized views, as well as the
alignment between the condition view and novel views. In the sampling stage,
such architecture supports simultaneously generating an arbitrary number of
views while training at a fixed length. We also introduce a progressive
classifier-free guidance strategy to achieve the trade-off between texture and
geometry for synthesized object views. Qualitative and quantitative experiments
show that Consistent123 outperforms baselines in view consistency by a large
margin. Furthermore, we demonstrate a significant improvement of Consistent123
on varying downstream tasks, showing its great potential in the 3D generation
field. The project page is available at consistent-123.github.io. |
Consistent123, a novel image-to-3D model that synthesizes consistent multiple views simultaneously, is proposed. |
Existing image-to-image translation based diffusion models for novel view synthesis lack view consistency, hindering their use in downstream tasks like 3D reconstruction. |
Consistent123 incorporates cross-view attention layers and a shared self-attention mechanism into a denoising U-Net to align synthesized views. It employs progressive classifier-free guidance for a trade-off between texture and geometry. |
Significantly improved view consistency over baselines on Objaverse, GSO, and RTMV datasets.
Supports synthesizing arbitrary numbers of views while being trained at a fixed length.
Demonstrates substantial improvement in downstream tasks like 3D reconstruction and image-to-3D generation. |
The model requires a relatively large number of views to achieve high consistency.
Further exploration is needed to optimize the trade-off between view consistency and image quality.
Future work includes exploring the impact of different attention mechanisms and sampling techniques. |
novel view synthesis, view consistency, 3d object synthesis, diffusion models, cross-view attention |
2310.07771
Report |
DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model |
Xiaofan Li, Yifu Zhang, Xiaoqing Ye |
With the increasing popularity of autonomous driving based on the powerful
and unified bird's-eye-view (BEV) representation, a demand for high-quality and
large-scale multi-view video data with accurate annotation is urgently
required. However, such large-scale multi-view data is hard to obtain due to
expensive collection and annotation costs. To alleviate the problem, we propose
a spatial-temporal consistent diffusion framework DrivingDiffusion, to generate
realistic multi-view videos controlled by 3D layout. There are three challenges
when synthesizing multi-view videos given a 3D layout: How to keep 1)
cross-view consistency and 2) cross-frame consistency? 3) How to guarantee the
quality of the generated instances? Our DrivingDiffusion solves the problem by
cascading the multi-view single-frame image generation step, the single-view
video generation step shared by multiple cameras, and post-processing that can
handle long video generation. In the multi-view model, the consistency of
multi-view images is ensured by information exchange between adjacent cameras.
In the temporal model, we mainly query the information that needs attention in
subsequent frame generation from the multi-view images of the first frame. We
also introduce the local prompt to effectively improve the quality of generated
instances. In post-processing, we further enhance the cross-view consistency of
subsequent frames and extend the video length by employing temporal sliding
window algorithm. Without any extra cost, our model can generate large-scale
realistic multi-camera driving videos in complex urban scenes, fueling the
downstream driving tasks. The code will be made publicly available. |
DrivingDiffusion, a novel spatial-temporal consistent diffusion framework, generates realistic multi-view driving videos controlled by 3D layout. |
High-quality, annotated multi-view video data is crucial for autonomous driving but expensive to collect. DrivingDiffusion offers a solution by generating such data, supporting BEV perception model development. |
The method uses a cascaded approach with a multi-view image generation model, a single-view temporal model (shared across cameras), and post-processing for consistency and length. Key components include a 3D layout controller, cross-view/frame attention, consistency loss, and local prompt for instance quality. |
DrivingDiffusion achieves state-of-the-art video synthesis on nuScenes, outperforming existing methods in FID and FVD metrics.
The generated data, when used for augmenting training data, demonstrably improves BEV perception tasks, evidenced by increased NDS and decreased mAOE.
Ablation studies confirm the importance of the consistency module and local prompt for overall quality and instance-level performance, respectively. |
Future work involves exploring memory-efficient end-to-end multi-view video generation.
Incorporating NeRF-based approaches is planned to further enhance spatial and temporal consistency. |
multi-view video generation, autonomous driving, layout-guided synthesis, latent diffusion model, data augmentation |
2310.07726
Report |
Warfare:Breaking the Watermark Protection of AI-Generated Content |
Guanlin Li, Yifei Chen, Jie Zhang, Jiwei Li, Shangwei Guo, Tianwei Zhang |
AI-Generated Content (AIGC) is gaining great popularity, with many emerging
commercial services and applications. These services leverage advanced
generative models, such as latent diffusion models and large language models,
to generate creative content (e.g., realistic images and fluent sentences) for
users. The usage of such generated content needs to be highly regulated, as the
service providers need to ensure the users do not violate the usage policies
(e.g., abuse for commercialization, generating and distributing unsafe
content). A promising solution to achieve this goal is watermarking, which adds
unique and imperceptible watermarks on the content for service verification and
attribution. Numerous watermarking approaches have been proposed recently.
However, in this paper, we show that an adversary can easily break these
watermarking mechanisms. Specifically, we consider two possible attacks. (1)
Watermark removal: the adversary can easily erase the embedded watermark from
the generated content and then use it freely bypassing the regulation of the
service provider. (2) Watermark forging: the adversary can create illegal
content with forged watermarks from another user, causing the service provider
to make wrong attributions. We propose Warfare, a unified methodology to
achieve both attacks in a holistic way. The key idea is to leverage a
pre-trained diffusion model for content processing and a generative adversarial
network for watermark removal or forging. We evaluate Warfare on different
datasets and embedding setups. The results prove that it can achieve high
success rates while maintaining the quality of the generated content. Compared
to existing diffusion model-based attacks, Warfare is 5,050~11,000x faster. |
Introduces \SysName, a novel method to break the watermark protection of AI-generated content, enabling both watermark removal and forging. |
Highlights a critical vulnerability in current AI-generated content protection mechanisms, emphasizing the need for more robust watermarking techniques. |
Employs a two-stage approach, first training a generator on watermarked images and then using it to either remove or forge watermarks based on specific bit manipulation. |
\SysName achieves high bit accuracy in both watermark removal (up to 99.98%) and forging (up to 99.11%).
Demonstrates effectiveness across different watermarking schemes, including those embedded in latent spaces of diffusion models.
Shows potential for few-shot learning, achieving significant results with limited new data. |
Zero-shot performance, while promising, requires further improvement.
Limited evaluation on real-world datasets beyond LSUN. |
watermark removal, watermark forging, ai-generated content, diffusion models, adversarial attacks |
2310.07704
Report |
Ferret: Refer and Ground Anything Anywhere at Any Granularity |
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang |
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and
accurately grounding open-vocabulary descriptions. To unify referring and
grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid
region representation that integrates discrete coordinates and continuous
features jointly to represent a region in the image. To extract the continuous
features of versatile regions, we propose a spatial-aware visual sampler, adept
at handling varying sparsity across different shapes. Consequently, Ferret can
accept diverse region inputs, such as points, bounding boxes, and free-form
shapes. To bolster the desired capability of Ferret, we curate GRIT, a
comprehensive refer-and-ground instruction tuning dataset including 1.1M
samples that contain rich hierarchical spatial knowledge, with 95K hard
negative data to promote model robustness. The resulting model not only
achieves superior performance in classical referring and grounding tasks, but
also greatly outperforms existing MLLMs in region-based and
localization-demanded multimodal chatting. Our evaluations also reveal a
significantly improved capability of describing image details and a remarkable
alleviation in object hallucination. Code and data will be available at
https://github.com/apple/ml-ferret |
Ferret is a multimodal large language model (MLLM) that combines discrete coordinates and continuous visual features for fine-grained spatial understanding within images, enabling both referring to and grounding of open-vocabulary descriptions. |
Existing methods struggle to unify referring and grounding in one framework, represent versatile region types beyond bounding boxes, and ensure open-vocabulary and robust performance. Ferret addresses these limitations by integrating referring/grounding in an MLLM and supporting diverse region inputs. |
Ferret utilizes a hybrid region representation that combines discrete coordinates with continuous visual features extracted by a novel spatial-aware visual sampler. It is trained on GRIT, a new dataset curated for refer-and-ground instruction tuning and enhanced robustness. |
Ferret outperforms previous MLLMs in conventional referring and grounding tasks, including referring object classification, phrase grounding, and grounded image captioning.
Ferret demonstrates superior performance in region-based multimodal chatting, excelling in tasks like referring description, referring reasoning, and grounding in conversation.
Ferret exhibits strong robustness against object hallucination, significantly outperforming other MLLMs on the POPE benchmark. |
While Ferret supports various region inputs, its evaluation primarily focuses on bounding boxes for benchmarking purposes.
Future work includes enabling Ferret to output segmentation masks for even finer-grained region localization. |
multimodal large language models, referring expression comprehension, visual grounding, spatial understanding, object hallucination |
2310.07702
Report |
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models |
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan |
In this work, we investigate the capability of generating images from
pre-trained diffusion models at much higher resolutions than the training image
sizes. In addition, the generated images should have arbitrary image aspect
ratios. When generating images directly at a higher resolution, 1024 x 1024,
with the pre-trained Stable Diffusion using training images of resolution 512 x
512, we observe persistent problems of object repetition and unreasonable
object structures. Existing works for higher-resolution generation, such as
attention-based and joint-diffusion approaches, cannot well address these
issues. As a new perspective, we examine the structural components of the U-Net
in diffusion models and identify the crucial cause as the limited perception
field of convolutional kernels. Based on this key observation, we propose a
simple yet effective re-dilation that can dynamically adjust the convolutional
perception field during inference. We further propose the dispersed convolution
and noise-damped classifier-free guidance, which can enable
ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our
approach does not require any training or optimization. Extensive experiments
demonstrate that our approach can address the repetition issue well and achieve
state-of-the-art performance on higher-resolution image synthesis, especially
in texture details. Our work also suggests that a pre-trained diffusion model
trained on low-resolution images can be directly used for high-resolution
visual generation without further tuning, which may provide insights for future
research on ultra-high-resolution image and video synthesis. |
This paper proposes a novel method, named ScaleCrafter, for generating high-resolution images from pre-trained diffusion models without requiring any further training or optimization. |
Existing text-to-image diffusion models are limited in resolution, and generating images at higher resolutions often leads to object repetition and unreasonable object structures. This method addresses these limitations, paving the way for high-resolution image synthesis using pre-trained models. |
The authors analyze the structural components of diffusion models and identify the limited perception field of convolutional kernels as the primary cause for object repetition. They propose a dynamic "re-dilation" technique to adjust the convolutional perception field during inference, along with dispersed convolution and noise-damped classifier-free guidance for ultra-high-resolution generation. |
ScaleCrafter effectively addresses the object repetition issue in higher-resolution image synthesis.
The method demonstrates state-of-the-art performance on various diffusion models, including Stable Diffusion versions and a text-to-video model, achieving superior results compared to direct inference and attention-based adaptation.
ScaleCrafter generates images with superior texture details compared to a pre-trained super-resolution model, highlighting the potential of leveraging pre-trained models for high-resolution synthesis. |
The method focuses on adapting pre-trained models and does not explore the impact of training diffusion models directly on high-resolution images.
Further investigation is needed to explore the trade-off between computational cost and generation quality when applying the method to even higher resolutions. |
diffusion models, high-resolution image synthesis, text-to-image generation, re-dilation, perception field |
2310.07697
Report |
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation |
Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao |
Recent works have successfully extended large-scale text-to-image models to
the video domain, producing promising results but at a high computational cost
and requiring a large amount of video data. In this work, we introduce
ConditionVideo, a training-free approach to text-to-video generation based on
the provided condition, video, and input text, by leveraging the power of
off-the-shelf text-to-image generation methods (e.g., Stable Diffusion).
ConditionVideo generates realistic dynamic videos from random noise or given
scene videos. Our method explicitly disentangles the motion representation into
condition-guided and scenery motion components. To this end, the ConditionVideo
model is designed with a UNet branch and a control branch. To improve temporal
coherence, we introduce sparse bi-directional spatial-temporal attention
(sBiST-Attn). The 3D control network extends the conventional 2D controlnet
model, aiming to strengthen conditional generation accuracy by additionally
leveraging the bi-directional frames in the temporal domain. Our method
exhibits superior performance in terms of frame consistency, clip score, and
conditional accuracy, outperforming other compared methods. |
Presents ConditionVideo, a training-free method for generating realistic and temporally consistent videos from text descriptions, guided by optional reference videos and various input conditions (e.g., pose, depth, segmentation). |
Addresses limitations of existing text-to-video generation methods that are computationally expensive, require large training datasets, and struggle to generate dynamic backgrounds. |
Leverages pre-trained text-to-image diffusion models (Stable Diffusion, ControlNet) with a novel pipeline that disentangles motion representation into condition-guided and scenery components. Introduces sparse bi-directional spatial-temporal attention (sBiST-Attn) and a 3D control branch for improved temporal consistency and conditional accuracy. |
Generates videos with realistic dynamic backgrounds, unlike previous training-free methods.
Achieves superior temporal consistency and condition alignment compared to existing methods, as demonstrated by quantitative metrics (frame consistency, CLIP score, pose accuracy).
Demonstrates the effectiveness of the proposed sBiST-Attn and 3D control branch through ablation studies. |
Flickering observed in videos generated with sparse conditions (e.g., pose), potentially addressed by denser control inputs and additional temporal structures in future work.
Exploration of hierarchical sampling for long video generation as future work. |
video generation, text-to-video, diffusion models, conditional generation, training-free |
2310.07653
Report |
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models |
Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang |
The revolution of artificial intelligence content generation has been rapidly
accelerated with the booming text-to-image (T2I) diffusion models. Within just
two years of development, it was unprecedentedly of high-quality, diversity,
and creativity that the state-of-the-art models could generate. However, a
prevalent limitation persists in the effective communication with these popular
T2I models, such as Stable Diffusion, using natural language descriptions. This
typically makes an engaging image hard to obtain without expertise in prompt
engineering with complex word compositions, magic tags, and annotations.
Inspired by the recently released DALLE3 - a T2I model directly built-in
ChatGPT that talks human language, we revisit the existing T2I systems
endeavoring to align human intent and introduce a new task - interactive text
to image (iT2I), where people can interact with LLM for interleaved
high-quality image generation/edit/refinement and question answering with
stronger images and text correspondences using natural language. In addressing
the iT2I problem, we present a simple approach that augments LLMs for iT2I with
prompting techniques and off-the-shelf T2I models. We evaluate our approach for
iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT,
LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a
convenient and low-cost way to introduce the iT2I ability for any existing LLMs
and any text-to-image models without any training while bringing little
degradation on LLMs' inherent capabilities in, e.g., question answering and
code generation. We hope this work could draw broader attention and provide
inspiration for boosting user experience in human-machine interactions
alongside the image quality of the next-generation T2I systems. |
This paper introduces the concept of interactive text-to-image (iT2I), enabling multi-turn image generation/editing through natural language conversations with large language models (LLMs). |
Existing text-to-image models often require expertise in prompt engineering, making it difficult for general users to obtain desired images. iT2I aims to bridge this gap by providing a more user-friendly and interactive interface using natural language. |
The proposed approach, Mini-DALLE3, prompts LLMs to generate intermediate textual descriptions within special tags. These descriptions are then refined and used to generate images using pre-trained text-to-image models. The system also incorporates hierarchical content consistency control and leverages off-the-shelf T2I models for multi-turn generation. |
Prompting LLMs for iT2I does not significantly impact their inherent abilities like question answering and code generation.
Commercial LLMs (ChatGPT, GPT-4, Claude) effectively generate images with corresponding textual responses, demonstrating successful augmentation for iT2I.
Mini-DALLE3 shows promise in various iT2I use cases, including single/multi-turn image generation and interactive storytelling. |
Evaluation on open-source LLMs shows less satisfactory results, with some struggling to generate images.
Future work could focus on improving performance with open-source LLMs and exploring more sophisticated prompt refinement techniques. |
text-to-image generation, interactive image generation, large language models, prompt engineering, human-computer interaction |
2310.07419
Report |
Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else |
Hazarapet Tunanyan, Dejia Xu, Shant Navasardyan, Zhangyang Wang, Humphrey Shi |
Recent advances in text-to-image diffusion models have enabled the
photorealistic generation of images from text prompts. Despite the great
progress, existing models still struggle to generate compositional
multi-concept images naturally, limiting their ability to visualize human
imagination. While several recent works have attempted to address this issue,
they either introduce additional training or adopt guidance at inference time.
In this work, we consider a more ambitious goal: natural multi-concept
generation using a pre-trained diffusion model, and with almost no extra cost.
To achieve this goal, we identify the limitations in the text embeddings used
for the pre-trained text-to-image diffusion models. Specifically, we observe
concept dominance and non-localized contribution that severely degrade
multi-concept generation performance. We further design a minimal low-cost
solution that overcomes the above issues by tweaking (not re-training) the text
embeddings for more realistic multi-concept text-to-image generation. Our
Correction by Similarities method tweaks the embedding of concepts by
collecting semantic features from most similar tokens to localize the
contribution. To avoid mixing features of concepts, we also apply Cross-Token
Non-Maximum Suppression, which excludes the overlap of contributions from
different concepts. Experiments show that our approach outperforms previous
methods in text-to-image, image manipulation, and personalization tasks,
despite not introducing additional training or inference costs to the diffusion
steps. |
This paper proposes a novel zero-shot method for multi-concept text-to-image generation using pre-trained diffusion models without additional training or inference-time optimization. The method tweaks the text embeddings to address concept dominance and non-localized contribution issues. |
Existing text-to-image diffusion models struggle to generate compositional multi-concept images due to limitations in text embeddings. Existing solutions require additional training or inference-time guidance, leading to high computational cost. This method offers a low-cost alternative by focusing on text embedding manipulation. |
The method consists of two techniques: 1) **Corrections-by-Similarities**, which aggregates semantic features from similar tokens to localize contributions, and 2) **Cross-Token Non-Maximum Suppression**, which minimizes overlap in contributions from different concepts to avoid feature mixing. |
Outperforms existing methods in multi-concept text-to-image generation despite not introducing additional training or inference cost.
Enables realistic multi-concept image manipulation by improving contribution localization.
Successfully extends single-concept personalization methods to multi-concept scenarios. |
The effectiveness of the method heavily relies on the quality of the pre-trained text encoder.
Further improvement in concept disentanglement is needed for more complex multi-concept compositions. |
text-to-image generation, diffusion models, multi-concept generation, text embeddings, zero-shot learning |
2310.07222
Report |
Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model |
Shiyuan Yang, Xiaodong Chen, Jing Liao |
Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have
demonstrated impressive image generation capabilities and have also been
successfully applied to image inpainting. However, in practice, users often
require more control over the inpainting process beyond textual guidance,
especially when they want to composite objects with customized appearance,
color, shape, and layout. Unfortunately, existing diffusion-based inpainting
methods are limited to single-modal guidance and require task-specific
training, hindering their cross-modal scalability. To address these
limitations, we propose Uni-paint, a unified framework for multimodal
inpainting that offers various modes of guidance, including unconditional,
text-driven, stroke-driven, exemplar-driven inpainting, as well as a
combination of these modes. Furthermore, our Uni-paint is based on pretrained
Stable Diffusion and does not require task-specific training on specific
datasets, enabling few-shot generalizability to customized images. We have
conducted extensive qualitative and quantitative evaluations that show our
approach achieves comparable results to existing single-modal methods while
offering multimodal inpainting capabilities not available in other methods.
Code will be available at https://github.com/ysy31415/unipaint. |
This paper presents Uni-paint, a unified framework for multimodal image inpainting based on a pretrained diffusion model, supporting unconditional, text-driven, stroke-driven, and exemplar-driven inpainting within a single framework. |
Existing diffusion-based inpainting methods are limited to single-modal guidance and often require task-specific training, hindering their cross-modal scalability and generalization. Uni-paint addresses these limitations by offering flexible and versatile inpainting capabilities. |
The authors finetune a pretrained Stable Diffusion model unconditionally on masked images, enabling context-aware inpainting. They leverage the textual interface (cross-attention) for semantic guidance (text and exemplar) and the spatial interface (image blending) for stroke guidance. A masked attention control mechanism is introduced to restrict inpainted content within the unknown region. |
Uni-paint achieves comparable results to existing single-modal methods in unconditional and text-driven inpainting while not requiring large-scale training.
For exemplar-driven inpainting, Uni-paint shows superior performance in capturing customized object details compared to baselines.
Uni-paint effectively performs stroke-driven inpainting and allows for flexible combinations of text, stroke, and exemplar guidance. |
Uni-paint may struggle to harmonize large gaps between exemplar and input images, leading to unnatural stitching.
Conflicting multi-modal guidance (e.g., stroke and exemplar) can pose challenges in finding a balance between different modalities. |
image inpainting, diffusion models, multimodal guidance, stable diffusion, few-shot learning |
2310.06968
Report |
ObjectComposer: Consistent Generation of Multiple Objects Without Fine-tuning |
Alec Helbling, Evan Montoya, Duen Horng Chau |
Recent text-to-image generative models can generate high-fidelity images from
text prompts. However, these models struggle to consistently generate the same
objects in different contexts with the same appearance. Consistent object
generation is important to many downstream tasks like generating comic book
illustrations with consistent characters and setting. Numerous approaches
attempt to solve this problem by extending the vocabulary of diffusion models
through fine-tuning. However, even lightweight fine-tuning approaches can be
prohibitively expensive to run at scale and in real-time. We introduce a method
called ObjectComposer for generating compositions of multiple objects that
resemble user-specified images. Our approach is training-free, leveraging the
abilities of preexisting models. We build upon the recent BLIP-Diffusion model,
which can generate images of single objects specified by reference images.
ObjectComposer enables the consistent generation of compositions containing
multiple specific objects simultaneously, all without modifying the weights of
the underlying models. |
Introduces \ObjectComposer{}, a training-free method for generating image compositions with multiple user-specified objects using pre-existing diffusion models. |
Existing text-to-image models struggle with consistent object generation across different contexts, limiting their use in applications like comic book illustration. Fine-tuning approaches, while effective, are computationally expensive. |
Leverages BLIP-Diffusion for object generation and a vanilla diffusion model for background composition. Employs cross-attention maps to guide object placement and blends diffusion processes of individual objects and the background. |
Generates images containing user-specified objects while adhering to text prompts.
Maintains consistent object appearance across different backgrounds and compositions.
Outperforms vanilla Stable Diffusion in preserving object fidelity to reference images. |
Object appearance can sometimes deviate from the reference image.
Relies on accurate object localization through cross-attention maps, which might not always be perfect. |
image generation, object composition, diffusion models, blip-diffusion, cross-attention |
2310.06904
Report |
Mitigating stereotypical biases in text to image generative systems |
Piero Esposito, Parmida Atighehchian, Anastasis Germanidis, Deepti Ghadiyaram |
State-of-the-art generative text-to-image models are known to exhibit social
biases and over-represent certain groups like people of perceived lighter skin
tones and men in their outcomes. In this work, we propose a method to mitigate
such biases and ensure that the outcomes are fair across different groups of
people. We do this by finetuning text-to-image models on synthetic data that
varies in perceived skin tones and genders constructed from diverse text
prompts. These text prompts are constructed from multiplicative combinations of
ethnicities, genders, professions, age groups, and so on, resulting in diverse
synthetic data. Our diversity finetuned (DFT) model improves the group fairness
metric by 150% for perceived skin tone and 97.7% for perceived gender. Compared
to baselines, DFT models generate more people with perceived darker skin tone
and more women. To foster open research, we will release all text prompts and
code to generate training images. |
This paper proposes a method for mitigating social biases in text-to-image models by fine-tuning them on synthetically generated data diverse in perceived skin tones, genders, age groups, and professions. |
State-of-the-art text-to-image models often exhibit social biases, over-representing certain demographics. This work aims to address these biases and promote fairness in generated content. |
The authors generate diverse text prompts, synthesize images from these prompts using an off-the-shelf model (SDXL), and fine-tune existing models (Stable Diffusion and Stable Diffusion XL) on this data. |
The diversity fine-tuned (DFT) models significantly improve group fairness metrics for perceived skin tone (up to 150% improvement) and gender (up to 97.7% improvement).
The study finds that training on a balanced distribution of perceived skin tones leads to the most diverse outputs.
Subjective evaluations indicate that fine-tuning on synthetic data does not negatively impact image quality and may even improve it in some cases. |
The study acknowledges limitations such as inheriting issues from the model used for synthetic data generation (SDXL) and occasional reduction in photorealism.
Future work includes addressing other forms of bias and exploring the application of this technique to video generation models. |
social bias, text-to-image generation, fairness, synthetic data, diversity |
2310.06836
Report |
What Does Stable Diffusion Know about the 3D Scene? |
Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman |
Recent advances in generative models like Stable Diffusion enable the
generation of highly photo-realistic images. Our objective in this paper is to
probe the diffusion network to determine to what extent it 'understands'
different properties of the 3D scene depicted in an image. To this end, we make
the following contributions: (i) We introduce a protocol to evaluate whether
features of an off-the-shelf diffusion model encode a number of physical
'properties' of the 3D scene, by training discriminative classifiers on the
features for these properties. The probes are applied on datasets of real
images with annotations for the property. (ii) We apply this protocol to
properties covering scene geometry, scene material, support relations,
lighting, and view dependent measures. (iii) We find that features from Stable
Diffusion are good for discriminative learning of a number of properties,
including scene geometry, support relations, shadows and depth, but less
performant for occlusion and material. (iv) We also apply the probes to other
networks trained at large-scale, including DINO, CLIP and VQGAN, and find that
DINOv2 has a similar performance to Stable Diffusion, while outperforming
DINOv1, CLIP and VQGAN. |
This paper presents a protocol to evaluate the extent to which diffusion models and other large-scale image networks understand 3D scene properties. |
Understanding what these networks learn about 3D scenes can provide insights into their workings, enable new applications using their features, help detect synthetic images, and guide further training for improved 3D modeling. |
The protocol involves extracting features from different layers and timesteps of the networks, training linear classifiers to predict specific 3D scene properties from these features, and evaluating their performance on real image datasets. |
Stable Diffusion and DINOv2 demonstrate a good understanding of scene geometry, support relations, lighting, and depth.
They are less performant in predicting material and occlusion, indicating areas for potential improvement.
Stable Diffusion and DINOv2 generally outperform other large-scale networks tested, including OpenCLIP, DINOv1, and VQGAN. |
The study primarily focuses on linear probing, which might not fully capture the networks' capabilities.
Future work could explore more complex properties, non-symmetric question formulations, and combinations of features from different layers and timesteps. |
3d physical scene understanding, stable diffusion, representation learning, generative models, linear probing |
2310.06389
Report |
Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling |
Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, Mingyuan Zhou |
Diffusion models excel at generating photo-realistic images but come with
significant computational costs in both training and sampling. While various
techniques address these computational challenges, a less-explored issue is
designing an efficient and adaptable network backbone for iterative refinement.
Current options like U-Net and Vision Transformer often rely on
resource-intensive deep networks and lack the flexibility needed for generating
images at variable resolutions or with a smaller network than used in training.
This study introduces LEGO bricks, which seamlessly integrate Local-feature
Enrichment and Global-content Orchestration. These bricks can be stacked to
create a test-time reconfigurable diffusion backbone, allowing selective
skipping of bricks to reduce sampling costs and generate higher-resolution
images than the training data. LEGO bricks enrich local regions with an MLP and
transform them using a Transformer block while maintaining a consistent
full-resolution image across all bricks. Experimental results demonstrate that
LEGO bricks enhance training efficiency, expedite convergence, and facilitate
variable-resolution image generation while maintaining strong generative
performance. Moreover, LEGO significantly reduces sampling time compared to
other methods, establishing it as a valuable enhancement for diffusion models. |
This paper introduces "LEGO bricks", a novel network unit for diffusion models that integrates local feature enrichment and global content orchestration. |
Diffusion models excel at generating realistic images but suffer from high computational costs during training and sampling. This work addresses these limitations by designing a more efficient and flexible network backbone. |
LEGO bricks, built upon Transformer blocks and trained on image patches, are stacked to form a reconfigurable backbone. This allows selective skipping of bricks during sampling, reducing computational cost while enabling variable-resolution image generation. |
LEGO significantly reduces training time and FLOPs compared to U-Net and ViT-based diffusion models while maintaining competitive FID scores.
The LEGO framework enables a 60% reduction in sampling time compared to DiT without sacrificing generation quality.
LEGO can generate coherent images at resolutions much higher than the training data, demonstrated by generating panoramas from models trained on ImageNet. |
The paper primarily focuses on progressive growth and refinement for stacking LEGO bricks, leaving other strategies unexplored.
The current work doesn't explore text-guided image generation with LEGO bricks. |
diffusion models, image generation, generative models, transformer, efficient training |
2310.06347
Report |
JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling |
Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Yao Yao |
We introduce JointNet, a novel neural network architecture for modeling the
joint distribution of images and an additional dense modality (e.g., depth
maps). JointNet is extended from a pre-trained text-to-image diffusion model,
where a copy of the original network is created for the new dense modality
branch and is densely connected with the RGB branch. The RGB branch is locked
during network fine-tuning, which enables efficient learning of the new
modality distribution while maintaining the strong generalization ability of
the large-scale pre-trained diffusion model. We demonstrate the effectiveness
of JointNet by using RGBD diffusion as an example and through extensive
experiments, showcasing its applicability in a variety of applications,
including joint RGBD generation, dense depth prediction, depth-conditioned
image generation, and coherent tile-based 3D panorama generation. |
This paper presents JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps) by extending a pre-trained text-to-image diffusion model. |
Existing methods for joint distribution modeling either rely on limited labeled dense image pairs or struggle to retain the generalization ability of pre-trained models. JointNet addresses these limitations, offering high-quality joint generation without sacrificing performance in the original RGB domain. |
JointNet creates a copy of the pre-trained diffusion network for the dense label branch and connects it densely with the RGB branch. It leverages the 'output preserving principle' to ensure smooth adaptation to the new objective by fixing the original RGB branch during training and fine-tuning only the dense label branch and connections. |
JointNet preserves the RGB generation quality of the base model, achieving comparable FID, IS, and CLIP similarity scores.
It demonstrates comparable performance in mono-view depth estimation tasks, with results comparable to MiDaS in terms of RMSE.
JointNet enables coherent tile-based joint data generation, as evidenced by its low intra-LPIPS loss and efficient generation of high-quality RGBD panoramas. |
The inference time of JointNet is doubled due to maintaining two branches.
Directly extending JointNet to support more modalities could lead to further increases in time consumption. |
diffusion models, joint distribution modeling, dense prediction, rgbd generation, panorama generation |
2310.06313
Report |
Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models |
Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, Wei Yang |
Recent work has showcased the significant potential of diffusion models in
pose-guided person image synthesis. However, owing to the inconsistency in pose
between the source and target images, synthesizing an image with a distinct
pose, relying exclusively on the source image and target pose information,
remains a formidable challenge. This paper presents Progressive Conditional
Diffusion Models (PCDMs) that incrementally bridge the gap between person
images under the target and source poses through three stages. Specifically, in
the first stage, we design a simple prior conditional diffusion model that
predicts the global features of the target image by mining the global alignment
relationship between pose coordinates and image appearance. Then, the second
stage establishes a dense correspondence between the source and target images
using the global features from the previous stage, and an inpainting
conditional diffusion model is proposed to further align and enhance the
contextual features, generating a coarse-grained person image. In the third
stage, we propose a refining conditional diffusion model to utilize the
coarsely generated image from the previous stage as a condition, achieving
texture restoration and enhancing fine-detail consistency. The three-stage
PCDMs work progressively to generate the final high-quality and high-fidelity
synthesized image. Both qualitative and quantitative results demonstrate the
consistency and photorealism of our proposed PCDMs under challenging
scenarios.The code and model will be available at
https://github.com/tencent-ailab/PCDMs. |
This paper proposes Progressive Conditional Diffusion Models (PCDMs), a novel three-stage pipeline for pose-guided person image synthesis that incrementally bridges the gap between source and target poses. |
Synthesizing realistic images with distinct poses from a source image remains a significant challenge due to pose inconsistencies. PCDMs address this by progressively predicting global features, establishing dense correspondences, and refining textures. |
PCDMs consist of: 1) a prior conditional diffusion model predicting global target features from pose and source image, 2) an inpainting diffusion model establishing dense correspondences for a coarse image, and 3) a refining diffusion model enhancing texture and detail consistency. |
PCDMs outperform state-of-the-art methods on DeepFashion and Market-1501 datasets in SSIM and LPIPS, demonstrating improved image quality and realism.
User studies confirm that PCDMs generate more realistic and visually appealing person images compared to existing methods.
PCDMs demonstrate strong applicability in downstream tasks, significantly improving person re-identification performance on Market-1501. |
The use of three diffusion models increases computational costs and inference time.
Future work should explore more efficient methods to reduce computational overhead without sacrificing quality. |
image synthesis, diffusion models, pose-guided generation, person image generation, deep learning |
2310.06311
Report |
Improving Compositional Text-to-image Generation with Large Vision-Language Models |
Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, Dimitris Metaxas |
Recent advancements in text-to-image models, particularly diffusion models,
have shown significant promise. However, compositional text-to-image models
frequently encounter difficulties in generating high-quality images that
accurately align with input texts describing multiple objects, variable
attributes, and intricate spatial relationships. To address this limitation, we
employ large vision-language models (LVLMs) for multi-dimensional assessment of
the alignment between generated images and their corresponding input texts.
Utilizing this assessment, we fine-tune the diffusion model to enhance its
alignment capabilities. During the inference phase, an initial image is
produced using the fine-tuned diffusion model. The LVLM is then employed to
pinpoint areas of misalignment in the initial image, which are subsequently
corrected using the image editing algorithm until no further misalignments are
detected by the LVLM. The resultant image is consequently more closely aligned
with the input text. Our experimental results validate that the proposed
methodology significantly improves text-image alignment in compositional image
generation, particularly with respect to object number, attribute binding,
spatial relationships, and aesthetic quality. |
This paper introduces a novel framework leveraging Large Vision-Language Models (LVLMs) to enhance the quality of compositional image generation, particularly addressing the limitations of existing models in accurately aligning images with complex textual descriptions. |
Compositional text-to-image models often struggle to generate images that accurately reflect the input text, particularly concerning object number, attribute binding, spatial relationships, and aesthetic quality. This work aims to improve the alignment between generated images and complex textual descriptions. |
The proposed method comprises three core components: (1) LVLM-based Evaluation: LVLMs assess the alignment between generated images and input texts by analyzing answers to questions formulated from the text. (2) Model Fine-tuning: Diffusion models are fine-tuned using Reward Feedback Learning (ReFL) based on the LVLM-derived evaluation metrics. (3) LVLM-guided Editing: During inference, LVLMs identify misalignments, guiding image-editing algorithms to iteratively refine the generated image until it aligns with the input text. |
LVLMs effectively evaluate image-text alignment by analyzing the accuracy of answers to questions derived from the input text.
Fine-tuning diffusion models with LVLM-based evaluation metrics significantly improves the alignment between generated images and input texts.
The LVLM-guided editing process effectively corrects misalignments in generated images, resulting in images that are more faithful to the input text. |
The effectiveness of the method is limited by the performance of current LVLMs.
Future work will explore the use of more advanced LVLMs and image editing algorithms. |
text-to-image generation, compositional image generation, large vision-language models (lvlms), reward feedback learning (refl), image editing |
2310.06214
Report |
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding |
Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny |
3D visual grounding is the ability to localize objects in 3D scenes
conditioned by utterances. Most existing methods devote the referring head to
localize the referred object directly, causing failure in complex scenarios. In
addition, it does not illustrate how and why the network reaches the final
decision. In this paper, we address this question Can we design an
interpretable 3D visual grounding framework that has the potential to mimic the
human perception system?. To this end, we formulate the 3D visual grounding
problem as a sequence-to-sequence Seq2Seq task by first predicting a chain of
anchors and then the final target. Interpretability not only improves the
overall performance but also helps us identify failure cases. Following the
chain of thoughts approach enables us to decompose the referring task into
interpretable intermediate steps, boosting the performance and making our
framework extremely data-efficient. Moreover, our proposed framework can be
easily integrated into any existing architecture. We validate our approach
through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks
and show consistent performance gains compared to existing methods without
requiring manually annotated data. Furthermore, our proposed framework, dubbed
CoT3DRef, is significantly data-efficient, whereas on the Sr3D dataset, when
trained only on 10% of the data, we match the SOTA performance that trained on
the entire data. The code is available at
https:eslambakr.github.io/cot3dref.github.io/. |
This paper presents CoT3DRef, a novel 3D visual grounding framework that formulates the task as a sequence-to-sequence problem. By predicting a chain of anchor objects before localizing the final target, CoT3DRef aims to mimic human perception and improve interpretability. |
Existing 3D visual grounding methods fail to provide insights into their decision-making process and struggle in complex scenarios. CoT3DRef addresses these limitations by introducing interpretability and mimicking human-like reasoning. |
CoT3DRef employs a Chain-of-Thoughts decoder that leverages a Pathway module to predict the logical order of anchors extracted from the input utterance. A parallel referring head first localizes anchors and the target, which are then refined by the CoT decoder in a sequential manner. |
CoT3DRef achieves state-of-the-art results on Nr3D, Sr3D, and ScanRefer benchmarks without requiring manually annotated data.
The framework demonstrates significant data efficiency, surpassing existing methods even when trained on only 10% of the data.
Visualizing attention maps provides insights into the model's reasoning process, aiding in the identification of failure cases. |
The pseudo-label module, while effective, limits performance gains on the Nr3D dataset due to the inherent ambiguity in free-form language.
The Pathway module does not currently handle scenarios with multiple valid logical paths. |
3d visual grounding, chain-of-thoughts, interpretability, data efficiency, sequence-to-sequence |
2310.05986
Report |
The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric |
Daniel Severo, Lucas Theis, Johannes Ballé |
We show how perceptual embeddings of the visual system can be constructed at
inference-time with no training data or deep neural network features. Our
perceptual embeddings are solutions to a weighted least squares (WLS) problem,
defined at the pixel-level, and solved at inference-time, that can capture
global and local image characteristics. The distance in embedding space is used
to define a perceptual similarity metric which we call LASI: Linear
Autoregressive Similarity Index. Experiments on full-reference image quality
assessment datasets show LASI performs competitively with learned deep feature
based methods like LPIPS (Zhang et al., 2018) and PIM (Bhardwaj et al., 2020),
at a similar computational cost to hand-crafted methods such as MS-SSIM (Wang
et al., 2003). We found that increasing the dimensionality of the embedding
space consistently reduces the WLS loss while increasing performance on
perceptual tasks, at the cost of increasing the computational complexity. LASI
is fully differentiable, scales cubically with the number of embedding
dimensions, and can be parallelized at the pixel-level. A Maximum
Differentiation (MAD) competition (Wang & Simoncelli, 2008) between LASI and
LPIPS shows that both methods are capable of finding failure points for the
other, suggesting these metrics can be combined. |
This paper introduces LASI (Linear Autoregressive Similarity Index), a data-free perceptual similarity metric that constructs image embeddings at inference time without needing training data or deep neural networks. |
This is important because current perceptual similarity metrics often rely on expensive training data or complex deep learning models, hindering their applicability. LASI offers a lightweight and efficient alternative. |
LASI leverages a weighted least squares (WLS) approach inspired by lossless compression algorithms. It learns pixel-level representations by predicting neighboring pixel values, capturing global image semantics through this self-supervised process. |
LASI achieves competitive performance with learned methods (LPIPS, PIM) on the BAPPS dataset for both 2-AFC and JND tasks.
Increasing LASI's embedding dimensionality improves both its predictive performance and its scores on perceptual tasks.
MAD competition analysis reveals that LASI and LPIPS exhibit distinct failure modes, suggesting potential for combining these metrics. |
The generalization ability of LASI beyond BAPPS and its applicability to larger images requires further investigation.
Future work could explore the usefulness of LASI embeddings in other computer vision tasks beyond perceptual similarity. |
perceptual similarity, image quality assessment, data-free methods, weighted least squares, self-supervision |
2310.05922
Report |
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing |
Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He |
Text-to-video editing aims to edit the visual appearance of a source video
conditional on textual prompts. A major challenge in this task is to ensure
that all frames in the edited video are visually consistent. Most recent works
apply advanced text-to-image diffusion models to this task by inflating 2D
spatial attention in the U-Net into spatio-temporal attention. Although
temporal context can be added through spatio-temporal attention, it may
introduce some irrelevant information for each patch and therefore cause
inconsistency in the edited video. In this paper, for the first time, we
introduce optical flow into the attention module in the diffusion model's U-Net
to address the inconsistency issue for text-to-video editing. Our method,
FLATTEN, enforces the patches on the same flow path across different frames to
attend to each other in the attention module, thus improving the visual
consistency in the edited videos. Additionally, our method is training-free and
can be seamlessly integrated into any diffusion-based text-to-video editing
methods and improve their visual consistency. Experiment results on existing
text-to-video editing benchmarks show that our proposed method achieves the new
state-of-the-art performance. In particular, our method excels in maintaining
the visual consistency in the edited videos. |
Introduces FLATTEN, a flow-guided attention mechanism for text-to-video editing, that improves visual consistency by leveraging optical flow to guide attention in diffusion models. |
Addresses the challenge of maintaining visual consistency in edited videos, a key limitation of existing text-to-video editing methods. |
Inflates a pre-trained text-to-image diffusion model, integrates FLATTEN into the U-Net, uses DDIM inversion for latent noise estimation, and employs DDIM sampling with feature injection for video generation. |
Achieves state-of-the-art performance on text-to-video editing benchmarks, demonstrating superior visual consistency.
Improves visual consistency when integrated into other diffusion-based video editing methods.
Outperforms competing methods in user studies evaluating semantic alignment, visual consistency, and motion preservation. |
Limited ability for dramatic structure editing due to reliance on optical flow from the source video.
Runtime, while comparable to other methods, has room for optimization. |
text-to-video editing, visual consistency, diffusion models, optical flow, attention mechanism |
2310.05917
Report |
Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input |
Donglai Xiang, Fabian Prada, Zhe Cao, Kaiwen Guo, Chenglei Wu, Jessica Hodgins, Timur Bagautdinov |
Clothing is an important part of human appearance but challenging to model in
photorealistic avatars. In this work we present avatars with dynamically moving
loose clothing that can be faithfully driven by sparse RGB-D inputs as well as
body and face motion. We propose a Neural Iterative Closest Point (N-ICP)
algorithm that can efficiently track the coarse garment shape given sparse
depth input. Given the coarse tracking results, the input RGB-D images are then
remapped to texel-aligned features, which are fed into the drivable avatar
models to faithfully reconstruct appearance details. We evaluate our method
against recent image-driven synthesis baselines, and conduct a comprehensive
analysis of the N-ICP algorithm. We demonstrate that our method can generalize
to a novel testing environment, while preserving the ability to produce
high-fidelity and faithful clothing dynamics and appearance. |
This paper introduces a novel framework for creating photorealistic full-body avatars with dynamic clothing, driven by sparse RGB-D input, enabling faithful telepresence. |
Faithfully capturing clothing dynamics is crucial for realistic avatars and telepresence, addressing the limitations of pose-driven methods that struggle to represent the nuances of loose clothing. |
The framework employs a two-step process: (1) Neural Iterative Closest Point (N-ICP) algorithm for coarse clothing surface tracking, and (2) Texel-conditioned clothed avatars for high-fidelity geometry and appearance reconstruction from sparse RGB-D input and N-ICP tracking results. |
The N-ICP algorithm demonstrates faster convergence than classical optimization solvers.
The full framework outperforms various baselines, including DVA, NeRF-based methods, and sensing-based techniques, in reconstructing dynamic clothing.
The method generalizes to a novel environment with different backgrounds and illumination, preserving appearance style and capturing unseen motion. |
The current model is person- and garment-specific.
The approach cannot handle drastic clothing deformations like topology changes. |
telepresence, photorealistic avatars, clothing capture, neural iterative closest point, texel-conditioned avatars |
2310.05916
Report |
Interpreting CLIP's Image Representation via Text-Based Decomposition |
Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt |
We investigate the CLIP image encoder by analyzing how individual model
components affect the final representation. We decompose the image
representation as a sum across individual image patches, model layers, and
attention heads, and use CLIP's text representation to interpret the summands.
Interpreting the attention heads, we characterize each head's role by
automatically finding text representations that span its output space, which
reveals property-specific roles for many heads (e.g. location or shape). Next,
interpreting the image patches, we uncover an emergent spatial localization
within CLIP. Finally, we use this understanding to remove spurious features
from CLIP and to create a strong zero-shot image segmenter. Our results
indicate that a scalable understanding of transformer models is attainable and
can be used to repair and improve models. |
This paper investigates and interprets the internal representations of the CLIP image encoder, particularly the ViT-based variant (CLIP-ViT), by decomposing the representation into contributions from individual model components (layers, attention heads, and image patches). |
Understanding the inner workings of CLIP is crucial because of its wide adoption and strong performance in various downstream tasks like image classification, segmentation, and generation. However, the complex representations learned by CLIP remain largely opaque. |
The authors leverage the residual and attention mechanisms of the ViT architecture to decompose the image representation. They first identify that the last few attention layers contribute most significantly to the representation. Then, they propose an algorithm called TextSpan, which uses a greedy approach to find text descriptions that explain the output space of individual attention heads, revealing specialized roles like shape, color, and location for different heads. Lastly, they decompose the representation by image tokens (patches) to visualize the contribution of image regions to specific text concepts. |
The last few attention layers in CLIP-ViT contribute most significantly to the final image representation.
TextSpan reveals that individual attention heads specialize in capturing specific image properties like shape, color, counting, and location.
Decomposing the representation by image tokens yields a state-of-the-art zero-shot semantic image segmenter. |
The study primarily focuses on direct effects of model components, leaving the analysis of indirect effects and information flow between layers for future work.
Not all attention heads have a clear semantic role assigned by TextSpan, potentially due to limitations in the initial pool of text descriptions or the collaborative nature of certain heads. |
clip, vision transformer (vit), explainable ai (xai), image segmentation, zero-shot learning |
2310.05873
Report |
Implicit Concept Removal of Diffusion Models |
Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James Kwok |
Text-to-image (T2I) diffusion models often inadvertently generate unwanted
concepts such as watermarks and unsafe images. These concepts, termed as the
"implicit concepts", could be unintentionally learned during training and then
be generated uncontrollably during inference. Existing removal methods still
struggle to eliminate implicit concepts primarily due to their dependency on
the model's ability to recognize concepts it actually can not discern. To
address this, we utilize the intrinsic geometric characteristics of implicit
concepts and present the Geom-Erasing, a novel concept removal method based on
geometric-driven control. Specifically, once an unwanted implicit concept is
identified, we integrate the existence and geometric information of the concept
into text prompts with the help of an accessible classifier or detector model.
Subsequently, the model is optimized to identify and disentangle this
information, which is adopted as negative prompts for generation. Moreover, we
introduce Implicit Concept Dataset (ICD), a novel image-text dataset imbued
with three typical implicit concepts (i.e., QR codes, watermarks, and text),
reflecting real-life situations where implicit concepts are easily injected.
Geom-Erasing effectively mitigates the generation of implicit concepts,
achieving state-of-the-art results on the Inappropriate Image Prompts (I2P) and
our challenging Implicit Concept Dataset (ICD) benchmarks. |
This paper introduces Implicit Concept Dataset (ICD) and proposes Geo Erasure, a novel method for removing implicit concepts (e.g., watermarks, unsafe content) from text-to-image diffusion models. |
Implicit concepts are difficult to remove with existing methods because they are unintentionally learned during training and cannot be reliably controlled with text prompts. This hinders personalized model fine-tuning and can result in the generation of undesired or even harmful content. |
Geo Erasure leverages the geometric properties of implicit concepts by using a classifier or detector to identify concept existence and location within an image. This information is then integrated into the text prompts, guiding the diffusion model to learn and disentangle the implicit concept. |
Geo Erasure effectively removes implicit concepts like watermarks and unsafe content from pre-trained Stable Diffusion models.
It outperforms existing erasure methods in personalized fine-tuning scenarios, achieving lower FID and Implicit Concept Ratio (ICR) on ICD datasets.
The method demonstrates that incorporating geometric information significantly improves the model's ability to recognize and eliminate implicit concepts. |
Geo Erasure currently relies on external detectors for geometric information, which could be replaced with more general localizers in future work.
Further exploration is needed to understand the impact of adding geometric information as negative prompts, which currently improves concept removal but slightly degrades image quality (FID). |
implicit concept removal, text-to-image diffusion models, geometric guidance, personalized fine-tuning, stable diffusion |
2310.05737
Report |
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation |
Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang |
While Large Language Models (LLMs) are the dominant models for generative
tasks in language, they do not perform as well as diffusion models on image and
video generation. To effectively use LLMs for visual generation, one crucial
component is the visual tokenizer that maps pixel-space inputs to discrete
tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a
video tokenizer designed to generate concise and expressive tokens for both
videos and images using a common token vocabulary. Equipped with this new
tokenizer, we show that LLMs outperform diffusion models on standard image and
video generation benchmarks including ImageNet and Kinetics. In addition, we
demonstrate that our tokenizer surpasses the previously top-performing video
tokenizer on two more tasks: (1) video compression comparable to the
next-generation video codec (VCC) according to human evaluations, and (2)
learning effective representations for action recognition tasks. |
\modelname{} is a novel video tokenizer that leverages lookup-free quantization and architectural advancements to tokenize images and videos using a shared vocabulary. |
A good visual tokenizer is crucial for language models (LMs) to excel in image and video generation, bridging the gap between pixel-based representations and discrete token-based processing inherent to LLMs. |
The paper introduces (1) Lookup-free quantization (LFQ) that eliminates the need for embedding lookup in VQ-VAEs, enabling learning of larger vocabularies beneficial for LMs. (2) Architectural improvements to the tokenizer, including causal 3D CNNs for joint image-video tokenization and modifications for better temporal modeling. |
\modelname{} significantly outperforms previous state-of-the-art video tokenizers in visual generation tasks on ImageNet and Kinetics datasets.
In human rater studies, \modelname{} achieves better video compression quality compared to MAGVIT, HEVC, and on par with VVC.
The tokens generated by \modelname{} prove to be effective representations for video understanding, leading to improved performance on action recognition benchmarks. |
While \modelname{} shows promising results, further research is needed to adapt it for efficient CPU execution, aligning it with standard video codecs.
Exploring the full potential of text-to-image and text-to-video generation with \modelname{} is left as future work. |
video tokenization, language models, visual generation, video compression, action recognition |
2310.05718
Report |
EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders |
Gulcin Baykal, Melih Kandemir, Gozde Unal |
Codebook collapse is a common problem in training deep generative models with
discrete representation spaces like Vector Quantized Variational Autoencoders
(VQ-VAEs). We observe that the same problem arises for the alternatively
designed discrete variational autoencoders (dVAEs) whose encoder directly
learns a distribution over the codebook embeddings to represent the data. We
hypothesize that using the softmax function to obtain a probability
distribution causes the codebook collapse by assigning overconfident
probabilities to the best matching codebook elements. In this paper, we propose
a novel way to incorporate evidential deep learning (EDL) instead of softmax to
combat the codebook collapse problem of dVAE. We evidentially monitor the
significance of attaining the probability distribution over the codebook
embeddings, in contrast to softmax usage. Our experiments using various
datasets show that our model, called EdVAE, mitigates codebook collapse while
improving the reconstruction performance, and enhances the codebook usage
compared to dVAE and VQ-VAE based models. Our code can be found at
https://github.com/ituvisionlab/EdVAE . |
This paper proposes EdVAE, an extension of dVAE using evidential deep learning (EDL) to address codebook collapse by incorporating uncertainty awareness in codebook embedding selection. |
Codebook collapse, the under-utilization of codebook embeddings, is a significant problem in discrete representation learning with dVAEs, limiting their expressiveness and performance. |
EdVAE replaces the softmax layer in dVAE’s encoder with an evidential mechanism. It models a distribution over Categorical distributions (representing codebook selections) using a Dirichlet distribution, learning to select embeddings based on data-driven evidence. |
EdVAE significantly improves codebook usage (measured by perplexity) compared to dVAE and achieves comparable or better performance than state-of-the-art VQ-VAE models.
The paper provides evidence for a correlation between uncertainty values and perplexity, supporting the claim that uncertainty awareness improves codebook usage.
EdVAE demonstrates robust performance across various codebook designs, showing less sensitivity to codebook size and dimensionality compared to other methods. |
The method is primarily evaluated on small to medium-sized datasets and may require further exploration for larger, more diverse datasets like ImageNet.
The $eta$ coefficient, balancing reconstruction and KL divergence terms, requires fine-tuning due to the complexity introduced by the evidential formulation. |
evidential deep learning, discrete variational autoencoders, codebook collapse, generative models, uncertainty quantification |
2310.05654
Report |
No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling |
Xuwei Xu, Changlin Li, Yudong Chen, Xiaojun Chang, Jiajun Liu, Sen Wang |
Vision Transformers (ViTs) have demonstrated outstanding performance in
computer vision tasks, yet their high computational complexity prevents their
deployment in computing resource-constrained environments. Various token
pruning techniques have been introduced to alleviate the high computational
burden of ViTs by dynamically dropping image tokens. However, some undesirable
pruning at early stages may result in permanent loss of image information in
subsequent layers, consequently hindering model performance. To address this
problem, we propose IdleViT, a dynamic token-idle-based method that achieves an
excellent trade-off between performance and efficiency. Specifically, in each
layer, IdleViT selects a subset of the image tokens to participate in
computations while keeping the rest of the tokens idle and directly passing
them to this layer's output. By allowing the idle tokens to be re-selected in
the following layers, IdleViT mitigates the negative impact of improper pruning
in the early stages. Furthermore, inspired by the normalized graph cut, we
devise a token cut loss on the attention map as regularization to improve
IdleViT's token selection ability. Our method is simple yet effective and can
be extended to pyramid ViTs since no token is completely dropped. Extensive
experimental results on various ViT architectures have shown that IdleViT can
diminish the complexity of pretrained ViTs by up to 33\% with no more than
0.2\% accuracy decrease on ImageNet, after finetuning for only 30 epochs.
Notably, when the keep ratio is 0.5, IdleViT outperforms the state-of-the-art
EViT on DeiT-S by 0.5\% higher accuracy and even faster inference speed. The
source code is available in the supplementary material. |
Proposes IdleViT, a token-idle-based efficient ViT framework that dynamically selects tokens for computation while keeping the rest idle, allowing re-selection in later layers and mitigating information loss from early pruning. |
Addresses the high computational complexity of ViTs, hindering their deployment in resource-constrained environments, by achieving a better balance between performance and efficiency. |
Preserves unselected tokens (idle) throughout the layer, allowing re-selection. Introduces a token cut loss based on normalized graph cut theory to enhance semantic consistency in token selection. Fine-tunes pretrained ViTs with knowledge distillation and token cut loss. |
Reduces DeiT-S complexity by 33% with only a 0.2% accuracy drop on ImageNet.
Outperforms state-of-the-art EViT on DeiT-S with 0.5% higher accuracy and faster inference speed at a 0.5 keep ratio.
Improves accuracy on a pyramid ViT (Swin-Ti) compared to vanilla DynamicViT at various keep ratios. |
Only evaluates token selection based on class attention; other methods could be explored.
Limited evaluation on pyramid ViTs; more extensive experiments are needed. |
vision transformer, efficient deep learning, token pruning, token idling, normalized graph cut |
2310.05590
Report |
Perceptual Artifacts Localization for Image Synthesis Tasks |
Lingzhi Zhang, Zhengjie Xu, Connelly Barnes, Yuqian Zhou, Qing Liu, He Zhang, Sohrab Amirghodsi, Zhe Lin, Eli Shechtman, Jianbo Shi |
Recent advancements in deep generative models have facilitated the creation
of photo-realistic images across various tasks. However, these generated images
often exhibit perceptual artifacts in specific regions, necessitating manual
correction. In this study, we present a comprehensive empirical examination of
Perceptual Artifacts Localization (PAL) spanning diverse image synthesis
endeavors. We introduce a novel dataset comprising 10,168 generated images,
each annotated with per-pixel perceptual artifact labels across ten synthesis
tasks. A segmentation model, trained on our proposed dataset, effectively
localizes artifacts across a range of tasks. Additionally, we illustrate its
proficiency in adapting to previously unseen models using minimal training
samples. We further propose an innovative zoom-in inpainting pipeline that
seamlessly rectifies perceptual artifacts in the generated images. Through our
experimental analyses, we elucidate several practical downstream applications,
such as automated artifact rectification, non-referential image quality
evaluation, and abnormal region detection in images. The dataset and code are
released. |
This paper presents a novel dataset and a deep learning model for localizing perceptual artifacts in images generated by various AI image synthesis models. |
Current generative models often produce images with noticeable artifacts, requiring manual correction. This work aims to automate this process and improve the quality of generated images. |
The authors created a dataset of 10,168 generated images annotated with per-pixel artifact labels across ten synthesis tasks. They trained a segmentation model (using Swin-T backbone and UperNet head) on this dataset to localize artifacts. |
The trained model effectively locates artifacts across a range of synthesis tasks and generalizes well to unseen models with minimal fine-tuning.
The authors propose a 'zoom-in' inpainting pipeline, which significantly improves the refinement of artifact regions, especially for detailed objects like faces and hands.
The study demonstrates the effectiveness of using the artifact localization model for downstream tasks such as automatic artifact correction, no-reference image quality assessment, and anomaly detection in real images. |
The study primarily focuses on inpainting as a method for artifact correction, leaving room for exploring other task-specific refinement modules.
The dataset is labeled based on a specific criterion and may not encompass the full spectrum of individual preferences concerning perceptual artifacts. |
image synthesis, perceptual artifacts, artifact localization, deep learning, image quality assessment |
2310.05375
Report |
IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts |
Bohan Zeng, Shanglin Li, Yutang Feng, Ling Yang, Hong Li, Sicheng Gao, Jiaming Liu, Conghui He, Wentao Zhang, Jianzhuang Liu, Baochang Zhang, Shuicheng Yan |
Recent advances in 3D generation have been remarkable, with methods such as
DreamFusion leveraging large-scale text-to-image diffusion-based models to
supervise 3D object generation. These methods enable the synthesis of detailed
and photorealistic textured objects. However, the appearance of 3D objects
produced by these text-to-3D models is unpredictable, and it is hard for the
single-image-to-3D methods to deal with complex images, thus posing a challenge
in generating appearance-controllable 3D objects. To achieve controllable
complex 3D object synthesis, we propose IPDreamer, a novel approach that
incorporates image prompt adaption to extract detailed and comprehensive
appearance features from complex images, which are then utilized for 3D object
generation. Our results demonstrate that IPDreamer effectively generates
high-quality 3D objects that are consistent with both the provided text and the
appearance of complex image prompts, demonstrating its promising capability in
appearance-controllable 3D object generation. Our code is available at
https://github.com/zengbohan0217/IPDreamer. |
IPDreamer: a novel 3D object generation framework enabling controllable, high-quality 3D object creation from complex image prompts. |
Existing text-to-3D models lack appearance control, and single-image-to-3D methods struggle with complex images. |
Two-stage approach: 1) Train a coarse NeRF model from text/image. 2) Extract 3D mesh and optimize geometry and texture using Image Prompt Score Distillation (IPSD), leveraging image prompt features and a geometry prompt difference. Local Editing with Partial Images (LEPI) handles large appearance discrepancies. |
Effectively transfers complex image styles to 3D objects, enabling high-quality texture editing.
Generates more controllable and realistic 3D objects from text prompts compared to SOTA methods.
Outperforms existing methods in quantitative metrics (FID, CLIP score) and user study. |
Color inconsistency between generated 3D object and image prompt can occur.
Further improvements in processing speed are desired. |
3d object generation, image prompt adaption, score distillation sampling, texture editing, neural radiance fields |
2310.05056
Report |
Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching |
Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang |
Current image-based keypoint detection methods for animal (including human)
bodies and faces are generally divided into full-supervised and few-shot
class-agnostic approaches. The former typically relies on laborious and
time-consuming manual annotations, posing considerable challenges in expanding
keypoint detection to a broader range of keypoint categories and animal
species. The latter, though less dependent on extensive manual input, still
requires necessary support images with annotation for reference during testing.
To realize zero-shot keypoint detection without any prior annotation, we
introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is
innovatively designed to use text prompts for identifying arbitrary keypoints
across any species. In pursuit of this goal, we have developed a novel
framework named Open-Vocabulary Keypoint Detection with Semantic-feature
Matching (KDSM). This framework synergistically combines vision and language
models, creating an interplay between language features and local keypoint
visual features. KDSM enhances its capabilities by integrating Domain
Distribution Matrix Matching (DDMM) and other special modules, such as the
Vision-Keypoint Relational Awareness (VKRA) module, improving the framework's
generalizability and overall performance.Our comprehensive experiments
demonstrate that KDSM significantly outperforms the baseline in terms of
performance and achieves remarkable success in the OVKD task.Impressively, our
method, operating in a zero-shot fashion, still yields results comparable to
state-of-the-art few-shot species class-agnostic keypoint detection methods.We
will make the source code publicly accessible. |
This paper introduces Open-Vocabulary Keypoint Detection (OVKD), a novel task aiming to detect arbitrary keypoints in images using text prompts, even for unseen animal species and keypoint categories. |
Traditional keypoint detection methods struggle with generalizing to new categories. This work addresses this limitation by leveraging language models to achieve zero-shot detection of diverse keypoints across species. |
The paper proposes KDSM, a framework that combines a Vision-Keypoint Relational Awareness (VKRA) module with Domain Distribution Matrix Matching (DDMM). VKRA enhances interactions between text embeddings and visual features, while DDMM clusters keypoint categories to enable efficient learning and generalization. |
KDSM significantly outperforms the baseline framework on the MP-78 dataset for both diverse keypoint categories and varied animal species settings.
In the zero-shot setting, KDSM achieves comparable performance to state-of-the-art few-shot species class-agnostic keypoint detection methods.
Ablation studies confirm the contribution of DDMM, VKRA, and the choice of pre-trained encoders to KDSM's performance. |
KDSM's performance in challenging scenarios with occlusion, lighting variations, and resolution changes requires further investigation.
Future work could explore integrating stronger text encoders to further enhance the method's capabilities. |
open vocabulary, keypoint detection, zero-shot learning, vision-language models, domain distribution matrix matching |
2310.04995
Report |
SemST: Semantically Consistent Multi-Scale Image Translation via Structure-Texture Alignment |
Ganning Zhao, Wenhui Cui, Suya You, C. -C. Jay Kuo |
Unsupervised image-to-image (I2I) translation learns cross-domain image
mapping that transfers input from the source domain to output in the target
domain while preserving its semantics. One challenge is that different semantic
statistics in source and target domains result in content discrepancy known as
semantic distortion. To address this problem, a novel I2I method that maintains
semantic consistency in translation is proposed and named SemST in this work.
SemST reduces semantic distortion by employing contrastive learning and
aligning the structural and textural properties of input and output by
maximizing their mutual information. Furthermore, a multi-scale approach is
introduced to enhance translation performance, thereby enabling the
applicability of SemST to domain adaptation in high-resolution images.
Experiments show that SemST effectively mitigates semantic distortion and
achieves state-of-the-art performance. Also, the application of SemST to domain
adaptation (DA) is explored. It is demonstrated by preliminary experiments that
SemST can be utilized as a beneficial pre-training for the semantic
segmentation task. |
This paper presents SemST, a novel multi-scale image-to-image translation method that reduces semantic distortion by aligning structural and textural properties of input and output images. |
Semantic distortion, a common problem in cross-domain image translation, can negatively impact the performance of downstream tasks like semantic segmentation and domain adaptation. |
SemST leverages a multi-scale framework to capture both global context and local details, uses a texture-structure consistency loss based on mutual information to align semantic features, and employs semantics-aided hard negative sampling to enhance contrastive learning. |
SemST achieves state-of-the-art performance in image translation across multiple datasets, including GTA5 to Cityscapes, Cityscapes Parsing to Image, and Photo to Maps.
Qualitative results demonstrate SemST's ability to effectively preserve semantic consistency and generate high-quality translated images with fewer artifacts.
Experiments on domain adaptation for semantic segmentation show that using SemST-refined synthetic images during training improves performance, highlighting its potential for UDA pre-training. |
The selection of the TS loss weight requires careful tuning to balance input-output consistency and target domain style learning.
Future work could explore the integration of semantic segmentation loss for explicit semantic guidance during image translation. |
image-to-image translation, semantic consistency, multi-scale learning, contrastive learning, domain adaptation |
2310.04719
Report |
A Comprehensive Survey on Deep Neural Image Deblurring |
Sajjad Amrollahi Biyouki, Hoon Hwangbo |
Image deblurring tries to eliminate degradation elements of an image causing
blurriness and improve the quality of an image for better texture and object
visualization. Traditionally, prior-based optimization approaches predominated
in image deblurring, but deep neural networks recently brought a major
breakthrough in the field. In this paper, we comprehensively review the recent
progress of the deep neural architectures in both blind and non-blind image
deblurring. We outline the most popular deep neural network structures used in
deblurring applications, describe their strengths and novelties, summarize
performance metrics, and introduce broadly used datasets. In addition, we
discuss the current challenges and research gaps in this domain and suggest
potential research directions for future works. |
This paper presents a comprehensive review of deep neural network architectures for both blind and non-blind image deblurring, summarizing their contributions, structures, and mechanisms. |
Image deblurring is crucial for improving image quality in various applications, and deep learning has shown breakthroughs in this field. |
The paper examines various deep neural networks, including CNNs, ResNets, encoder-decoder networks, GANs, and their variations. It analyzes their architectural configurations, training loss functions, and performance on benchmark datasets. |
Multi-scale architectures and GANs have significantly improved blind image deblurring performance.
Attention mechanisms effectively capture blur characteristics and locations.
Deep learning-based image priors, especially deep image prior, have shown promise in enhancing deblurring results. |
Current architectures face challenges in scalability, generalizability, and handling real-world blur.
Future research should focus on developing more efficient feature extraction modules, reducing architecture complexity, and creating realistic datasets. |
image deblurring, deep learning, convolutional neural networks, generative adversarial networks, image restoration |
2310.04672
Report |
EasyPhoto: Your Smart AI Photo Generator |
Ziheng Wu, Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Xing Shi, Jun Huang |
Stable Diffusion web UI (SD-WebUI) is a comprehensive project that provides a
browser interface based on Gradio library for Stable Diffusion models. In this
paper, We propose a novel WebUI plugin called EasyPhoto, which enables the
generation of AI portraits. By training a digital doppelganger of a specific
user ID using 5 to 20 relevant images, the finetuned model (according to the
trained LoRA model) allows for the generation of AI photos using arbitrary
templates. Our current implementation supports the modification of multiple
persons and different photo styles. Furthermore, we allow users to generate
fantastic template image with the strong SDXL model, enhancing EasyPhoto's
capabilities to deliver more diverse and satisfactory results. The source code
for EasyPhoto is available at: https://github.com/aigc-apps/sd-webui-EasyPhoto.
We also support a webui-free version by using diffusers:
https://github.com/aigc-apps/EasyPhoto. We are continuously enhancing our
efforts to expand the EasyPhoto pipeline, making it suitable for any
identification (not limited to just the face), and we enthusiastically welcome
any intriguing ideas or suggestions. |
EasyPhoto, a Stable Diffusion web UI plugin for generating high-quality AI portraits by training a digital doppelganger of a user using a few input images. |
Existing methods for AI portrait generation often result in unrealistic lighting, identity loss, or boundary artifacts. EasyPhoto overcomes these limitations by leveraging the image-to-image capabilities of Stable Diffusion and a novel two-stage diffusion process. |
EasyPhoto uses a multi-stage process involving: (1) Training a LoRA model on user images, incorporating reinforcement learning for identity preservation. (2) A two-stage diffusion process with ControlNet guidance for generating realistic and identity-consistent portraits in various styles and with multiple users. |
EasyPhoto generates high-quality AI portraits that maintain user identity and resemble the input template style.
The two-stage diffusion process effectively addresses issues of boundary artifacts and identity loss.
The system supports multi-user generation and leverages SDXL for diverse and realistic template creation. |
Current implementation primarily focuses on face IDs; expanding to other objects is under development.
Reliance on multiple ControlNet units and diffusion stages can increase computational cost. |
ai portrait generation, stable diffusion, lora, controlnet, digital doppelganger |
2310.04414
Report |
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis |
Xiaoxiao Sun, Xingjian Leng, Zijian Wang, Yang Yang, Zi Huang, Liang Zheng |
Analyzing model performance in various unseen environments is a critical
research problem in the machine learning community. To study this problem, it
is important to construct a testbed with out-of-distribution test sets that
have broad coverage of environmental discrepancies. However, existing testbeds
typically either have a small number of domains or are synthesized by image
corruptions, hindering algorithm design that demonstrates real-world
effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of
180 datasets collected by prompting image search engines and diffusion models
in various ways. Generally sized between 300 and 8,000 images, the datasets
contain natural images, cartoons, certain colors, or objects that do not
naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen
the understanding of two generalization tasks: domain generalization and model
accuracy prediction in various out-of-distribution environments. We conduct
extensive benchmarking and comparison experiments and show that CIFAR-10-W
offers new and interesting insights inherent to these tasks. We also discuss
other fields that would benefit from CIFAR-10-W. |
This paper introduces CIFAR-10-Warehouse (CIFAR-10-W), a dataset of 180 diverse domains for evaluating model generalization in out-of-distribution (OOD) settings. |
Existing OOD datasets are limited by a small number of domains or reliance on synthetic corruptions, hindering the development of algorithms with real-world effectiveness. |
CIFAR-10-W is constructed by collecting real-world images from various search engines using diverse prompts and by generating synthetic images using a diffusion model. |
CIFAR-10-W provides a more challenging and realistic benchmark for accuracy prediction methods compared to synthetic datasets.
Domain generalization methods show limited improvement over baseline models on CIFAR-10-W, suggesting the need for more robust algorithms.
Accuracy prediction methods can be applied to estimate the performance of domain generalized models on unseen target domains. |
CIFAR-10-W is based on CIFAR-10, which is relatively small compared to ImageNet.
Individual datasets in CIFAR-10-W might not be sufficient for full training due to their relatively small size.
The domain coverage of CIFAR-10-W, while broad, is not exhaustive. |
domain generalization, accuracy prediction, out-of-distribution generalization, dataset, benchmark |
2310.03739
Report |
Aligning Text-to-Image Diffusion Models with Reward Backpropagation |
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki |
Text-to-image diffusion models have recently emerged at the forefront of
image generation, powered by very large-scale unsupervised or weakly supervised
text-to-image training datasets. Due to their unsupervised training,
controlling their behavior in downstream tasks, such as maximizing
human-perceived image quality, image-text alignment, or ethical image
generation, is difficult. Recent works finetune diffusion models to downstream
reward functions using vanilla reinforcement learning, notorious for the high
variance of the gradient estimators. In this paper, we propose AlignProp, a
method that aligns diffusion models to downstream reward functions using
end-to-end backpropagation of the reward gradient through the denoising
process. While naive implementation of such backpropagation would require
prohibitive memory resources for storing the partial derivatives of modern
text-to-image models, AlignProp finetunes low-rank adapter weight modules and
uses gradient checkpointing, to render its memory usage viable. We test
AlignProp in finetuning diffusion models to various objectives, such as
image-text semantic alignment, aesthetics, compressibility and controllability
of the number of objects present, as well as their combinations. We show
AlignProp achieves higher rewards in fewer training steps than alternatives,
while being conceptually simpler, making it a straightforward choice for
optimizing diffusion models for differentiable reward functions of interest.
Code and Visualization results are available at https://align-prop.github.io/. |
Introduces Alignment by Backpropagation (AlignProp), a differentiable method for finetuning pretrained text-to-image diffusion models using end-to-end backpropagation of reward gradients, addressing limitations of reinforcement learning approaches. |
Important because aligning pretrained diffusion models with downstream objectives like aesthetics, fairness, and text-to-image alignment is crucial, and current methods are either data-hungry, computationally expensive, or inefficient. |
Models the denoising process as a differentiable recurrent policy and finetunes low-rank adapter weights using gradient checkpointing and randomized truncated backpropagation to reduce memory overhead and prevent over-optimization. |
AlignProp achieves higher rewards and better data efficiency (25x) compared to reinforcement learning baselines.
It generalizes better to novel text prompts, demonstrating its ability to learn beyond training data.
Human evaluations show preference for AlignProp-generated images in terms of fidelity and image-text alignment. |
Reliance on differentiable reward functions might lead to over-optimization for imperfect reward functions.
Future work includes exploring the applicability of AlignProp to diffusion-based language models for improved alignment with human feedback. |
diffusion models, text-to-image generation, model alignment, reward learning, backpropagation |
2310.03734
Report |
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency |
Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan |
Current vision-language generative models rely on expansive corpora of paired
image-text data to attain optimal performance and generalization capabilities.
However, automatically collecting such data (e.g. via large-scale web scraping)
leads to low quality and poor image-text correlation, while human annotation is
more accurate but requires significant manual effort and expense. We introduce
$\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage
$\textbf{T}$ext): an innovative training paradigm grounded in the concept of
cycle consistency which allows vision-language training on unpaired image and
text data. ITIT is comprised of a joint image-text encoder with disjoint image
and text decoders that enable bidirectional image-to-text and text-to-image
generation in a single framework. During training, ITIT leverages a small set
of paired image-text data to ensure its output matches the input reasonably
well in both directions. Simultaneously, the model is also trained on much
larger datasets containing only images or texts. This is achieved by enforcing
cycle consistency between the original unpaired samples and the cycle-generated
counterparts. For instance, it generates a caption for a given input image and
then uses the caption to create an output image, and enforces similarity
between the input and output images. Our experiments show that ITIT with
unpaired datasets exhibits similar scaling behavior as using high-quality
paired data. We demonstrate image generation and captioning performance on par
with state-of-the-art text-to-image and image-to-text models with orders of
magnitude fewer (only 3M) paired image-text data. |
This paper introduces ITIT, a novel training paradigm that leverages cycle consistency to train vision-language models on unpaired image and text data. |
This is important because collecting large-scale paired image-text data is expensive and often results in low quality, while vast amounts of unpaired data remain underutilized. |
The method uses a unified image-text encoder and separate image/text decoders. It leverages a small paired dataset for initial training and then uses cycle consistency losses on unpaired data, enforcing similarity between an input image/text and a reconstructed version generated after passing through text-to-image and then image-to-text generation (or vice versa). |
ITIT achieves performance comparable to state-of-the-art methods on text-to-image and image-to-text benchmarks using significantly fewer paired data samples (up to 100x less).
The method exhibits similar scaling behavior with unpaired data as models trained solely on paired data, highlighting its potential for leveraging large unpaired datasets.
ITIT proves robust to low-quality paired data, showing significant improvements when incorporating a large, noisy paired dataset compared to baselines trained only on this data. |
The current implementation leads to increased training time compared to standard paired data training.
Future work includes scaling ITIT to even larger unpaired datasets and exploring its effectiveness with more diverse data sources. |
vision-language models, cycle consistency, unpaired data, text-to-image generation, image captioning |
2310.03669
Report |
LumiNet: The Bright Side of Perceptual Knowledge Distillation |
Md. Ismail Hossain, M M Lutfe Elahi, Sameera Ramasinghe, Ali Cheraghian, Fuad Rahman, Nabeel Mohammed, Shafin Rahman |
In knowledge distillation literature, feature-based methods have dominated
due to their ability to effectively tap into extensive teacher models. In
contrast, logit-based approaches, which aim to distill `dark knowledge' from
teachers, typically exhibit inferior performance compared to feature-based
methods. To bridge this gap, we present LumiNet, a novel knowledge distillation
algorithm designed to enhance logit-based distillation. We introduce the
concept of 'perception', aiming to calibrate logits based on the model's
representation capability. This concept addresses overconfidence issues in
logit-based distillation method while also introducing a novel method to
distill knowledge from the teacher. It reconstructs the logits of a
sample/instances by considering relationships with other samples in the batch.
LumiNet excels on benchmarks like CIFAR-100, ImageNet, and MSCOCO,
outperforming leading feature-based methods, e.g., compared to KD with ResNet18
and MobileNetV2 on ImageNet, it shows improvements of 1.5% and 2.05%,
respectively. |
The paper presents LumiNet, a novel knowledge distillation algorithm that generates new representations for instances/samples, addressing the overconfidence issue in logit-based distillation and improving performance. |
Logit-based knowledge distillation, while potentially efficient, often lags behind feature-based methods due to overconfidence issues and limitations in capturing knowledge granularity. LumiNet aims to bridge this gap. |
LumiNet introduces the concept of 'perception,' leveraging mean and variance statistics of logits within a batch to generate a new representation for each instance. This approach enhances logit granularity and mitigates the overconfidence issue. |
LumiNet outperforms state-of-the-art knowledge distillation methods on CIFAR-100, ImageNet, and MS-COCO datasets for image recognition and object detection tasks.
The method demonstrates consistent improvement across various architectures, including ResNet, VGG, ShuffleNet, MobileNet, WRN, and Faster-RCNN-FPN.
LumiNet achieves superior accuracy while maintaining efficiency comparable to traditional knowledge distillation methods. |
While LumiNet shows promising results, exploring its effectiveness with larger batch sizes and diverse datasets could further strengthen its applicability.
Further research can investigate the integration of feature-based methods with LumiNet to potentially enhance performance even further. |
knowledge distillation, deep learning, logit-based distillation, overconfidence, perception |
2310.03502
Report |
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion |
Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov |
Text-to-image generation is a significant domain in modern computer vision
and has achieved substantial improvements through the evolution of generative
architectures. Among these, there are diffusion-based models that have
demonstrated essential quality enhancements. These models are generally split
into two categories: pixel-level and latent-level approaches. We present
Kandinsky1, a novel exploration of latent diffusion architecture, combining the
principles of the image prior models with latent diffusion techniques. The
image prior model is trained separately to map text embeddings to image
embeddings of CLIP. Another distinct feature of the proposed model is the
modified MoVQ implementation, which serves as the image autoencoder component.
Overall, the designed model contains 3.3B parameters. We also deployed a
user-friendly demo system that supports diverse generative modes such as
text-to-image generation, image fusion, text and image fusion, image variations
generation, and text-guided inpainting/outpainting. Additionally, we released
the source code and checkpoints for the Kandinsky models. Experimental
evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking
our model as the top open-source performer in terms of measurable image
generation quality. |
This paper introduces Kandinsky, a novel text-to-image latent diffusion model that combines image prior and latent diffusion techniques. |
Kandinsky achieves state-of-the-art image generation quality among open-source models, evidenced by a high FID score and strong human evaluation results. |
The model utilizes CLIP embeddings for both text and images, employs a transformer-based image prior, a modified MoVQ autoencoder, and a UNet for latent diffusion. |
Kandinsky achieves a FID score of 8.03 on the COCO-30K dataset, outperforming other open-source models.
A linear mapping between text and image embedding spaces proves surprisingly effective for image generation.
The model supports diverse generation modes like text-to-image, image fusion, and inpainting/outpainting. |
Further improvements in semantic coherence between text prompts and generated images are needed.
Continued research on mitigating potential biases and preventing the generation of harmful content is crucial. |
text-to-image synthesis, latent diffusion model, image prior, clip embeddings, movq autoencoder |
2310.03337
Report |
Denoising Diffusion Step-aware Models |
Shuai Yang, Yukang Chen, Luozhou Wang, Shu Liu, Yingcong Chen |
Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for
data generation across various domains. However, a significant bottleneck is
the necessity for whole-network computation during every step of the generative
process, leading to high computational overheads. This paper presents a novel
framework, Denoising Diffusion Step-aware Models (DDSM), to address this
challenge. Unlike conventional approaches, DDSM employs a spectrum of neural
networks whose sizes are adapted according to the importance of each generative
step, as determined through evolutionary search. This step-wise network
variation effectively circumvents redundant computational efforts, particularly
in less critical steps, thereby enhancing the efficiency of the diffusion
model. Furthermore, the step-aware design can be seamlessly integrated with
other efficiency-geared diffusion models such as DDIMs and latent diffusion,
thus broadening the scope of computational savings. Empirical evaluations
demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61%
for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all
without compromising the generation quality. |
This paper introduces Denoising Diffusion Step-aware Models (DDSM), a novel framework that accelerates Denoising Diffusion Probabilistic Models (DDPMs) by employing different-sized neural networks for different generative steps based on their importance. |
DDPMs, while effective for data generation, suffer from high computational overhead due to the need for whole-network computation at each step of the generative process. DDSM addresses this challenge by optimizing resource allocation across steps. |
DDSM utilizes a slimmable neural network trained to be executable at various sizes. An evolutionary search algorithm then identifies the optimal model size for each generative step, minimizing computational cost without sacrificing generation quality. |
DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet without compromising image generation quality.
The optimal step-aware strategy varies significantly across datasets, highlighting the importance of dataset-specific optimization.
DDSM is compatible with other diffusion model acceleration techniques, like DDIM and latent diffusion, allowing for further efficiency improvements. |
The current search algorithm, while effective, introduces additional computational overhead during the training process.
Further investigation is needed to understand the theoretical underpinnings of step importance in diffusion models. |
denoising diffusion probabilistic models, generative models, model compression, network pruning, evolutionary search |
2310.03324
Report |
Investigating the Limitation of CLIP Models: The Worst-Performing Categories |
Jie-Jing Shao, Jiang-Xin Shi, Xiao-Wen Yang, Lan-Zhe Guo, Yu-Feng Li |
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by
integrating natural language into visual concepts, enabling zero-shot
recognition on downstream tasks. It is usually expected that satisfactory
overall accuracy can be achieved across numerous domains through well-designed
textual prompts. However, we found that their performance in the worst
categories is significantly inferior to the overall performance. For example,
on ImageNet, there are a total of 10 categories with class-wise accuracy as low
as 0\%, even though the overall performance has achieved 64.1\%. This
phenomenon reveals the potential risks associated with using CLIP models,
particularly in risk-sensitive applications where specific categories hold
significant importance. To address this issue, we investigate the alignment
between the two modalities in the CLIP model and propose the Class-wise
Matching Margin (\cmm) to measure the inference confusion. \cmm\ can
effectively identify the worst-performing categories and estimate the potential
performance of the candidate prompts. We further query large language models to
enrich descriptions of worst-performing categories and build a weighted
ensemble to highlight the efficient prompts. Experimental results clearly
verify the effectiveness of our proposal, where the accuracy on the worst-10
categories on ImageNet is boosted to 5.2\%, without manual prompt engineering,
laborious optimization, or access to labeled validation data. |
The paper proposes CPE, a method to improve the performance of Contrastive Language-Image Pre-training (CLIP) models on worst-performing categories, which are often overlooked when focusing on overall accuracy. |
CLIP models often exhibit significantly inferior performance in specific categories despite good overall accuracy. This poses potential risks for real-world applications, especially in risk-sensitive domains where performance in certain categories is crucial. |
The authors introduce Class-wise Matching Margin (CMM) to measure inference confusion and identify worst-performing categories. They use CMM to select effective prompt templates and enrich descriptions of worst-performing categories using large language models (LLMs), leading to a weighted prompt ensemble method. |
CPE consistently boosts the accuracy of worst-performing categories across various benchmark datasets.
On ImageNet, the accuracy of the worst-10 categories is boosted from 0% to 5.2%.
CPE achieves comparable overall accuracy to state-of-the-art methods while significantly improving worst-category performance, demonstrating that both can be achieved simultaneously. |
CPE's reliance on pseudo-labels for CMM calculation might introduce errors, especially on challenging datasets.
Further exploration is needed to optimize the prompt selection process and the number of categories to enrich descriptions for, potentially improving CPE's effectiveness further.
The work uses a predefined template pool from CLIP. Future work could explore automatic generation and selection of templates. |
clip, zero-shot learning, worst-case performance, prompt engineering, large language models |
2310.03291
Report |
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction |
Yiren Jian, Tingkai Liu, Yunzhe Tao, Chunhui Zhang, Soroush Vosoughi, Hongxia Yang |
In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined
framework designed for the pre-training of visually conditioned language
generation models with high computational demands, utilizing frozen pre-trained
large language models (LLMs). The conventional approach in vision-language
pre-training (VLP) typically involves a two-stage optimization process: an
initial resource-intensive phase dedicated to general-purpose vision-language
representation learning, focused on extracting and consolidating relevant
visual features. This is followed by a subsequent phase that emphasizes
end-to-end alignment between visual and linguistic modalities. Our novel
one-stage, single-loss framework bypasses the computationally demanding first
training stage by gradually merging similar visual tokens during training,
while avoiding model collapse caused by single-stage training of BLIP-2 type
models. The gradual merging process effectively condenses visual information
while preserving semantic richness, resulting in rapid convergence without
compromising performance. Our experimental findings demonstrate that our
approach accelerates the training of vision-language models by a factor of 5
without a noticeable impact on overall performance. Furthermore, we illustrate
that our models significantly narrow the performance gap to current
vision-language models using only 1/10 of the data. Finally, we showcase how
our image-text models can seamlessly adapt to video-conditioned language
generation tasks through novel soft attentive temporal token contextualizing
modules. Code is available at \url{https://github.com/yiren-jian/EVLGen}. |
The paper proposes EVL_Gen, a streamlined framework for pre-training visually conditioned language generation models using frozen pre-trained large language models (LLMs). EVL_Gen utilizes a novel token merging transformer, TomeFormer, as an efficient vision-language connector, reducing computational cost and training time. |
Existing vision-language pre-training (VLP) methods, while effective, are computationally expensive, limiting research and exploration of different model configurations. EVL_Gen addresses this challenge by enabling faster and more efficient VLP with comparable or better performance. |
EVL_Gen replaces the resource-intensive two-stage training process of previous methods with a single-stage, single-loss framework. It employs TomeFormer to merge similar visual tokens during training, compressing visual information while preserving semantic richness, leading to faster convergence. |
EVL_Gen achieves competitive performance to BLIP-2 on various image-text benchmarks while using significantly less training time (5x faster) and data.
The proposed temporal token contextualizing module effectively adapts EVL_Gen for video-language tasks, achieving strong performance on video captioning benchmarks.
TomeFormer effectively compresses visual information into semantically rich tokens, simplifying the training process and enabling single-stage optimization. |
The fixed token merging rate (r) in TomeFormer may not be optimal for all images/videos.
TomeFormer lacks the ability for text-specific selection of visual features, potentially limiting performance in tasks like VQA. |
vision-language pre-training, large language models, token merging, vision-language generation, video captioning |
2310.03270
Report |
EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models |
Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang |
Diffusion models have demonstrated remarkable capabilities in image synthesis
and related generative tasks. Nevertheless, their practicality for real-world
applications is constrained by substantial computational costs and latency
issues. Quantization is a dominant way to compress and accelerate diffusion
models, where post-training quantization (PTQ) and quantization-aware training
(QAT) are two main approaches, each bearing its own properties. While PTQ
exhibits efficiency in terms of both time and data usage, it may lead to
diminished performance in low bit-width. On the other hand, QAT can alleviate
performance degradation but comes with substantial demands on computational and
data resources. In this paper, we introduce a data-free and parameter-efficient
fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to
achieve QAT-level performance with PTQ-like efficiency. Specifically, we
propose a quantization-aware variant of the low-rank adapter (QALoRA) that can
be merged with model weights and jointly quantized to low bit-width. The
fine-tuning process distills the denoising capabilities of the full-precision
model into its quantized counterpart, eliminating the requirement for training
data. We also introduce scale-aware optimization and temporal learned step-size
quantization to further enhance performance. Extensive experimental results
demonstrate that our method significantly outperforms previous PTQ-based
diffusion models while maintaining similar time and data efficiency.
Specifically, there is only a 0.05 sFID increase when quantizing both weights
and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based
methods, our EfficientDM also boasts a 16.2x faster quantization speed with
comparable generation quality. Code is available at
\href{https://github.com/ThisisBillhe/EfficientDM}{this hrl}. |
This paper proposes EfficientDM, a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, aiming to achieve quantization-aware training (QAT) performance with post-training quantization (PTQ) efficiency. |
Diffusion models, while powerful for image generation, suffer from high computational cost and latency. Quantization effectively addresses these issues but struggles with significant performance degradation at low bit-widths, especially for PTQ methods. |
The paper introduces a quantization-aware low-rank adapter (QALoRA), enabling joint quantization of adapter weights and model weights. This facilitates data-free fine-tuning by minimizing the MSE between estimated noises from full-precision and quantized models. The authors further propose scale-aware LoRA optimization to handle variations in weight quantization scales across layers and temporal activation LSQ (TALSQ) to tackle variations in activation distributions across time steps. |
EfficientDM achieves state-of-the-art performance for low-bit quantization of diffusion models on CIFAR-10, LSUN, and ImageNet datasets, outperforming previous PTQ-based methods while maintaining similar time and data efficiency.
The method enables quantization of LDM-4 model weights to 2-bit for the first time with marginal performance loss.
Ablation studies demonstrate the effectiveness of each proposed component, including QALoRA, scale-aware LoRA optimization, and TALSQ. |
Despite achieving QAT-level performance with PTQ-like data and time efficiency, EfficientDM still requires more GPU memory than pure PTQ methods, especially for large diffusion models.
Exploration of efficient diffusion models for video or 3D generation remains an open area for future work. |
diffusion models, model quantization, low-bit quantization, data-free fine-tuning, parameter-efficient fine-tuning |
2310.03020
Report |
Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models |
Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, Heng Wang |
Zero-shot novel view synthesis (NVS) from a single image is an essential
problem in 3D object understanding. While recent approaches that leverage
pre-trained generative models can synthesize high-quality novel views from
in-the-wild inputs, they still struggle to maintain 3D consistency across
different views. In this paper, we present Consistent-1-to-3, which is a
generative framework that significantly mitigates this issue. Specifically, we
decompose the NVS task into two stages: (i) transforming observed regions to a
novel view, and (ii) hallucinating unseen regions. We design a scene
representation transformer and view-conditioned diffusion model for performing
these two stages respectively. Inside the models, to enforce 3D consistency, we
propose to employ epipolor-guided attention to incorporate geometry
constraints, and multi-view attention to better aggregate multi-view
information. Finally, we design a hierarchy generation paradigm to generate
long sequences of consistent views, allowing a full 360-degree observation of
the provided object image. Qualitative and quantitative evaluation over
multiple datasets demonstrates the effectiveness of the proposed mechanisms
against state-of-the-art approaches. Our project page is at
https://jianglongye.com/consistent123/ |
Introduces Consistent-1-to-3, a novel framework that generates consistent novel views of objects from any viewpoint given a single image. |
Zero-shot novel view synthesis is essential for 3D object understanding with applications in AR/VR, robotics, and content creation, but existing methods struggle to maintain 3D consistency across different views. |
Decomposes the task into two stages: using a Scene Representation Transformer (SRT) for photometric warping to novel views and a view-conditioned diffusion model to hallucinate unseen regions. Employs epipolar-guided and multi-view attention for 3D consistency and a hierarchical generation paradigm for long sequences of consistent views. |
Significantly improves geometric consistency compared to previous state-of-the-art methods.
Achieves superior performance on Objaverse and Google Scanned Objects datasets in terms of PSNR, SSIM, LPIPS, and flow warping error.
Demonstrates the effectiveness of each component through ablation studies. |
Trade-off between fidelity and consistency when using multi-view attention and hierarchical generation.
Future work includes incorporating better geometry constraints and representations. |
novel view synthesis, 3d consistency, diffusion models, scene representation transformer, epipolar geometry |
2310.03015
Report |
Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day |
Yifan Jiang, Hao Tang, Jen-Hao Rick Chang, Liangchen Song, Zhangyang Wang, Liangliang Cao |
The task of novel view synthesis aims to generate unseen perspectives of an
object or scene from a limited set of input images. Nevertheless, synthesizing
novel views from a single image still remains a significant challenge in the
realm of computer vision. Previous approaches tackle this problem by adopting
mesh prediction, multi-plain image construction, or more advanced techniques
such as neural radiance fields. Recently, a pre-trained diffusion model that is
specifically designed for 2D image synthesis has demonstrated its capability in
producing photorealistic novel views, if sufficiently optimized on a 3D
finetuning task. Although the fidelity and generalizability are greatly
improved, training such a powerful diffusion model requires a vast volume of
training data and model parameters, resulting in a notoriously long time and
high computational costs. To tackle this issue, we propose Efficient-3DiM, a
simple but effective framework to learn a single-image novel-view synthesizer.
Motivated by our in-depth analysis of the inference process of diffusion
models, we propose several pragmatic strategies to reduce the training overhead
to a manageable scale, including a crafted timestep sampling strategy, a
superior 3D feature extractor, and an enhanced training scheme. When combined,
our framework is able to reduce the total training time from 10 days to less
than 1 day, significantly accelerating the training process under the same
computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive
experiments are conducted to demonstrate the efficiency and generalizability of
our proposed method. |
Efficient-3DiM, an efficient diffusion model framework for single-image novel view synthesis, significantly reducing training time without sacrificing performance. |
Training diffusion models for novel view synthesis is computationally expensive and time-consuming, hindering research progress. This work aims to improve training efficiency and make it more accessible. |
The paper introduces three key strategies: 1) a revised timestep sampling method using Gaussian distribution to prioritize important training segments, 2) integration of a self-supervised Vision Transformer (DINO-v2) for superior 3D feature extraction and amalgamation, and 3) an enhanced training paradigm with mixed-precision training and other optimizations. |
Gaussian sampling for timesteps proves more effective than uniform sampling, especially prioritizing later stages.
DINO-v2 encoder outperforms CLIP encoder in capturing 3D features, leading to better performance when combined with multi-scale feature amalgamation.
Efficient-3DiM achieves a 14x speedup compared to the baseline Zero 1-to-3 method, reducing training time from 10 days to less than 1 day on the same hardware. |
Multi-view consistency can be further improved for even better fidelity.
Exploration of applying the framework to more complex and realistic datasets beyond synthetic objects. |
novel view synthesis, diffusion models, efficient training, dino-v2, single-image 3d reconstruction |
2310.02992
Report |
Kosmos-G: Generating Images in Context with Multimodal Large Language Models |
Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu Wei |
Recent advancements in subject-driven image generation have made significant
strides. However, current methods still fall short in diverse application
scenarios, as they require test-time tuning and cannot accept interleaved
multi-image and text input. These limitations keep them far from the ultimate
goal of "image as a foreign language in image generation." This paper presents
Kosmos-G, a model that leverages the advanced multimodal perception
capabilities of Multimodal Large Language Models (MLLMs) to tackle the
aforementioned challenge. Our approach aligns the output space of MLLM with
CLIP using the textual modality as an anchor and performs compositional
instruction tuning on curated data. Kosmos-G demonstrates an impressive
capability of zero-shot subject-driven generation with interleaved multi-image
and text input. Notably, the score distillation instruction tuning requires no
modifications to the image decoder. This allows for a seamless substitution of
CLIP and effortless integration with a myriad of U-Net techniques ranging from
fine-grained controls to personalized image decoder variants. We posit Kosmos-G
as an initial attempt towards the goal of "image as a foreign language in image
generation." The code can be found at https://aka.ms/Kosmos-G |
Introduces Kosmos-G, a novel model leveraging Multimodal Large Language Models (MLLMs) for zero-shot subject-driven image generation with interleaved multi-image and text input. |
Addresses limitations of current subject-driven image generation methods that require test-time tuning and struggle with complex, multi-entity scenarios, bringing us closer to “image as a foreign language in image generation.” |
Employs a three-stage 'align before instruct' training strategy: 1) Multimodal Language Modeling, 2) Image Decoder Aligning, and 3) Instruction Tuning with a compositional generation task on a curated multimodal dataset. |
Achieves impressive zero-shot generation results across diverse settings, including re-contextualization, stylization, modification, and accessory incorporation.
Outperforms or shows comparable performance to existing fine-tuning and test-time tuning free methods in single-entity subject-driven generation.
Seamlessly integrates with existing U-Net techniques like ControlNet and LoRA, unlocking a variety of novel applications. |
Single-image input during evaluation for DreamBench, potentially limiting performance compared to methods using multiple images.
Further improvement possible by exploring different data paradigms and refining the alignment process, particularly for prompts with prefixes. |
image generation, subject-driven generation, multimodal large language model, zero-shot learning, vision-language alignment |
2310.02596
Report |
SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D |
Weiyu Li, Rui Chen, Xuelin Chen, Ping Tan |
It is inherently ambiguous to lift 2D results from pre-trained diffusion
models to a 3D world for text-to-3D generation. 2D diffusion models solely
learn view-agnostic priors and thus lack 3D knowledge during the lifting,
leading to the multi-view inconsistency problem. We find that this problem
primarily stems from geometric inconsistency, and avoiding misplaced geometric
structures substantially mitigates the problem in the final outputs. Therefore,
we improve the consistency by aligning the 2D geometric priors in diffusion
models with well-defined 3D shapes during the lifting, addressing the vast
majority of the problem. This is achieved by fine-tuning the 2D diffusion model
to be viewpoint-aware and to produce view-specific coordinate maps of
canonically oriented 3D objects. In our process, only coarse 3D information is
used for aligning. This "coarse" alignment not only resolves the multi-view
inconsistency in geometries but also retains the ability in 2D diffusion models
to generate detailed and diversified high-quality objects unseen in the 3D
datasets. Furthermore, our aligned geometric priors (AGP) are generic and can
be seamlessly integrated into various state-of-the-art pipelines, obtaining
high generalizability in terms of unseen shapes and visual appearance while
greatly alleviating the multi-view inconsistency problem. Our method represents
a new state-of-the-art performance with an 85+% consistency rate by human
evaluation, while many previous methods are around 30%. Our project page is
https://sweetdreamer3d.github.io/ |
This paper proposes Aligned Geometric Priors (AGP) to address the multi-view inconsistency problem in text-to-3D generation by aligning 2D geometric priors in diffusion models with well-defined 3D shapes. |
Lifting 2D diffusion results to 3D is inherently ambiguous due to the lack of 3D knowledge, leading to inconsistent 3D structures across different views. This work tackles the primary cause - geometric inconsistency. |
A pre-trained 2D diffusion model is fine-tuned to generate viewpoint-conditioned canonical coordinate maps from a 3D dataset. This aligns the geometric priors in 2D diffusion with consistent 3D geometry, resulting in AGP. |
AGP significantly improves multi-view consistency in text-to-3D generation, achieving over 85% consistency rate compared to around 30% in previous methods.
The 'coarse' alignment using only coarse geometry information preserves the generalizability of 2D diffusion models, enabling diverse and high-quality 3D object generation.
AGP is generically applicable and can be seamlessly integrated into various text-to-3D pipelines using different 3D representations like DMTet and NeRF. |
The paper primarily focuses on geometric consistency and doesn't directly address appearance inconsistencies, which could be a future work direction.
There's a potential risk of degrading the generalizability of geometric priors during AGP training. Investigating regularization constraints could be beneficial. |
text-to-3d generation, multi-view consistency, diffusion models, geometric priors, 3d shape representation |
2310.02279
Report |
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion |
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon |
Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion
model sampling at the cost of sample quality but lack a natural way to
trade-off quality for speed. To address this limitation, we propose Consistency
Trajectory Model (CTM), a generalization encompassing CM and score-based models
as special cases. CTM trains a single neural network that can -- in a single
forward pass -- output scores (i.e., gradients of log-density) and enables
unrestricted traversal between any initial and final time along the Probability
Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables
the efficient combination of adversarial training and denoising score matching
loss to enhance performance and achieves new state-of-the-art FIDs for
single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at
64x64 resolution (FID 1.92). CTM also enables a new family of sampling schemes,
both deterministic and stochastic, involving long jumps along the ODE solution
trajectories. It consistently improves sample quality as computational budgets
increase, avoiding the degradation seen in CM. Furthermore, unlike CM, CTM's
access to the score function can streamline the adoption of established
controllable/conditional generation methods from the diffusion community. This
access also enables the computation of likelihood. The code is available at
https://github.com/sony/ctm. |
The paper introduces Consistency Trajectory Model (CTM), a new generative model unifying score-based and distillation models, which enables unrestricted time traversal along the Probability Flow ODE trajectory. |
CTM addresses limitations of current diffusion models: 1) discretization errors in score-based sampling and 2) lack of speed-quality trade-off in distillation sampling. It allows for more efficient and flexible sampling with improved generation quality. |
CTM trains a neural network to predict jumps along the PF ODE trajectory using a novel 'soft consistency matching' distillation loss. Additionally, it incorporates denoising score matching and adversarial losses to enhance student model training. |
CTM achieves state-of-the-art FID scores for single-step diffusion model sampling on CIFAR-10 and ImageNet 64x64.
A new sampling scheme called 'γ-sampling' allows for deterministic and stochastic sampling, providing control over sample variance.
CTM surpasses its teacher model in both density estimation and image generation quality. |
The current CTM implementation relies on discrete timesteps for training, limiting its theoretical potential for continuous time traversal.
Further investigation is needed to explore potential applications of CTM's trajectory control capabilities in downstream tasks like inpainting and colorization. |
generative models, diffusion models, score-based models, distillation models, consistency models |
2310.02239
Report |
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens |
Kaizhi Zheng, Xuehai He, Xin Eric Wang |
The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a
profound capability in multimodal understanding. However, the simultaneous
generation of images with coherent texts is still underdeveloped. Addressing
this, we introduce a novel interleaved vision-and-language generation method,
centered around the concept of ``generative vokens". These vokens serve as
pivotal elements contributing to coherent image-text outputs. Our method is
marked by a unique two-stage training strategy for description-free multimodal
generation, which does not necessitate extensive descriptions of images. We
integrate classifier-free guidance to enhance the alignment of generated images
and texts, ensuring more seamless and contextually relevant multimodal
interactions. Our model, MiniGPT-5, exhibits substantial improvement over the
baseline models on multimodal generation datasets, including MMDialog and VIST.
The human evaluation shows MiniGPT-5 is better than the baseline model on more
than 56\% cases for multimodal generation, highlighting its efficacy across
diverse benchmarks. |
The paper introduces \modelname, a novel interleaved vision-and-language generation method using "generative vokens" to bridge the gap between text and image generation in LLMs, enabling coherent image-text outputs. |
Existing Multimodal Large Language Models (MLLMs) excel in understanding but struggle with coherent multimodal output generation, particularly in tasks requiring integrated vision and language handling. |
The method employs a two-stage training strategy: (1) Description-free pretraining aligns visual features with text-image pairs. (2) Fine-tuning focuses on interleaved vision-and-language generation using multimodal datasets. Classifier-free guidance enhances image-text coherence during training. |
\modelname shows superior performance over baseline models on multimodal generation datasets, including MMDialog and VIST.
Human evaluation indicates \modelname generates better text narrations (55%), superior image quality (53%), and more coherent multimodal outputs (56%).
\modelname effectively leverages long-horizontal multimodal inputs, outperforming baselines in generating contextually relevant images and text. |
Maintaining object texture consistency in generated images remains challenging.
Further improvements in generated image quality are possible. |
multimodal generation, large language models, generative vokens, vision-and-language, classifier-free guidance |
2310.01830
Report |
AI-Generated Images as Data Source: The Dawn of Synthetic Era |
Zuhao Yang, Fangneng Zhan, Kunhao Liu, Muyu Xu, Shijian Lu |
The advancement of visual intelligence is intrinsically tethered to the
availability of large-scale data. In parallel, generative Artificial
Intelligence (AI) has unlocked the potential to create synthetic images that
closely resemble real-world photographs. This prompts a compelling inquiry: how
much visual intelligence could benefit from the advance of generative AI? This
paper explores the innovative concept of harnessing these AI-generated images
as new data sources, reshaping traditional modeling paradigms in visual
intelligence. In contrast to real data, AI-generated data exhibit remarkable
advantages, including unmatched abundance and scalability, the rapid generation
of vast datasets, and the effortless simulation of edge cases. Built on the
success of generative AI models, we examine the potential of their generated
data in a range of applications, from training machine learning models to
simulating scenarios for computational modeling, testing, and validation. We
probe the technological foundations that support this groundbreaking use of
generative AI, engaging in an in-depth discussion on the ethical, legal, and
practical considerations that accompany this transformative paradigm shift.
Through an exhaustive survey of current technologies and applications, this
paper presents a comprehensive view of the synthetic era in visual
intelligence. A project associated with this paper can be found at
https://github.com/mwxely/AIGS . |
This paper surveys the emerging field of using AI-generated images as data sources (AIGS) for enhancing visual intelligence tasks. |
AIGS offers benefits such as generating large-scale datasets with reduced cost and privacy concerns, leading to improved performance in various computer vision tasks. |
The paper reviews methods like GANs, Diffusion Models, and Neural Rendering for generating synthetic images. It explores their use for data augmentation and automatic label acquisition, enabling diverse applications. |
Models trained solely on synthetic images show promising results, sometimes surpassing real-image training.
Augmenting real datasets with synthetic images significantly boosts performance in classification, segmentation, and detection tasks.
NeRF-based AIGS shows potential for 3D-aware applications like robotics and autonomous driving. |
Explainability of AIGS in handling corner cases and outliers needs further research.
Development of more precise and robust evaluation metrics is crucial for assessing AIGS effectiveness. |
ai-generated images, synthetic data, generative models, neural rendering, computer vision |
2310.01819
Report |
TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling |
Jun Li, Zedong Zhang, Jian Yang |
Generating creative combinatorial objects from two seemingly unrelated object
texts is a challenging task in text-to-image synthesis, often hindered by a
focus on emulating existing data distributions. In this paper, we develop a
straightforward yet highly effective method, called \textbf{balance
swap-sampling}. First, we propose a swapping mechanism that generates a novel
combinatorial object image set by randomly exchanging intrinsic elements of two
text embeddings through a cutting-edge diffusion model. Second, we introduce a
balance swapping region to efficiently sample a small subset from the newly
generated image set by balancing CLIP distances between the new images and
their original generations, increasing the likelihood of accepting the
high-quality combinations. Last, we employ a segmentation method to compare
CLIP distances among the segmented components, ultimately selecting the most
promising object from the sampled subset. Extensive experiments demonstrate
that our approach outperforms recent SOTA T2I methods. Surprisingly, our
results even rival those of human artists, such as frog-broccoli. |
This paper introduces BASS (BAlance Swap-Sampling), a novel approach for generating creative combinatorial objects from two distinct object text descriptions in text-to-image synthesis. |
Current text-to-image models often struggle to generate truly creative and novel combinations of objects, focusing instead on emulating existing data distributions. |
BASS leverages a swapping mechanism to interchange elements of prompt embeddings, creating novel combinations. It then employs a balance region based on CLIP distances to sample high-quality combinatorial images, further refined using the Segment Anything Model (SAM) for semantic coherence. |
BASS generates novel and surprising combinatorial objects, often surpassing the creativity of human artists in combining unrelated concepts.
The method outperforms SOTA T2I models in generating creative combinations and demonstrates the ability to create out-of-distribution images.
Evaluations using PickScore and HPSv2, trained on human preference datasets, reveal BASS's capability to generate objects with high human-preference value. |
The balance swap region can sometimes lead to nonsensical or chaotic image generation, requiring further investigation and refinement.
Current hyperparameter settings might favor majority classes, necessitating future research into distribution tailoring for enhanced creativity across all categories. |
text-to-image synthesis, combinatorial creativity, diffusion models, clip, out-of-distribution generation |
2310.01779
Report |
HallE-Control: Controlling Object Hallucination in Large Multimodal Models |
Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, Manling Li |
Current Large Multimodal Models (LMMs) achieve remarkable progress, yet there
remains significant uncertainty regarding their ability to accurately apprehend
visual details, that is, in performing detailed captioning. To address this, we
introduce $\textit{CCEval}$, a GPT-4 assisted evaluation method for detailed
captioning. Interestingly, while LMMs demonstrate minimal object existence
hallucination in existing VQA benchmarks, our proposed evaluation reveals
continued susceptibility to such hallucinations. In this paper, we make the
first attempt to investigate such hallucination from different aspects,
including image resolution, the language decoder size, and instruction data
amount, quality, granularity. Our findings underscore the unwarranted inference
when the language description includes details at a finer object granularity
than what the vision module can ground or verify, thus inducing hallucination.
To control such hallucinations, we further attribute the reliability of
captioning to contextual knowledge (involving only contextually grounded
objects) and parametric knowledge (containing inferred objects by the model).
Thus, we introduce $\textit{HallE-Control}$, a controllable LMM in terms of
$\textbf{Hall}$ucination in object $\textbf{E}$xistence. HallE-Control can
condition the captioning to shift between (i) exclusively depicting contextual
knowledge for grounded objects and (ii) blending it with parametric knowledge
to imagine inferred objects. Our method reduces hallucination by 44% compared
to LLaVA$_{7B}$ and maintains the object coverage. |
The paper introduces HallE-Control, a novel approach for controlling object existence hallucination in large multimodal models (LMMs) trained for detailed image captioning. |
Existing LMMs, while proficient in tasks like VQA, often hallucinate objects in detailed captions, hindering their applicability in real-world scenarios. |
The authors first analyze factors influencing hallucination, identifying a key issue: misalignment between the vision encoder's grounding ability and objects mentioned in training captions. They then propose HallE-Control, which uses a control parameter and specialized datasets to distinguish between contextual (grounded) and parametric (inferred) knowledge, enabling controlled object imagination. |
Increasing image resolution significantly reduces hallucination by improving object grounding.
Scaling the language decoder or instruction data volume alone doesn't consistently mitigate hallucination.
HallE-Control reduces hallucination by 44% compared to the baseline LLaVA model while maintaining object coverage in captions. |
The study primarily focuses on object existence hallucination, leaving other types like attribute and relationship hallucinations for future work.
Further exploration into alternative control mechanisms and their impact on different LMM architectures is necessary. |
large multimodal models, hallucination control, image captioning, vision-language models, object grounding |
2310.01662
Report |
SYRAC: Synthesize, Rank, and Count |
Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh |
Crowd counting is a critical task in computer vision, with several important
applications. However, existing counting methods rely on labor-intensive
density map annotations, necessitating the manual localization of each
individual pedestrian. While recent efforts have attempted to alleviate the
annotation burden through weakly or semi-supervised learning, these approaches
fall short of significantly reducing the workload. We propose a novel approach
to eliminate the annotation burden by leveraging latent diffusion models to
generate synthetic data. However, these models struggle to reliably understand
object quantities, leading to noisy annotations when prompted to produce images
with a specific quantity of objects. To address this, we use latent diffusion
models to create two types of synthetic data: one by removing pedestrians from
real images, which generates ranked image pairs with a weak but reliable object
quantity signal, and the other by generating synthetic images with a
predetermined number of objects, offering a strong but noisy counting signal.
Our method utilizes the ranking image pairs for pre-training and then fits a
linear layer to the noisy synthetic images using these crowd quantity features.
We report state-of-the-art results for unsupervised crowd counting. |
This paper introduces a novel unsupervised crowd counting method that leverages latent diffusion models to generate synthetic data, eliminating the need for manual annotations. |
Crowd counting is crucial in computer vision, but traditional methods rely on labor-intensive density map annotations. This work aims to alleviate this annotation burden by proposing an unsupervised approach. |
The method utilizes latent diffusion models to create two types of synthetic data: 1) Ranked image pairs with weak but reliable object quantity signal generated by removing pedestrians from real images. 2) Synthetic images with noisy count labels generated by prompting the model to produce images with a specific number of objects. A Siamese network is pre-trained on the ranked image pairs to learn crowd quantity features, followed by fine-tuning a linear layer on the noisy synthetic counting data. |
Achieves state-of-the-art performance on multiple crowd counting benchmarks for unsupervised crowd counting.
Significantly reduces the annotation burden associated with crowd counting.
Demonstrates the potential of using synthetic data for unsupervised crowd counting. |
The method's performance on datasets with extremely dense crowds is limited by the increased label noise in synthetic data with high object counts.
Future work could explore more sophisticated prompt engineering or alternative generative models to further improve the quality of synthetic data. |
unsupervised crowd counting, synthetic data, latent diffusion models, crowd density estimation, computer vision |
2310.01596
Report |
ImagenHub: Standardizing the evaluation of conditional image generation models |
Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, Wenhu Chen |
Recently, a myriad of conditional image generation and editing models have
been developed to serve different downstream tasks, including text-to-image
generation, text-guided image editing, subject-driven image generation,
control-guided image generation, etc. However, we observe huge inconsistencies
in experimental conditions: datasets, inference, and evaluation metrics -
render fair comparisons difficult. This paper proposes ImagenHub, which is a
one-stop library to standardize the inference and evaluation of all the
conditional image generation models. Firstly, we define seven prominent tasks
and curate high-quality evaluation datasets for them. Secondly, we built a
unified inference pipeline to ensure fair comparison. Thirdly, we design two
human evaluation scores, i.e. Semantic Consistency and Perceptual Quality,
along with comprehensive guidelines to evaluate generated images. We train
expert raters to evaluate the model outputs based on the proposed metrics. Our
human evaluation achieves a high inter-worker agreement of Krippendorff's alpha
on 76% models with a value higher than 0.4. We comprehensively evaluated a
total of around 30 models and observed three key takeaways: (1) the existing
models' performance is generally unsatisfying except for Text-guided Image
Generation and Subject-driven Image Generation, with 74% models achieving an
overall score lower than 0.5. (2) we examined the claims from published papers
and found 83% of them hold with a few exceptions. (3) None of the existing
automatic metrics has a Spearman's correlation higher than 0.2 except
subject-driven image generation. Moving forward, we will continue our efforts
to evaluate newly published models and update our leaderboard to keep track of
the progress in conditional image generation. |
ImagenHub, a comprehensive library designed to standardize the inference and evaluation of conditional image generation models, including 7 prominent tasks and curated evaluation datasets. |
Addresses inconsistencies in datasets, inference, and evaluation metrics in existing conditional image generation models, enabling fair comparison and progress tracking. |
Curates standardized evaluation datasets, builds a unified inference pipeline, and defines human evaluation protocols (Semantic Consistency and Perceptual Quality) with comprehensive guidelines. |
Existing models perform poorly in most tasks except for Text-guided Image Generation and Subject-driven Image Generation.
83% of performance claims from published papers are consistent with ImagenHub's evaluation.
Automatic metrics show weak correlation with human preference, except in Subject-driven Image Generation. |
Reliance on human evaluation is expensive and time-consuming.
Future work includes developing generic automatic evaluation methods that better approximate human ratings. |
image generation, benchmarking, evaluation metrics, diffusion models, human evaluation |
2310.01506
Report |
Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code |
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, Qiang Xu |
Text-guided diffusion models have revolutionized image generation and
editing, offering exceptional realism and diversity. Specifically, in the
context of diffusion-based editing, where a source image is edited according to
a target prompt, the process commences by acquiring a noisy latent vector
corresponding to the source image via the diffusion model. This vector is
subsequently fed into separate source and target diffusion branches for
editing. The accuracy of this inversion process significantly impacts the final
editing outcome, influencing both essential content preservation of the source
image and edit fidelity according to the target prompt. Prior inversion
techniques aimed at finding a unified solution in both the source and target
diffusion branches. However, our theoretical and empirical analyses reveal that
disentangling these branches leads to a distinct separation of responsibilities
for preserving essential content and ensuring edit fidelity. Building on this
insight, we introduce "Direct Inversion," a novel technique achieving optimal
performance of both branches with just three lines of code. To assess image
editing performance, we present PIE-Bench, an editing benchmark with 700 images
showcasing diverse scenes and editing types, accompanied by versatile
annotations and comprehensive evaluation metrics. Compared to state-of-the-art
optimization-based inversion techniques, our solution not only yields superior
performance across 8 editing methods but also achieves nearly an order of
speed-up. |
This paper introduces Direct Inversion, a simple yet effective technique for inverting diffusion models for image editing. |
Existing inversion techniques for diffusion-based image editing struggle to balance essential content preservation with edit fidelity, often relying on time-consuming optimization and facing error propagation issues. |
Direct Inversion disentangles the source and target diffusion branches, rectifying the deviation path in the source branch directly using a 3-line code modification for better content preservation while leaving the target branch untouched to maximize edit fidelity. |
Direct Inversion enhances essential content preservation by up to 83.2% in Structure Distance and up to 73.9% in background LPIPS compared to state-of-the-art optimization-based techniques.
It improves edit fidelity by up to 8.8% in Edit Region Clip Similarity.
The method achieves a significant speedup, nearly an order of magnitude faster than optimization-based inversion methods. |
The performance of Direct Inversion is inherently tied to the limitations of existing diffusion-based editing methods, which can be unstable and have low success rates in certain editing tasks.
Future work includes extending the technique to video editing, developing editing models with higher success rates, and creating a more comprehensive metric evaluation system. |
image editing, diffusion models, inversion techniques, content preservation, edit fidelity |
2310.01407
Report |
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation |
Kangfu Mei, Mauricio Delbracio, Hossein Talebi, Zhengzhong Tu, Vishal M. Patel, Peyman Milanfar |
Large generative diffusion models have revolutionized text-to-image
generation and offer immense potential for conditional generation tasks such as
image enhancement, restoration, editing, and compositing. However, their
widespread adoption is hindered by the high computational cost, which limits
their real-time application. To address this challenge, we introduce a novel
method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept
additional image conditioning inputs while significantly reducing the sampling
steps required to achieve high-quality results. Our method can leverage
architectures such as ControlNet to incorporate conditioning inputs without
compromising the model's prior knowledge gained during large scale
pre-training. Additionally, a conditional consistency loss enforces consistent
predictions across diffusion steps, effectively compelling the model to
generate high-quality images with conditions in a few steps. Our
conditional-task learning and distillation approach outperforms previous
distillation methods, achieving a new state-of-the-art in producing
high-quality images with very few steps (e.g., 1-4) across multiple tasks,
including super-resolution, text-guided image editing, and depth-to-image
generation. |
This paper introduces CoDi, a novel method that adapts pre-trained latent diffusion models to accept image conditioning inputs while significantly reducing sampling steps for high-quality image generation. |
Large generative diffusion models are computationally expensive, limiting their real-time application. CoDi addresses this challenge by enabling rapid generation of high-quality images under various conditional settings. |
CoDi adapts pre-trained models with a conditional encoder and introduces a conditional consistency loss. This loss enforces consistent predictions across diffusion steps, enabling high-quality generation with few steps. |
CoDi outperforms previous distillation methods in visual quality and quantitative metrics across tasks like super-resolution and image editing.
The method enables parameter-efficient distillation (PE-CoDi), adapting models to new tasks with minimal additional parameters.
CoDi achieves comparable performance with significantly fewer sampling steps (e.g., 1-4) than models requiring 20-200 steps. |
The current adapter architecture introduces additional computation, which could be addressed with lightweight architectures in future work.
While CoDi enhances diffusion model practicality, potential misuse for deceptive content necessitates ethical considerations. |
diffusion models, conditional image generation, model distillation, parameter-efficient tuning, image-to-image translation |
2310.01406
Report |
HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation |
Xin Huang, Ruizhi Shao, Qi Zhang, Hongwen Zhang, Ying Feng, Yebin Liu, Qing Wang |
Recent text-to-3D methods employing diffusion models have made significant
advancements in 3D human generation. However, these approaches face challenges
due to the limitations of text-to-image diffusion models, which lack an
understanding of 3D structures. Consequently, these methods struggle to achieve
high-quality human generation, resulting in smooth geometry and cartoon-like
appearances. In this paper, we propose HumanNorm, a novel approach for
high-quality and realistic 3D human generation. The main idea is to enhance the
model's 2D perception of 3D geometry by learning a normal-adapted diffusion
model and a normal-aligned diffusion model. The normal-adapted diffusion model
can generate high-fidelity normal maps corresponding to user prompts with
view-dependent and body-aware text. The normal-aligned diffusion model learns
to generate color images aligned with the normal maps, thereby transforming
physical geometry details into realistic appearance. Leveraging the proposed
normal diffusion model, we devise a progressive geometry generation strategy
and a multi-step Score Distillation Sampling (SDS) loss to enhance the
performance of 3D human generation. Comprehensive experiments substantiate
HumanNorm's ability to generate 3D humans with intricate geometry and realistic
appearances. HumanNorm outperforms existing text-to-3D methods in both geometry
and texture quality. The project page of HumanNorm is
https://humannorm.github.io/. |
Proposes HumanNorm, a novel approach for generating high-quality, realistic 3D human models from text descriptions by leveraging normal diffusion models to enhance 2D diffusion models' understanding of 3D geometry. |
Existing text-to-3D human generation methods often struggle to produce high-fidelity models, resulting in smooth geometry, unrealistic textures, and artifacts. This work addresses these limitations to achieve more realistic and detailed 3D human generation. |
Introduces a normal-adapted diffusion model for generating detailed geometry from text prompts by learning from multi-view normal maps. Employs a normal-aligned diffusion model to generate textures aligned with the 3D geometry using normal maps as guidance. Utilizes a progressive geometry generation strategy and multi-step Score Distillation Sampling (SDS) loss to enhance performance and realism. |
Generates 3D humans with intricate geometric details like clothing wrinkles and realistic appearances.
Quantitative evaluation shows superior performance over existing text-to-3D methods in terms of FID and CLIP score.
User study confirms HumanNorm produces higher-quality 3D humans compared to state-of-the-art methods. |
Current implementation requires a rigged human skeleton for animation. Future work will integrate SMPL-X for direct animation and improved body details.
Generated textures might exhibit shading inconsistencies. Future research will explore Physically-Based Rendering (PBR) for material estimation and relighting. |
3d human generation, text-to-3d, diffusion models, normal diffusion, score distillation sampling |
2310.01400
Report |
Sequential Data Generation with Groupwise Diffusion Process |
Sangyun Lee, Gayoung Lee, Hyunsu Kim, Junho Kim, Youngjung Uh |
We present the Groupwise Diffusion Model (GDM), which divides data into
multiple groups and diffuses one group at one time interval in the forward
diffusion process. GDM generates data sequentially from one group at one time
interval, leading to several interesting properties. First, as an extension of
diffusion models, GDM generalizes certain forms of autoregressive models and
cascaded diffusion models. As a unified framework, GDM allows us to investigate
design choices that have been overlooked in previous works, such as
data-grouping strategy and order of generation. Furthermore, since one group of
the initial noise affects only a certain group of the generated data, latent
space now possesses group-wise interpretable meaning. We can further extend GDM
to the frequency domain where the forward process sequentially diffuses each
group of frequency components. Dividing the frequency bands of the data as
groups allows the latent variables to become a hierarchical representation
where individual groups encode data at different levels of abstraction. We
demonstrate several applications of such representation including
disentanglement of semantic attributes, image editing, and generating
variations. |
Introduces the Groupwise Diffusion Model (GDM), which divides data into groups and diffuses one group at a time, allowing for sequential data generation and interpretable latent space. |
Provides a unified framework that generalizes certain autoregressive and cascaded diffusion models, offering flexibility in grouping strategies and generation order. |
Extends traditional diffusion models by employing a per-group noise scheduling strategy and generalizing the noise schedule function to a matrix form. |
GDM with specific grouping strategies encapsulates autoregressive models and cascaded diffusion models.
GDM's latent space exhibits group-wise interpretability, where each latent group influences specific data elements.
Extending GDM to the frequency domain (GDM-F) enables hierarchical representation learning, disentanglement of semantic attributes, and image editing applications. |
Sampling efficiency decreases as the number of groups increases.
Further investigation is needed to determine the optimal grouping strategies and generation order for various datasets. |
diffusion models, generative models, interpretable latent space, hierarchical representation learning, image editing |
2310.01391
Report |
A Restoration Network as an Implicit Prior |
Yuyang Hu, Mauricio Delbracio, Peyman Milanfar, Ulugbek S. Kamilov |
Image denoisers have been shown to be powerful priors for solving inverse
problems in imaging. In this work, we introduce a generalization of these
methods that allows any image restoration network to be used as an implicit
prior. The proposed method uses priors specified by deep neural networks
pre-trained as general restoration operators. The method provides a principled
approach for adapting state-of-the-art restoration models for other inverse
problems. Our theoretical result analyzes its convergence to a stationary point
of a global functional associated with the restoration operator. Numerical
results show that the method using a super-resolution prior achieves
state-of-the-art performance both quantitatively and qualitatively. Overall,
this work offers a step forward for solving inverse problems by enabling the
use of powerful pre-trained restoration models as priors. |
This paper introduces Deep Restoration Priors (DRP), a novel method that generalizes the use of image denoisers as priors for solving inverse problems to any image restoration network. |
This generalization enables the adaptation of powerful pre-trained restoration models for solving a variety of inverse problems, potentially leading to improved performance. |
DRP incorporates a pre-trained restoration network into an iterative optimization framework similar to plug-and-play priors (PnP), leveraging the network's implicit prior to regularize the solution. |
The paper provides theoretical analysis proving the convergence of DRP to a stationary point of an objective function associated with the restoration operator.
DRP, when using a super-resolution network as a prior, achieves state-of-the-art performance on deblurring and super-resolution tasks, outperforming existing methods based on denoiser priors.
A prior refinement strategy, inspired by the theoretical analysis, is introduced to further improve the performance of DRP. |
The performance of DRP is limited by the quality of the pre-trained restoration model used as a prior.
The theoretical analysis currently relies on the assumption that the restoration model performs MMSE estimation, which might not always hold in practice. |
inverse problems, image restoration, deep learning, plug-and-play priors, super-resolution |
2310.01218
Report |
Making LLaMA SEE and Draw with SEED Tokenizer |
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan |
The great success of Large Language Models (LLMs) has expanded the potential
of multimodality, contributing to the gradual evolution of General Artificial
Intelligence (AGI). A true AGI agent should not only possess the capability to
perform predefined multi-tasks but also exhibit emergent abilities in an
open-world context. However, despite the considerable advancements made by
recent multimodal LLMs, they still fall short in effectively unifying
comprehension and generation tasks, let alone open-world emergent abilities. We
contend that the key to overcoming the present impasse lies in enabling text
and images to be represented and processed interchangeably within a unified
autoregressive Transformer. To this end, we introduce SEED, an elaborate image
tokenizer that empowers LLMs with the ability to SEE and Draw at the same time.
We identify two crucial design principles: (1) Image tokens should be
independent of 2D physical patch positions and instead be produced with a 1D
causal dependency, exhibiting intrinsic interdependence that aligns with the
left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens
should capture high-level semantics consistent with the degree of semantic
abstraction in words, and be optimized for both discriminativeness and
reconstruction during the tokenizer training phase. With SEED tokens, LLM is
able to perform scalable multimodal autoregression under its original training
recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by
large-scale pretraining and instruction tuning on the interleaved textual and
visual data, demonstrating impressive performance on a broad range of
multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has
exhibited compositional emergent abilities such as multi-turn in-context
multimodal generation, acting like your AI assistant. |
This paper introduces SEED, a novel image tokenizer designed to enhance Large Language Models (LLMs) with multimodal capabilities, and presents SEED-LLaMA, an MLLM built using SEED that demonstrates strong performance in multimodal comprehension, generation, and emergent abilities. |
Existing MLLMs struggle to effectively unify comprehension and generation tasks and lack open-world emergent abilities. This work addresses these limitations by enabling interchangeable representation and processing of text and images within a unified autoregressive Transformer framework. |
SEED tokenizes images into discrete codes with 1D causal dependency and high-level semantics. SEED-LLaMA is trained via large-scale multimodal pretraining and instruction tuning on interleaved textual and visual data, leveraging the next-word prediction objective of LLMs. |
SEED-LLaMA achieves competitive performance on various multimodal comprehension tasks, including image captioning, image/video question answering, surpassing some models using continuous visual representations.
SEED-LLaMA demonstrates strong text-to-image generation capabilities, producing images that are highly correlated with given textual descriptions.
SEED-LLaMA exhibits emergent abilities such as multi-turn in-context multimodal generation, including image and text generation based on user instructions, and zero-shot compositional image generation (e.g., stylized generation, image blending). |
The authors acknowledge that current VQA benchmarks may not be ideal for evaluating MLLMs with open-ended output as they require exact matches.
Future work will explore enhancements to the SEED tokenizer and expand the pretraining data scale and model size of SEED-LLaMA. |
multimodal learning, large language models, image tokenization, multimodal generation, emergent abilities |
2310.01110
Report |
Prompt-tuning latent diffusion models for inverse problems |
Hyungjin Chung, Jong Chul Ye, Peyman Milanfar, Mauricio Delbracio |
We propose a new method for solving imaging inverse problems using
text-to-image latent diffusion models as general priors. Existing methods using
latent diffusion models for inverse problems typically rely on simple null text
prompts, which can lead to suboptimal performance. To address this limitation,
we introduce a method for prompt tuning, which jointly optimizes the text
embedding on-the-fly while running the reverse diffusion process. This allows
us to generate images that are more faithful to the diffusion prior. In
addition, we propose a method to keep the evolution of latent variables within
the range space of the encoder, by projection. This helps to reduce image
artifacts, a major problem when using latent diffusion models instead of
pixel-based diffusion models. Our combined method, called P2L, outperforms both
image- and latent-diffusion model-based inverse problem solvers on a variety of
tasks, such as super-resolution, deblurring, and inpainting. |
This paper proposes P2L, a novel method for solving imaging inverse problems using text-to-image latent diffusion models as general priors, enhancing restoration quality by jointly optimizing text embedding during inference. |
Existing methods using latent diffusion models for inverse problems often rely on simple null text prompts, leading to suboptimal performance in generating high-fidelity images. |
The method leverages prompt tuning to optimize text embedding on-the-fly while running the reverse diffusion process. Additionally, it introduces a projection technique to constrain the evolution of latent variables within the range space of the encoder, thereby reducing image artifacts. |
P2L significantly outperforms both image- and latent-diffusion model-based inverse problem solvers in perceptual quality (FID, LPIPS) on tasks like super-resolution, deblurring, and inpainting.
The method effectively mitigates artifacts common in latent diffusion model-based restoration, resulting in sharper and more realistic reconstructions.
It demonstrates efficacy in high-resolution image restoration, achieving comparable or superior performance to computationally expensive patch-based methods. |
Prompt tuning, while improving performance, increases computational complexity due to additional forward/backward passes through the model, demanding future research for time-critical applications.
The use of continuous text embedding optimization hinders direct interpretation of the learned text. Employing models with text decoders, like Imagen, could address this limitation. |
inverse problems, latent diffusion models, prompt tuning, image restoration, generative priors |
2310.01107
Report |
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models |
Hyeonho Jeong, Jong Chul Ye |
Recent endeavors in video editing have showcased promising results in
single-attribute editing or style transfer tasks, either by training
text-to-video (T2V) models on text-video data or adopting training-free
methods. However, when confronted with the complexities of multi-attribute
editing scenarios, they exhibit shortcomings such as omitting or overlooking
intended attribute changes, modifying the wrong elements of the input video,
and failing to preserve regions of the input video that should remain intact.
To address this, here we present a novel grounding-guided video-to-video
translation framework called Ground-A-Video for multi-attribute video editing.
Ground-A-Video attains temporally consistent multi-attribute editing of input
videos in a training-free manner without aforementioned shortcomings. Central
to our method is the introduction of Cross-Frame Gated Attention which
incorporates groundings information into the latent representations in a
temporally consistent fashion, along with Modulated Cross-Attention and optical
flow guided inverted latents smoothing. Extensive experiments and applications
demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline
methods in terms of edit-accuracy and frame consistency. Further results and
code are available at our project page (http://ground-a-video.github.io). |
Ground-A-Video is a novel video editing framework that performs multi-attribute editing on input videos by leveraging both spatially-continuous and discrete conditions. |
Existing video editing methods struggle with multi-attribute editing scenarios, often exhibiting issues like omitting desired edits, modifying incorrect elements, or failing to preserve intended regions. |
Ground-A-Video utilizes a combination of: (1) Optical flow guided latent smoothing for temporal consistency. (2) Inflated ControlNet with depth maps for structural guidance. (3) Modulated Cross-Attention for consistent null-embeddings. (4) Cross-Frame Gated Attention for incorporating video groundings. |
Outperforms baseline methods in terms of edit-accuracy and frame consistency in qualitative and quantitative evaluations.
Demonstrates successful application in video style transfer and text-to-video generation with pose control.
Exhibits strong performance in preserving regions not targeted for editing, especially when combined with an inpainting strategy guided by groundings. |
Performance is heavily reliant on the accuracy of the video groundings; inaccurate groundings can lead to editing errors.
While ControlNet enhances structural guidance, it can limit flexibility. This limitation is mitigated by adjusting the 'ControlNet Scale' parameter. |
video editing, multi-attribute editing, grounding, stable diffusion, controlnet |
2310.01018
Report |
Controlling Vision-Language Models for Multi-Task Image Restoration |
Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön |
Vision-language models such as CLIP have shown great impact on diverse
downstream tasks for zero-shot or label-free predictions. However, when it
comes to low-level vision such as image restoration their performance
deteriorates dramatically due to corrupted inputs. In this paper, we present a
degradation-aware vision-language model (DA-CLIP) to better transfer pretrained
vision-language models to low-level vision tasks as a multi-task framework for
image restoration. More specifically, DA-CLIP trains an additional controller
that adapts the fixed CLIP image encoder to predict high-quality feature
embeddings. By integrating the embedding into an image restoration network via
cross-attention, we are able to pilot the model to learn a high-fidelity image
reconstruction. The controller itself will also output a degradation feature
that matches the real corruptions of the input, yielding a natural classifier
for different degradation types. In addition, we construct a mixed degradation
dataset with synthetic captions for DA-CLIP training. Our approach advances
state-of-the-art performance on both \emph{degradation-specific} and
\emph{unified} image restoration tasks, showing a promising direction of
prompting image restoration with large-scale pretrained vision-language models.
Our code is available at https://github.com/Algolzw/daclip-uir. |
DA-CLIP, a degradation-aware vision-language model, leverages pretrained VLMs for multi-task image restoration by adapting the image encoder to predict high-quality features from corrupted images and classifying degradation types. |
Existing VLMs struggle with low-level vision tasks like image restoration due to feature misalignment between corrupted inputs and clean captions, limiting their use in this domain. |
DA-CLIP introduces an Image Controller that adapts a frozen CLIP image encoder to output both high-quality content embeddings aligned with clean captions and degradation embeddings matching real corruption types. Trained with contrastive learning on a mixed degradation dataset with synthetic captions, DA-CLIP integrates into image restoration networks via cross-attention and prompt learning, enabling both degradation-specific and unified restoration. |
DA-CLIP improves image restoration performance on various degradation-specific tasks, setting a new state-of-the-art on deraining.
In unified image restoration, DA-CLIP consistently outperforms existing methods in perceptual quality across diverse degradation types.
DA-CLIP effectively classifies ten degradation types, achieving near-perfect accuracy on most and significantly improving over the original CLIP model's classification ability. |
The current dataset consists of images with single degradation types, limiting the model's ability to handle mixed degradations in real-world scenarios.
DA-CLIP introduces additional model complexity and memory requirements compared to baseline models. |
image restoration, vision-language models, clip, unified image restoration, degradation classification |
2310.00936
Report |
Trained Latent Space Navigation to Prevent Lack of Photorealism in Generated Images on Style-based Models |
Takumi Harada, Kazuyuki Aihara, Hiroyuki Sakai |
Recent studies on StyleGAN variants show promising performances for various
generation tasks. In these models, latent codes have traditionally been
manipulated and searched for the desired images. However, this approach
sometimes suffers from a lack of photorealism in generated images due to a lack
of knowledge about the geometry of the trained latent space. In this paper, we
show a simple unsupervised method that provides well-trained local latent
subspace, enabling latent code navigation while preserving the photorealism of
the generated images. Specifically, the method identifies densely mapped latent
spaces and restricts latent manipulations within the local latent subspace.
Experimental results demonstrate that images generated within the local latent
subspace maintain photorealism even when the latent codes are significantly and
repeatedly manipulated. Moreover, experiments show that the method can be
applied to latent code optimization for various types of style-based models.
Our empirical evidence of the method will benefit applications in style-based
models. |
This paper proposes 'Bounded Local Space', a method for navigating the latent space of StyleGAN-like models while preserving image photorealism, especially during large traversals. |
Manipulating latent codes in StyleGAN often results in unrealistic images due to leaving the distribution of training data. This method addresses this issue by restricting manipulations within a learned, photorealistic subspace. |
The method computes a 'Bounded Local Space' around a latent code using singular value decomposition of the Jacobian matrix of the StyleGAN mapping network. This space, defined by singular vectors and values, restricts large traversal steps to maintain photorealism. |
Bounded Local Space preserves photorealism even with significant and repeated latent code manipulations, as demonstrated by lower FID scores compared to baseline methods.
The method is effective across various StyleGAN architectures and datasets, showing its generalizability.
Bounded Local Space can be integrated into latent code optimization for tasks like aesthetic image manipulation, masked image search, and text-guided image generation, leading to more realistic results. |
The method's reliance on both Z and W latent spaces limits its application with inversion methods that only operate in extended W spaces.
While the method mitigates unrealistic images during traversal, it does not guarantee photorealism for all manipulations, as StyleGAN itself may generate unrealistic images within its trained distribution. |
generative adversarial networks, stylegan, latent space manipulation, photorealism, image generation |
2310.00808
Report |
Completing Visual Objects via Bridging Generation and Segmentation |
Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu |
This paper presents a novel approach to object completion, with the primary
goal of reconstructing a complete object from its partially visible components.
Our method, named MaskComp, delineates the completion process through iterative
stages of generation and segmentation. In each iteration, the object mask is
provided as an additional condition to boost image generation, and, in return,
the generated images can lead to a more accurate mask by fusing the
segmentation of images. We demonstrate that the combination of one generation
and one segmentation stage effectively functions as a mask denoiser. Through
alternation between the generation and segmentation stages, the partial object
mask is progressively refined, providing precise shape guidance and yielding
superior object completion results. Our experiments demonstrate the superiority
of MaskComp over existing approaches, e.g., ControlNet and Stable Diffusion,
establishing it as an effective solution for object completion. |
Introduces MaskComp, a novel object completion approach that bridges conditional generation and segmentation by leveraging the observation that generated object quality is directly related to the conditioned mask quality. |
Object completion is challenging due to the need for seamless alignment between generated and partial objects. MaskComp addresses this by iteratively refining incomplete masks to provide comprehensive shape guidance. |
MaskComp utilizes an Iterative Mask Denoising (IMD) process with alternating generation and segmentation stages. The generation stage (CompNet) produces complete object images conditioned on partial objects and masks. The segmentation stage uses an off-the-shelf model (SAM) to refine masks from generated images. |
MaskComp significantly outperforms state-of-the-art methods like ControlNet and Stable Diffusion in FID scores and user study rankings for object completeness and realism.
The quality of the conditioned mask significantly influences the quality of generated objects, with complete masks leading to the best results.
MaskComp shows robustness to segmentation errors and can be potentially adapted to scenarios without ground-truth complete objects during training. |
The current implementation of MaskComp requires multiple diffusion processes in each IMD step, impacting inference speed.
MaskComp may struggle with uncommon object poses, highlighting the need for more diverse training datasets. |
object completion, image generation, segmentation, mask denoising, conditional generation |
2310.00632
Report |
Win-Win: Training High-Resolution Vision Transformers from Two Windows |
Vincent Leroy, Jerome Revaud, Thomas Lucas, Philippe Weinzaepfel |
Transformers have become the standard in state-of-the-art vision
architectures, achieving impressive performance on both image-level and dense
pixelwise tasks. However, training vision transformers for high-resolution
pixelwise tasks has a prohibitive cost. Typical solutions boil down to
hierarchical architectures, fast and approximate attention, or training on
low-resolution crops. This latter solution does not constrain architectural
choices, but it leads to a clear performance drop when testing at resolutions
significantly higher than that used for training, thus requiring ad-hoc and
slow post-processing schemes. In this paper, we propose a novel strategy for
efficient training and inference of high-resolution vision transformers. The
key principle is to mask out most of the high-resolution inputs during
training, keeping only N random windows. This allows the model to learn local
interactions between tokens inside each window, and global interactions between
tokens from different windows. As a result, the model can directly process the
high-resolution input at test time without any special trick. We show that this
strategy is effective when using relative positional embedding such as rotary
embeddings. It is 4 times faster to train than a full-resolution network, and
it is straightforward to use at test time compared to existing approaches. We
apply this strategy to three dense prediction tasks with high-resolution data.
First, we show on the task of semantic segmentation that a simple setting with
2 windows performs best, hence the name of our method: Win-Win. Second, we
confirm this result on the task of monocular depth prediction. Third, we
further extend it to the binocular task of optical flow, reaching
state-of-the-art performance on the Spring benchmark that contains Full-HD
images with an order of magnitude faster inference than the best competitor. |
This paper introduces Win-Win, a novel training strategy for high-resolution vision transformers that leverages a multi-window masking approach during training to significantly reduce computational cost while maintaining the ability to process full-resolution images at test time. |
Training vision transformers on high-resolution images for dense prediction tasks is computationally expensive, and existing solutions often involve architectural changes or lead to performance degradation at test time. This paper proposes a simple yet effective training scheme to overcome these limitations. |
The proposed method masks out most of the input image during training, keeping only a small number of randomly selected windows. This allows the model to learn both local and global interactions, crucial for high-resolution generalization. A temperature parameter in the attention softmax is adjusted at test time to account for the change in token distribution. |
Using two windows is sufficient to achieve optimal performance, leading to the name Win-Win.
The method achieves comparable or better performance than full-resolution training or tiling-based approaches while being significantly faster.
Win-Win obtains state-of-the-art results on the challenging Spring optical flow benchmark with Full-HD images. |
The method relies on relative positional embeddings and might not be suitable for architectures using absolute positional embeddings.
Future work could explore extending the multi-window strategy to other vision tasks beyond dense prediction. |
vision transformers, high-resolution images, dense prediction, semantic segmentation, optical flow |
2310.00426
Report |
PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis |
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li |
The most advanced text-to-image (T2I) models require significant training
costs (e.g., millions of GPU hours), seriously hindering the fundamental
innovation for the AIGC community while increasing CO2 emissions. This paper
introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image
generation quality is competitive with state-of-the-art image generators (e.g.,
Imagen, SDXL, and even Midjourney), reaching near-commercial application
standards. Additionally, it supports high-resolution image synthesis up to
1024px resolution with low training cost, as shown in Figure 1 and 2. To
achieve this goal, three core designs are proposed: (1) Training strategy
decomposition: We devise three distinct training steps that separately optimize
pixel dependency, text-image alignment, and image aesthetic quality; (2)
Efficient T2I Transformer: We incorporate cross-attention modules into
Diffusion Transformer (DiT) to inject text conditions and streamline the
computation-intensive class-condition branch; (3) High-informative data: We
emphasize the significance of concept density in text-image pairs and leverage
a large Vision-Language model to auto-label dense pseudo-captions to assist
text-image alignment learning. As a result, PIXART-$\alpha$'s training speed
markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only
takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU
days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2
emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training
cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$
excels in image quality, artistry, and semantic control. We hope
PIXART-$\alpha$ will provide new insights to the AIGC community and startups to
accelerate building their own high-quality yet low-cost generative models from
scratch. |
\model is a computationally efficient Transformer-based text-to-image diffusion model that achieves competitive image generation quality with state-of-the-art models while significantly reducing training costs and CO2 emissions. |
Current state-of-the-art text-to-image models require immense computational resources and incur high training costs, hindering innovation and accessibility within the AIGC community. |
The paper introduces three core designs: (1) Decomposition of the training strategy into pixel dependency learning, text-image alignment learning, and high-resolution aesthetic image generation. (2) An efficient T2I Transformer that incorporates cross-attention modules and streamlines computation. (3) Utilization of high-informative data through an auto-labeling pipeline with LLaVA for generating precise and detailed image captions. |
\model achieves a FID score of 7.32 on the COCO dataset with significantly less training time (12%) and data (1.25%) compared to Stable Diffusion v1.5.
It demonstrates superior performance in compositional text-to-image generation, excelling in attribute binding, object relationships, and complex compositions, as evaluated on T2I-CompBench.
User study results indicate that \model outperforms existing SOTA models in terms of image quality and text-image alignment. |
The model exhibits limitations in accurately controlling the number of objects and representing specific details, such as human hands.
Its text generation capability is limited due to the dataset's constraints on font and letter-related images. |
text-to-image generation, diffusion models, transformer, computational efficiency, auto-labeling |
2310.00390
Report |
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists |
Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed M. Alaa |
Recent advances in generative diffusion models have enabled text-controlled
synthesis of realistic and diverse images with impressive quality. Despite
these remarkable advances, the application of text-to-image generative models
in computer vision for standard visual recognition tasks remains limited. The
current de facto approach for these tasks is to design model architectures and
loss functions that are tailored to the task at hand. In this paper, we develop
a unified language interface for computer vision tasks that abstracts away
task-specific design choices and enables task execution by following natural
language instructions. Our approach involves casting multiple computer vision
tasks as text-to-image generation problems. Here, the text represents an
instruction describing the task, and the resulting image is a visually-encoded
task output. To train our model, we pool commonly-used computer vision datasets
covering a range of tasks, including segmentation, object detection, depth
estimation, and classification. We then use a large language model to
paraphrase prompt templates that convey the specific tasks to be conducted on
each image, and through this process, we create a multi-modal and multi-task
training dataset comprising input and output images along with annotated
instructions. Following the InstructPix2Pix architecture, we apply
instruction-tuning to a text-to-image diffusion model using our constructed
dataset, steering its functionality from a generative model to an
instruction-guided multi-task vision learner. Experiments demonstrate that our
model, dubbed InstructCV, performs competitively compared to other generalist
and task-specific vision models. Moreover, it exhibits compelling
generalization capabilities to unseen data, categories, and user instructions. |
This paper introduces InstructCV, a unified language interface for computer vision tasks using a text-to-image generation approach, where text instructions specify the task and the generated image represents the visual output. |
The current approach in computer vision relies on task-specific models, lacking generalizability. InstructCV aims to bridge this gap by learning generalized representations for multiple vision tasks through a unified language interface. |
The paper proposes instruction-tuning a pre-trained conditional diffusion model (Stable Diffusion). They create a multi-modal, multi-task training dataset with image pairs, textual instructions, and visually encoded task outputs. This dataset is used to fine-tune the diffusion model to function as an instruction-guided multi-task vision learner. |
InstructCV achieves competitive results on various vision tasks, including semantic segmentation, object detection, depth estimation, and classification.
It demonstrates compelling generalization ability to unseen data and categories, outperforming state-of-the-art generalist models.
The model shows strong performance in adapting to new, user-written instructions. |
Although InstructCV reduces computational costs compared to previous generalist models, its inference speed lags behind specialized task-specific models.
The model's semantic flexibility is limited by the diversity of the instruction-tuning dataset, which relies on rephrased template prompts, potentially hindering its ability to handle more nuanced instructions. |
computer vision, multi-task learning, diffusion models, instruction tuning, natural language interface |
2310.00240
Report |
Learning Mask-aware CLIP Representations for Zero-Shot Segmentation |
Siyu Jiao, Yunchao Wei, Yaowei Wang, Yao Zhao, Humphrey Shi |
Recently, pre-trained vision-language models have been increasingly used to
tackle the challenging zero-shot segmentation task. Typical solutions follow
the paradigm of first generating mask proposals and then adopting CLIP to
classify them. To maintain the CLIP's zero-shot transferability, previous
practices favour to freeze CLIP during training. However, in the paper, we
reveal that CLIP is insensitive to different mask proposals and tends to
produce similar predictions for various mask proposals of the same image. This
insensitivity results in numerous false positives when classifying mask
proposals. This issue mainly relates to the fact that CLIP is trained with
image-level supervision. To alleviate this issue, we propose a simple yet
effective method, named Mask-aware Fine-tuning (MAFT). Specifically,
Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary
numbers of image and mask proposals simultaneously. Then, mask-aware loss and
self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP
is responsive to different mask proposals while not sacrificing
transferability. In this way, mask-aware representations can be easily learned
to make the true positives stand out. Notably, our solution can seamlessly plug
into most existing methods without introducing any new parameters during the
fine-tuning process. We conduct extensive experiments on the popular zero-shot
benchmarks. With MAFT, the performance of the state-of-the-art methods is
promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on
Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. The
code is available at https://github.com/jiaosiyu1999/MAFT.git. |
This paper identifies a critical issue in zero-shot segmentation where frozen pre-trained vision-language models (like CLIP) are insensitive to variations in mask proposals, leading to false positives. To address this, the authors propose Mask-aware Fine-tuning (MAFT) to make CLIP sensitive to different mask proposals without compromising its transferability to novel classes. |
Zero-shot segmentation aims to segment objects from unseen categories using text descriptions. Existing methods rely on frozen pre-trained vision-language models to classify mask proposals, but these models are often insensitive to the quality of proposals, leading to inaccurate segmentations. |
The authors propose MAFT, which consists of an Image-Proposal CLIP Encoder (IP-CLIP) and two losses: mask-aware loss and self-distillation loss. IP-CLIP handles images and mask proposals simultaneously, mask-aware loss aligns classification scores with IoU scores of mask proposals, and self-distillation loss preserves CLIP's transferability. |
MAFT significantly improves the performance of various zero-shot segmentation methods on COCO-Stuff, Pascal-VOC, and ADE20K.
MAFT achieves state-of-the-art results on zero-shot segmentation benchmarks, demonstrating its effectiveness in enhancing the sensitivity of CLIP to mask proposals.
MAFT also shows significant improvements in the open-vocabulary setting, outperforming previous methods on ADE20K, Pascal-Context, and Pascal-VOC datasets. |
The performance of MAFT is still limited by the capabilities of pre-trained vision-language models in recognizing novel classes.
Future work will focus on further improving the generalization ability of the model for unseen classes. |
zero-shot segmentation, vision-language models, clip, mask-aware fine-tuning, transfer learning |
2310.00161
Report |
Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection |
Dahun Kim, Anelia Angelova, Weicheng Kuo |
We present a new open-vocabulary detection approach based on
detection-oriented image-text pretraining to bridge the gap between image-level
pretraining and open-vocabulary object detection. At the pretraining phase, we
replace the commonly used classification architecture with the detector
architecture, which better serves the region-level recognition needs of
detection by enabling the detector heads to learn from noisy image-text pairs.
Using only standard contrastive loss and no pseudo-labeling, our approach is a
simple yet effective extension of the contrastive learning method to learn
emergent object-semantic cues. In addition, we propose a shifted-window
learning approach upon window attention to make the backbone representation
more robust, translation-invariant, and less biased by the window pattern. On
the popular LVIS open-vocabulary detection benchmark, our approach sets a new
state of the art of 40.4 mask AP$_r$ using the common ViT-L backbone,
significantly outperforming the best existing approach by +6.5 mask AP$_r$ at
system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP
without pseudo labeling or weak supervision. In addition, we evaluate our
approach on the transfer detection setup, where ours outperforms the baseline
significantly. Visualization reveals emerging object locality from the
pretraining recipes compared to the baseline. Code and models will be publicly
released. |
This paper presents \ours, a new open-vocabulary object detection approach that leverages detection-oriented image-text pretraining to improve the alignment between image-level pretraining and open-vocabulary object detection. |
Existing open-vocabulary detectors often exhibit suboptimal generalization due to the reliance on classification-based pretrained models and training detector heads from scratch on limited datasets. This work aims to bridge this gap by incorporating detection-specific knowledge into the pretraining phase. |
The methodology involves two key components: (1) Detection-Oriented Pretraining (DOP) utilizes a detector architecture with FPN and Faster R-CNN head during pretraining, enabling learning from noisy image-text pairs. (2) Shifted-Window Learning (SWL) enhances the robustness and translation invariance of vision transformer backbone features by rolling and combining shifted features. |
\ours achieves a new state-of-the-art of 40.4 mask AP$_r$ on LVIS, outperforming the best existing approach by +6.5 AP$_r$.
On COCO, \ours achieves a competitive 40.8 novel AP without using pseudo-labels or joint training.
In transfer detection from LVIS to Objects365, \ours outperforms existing methods with comparable backbone sizes. |
The reliance on web-scale image-text data might inherit potential biases and stereotypes present in those datasets.
Future work could focus on more rigorous fairness evaluations and explore the impact of different pretraining datasets. |
open-vocabulary detection, image-text pretraining, contrastive learning, vision transformers, shifted-window learning |
2310.00031
Report |
Text-image Alignment for Diffusion-based Perception |
Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogerio Guimaraes, Pietro Perona |
Diffusion models are generative models with impressive text-to-image
synthesis capabilities and have spurred a new wave of creative methods for
classical machine learning tasks. However, the best way to harness the
perceptual knowledge of these generative models for visual tasks is still an
open question. Specifically, it is unclear how to use the prompting interface
when applying diffusion backbones to vision tasks. We find that automatically
generated captions can improve text-image alignment and significantly enhance a
model's cross-attention maps, leading to better perceptual performance. Our
approach improves upon the current state-of-the-art (SOTA) in diffusion-based
semantic segmentation on ADE20K and the current overall SOTA for depth
estimation on NYUv2. Furthermore, our method generalizes to the cross-domain
setting. We use model personalization and caption modifications to align our
model to the target domain and find improvements over unaligned baselines. Our
cross-domain object detection model, trained on Pascal VOC, achieves SOTA
results on Watercolor2K. Our cross-domain segmentation method, trained on
Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving.
Project page: https://www.vision.caltech.edu/tadp/. Code:
https://github.com/damaggu/TADP. |
This paper proposes Text-Aligned Diffusion Perception (TADP), a novel method to enhance diffusion-based perception models by aligning text prompts with images using automated caption generation. |
It addresses the open question of how to best leverage the perceptual knowledge of diffusion models for visual tasks, particularly in improving text-image alignment for enhanced performance. |
The methodology involves using BLIP-2 for automated image captioning to generate aligned text prompts for diffusion models, replacing previously used averaged embedding techniques. This is extended to cross-domain tasks by incorporating target domain information into the captions, further aided by model personalization methods like Textual Inversion and DreamBooth. |
Automated captioning significantly improves performance in semantic segmentation, depth estimation, and object detection tasks by enhancing text-image alignment.
The method demonstrates strong cross-domain generalizability, achieving state-of-the-art results on benchmarks like Cityscapes to Dark Zurich, Cityscapes to Nighttime Driving, and Pascal VOC to Watercolor2K.
Analysis reveals that diffusion models are sensitive to missing objects in captions (recall) and benefit from longer, more descriptive captions. |
The current method relies on open-vocabulary captioning models, which can be improved by exploring closed-vocabulary, task-specific captioners.
Future work can explore extending the framework for multi-domain generalization to unseen domains. |
diffusion models, image captioning, semantic segmentation, depth estimation, cross-domain learning |
2309.17400
Report |
Directly Fine-Tuning Diffusion Models on Differentiable Rewards |
Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet |
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method
for fine-tuning diffusion models to maximize differentiable reward functions,
such as scores from human preference models. We first show that it is possible
to backpropagate the reward function gradient through the full sampling
procedure, and that doing so achieves strong performance on a variety of
rewards, outperforming reinforcement learning-based approaches. We then propose
more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to
only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance
gradient estimates for the case when K=1. We show that our methods work well
for a variety of reward functions and can be used to substantially improve the
aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw
connections between our approach and prior work, providing a unifying
perspective on the design space of gradient-based fine-tuning algorithms. |
This paper presents Direct Reward Fine-Tuning (DRaFT), a method for efficiently fine-tuning diffusion models to maximize differentiable reward functions. |
Fine-tuning diffusion models on reward functions like human preferences is crucial for aligning model behavior with user needs, such as generating aesthetically pleasing or fair images. |
DRaFT leverages backpropagation through the sampling process to compute gradients for fine-tuning on a variety of reward functions. It employs techniques like LoRA and gradient checkpointing for efficiency, and introduces variants DRaFT-K and DRaFT-LV to further improve efficiency and training stability. |
DRaFT outperforms reinforcement learning-based approaches in terms of sample efficiency by a large margin.
DRaFT-LV, a variant of DRaFT, achieves state-of-the-art performance on the Human Preference Score v2 benchmark.
DRaFT is shown to be applicable to a wide range of reward functions beyond aesthetic quality, such as image compressibility, object detection, and generating adversarial examples. |
The paper identifies reward hacking as a challenge where the model might overfit to the reward function and lose diversity.
Future work could involve exploring more robust reward functions and investigating the theoretical properties of DRaFT in more depth.
Future work could explore applying DRaFT to other diffusion model applications beyond text-to-image generation |
diffusion models, reward learning, fine-tuning, image generation, human preferences |
2309.17261
Report |
Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors |
Yukang Lin, Haonan Han, Chaoqun Gong, Zunnan Xu, Yachao Zhang, Xiu Li |
Reconstructing 3D objects from a single image guided by pretrained diffusion
models has demonstrated promising outcomes. However, due to utilizing the
case-agnostic rigid strategy, their generalization ability to arbitrary cases
and the 3D consistency of reconstruction are still poor. In this work, we
propose Consistent123, a case-aware two-stage method for highly consistent 3D
asset reconstruction from one image with both 2D and 3D diffusion priors. In
the first stage, Consistent123 utilizes only 3D structural priors for
sufficient geometry exploitation, with a CLIP-based case-aware adaptive
detection mechanism embedded within this process. In the second stage, 2D
texture priors are introduced and progressively take on a dominant guiding
role, delicately sculpting the details of the 3D model. Consistent123 aligns
more closely with the evolving trends in guidance requirements, adaptively
providing adequate 3D geometric initialization and suitable 2D texture
refinement for different objects. Consistent123 can obtain highly 3D-consistent
reconstruction and exhibits strong generalization ability across various
objects. Qualitative and quantitative experiments show that our method
significantly outperforms state-of-the-art image-to-3D methods. See
https://Consistent123.github.io for a more comprehensive exploration of our
generated 3D assets. |
Consistent123 is a novel two-stage method for reconstructing highly consistent 3D assets from a single image, leveraging both 2D and 3D diffusion priors in a case-aware manner. |
Existing single-image 3D reconstruction methods suffer from limitations such as poor generalization ability, 3D inconsistency, and the 'multi-face' issue. Consistent123 addresses these challenges by adaptively combining 2D and 3D priors. |
Consistent123 employs a two-stage approach: (1) 3D structural initialization guided solely by 3D priors, with a CLIP-based adaptive mechanism to determine the optimal transition point to stage 2. (2) Dynamic Prior optimization, gradually integrating 2D texture priors while reducing the emphasis on 3D priors, ensuring both geometric fidelity and textural detail. |
Consistent123 outperforms state-of-the-art methods in terms of 3D consistency, as evidenced by higher CLIP-Similarity scores on RealFusion15 and C10 datasets.
The method effectively addresses the 'multi-face' issue observed in previous works, ensuring consistent appearance across all views.
Consistent123 demonstrates superior geometric and textural quality, as evidenced by quantitative metrics (PSNR, LPIPS) and a user study. |
The reconstruction quality in stage 1 is influenced by the input image's viewpoint due to the heavy reliance on 3D priors.
Output quality is dependent on the clarity and specificity of the asset description used in stage 2, with overly brief descriptions potentially leading to inaccuracies. |
3d reconstruction, diffusion models, single image, 3d consistency, dynamic prior |
2309.17164
Report |
Retail-786k: a Large-Scale Dataset for Visual Entity Matching |
Bianca Lamm, Janis Keuper |
Entity Matching (EM) defines the task of learning to group objects by
transferring semantic concepts from example groups (=entities) to unseen data.
Despite the general availability of image data in the context of many
EM-problems, most currently available EM-algorithms solely rely on (textual)
meta data. In this paper, we introduce the first publicly available large-scale
dataset for "visual entity matching", based on a production level use case in
the retail domain. Using scanned advertisement leaflets, collected over several
years from different European retailers, we provide a total of ~786k manually
annotated, high resolution product images containing ~18k different individual
retail products which are grouped into ~3k entities. The annotation of these
product entities is based on a price comparison task, where each entity forms
an equivalence class of comparable products. Following on a first baseline
evaluation, we show that the proposed "visual entity matching" constitutes a
novel learning problem which can not sufficiently be solved using standard
image based classification and retrieval algorithms. Instead, novel approaches
which allow to transfer example based visual equivalent classes to new data are
needed to address the proposed problem. The aim of this paper is to provide a
benchmark for such algorithms.
Information about the dataset, evaluation code and download instructions are
provided under https://www.retail-786k.org/. |
The paper introduces Retail-786k, the first large-scale publicly available dataset for "visual entity matching," aimed at benchmarking algorithms for grouping visually similar retail products. |
Entity Matching (EM) is crucial for tasks like price monitoring, but existing methods rely heavily on textual data. Retail-786k enables research on visual EM using a practical, real-world dataset. |
The dataset was created from a large collection of scanned advertisement leaflets, manually annotated to group over 786k product images into 3,298 entities representing comparable products. Baseline experiments were conducted using image classification and retrieval approaches. |
Visual entity matching presents a novel learning problem distinct from standard classification and retrieval.
Existing image classification models achieved a maximum F1-score of 83.2% on the dataset.
An image retrieval approach attained an R@10 score of 56%, indicating limitations in handling entity variance. |
The dataset currently lacks textual information (e.g., price, product description) that could be valuable for multi-modal solutions.
Entity definitions are specific to price monitoring, potentially limiting generalizability to other EM tasks. |
entity matching, images, long-tail, retail, leaflets |
2309.17128
Report |
HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field |
Xiaochen Zhao, Lizhen Wang, Jingxiang Sun, Hongwen Zhang, Jinli Suo, Yebin Liu |
The problem of modeling an animatable 3D human head avatar under light-weight
setups is of significant importance but has not been well solved. Existing 3D
representations either perform well in the realism of portrait images synthesis
or the accuracy of expression control, but not both. To address the problem, we
introduce a novel hybrid explicit-implicit 3D representation, Facial Model
Conditioned Neural Radiance Field, which integrates the expressiveness of NeRF
and the prior information from the parametric template. At the core of our
representation, a synthetic-renderings-based condition method is proposed to
fuse the prior information from the parametric model into the implicit field
without constraining its topological flexibility. Besides, based on the hybrid
representation, we properly overcome the inconsistent shape issue presented in
existing methods and improve the animation stability. Moreover, by adopting an
overall GAN-based architecture using an image-to-image translation network, we
achieve high-resolution, realistic and view-consistent synthesis of dynamic
head appearance. Experiments demonstrate that our method can achieve
state-of-the-art performance for 3D head avatar animation compared with
previous methods. |
This paper proposes a novel facial model conditioned Neural Radiance Field for high-fidelity and controllable 3D head avatar animation using monocular or sparse-view videos. |
Modeling animatable 3D human head avatars with realistic appearance and accurate expression control under lightweight setups is crucial for various applications but remains a challenge. |
The method integrates a parametric facial model with NeRF. It leverages synthetic renderings of the model to condition feature volume generation, enabling flexible topology and precise control. Learnable embeddings modulate feature generation to address shape inconsistency, and an image-to-image translation network enhances realism. |
Achieves state-of-the-art performance for 3D head avatar animation with realistic appearance and accurate expression control.
Addresses the inconsistent shape issue present in existing NeRF-based avatar modeling methods and significantly improves animation stability.
Enables high-resolution, photo-realistic, and view-consistent synthesis of dynamic head appearances. |
The proxy shapes generated by the method are not as accurate as some surface-based methods.
Handling out-of-distribution head poses and extreme expressions remains challenging. |
head avatar, image synthesis, neural radiance field, parametric facial model, image-to-image translation |
2309.17102
Report |
Guiding Instruction-based Image Editing via Multimodal Large Language Models |
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan |
Instruction-based image editing improves the controllability and flexibility
of image manipulation via natural commands without elaborate descriptions or
regional masks. However, human instructions are sometimes too brief for current
methods to capture and follow. Multimodal large language models (MLLMs) show
promising capabilities in cross-modal understanding and visual-aware response
generation via LMs. We investigate how MLLMs facilitate edit instructions and
present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive
instructions and provides explicit guidance. The editing model jointly captures
this visual imagination and performs manipulation through end-to-end training.
We evaluate various aspects of Photoshop-style modification, global photo
optimization, and local editing. Extensive experimental results demonstrate
that expressive instructions are crucial to instruction-based image editing,
and our MGIE can lead to a notable improvement in automatic metrics and human
evaluation while maintaining competitive inference efficiency. |
This paper introduces MLLM-Guided Image Editing (MGIE), which leverages Multimodal Large Language Models (MLLMs) to improve instruction-based image editing by deriving more expressive and detailed instructions. |
Human instructions for image editing are often too brief and ambiguous for existing methods to understand fully. MGIE addresses this by utilizing the cross-modal understanding and visual awareness of MLLMs to generate more effective guidance. |
MGIE employs an MLLM to generate concise, expressive instructions from initial, brief instructions and the input image. These instructions, along with visual tokens, guide a diffusion model to perform the desired edits. The MLLM and diffusion model are jointly trained end-to-end. |
Expressive instructions significantly enhance image editing performance compared to using only brief instructions.
Visual awareness is crucial for deriving effective expressive instructions, leading to superior results over language-only methods.
MGIE achieves state-of-the-art performance on various editing tasks, including Photoshop-style modification, global photo optimization, and local object alteration, while maintaining competitive inference efficiency. |
MGIE may inherit potential biases present in the pre-trained foundation models (LLaVA and StableDiffusion).
Complex compositional commands or those requiring precise numerical or spatial understanding remain challenging. |
image editing, instruction-based editing, multimodal large language models, diffusion models, expressive instructions |
2309.17074
Report |
DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation |
Shengkun Tang, Yaqing Wang, Caiwen Ding, Yi Liang, Yao Li, Dongkuan Xu |
Diffusion models achieve great success in generating diverse and
high-fidelity images. The performance improvements come with low generation
speed per image, which hinders the application diffusion models in real-time
scenarios. While some certain predictions benefit from the full computation of
the model in each sample iteration, not every iteration requires the same
amount of computation, potentially leading to computation waste. In this work,
we propose DeeDiff, an early exiting framework that adaptively allocates
computation resources in each sampling step to improve the generation
efficiency of diffusion models. Specifically, we introduce a timestep-aware
uncertainty estimation module (UEM) for diffusion models which is attached to
each intermediate layer to estimate the prediction uncertainty of each layer.
The uncertainty is regarded as the signal to decide if the inference
terminates. Moreover, we propose uncertainty-aware layer-wise loss to fill the
performance gap between full models and early-exited models. With such loss
strategy, our model is able to obtain comparable results as full-layer models.
Extensive experiments of class-conditional, unconditional, and text-guided
generation on several datasets show that our method achieves state-of-the-art
performance and efficiency trade-off compared with existing early exiting
methods on diffusion models. More importantly, our method even brings extra
benefits to baseline models and obtains better performance on CIFAR-10 and
Celeb-A datasets. Full code and model are released for reproduction. |
DeeDiff, a novel early exiting framework designed to accelerate the inference speed of diffusion models for image generation. |
Diffusion models, while powerful in generating high-quality images, suffer from slow generation speed due to the multi-step inference process, hindering their application in real-time scenarios. |
DeeDiff introduces a timestep-aware uncertainty estimation module (UEM) to estimate the prediction uncertainty of each layer in the diffusion model. This uncertainty guides the early exiting decisions, adaptively allocating computational resources based on the difficulty of the generation step. Additionally, an uncertainty-aware layer-wise loss function is proposed to minimize the performance gap between the full model and the early-exited model. |
DeeDiff achieves state-of-the-art performance and efficiency compared to existing early exiting methods on diffusion models, reducing inference time by up to 40% with minimal performance drop.
The proposed uncertainty-aware layer-wise loss strategy not only improves the efficiency of early exiting but also enhances the performance of the baseline diffusion models, even without early exiting.
DeeDiff demonstrates compatibility with different diffusion models (CNN-based and Transformer-based) and acceleration methods like DPM-Solver, highlighting its generalizability and potential for broader applications. |
While DeeDiff achieves a favorable performance-efficiency trade-off, further improvement is needed to maintain high image quality (low FID) at higher efficiency levels (over 60% layer reduction).
The current implementation of DeeDiff focuses on early exiting based on the depth (number of layers) of the diffusion model. Exploring adaptive width (adaptive sampling steps) remains an open avenue for future research. |
diffusion models, early exiting, image generation, uncertainty estimation, inference acceleration |
2309.16992
Report |
Segment Anything Model is a Good Teacher for Local Feature Learning |
Jingqian Wu, Rongtao Xu, Zach Wood-Doughty, Changwei Wang, Shibiao Xu, Edmund Lam |
Local feature detection and description play an important role in many
computer vision tasks, which are designed to detect and describe keypoints in
"any scene" and "any downstream task". Data-driven local feature learning
methods need to rely on pixel-level correspondence for training, which is
challenging to acquire at scale, thus hindering further improvements in
performance. In this paper, we propose SAMFeat to introduce SAM (segment
anything model), a fundamental model trained on 11 million images, as a teacher
to guide local feature learning and thus inspire higher performance on limited
datasets. To do so, first, we construct an auxiliary task of Pixel Semantic
Relational Distillation (PSRD), which distillates feature relations with
category-agnostic semantic information learned by the SAM encoder into a local
feature learning network, to improve local feature description using semantic
discrimination. Second, we develop a technique called Weakly Supervised
Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic
groupings derived from SAM as weakly supervised signals, to optimize the metric
space of local descriptors. Third, we design an Edge Attention Guidance (EAG)
to further improve the accuracy of local feature detection and description by
prompting the network to pay more attention to the edge region guided by SAM.
SAMFeat's performance on various tasks such as image matching on HPatches, and
long-term visual localization on Aachen Day-Night showcases its superiority
over previous local features. The release code is available at
https://github.com/vignywang/SAMFeat. |
SAMFeat, a novel local feature learning method that leverages the Segment Anything Model (SAM) as a teacher to enhance performance. |
Existing data-driven local feature learning methods rely heavily on pixel-level correspondence, neglecting semantic information crucial for robust feature description. SAMFeat bridges this gap by incorporating SAM's rich semantic understanding. |
SAMFeat utilizes three key strategies: 1) Pixel Semantic Relational Distillation (PSRD) to distill category-agnostic semantic relations from SAM, 2) Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC) to optimize descriptor space using SAM-derived semantic groupings, and 3) Edge Attention Guidance (EAG) to prioritize edge regions identified by SAM for enhanced detection and description. |
SAMFeat achieves state-of-the-art results on HPatches image matching benchmark, demonstrating superior accuracy across various thresholds.
In long-term visual localization tasks on the Aachen Day-Night dataset, SAMFeat exhibits competitive performance against methods specifically designed for localization, highlighting its strong generalization capabilities.
Ablation studies validate the contribution of each individual component (PSRD, WSC, EAG) to the overall performance improvement. |
Exploration of alternative visual foundation models (e.g., DINO, SEEM) as potential teachers is left for future work.
The impact of different weighting schemes for individual loss functions in SAMFeat's total loss could be further investigated. |
local feature learning, segment anything model, semantic segmentation, image matching, visual localization |
2309.16948
Report |
Denoising Diffusion Bridge Models |
Linqi Zhou, Aaron Lou, Samar Khanna, Stefano Ermon |
Diffusion models are powerful generative models that map noise to data using
stochastic processes. However, for many applications such as image editing, the
model input comes from a distribution that is not random noise. As such,
diffusion models must rely on cumbersome methods like guidance or projected
sampling to incorporate this information in the generative process. In our
work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural
alternative to this paradigm based on diffusion bridges, a family of processes
that interpolate between two paired distributions given as endpoints. Our
method learns the score of the diffusion bridge from data and maps from one
endpoint distribution to the other by solving a (stochastic) differential
equation based on the learned score. Our method naturally unifies several
classes of generative models, such as score-based diffusion models and
OT-Flow-Matching, allowing us to adapt existing design and architectural
choices to our more general problem. Empirically, we apply DDBMs to challenging
image datasets in both pixel and latent space. On standard image translation
problems, DDBMs achieve significant improvement over baseline methods, and,
when we reduce the problem to image generation by setting the source
distribution to random noise, DDBMs achieve comparable FID scores to
state-of-the-art methods despite being built for a more general task. |
This paper proposes Denoising Diffusion Bridge Models (DDBMs), a novel framework for distribution translation by building a stochastic bridge between paired samples with tractable marginal distributions. |
Standard diffusion models are limited to mapping to simple Gaussian distributions, making them ill-suited for tasks like image translation that require mapping between arbitrary distributions. |
The authors leverage diffusion bridges, stochastic processes that interpolate between paired distributions, and learn the score of the diffusion bridge by matching against a tractable closed-form score. |
DDBMs achieve strong performance in image-to-image translation tasks, outperforming baseline methods on metrics like FID and LPIPS.
When applied to unconditional image generation, DDBMs achieve comparable FID scores to state-of-the-art diffusion models.
The proposed preconditioning and hybrid sampler are shown to be crucial for the empirical success of DDBMs. |
The predict-x parameterization, while effective for pixel-space generation, may be less suitable for latent-space translation.
Future work includes exploring alternative parameterizations and extending DDBMs to handle more complex data modalities beyond images. |
diffusion models, generative models, image translation, diffusion bridges, score matching |
2309.16738
Report |
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens |
Yangyang Guo, Haoyu Zhang, Yongkang Wong, Liqiang Nie, Mohan Kankanhalli |
Learning a versatile language-image model is computationally prohibitive
under a limited computing budget. This paper delves into the \emph{efficient
language-image pre-training}, an area that has received relatively little
attention despite its importance in reducing computational cost and footprint.
To that end, we propose a vision token pruning and merging method ELIP, to
remove less influential tokens based on the supervision of language outputs.
Our method is designed with several strengths, such as being
computation-efficient, memory-efficient, and trainable-parameter-free, and is
distinguished from previous vision-only token pruning approaches by its
alignment with task objectives. We implement this method in a progressively
pruning manner using several sequential blocks. To evaluate its generalization
performance, we apply ELIP to three commonly used language-image pre-training
models and utilize public image-caption pairs with 4M images for pre-training.
Our experiments demonstrate that with the removal of ~30$\%$ vision tokens
across 12 ViT layers, ELIP maintains significantly comparable performance with
baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks
including cross-modal retrieval, VQA, image captioning, \emph{etc}. In
addition, the spared GPU resources by our ELIP allow us to scale up with larger
batch sizes, thereby accelerating model pre-training and even sometimes
enhancing downstream model performance. |
This paper presents ELIP, a vision token pruning and merging method for efficient language-image pre-training, which removes less influential vision tokens based on the supervision of language outputs. |
Learning versatile language-image models is computationally expensive. ELIP aims to improve efficiency by reducing the computational cost and footprint during pre-training. |
ELIP progressively prunes and merges less important vision tokens in a multi-stage manner, guided by the fusion of image and text [CLS] token features. It divides the ViT encoder into four blocks, with increasing pruning ratios for deeper blocks. |
ELIP achieves comparable performance to baseline models on various downstream tasks (e.g., retrieval, VQA) while removing ~30% of vision tokens.
The reduced computational cost allows for scaling up pre-training with larger batch sizes, leading to faster training and sometimes even improved downstream performance.
ELIP effectively removes less important background information while preserving salient object features, as demonstrated through visualizations. |
The pruning ratio in ELIP is fixed and could benefit from an adaptive approach based on image complexity.
Further efficiency improvements can be explored by integrating ELIP with other techniques like mixed-precision training. |
vision-language pre-training, efficient deep learning, vision token pruning, multi-modal learning, vision transformer |
2309.16671
Report |
Demystifying CLIP Data |
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer |
Contrastive Language-Image Pre-training (CLIP) is an approach that has
advanced research and applications in computer vision, fueling modern
recognition systems and generative models. We believe that the main ingredient
to the success of CLIP is its data and not the model architecture or
pre-training objective. However, CLIP only provides very limited information
about its data and how it has been collected, leading to works that aim to
reproduce CLIP's data by filtering with its model parameters. In this work, we
intend to reveal CLIP's data curation approach and in our pursuit of making it
open to the community introduce Metadata-Curated Language-Image Pre-training
(MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's
concepts) and yields a balanced subset over the metadata distribution. Our
experimental study rigorously isolates the model and training settings,
concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M
image-text data pairs outperforms CLIP's data on multiple standard benchmarks.
In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy,
surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining
the same training budget, attains 72.4%. Our observations hold across various
model sizes, exemplified by ViT-H achieving 80.5%, without any
bells-and-whistles. Curation code and training data distribution on metadata is
made available at https://github.com/facebookresearch/MetaCLIP. |
The paper introduces MetaCLIP, a method to reveal CLIP's data curation process by leveraging metadata (derived from CLIP's visual concepts) to create a balanced training dataset from a raw data pool. |
CLIP's training data is key to its success, but it is not publicly available, hindering reproducibility and further research. Existing attempts to replicate CLIP's data rely on filtering with the CLIP model, potentially introducing biases. |
The authors reconstruct CLIP's metadata, perform sub-string matching on a raw data pool (CommonCrawl), and then balance the distribution of data points over the metadata. |
MetaCLIP applied to CommonCrawl with 400M image-text pairs outperforms CLIP's data on 26 standard benchmarks.
In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models.
Scaling MetaCLIP to 2.5B data points, while maintaining the same training budget, reaches 79.2% accuracy on ViT-L/14. |
The paper only uses CommonCrawl as a data source, and other sources might yield different results.
Future work could explore more sophisticated methods for balancing data distribution. |
clip, data curation, vision-language pre-training, metadata, zero-shot learning |
2309.16653
Report |
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation |
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, Gang Zeng |
Recent advances in 3D content creation mostly leverage optimization-based 3D
generation via score distillation sampling (SDS). Though promising results have
been exhibited, these methods often suffer from slow per-sample optimization,
limiting their practical usage. In this paper, we propose DreamGaussian, a
novel 3D content generation framework that achieves both efficiency and quality
simultaneously. Our key insight is to design a generative 3D Gaussian Splatting
model with companioned mesh extraction and texture refinement in UV space. In
contrast to the occupancy pruning used in Neural Radiance Fields, we
demonstrate that the progressive densification of 3D Gaussians converges
significantly faster for 3D generative tasks. To further enhance the texture
quality and facilitate downstream applications, we introduce an efficient
algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning
stage to refine the details. Extensive experiments demonstrate the superior
efficiency and competitive generation quality of our proposed approach.
Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes
from a single-view image, achieving approximately 10 times acceleration
compared to existing methods. |
DreamGaussian, a novel 3D content generation framework that achieves fast and high-quality 3D generation by adapting 3D Gaussian Splatting into generative settings with companioned mesh extraction and texture refinement. |
Existing optimization-based 3D generation methods suffer from slow per-sample optimization, limiting their practical usage. |
1. Adapt 3D Gaussian Splatting into generative settings for efficient 3D content creation. 2. Design an efficient mesh extraction algorithm from 3D Gaussians. 3. Propose a UV-space texture refinement stage to further enhance the generation quality. |
DreamGaussian significantly reduces the generation time of optimization-based 2D lifting methods for 3D content creation.
DreamGaussian achieves approximately 10 times acceleration compared to existing methods, producing high-quality textured meshes in just 2 minutes from a single-view image.
The proposed mesh extraction algorithm and UV-space texture refinement stage effectively enhance the generation quality of 3D content. |
The generated models may exhibit limitations such as the multi-face Janus problem, over-saturated texture, and baked lighting.
The back-view texture generated in image-to-3D results may appear blurry. |
3d generation, gaussian splatting, score distillation sampling, mesh extraction, texture refinement |
2309.16633
Report |
Mixup Your Own Pairs |
Yilei Wu, Zijian Dong, Chongyao Chen, Wangchunshu Zhou, Juan Helen Zhou |
In representation learning, regression has traditionally received less
attention than classification. Directly applying representation learning
techniques designed for classification to regression often results in
fragmented representations in the latent space, yielding sub-optimal
performance. In this paper, we argue that the potential of contrastive learning
for regression has been overshadowed due to the neglect of two crucial aspects:
ordinality-awareness and hardness. To address these challenges, we advocate
"mixup your own contrastive pairs for supervised contrastive regression",
instead of relying solely on real/augmented samples. Specifically, we propose
Supervised Contrastive Learning for Regression with Mixup (SupReMix). It takes
anchor-inclusive mixtures (mixup of the anchor and a distinct negative sample)
as hard negative pairs and anchor-exclusive mixtures (mixup of two distinct
negative samples) as hard positive pairs at the embedding level. This strategy
formulates harder contrastive pairs by integrating richer ordinal information.
Through extensive experiments on six regression datasets including 2D images,
volumetric images, text, tabular data, and time-series signals, coupled with
theoretical analysis, we demonstrate that SupReMix pre-training fosters
continuous ordered representations of regression data, resulting in significant
improvement in regression performance. Furthermore, SupReMix is superior to
other approaches in a range of regression challenges including transfer
learning, imbalanced training data, and scenarios with fewer training samples. |
The paper proposes SupReMix, a supervised contrastive learning framework for regression that leverages mixup to generate hard positive and negative pairs, improving representation learning for regression tasks by considering ordinality and hardness. |
Directly applying contrastive learning methods designed for classification to regression often leads to suboptimal performance due to neglecting the inherent ordinal nature of regression data and the importance of hard contrastive pairs. |
SupReMix creates hard negative pairs using anchor-inclusive mixup (anchor and a negative sample) and hard positive pairs via anchor-exclusive mixup (two negative samples with a combined label equal to the anchor's). It also introduces distance magnifying weights for negative pairs based on label distance. |
SupReMix consistently outperforms baseline methods, including vanilla deep regression and other supervised contrastive learning frameworks, across six datasets with various input modalities (text, 2D/3D images, tabular data, time series).
SupReMix significantly improves performance on imbalanced regression, transfer learning, and scenarios with reduced training data, demonstrating its robustness and data efficiency.
Ablation studies confirm the effectiveness of each proposed component (hard negative/positive pairs, distance magnifying weights) in boosting the performance. |
The choice of hyperparameters, such as the mixup Beta distribution and the window size for hard positive pair generation, requires careful tuning.
Future work could explore alternative methods for generating hard contrastive pairs or investigate the effectiveness of SupReMix on other regression tasks and domains. |
contrastive learning, regression, representation learning, mixup, ordinality-awareness |
2309.16608
Report |
KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing |
Jiancheng Huang, Yifan Liu, Jin Qin, Shifeng Chen |
Text-conditioned image editing is a recently emerged and highly practical
task, and its potential is immeasurable. However, most of the concurrent
methods are unable to perform action editing, i.e. they can not produce results
that conform to the action semantics of the editing prompt and preserve the
content of the original image. To solve the problem of action editing, we
propose KV Inversion, a method that can achieve satisfactory reconstruction
performance and action editing, which can solve two major problems: 1) the
edited result can match the corresponding action, and 2) the edited object can
retain the texture and identity of the original real image. In addition, our
method does not require training the Stable Diffusion model itself, nor does it
require scanning a large-scale dataset to perform time-consuming training. |
Presents KV Inversion, a training-free method for text-conditioned action editing of real images using stable diffusion, focusing on preserving object identity and texture while enabling action modifications. |
Addresses limitations of existing image editing techniques that struggle to perform action editing on real images while maintaining fidelity to the original object's appearance. |
Introduces Content Preserving self-attention (CP-attn) which learns Key and Value embeddings during a tuning stage to better preserve source image content. During editing, these learned embeddings, combined with the target text prompt, guide the generation of the edited image. |
Achieves superior action editing results on real images compared to concurrent methods, successfully modifying actions while retaining object identity and texture.
Demonstrates consistent performance across different domains, including natural images and anime-style images.
Provides a more efficient alternative by eliminating the need for finetuning the diffusion model or training on extensive datasets. |
Editing results may be unsatisfactory if the prompted action drastically conflicts with the original image pose.
Reliance on user-provided prompts without additional guidance (e.g., skeleton maps) can limit control over complex action editing. |
real image editing, diffusion model, text-to-image generation, action editing, content preserving |
2309.16588
Report |
Vision Transformers Need Registers |
Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski |
Transformers have recently emerged as a powerful tool for learning visual
representations. In this paper, we identify and characterize artifacts in
feature maps of both supervised and self-supervised ViT networks. The artifacts
correspond to high-norm tokens appearing during inference primarily in
low-informative background areas of images, that are repurposed for internal
computations. We propose a simple yet effective solution based on providing
additional tokens to the input sequence of the Vision Transformer to fill that
role. We show that this solution fixes that problem entirely for both
supervised and self-supervised models, sets a new state of the art for
self-supervised visual models on dense visual prediction tasks, enables object
discovery methods with larger models, and most importantly leads to smoother
feature maps and attention maps for downstream visual processing. |
The paper identifies artifacts in feature maps of Vision Transformer (ViT) networks, both supervised and self-supervised, and proposes a solution by adding register tokens to absorb these artifacts. |
Artifacts, corresponding to high-norm tokens in low-informative areas, negatively impact feature map smoothness and hinder object discovery methods. Addressing this issue improves performance in dense prediction tasks and enables object discovery with larger models. |
The authors analyze the artifacts, characterize them as high-norm tokens, and observe their emergence during training. They then propose adding register tokens to the input sequence, allowing the model to use them for internal computations instead of repurposing patch tokens. |
Adding register tokens effectively eliminates high-norm outlier tokens.
Models trained with registers show improved performance in dense prediction tasks like semantic segmentation and depth estimation.
Object discovery methods, previously incompatible with newer ViT models, become viable and show significant performance improvement when using models trained with registers. |
The study focuses on DINOv2 and may need further investigation for generalizability to other self-supervised methods.
Future work includes exploring regularization techniques for register tokens and investigating their potential for multi-modal tasks. |
vision transformers, artifacts, self-supervised learning, object discovery, register tokens |
2309.16553
Report |
MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond |
Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, Bo Dai |
Neural radiance fields (NeRF) and its subsequent variants have led to
remarkable progress in neural rendering. While most of recent neural rendering
works focus on objects and small-scale scenes, developing neural rendering
methods for city-scale scenes is of great potential in many real-world
applications. However, this line of research is impeded by the absence of a
comprehensive and high-quality dataset, yet collecting such a dataset over real
city-scale scenes is costly, sensitive, and technically difficult. To this end,
we build a large-scale, comprehensive, and high-quality synthetic dataset for
city-scale neural rendering researches. Leveraging the Unreal Engine 5 City
Sample project, we develop a pipeline to easily collect aerial and street city
views, accompanied by ground-truth camera poses and a range of additional data
modalities. Flexible controls over environmental factors like light, weather,
human and car crowd are also available in our pipeline, supporting the need of
various tasks covering city-scale neural rendering and beyond. The resulting
pilot dataset, MatrixCity, contains 67k aerial images and 452k street images
from two city maps of total size $28km^2$. On top of MatrixCity, a thorough
benchmark is also conducted, which not only reveals unique challenges of the
task of city-scale neural rendering, but also highlights potential improvements
for future works. The dataset and code will be publicly available at our
project page: https://city-super.github.io/matrixcity/. |
This paper introduces MatrixCity, a large-scale, high-quality synthetic dataset designed for city-scale neural rendering research. |
Existing datasets for neural rendering are inadequate for city-scale scenes due to limited size, diversity, controllability, and availability. MatrixCity aims to bridge this gap and facilitate research in this area. |
The authors leveraged Unreal Engine 5 to create MatrixCity, developing a plugin for automatic data collection and incorporating diverse urban elements, controllable environments, and multiple ground-truth properties (depth, normal, reflectance). |
Modeling high-rise areas in aerial data is more challenging than low-rise areas due to complex occlusions.
Street-level data, being richer in detail, poses greater challenges for model capacity compared to aerial data.
Fusing aerial and street-level data naively degrades performance due to significant differences in detail and viewpoint. |
The current dataset focuses on static scenes; incorporating dynamic elements like moving objects and lighting changes more realistically is left for future work.
Exploring advanced algorithms to effectively fuse multi-view data with varying levels of detail is crucial for future research. |
neural rendering, city-scale 3d reconstruction, synthetic dataset, unreal engine 5, nerf |
2309.16421
Report |
Distilling ODE Solvers of Diffusion Models into Smaller Steps |
Sanghwan Kim, Hao Tang, Fisher Yu |
Abstract Diffusion models have recently gained prominence as a novel category
of generative models. Despite their success, these models face a notable
drawback in terms of slow sampling speeds, requiring a high number of function
evaluations (NFE) in the order of hundreds or thousands. In response, both
learning-free and learning-based sampling strategies have been explored to
expedite the sampling process. Learning-free sampling employs various ordinary
differential equation (ODE) solvers based on the formulation of diffusion ODEs.
However, it encounters challenges in faithfully tracking the true sampling
trajectory, particularly for small NFE. Conversely, learning-based sampling
methods, such as knowledge distillation, demand extensive additional training,
limiting their practical applicability. To overcome these limitations, we
introduce Distilled-ODE solvers (D-ODE solvers), a straightforward distillation
approach grounded in ODE solver formulations. Our method seamlessly integrates
the strengths of both learning-free and learning-based sampling. D-ODE solvers
are constructed by introducing a single parameter adjustment to existing ODE
solvers. Furthermore, we optimize D-ODE solvers with smaller steps using
knowledge distillation from ODE solvers with larger steps across a batch of
samples. Comprehensive experiments demonstrate the superior performance of
D-ODE solvers compared to existing ODE solvers, including DDIM, PNDM,
DPM-Solver, DEIS, and EDM, particularly in scenarios with fewer NFE. Notably,
our method incurs negligible computational overhead compared to previous
distillation techniques, facilitating straightforward and rapid integration
with existing samplers. Qualitative analysis reveals that D-ODE solvers not
only enhance image quality but also faithfully follow the target ODE
trajectory. |
This paper introduces Distilled-ODE solvers (D-ODE solvers), a novel distillation method to enhance the efficiency of diffusion model sampling by optimizing ODE solvers with minimal additional training. |
Diffusion models, despite their ability to generate high-quality samples, often suffer from slow sampling speeds due to the need for numerous function evaluations. Existing solutions, such as learning-free and learning-based sampling, present limitations in terms of trajectory accuracy or excessive training requirements. D-ODE solvers address these limitations by combining the strengths of both approaches. |
D-ODE solvers introduce a single adjustable parameter to existing ODE solvers, linearly combining current and previous denoising network outputs. This parameter is then optimized for each dataset by minimizing the difference between the outputs of D-ODE solvers with smaller steps (student) and those of ODE solvers with larger steps (teacher). The distillation process requires only one batch of samples, making it computationally efficient. |
D-ODE solvers consistently outperform state-of-the-art ODE solvers in terms of FID scores across various image generation benchmarks, particularly with a smaller number of function evaluations (NFE).
The method's efficiency is demonstrated through significantly reduced distillation times compared to previous distillation techniques, requiring only a few CPU minutes.
Visual analysis reveals that D-ODE solvers effectively guide the sampling process toward the target data manifold, enhancing image quality while maintaining fidelity to the original ODE trajectory. |
The single-parameter nature of D-ODE solvers may limit their effectiveness in generating high-resolution images.
Future research could explore incorporating local-specific parameters to address this limitation. |
diffusion models, generative models, knowledge distillation, ode solvers, fast sampling |
2309.16414
Report |
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models |
Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi |
Classifiers built upon vision-language models such as CLIP have shown
remarkable zero-shot performance across a broad range of image classification
tasks. Prior work has studied different ways of automatically creating
descriptor sets for every class based on prompt templates, ranging from
manually engineered templates over templates obtained from a large language
model to templates built from random words and characters. Up until now,
deriving zero-shot classifiers from the respective encoded class descriptors
has remained nearly unchanged, i.e., classify to the class that maximizes
cosine similarity between its averaged encoded class descriptors and the image
encoding. However, weighing all class descriptors equally can be suboptimal
when certain descriptors match visual clues on a given image better than
others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot
classifiers. AutoCLIP tunes per-image weights to each prompt template at
inference time, based on statistics of class descriptor-image similarities.
AutoCLIP is fully unsupervised, has very low computational overhead, and can be
easily implemented in few lines of code. We show that AutoCLIP outperforms
baselines across a broad range of vision-language models, datasets, and prompt
templates consistently and by up to 3 percent point accuracy. |
This paper introduces AutoCLIP, a method for auto-tuning zero-shot image classifiers built on top of vision-language models (VLMs). AutoCLIP adapts the weights of prompt templates at test time based on the similarity between the encoded image and class descriptors. |
Zero-shot classifiers based on VLMs rely heavily on prompt engineering. Existing methods either use fixed prompts or employ computationally expensive test-time prompt tuning. AutoCLIP offers a more efficient alternative by dynamically adjusting prompt weights for each image, enhancing the zero-shot classifier's performance. |
AutoCLIP leverages the embedding space of VLMs. It computes weights for prompt templates based on the similarity between encoded class descriptors and the encoded image. These weights, determined through gradient ascent on a logsumexp aggregation of similarities, are then used to compute weighted class queries for classification. |
AutoCLIP consistently outperforms baseline zero-shot classifiers on a wide range of datasets, VLMs, and prompt templates, particularly with larger and more diverse sets of prompts.
AutoCLIP shows an average accuracy improvement of 0.45 percentage points and up to 3 percentage points in certain settings, with minimal computational overhead.
Ablation studies confirm the effectiveness of the logsumexp aggregation and the robustness of AutoCLIP to the choice of the entropy reduction factor for the step size. |
The default entropy reduction factor, while generally robust, could be suboptimal for some datasets, suggesting further exploration.
Future work could explore extending AutoCLIP to other zero-shot tasks like object detection and multi-modal prompting. |
zero-shot learning, vision-language models, prompt engineering, test-time adaptation, image classification |
2309.16364
Report |
FG-NeRF: Flow-GAN based Probabilistic Neural Radiance Field for Independence-Assumption-Free Uncertainty Estimation |
Songlin Wei, Jiazhao Zhang, Yang Wang, Fanbo Xiang, Hao Su, He Wang |
Neural radiance fields with stochasticity have garnered significant interest
by enabling the sampling of plausible radiance fields and quantifying
uncertainty for downstream tasks. Existing works rely on the independence
assumption of points in the radiance field or the pixels in input views to
obtain tractable forms of the probability density function. However, this
assumption inadvertently impacts performance when dealing with intricate
geometry and texture. In this work, we propose an independence-assumption-free
probabilistic neural radiance field based on Flow-GAN. By combining the
generative capability of adversarial learning and the powerful expressivity of
normalizing flow, our method explicitly models the density-radiance
distribution of the whole scene. We represent our probabilistic NeRF as a
mean-shifted probabilistic residual neural model. Our model is trained without
an explicit likelihood function, thereby avoiding the independence assumption.
Specifically, We downsample the training images with different strides and
centers to form fixed-size patches which are used to train the generator with
patch-based adversarial learning. Through extensive experiments, our method
demonstrates state-of-the-art performance by predicting lower rendering errors
and more reliable uncertainty on both synthetic and real-world datasets. |
This paper proposes Flow-GAN NeRF (FG-NeRF), a novel probabilistic neural radiance field that leverages adversarial learning and normalizing flows to estimate uncertainty in neural scene representations without relying on independence assumptions. |
Estimating uncertainty in neural radiance fields is crucial for applications like robotics, autonomous driving, and human-computer interaction where understanding the confidence of predictions is essential. Existing methods often make strong independence assumptions that limit their accuracy, especially in complex scenes. |
FG-NeRF decomposes the radiance field into deterministic and probabilistic branches, with the latter implemented using conditional normalizing flow. It employs a GAN framework where the generator synthesizes image patches by volume rendering samples from the learned distribution, while the discriminator distinguishes them from real patches. This adversarial training scheme allows FG-NeRF to learn complex density-radiance distributions without relying on explicit likelihood computations or independence assumptions. |
FG-NeRF achieves state-of-the-art uncertainty estimation performance on LLFF, ScanNet, and Replica datasets, outperforming previous methods like S-NeRF and CF-NeRF.
The method effectively captures intricate geometry and appearance details, resulting in high-quality uncertainty maps that highlight uncertain regions like object edges and areas with high-frequency textures.
Ablation studies demonstrate the effectiveness of key components like adversarial learning and the deterministic branch in improving uncertainty estimation and rendering quality. |
FG-NeRF is computationally expensive, requiring significant resources for training even with acceleration techniques like multi-level hash encoding.
The rendering quality, while good, is not on par with the latest advancements in NeRF rendering, leaving room for improvement by incorporating scene priors, advanced training strategies, and novel network architectures. |
neural radiance fields, uncertainty estimation, generative adversarial networks, normalizing flows, scene representation |
2309.16354
Report |
Transformer-VQ: Linear-Time Transformers via Vector Quantization |
Lucas D. Lingle |
We introduce Transformer-VQ, a decoder-only transformer computing
softmax-based dense self-attention in linear time. Transformer-VQ's efficient
attention is enabled by vector-quantized keys and a novel caching mechanism. In
our large-scale experiments, Transformer-VQ is shown highly competitive in
quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on
ImageNet64. In addition, the optimized implementation of Transformer-VQ is over
3x faster than a comparable quadratic-time transformer at sequence length 8k,
is over 12x faster at 32k, and can scale to 131k with similar throughput. Code
available: \url{https://github.com/transformer-vq/transformer_vq} |
Transformer-VQ, a decoder-only Transformer that uses vector-quantized keys to compute dense self-attention in linear time, is introduced. |
Standard Transformers have quadratic time complexity, limiting their practicality for long sequences. Efficient Transformers are crucial for long-context applications. |
Transformer-VQ combines vector-quantized keys, localized positional biases, and a compressive cache for efficient attention. Theorems prove the linear-time complexity and equivalence to dense attention. The model is trained with a cross-entropy loss and a codebook commitment loss. |
Transformer-VQ achieves 0.99 bpb on Enwik8, matching Transformer-XL with fewer parameters and a shorter cache.
It achieves 26.6 ppl on PG-19, near state-of-the-art, showing the efficacy of standalone self-attention for long sequences.
It sets a new state-of-the-art of 3.16 bpb on ImageNet64, generating high-fidelity samples in linear time. |
Overfitting was a significant issue on Enwik8, requiring careful tuning of regularization parameters.
Future work includes exploring formal scaling laws, larger models, and porting to lower-level frameworks. |
transformer, linear-time attention, vector quantization, long-range dependencies, efficient transformers |
2309.16351
Report |
Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning |
Albert Mohwald, Tomas Jenicek, Ondřej Chum |
Image retrieval methods based on CNN descriptors rely on metric learning from
a large number of diverse examples of positive and negative image pairs.
Domains, such as night-time images, with limited availability and variability
of training data suffer from poor retrieval performance even with methods
performing well on standard benchmarks. We propose to train a GAN-based
synthetic-image generator, translating available day-time image examples into
night images. Such a generator is used in metric learning as a form of
augmentation, supplying training data to the scarce domain. Various types of
generators are evaluated and analyzed. We contribute with a novel light-weight
GAN architecture that enforces the consistency between the original and
translated image through edge consistency. The proposed architecture also
allows a simultaneous training of an edge detector that operates on both night
and day images. To further increase the variability in the training examples
and to maximize the generalization of the trained model, we propose a novel
method of diverse anchor mining.
The proposed method improves over the state-of-the-art results on a standard
Tokyo 24/7 day-night retrieval benchmark while preserving the performance on
Oxford and Paris datasets. This is achieved without the need of training image
pairs of matching day and night images. The source code is available at
https://github.com/mohwald/gandtr . |
This paper introduces a novel approach for training deep neural networks to generate image descriptors for day-night illumination-invariant image retrieval, utilizing synthetically generated night images instead of relying on corresponding pairs of night and day training images. |
This method addresses the challenge of limited availability and variability of training data for night-time images in image retrieval tasks, leading to improved performance in day-night retrieval scenarios. |
The proposed method utilizes a GAN-based synthetic-image generator to translate day-time images into night images for augmenting the training data. The authors propose a novel light-weight GAN architecture that enforces consistency between the original and translated image through edge consistency and enables simultaneous training of an edge detector effective on both night and day images. Additionally, a diverse anchor mining method is introduced to enhance the variability of training examples. |
The method surpasses previous approaches, including those using ground-truth day-night image pairs, in retrieval performance.
A larger diversity of synthesized training data proves more beneficial than a smaller set of real training data.
The proposed light-weight generator, utilizing edge consistency, demonstrates comparable performance to more computationally intensive generators while training significantly faster. |
The impact of using different edge detectors on the generator's performance requires further investigation.
Exploring the potential benefits of combining the proposed method with other domain adaptation techniques is a promising direction for future research. |
image retrieval, generative adversarial networks, data augmentation, illumination invariance, metric learning |
2309.16108
Report |
Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words |
Yujia Bao, Srinivasan Sivanandan, Theofanis Karaletsos |
Vision Transformer (ViT) has emerged as a powerful architecture in the realm
of modern computer vision. However, its application in certain imaging fields,
such as microscopy and satellite imaging, presents unique challenges. In these
domains, images often contain multiple channels, each carrying semantically
distinct and independent information. Furthermore, the model must demonstrate
robustness to sparsity in input channels, as they may not be densely available
during training or testing. In this paper, we propose a modification to the ViT
architecture that enhances reasoning across the input channels and introduce
Hierarchical Channel Sampling (HCS) as an additional regularization technique
to ensure robustness when only partial channels are presented during test time.
Our proposed model, ChannelViT, constructs patch tokens independently from each
input channel and utilizes a learnable channel embedding that is added to the
patch tokens, similar to positional embeddings. We evaluate the performance of
ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat
(satellite imaging). Our results show that ChannelViT outperforms ViT on
classification tasks and generalizes well, even when a subset of input channels
is used during testing. Across our experiments, HCS proves to be a powerful
regularizer, independent of the architecture employed, suggesting itself as a
straightforward technique for robust ViT training. Lastly, we find that
ChannelViT generalizes effectively even when there is limited access to all
channels during training, highlighting its potential for multi-channel imaging
under real-world conditions with sparse sensors. Our code is available at
https://github.com/insitro/ChannelViT. |
This paper proposes ChannelViT, a Vision Transformer (ViT) modification for multi-channel imaging, which improves channel reasoning and handles sparse channel availability. |
ViTs struggle with multi-channel imaging due to semantically distinct information in each channel and potential sparsity in channel availability during training and testing. |
ChannelViT generates separate patch tokens per channel, uses learnable channel embeddings, and employs Hierarchical Channel Sampling (HCS) for robustness during sparse channel training. |
ChannelViT outperforms ViT on ImageNet, JUMP-CP (microscopy), and So2Sat (satellite) datasets.
HCS significantly improves channel robustness, enabling models to generalize well to unseen channel combinations.
ChannelViT demonstrates data efficiency, performing well even with limited access to all channels during training. |
ChannelViTs increased sequence length increases computational cost.
Future work includes exploring more efficient attention mechanisms to reduce computational overhead. |
vision transformer, multi-channel imaging, channel robustness, hierarchical channel sampling, self-supervised learning |
2309.15954
Report |
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering |
Haichao Yu, Yu Tian, Sateesh Kumar, Linjie Yang, Heng Wang |
The quality of pre-training data plays a critical role in the performance of
foundation models. Popular foundation models often design their own recipe for
data filtering, which makes it hard to analyze and compare different data
filtering approaches. DataComp is a new benchmark dedicated to evaluating
different methods for data filtering. This paper describes our learning and
solution when participating in the DataComp challenge. Our filtering strategy
includes three stages: single-modality filtering, cross-modality filtering, and
data distribution alignment. We integrate existing methods and propose new
solutions, such as computing CLIP score on horizontally flipped images to
mitigate the interference of scene text, using vision and language models to
retrieve training samples for target downstream tasks, rebalancing the data
distribution to improve the efficiency of allocating the computational budget,
etc. We slice and dice our design choices, provide in-depth analysis, and
discuss open questions. Our approach outperforms the best method from the
DataComp paper by over 4% on the average performance of 38 tasks and by over 2%
on ImageNet. |
This paper presents a three-stage data filtering framework for multi-modal pre-training, aiming to enhance the performance of foundation models on various vision and language tasks. |
Data quality is crucial for foundation model performance, and understanding effective filtering strategies is vital for improving these models. |
The framework consists of: (1) Single-modality filtering (image/text quality), (2) Cross-modality filtering (image-text similarity using flipped-CLIP and BLIP-ITM), and (3) Distribution alignment (dataset diversity, computational budget allocation, and downstream task alignment). |
The proposed approach outperforms the best method from the DataComp paper by over 4% on average across 38 tasks.
Flipped-CLIP score is found to be a more effective filtering metric compared to the original CLIP score.
Aligning pre-training data distribution with downstream tasks, especially for digit recognition, significantly improves performance. |
The generalization of data filtering methods to different data distributions requires further investigation.
A better balance is needed when filtering images containing scene text to benefit tasks requiring such information. |
data filtering, multi-modal learning, foundation models, dataset curation, clip |
2309.15842
Report |
Exploiting the Signal-Leak Bias in Diffusion Models |
Martin Nicolas Everaert, Athanasios Fitsios, Marco Bocchio, Sami Arpa, Sabine Süsstrunk, Radhakrishna Achanta |
There is a bias in the inference pipeline of most diffusion models. This bias
arises from a signal leak whose distribution deviates from the noise
distribution, creating a discrepancy between training and inference processes.
We demonstrate that this signal-leak bias is particularly significant when
models are tuned to a specific style, causing sub-optimal style matching.
Recent research tries to avoid the signal leakage during training. We instead
show how we can exploit this signal-leak bias in existing diffusion models to
allow more control over the generated images. This enables us to generate
images with more varied brightness, and images that better match a desired
style or color. By modeling the distribution of the signal leak in the spatial
frequency and pixel domains, and including a signal leak in the initial latent,
we generate images that better match expected results without any additional
training. |
This paper analyzes the signal-leak bias in diffusion models, showing its caused by discrepancies between noise and data distributions at the last training timestep, and proposes a method to exploit it for controlling image generation. |
The signal-leak bias limits control over generated images, particularly affecting style adaptation and brightness diversity. This paper offers a simple solution for more controllable and varied image generation. |
The proposed method models the distribution of the signal leak from target images, either in the pixel or frequency domain. This distribution is then used to sample the initial latent at inference time, effectively biasing the generated image towards desired characteristics. |
Significantly improves style matching in models fine-tuned for specific styles without additional training.
Enables style-specific image generation with non-style-specific diffusion models by leveraging the signal leak.
Generates images with more diverse brightness and color variations by modeling low-frequency components of the signal leak. |
The proposed method relies on random sampling of the signal leak, potentially limiting control over brightness matching with the textual prompt.
Certain specific styles may not be easily captured by the proposed pixel-domain model, requiring alternative modeling or fine-tuning. |
diffusion models, signal-leak bias, style adaptation, image generation, frequency domain analysis |
2309.15830
Report |
OrthoPlanes: A Novel Representation for Better 3D-Awareness of GANs |
Honglin He, Zhuoqian Yang, Shikai Li, Bo Dai, Wayne Wu |
We present a new method for generating realistic and view-consistent images
with fine geometry from 2D image collections. Our method proposes a hybrid
explicit-implicit representation called \textbf{OrthoPlanes}, which encodes
fine-grained 3D information in feature maps that can be efficiently generated
by modifying 2D StyleGANs. Compared to previous representations, our method has
better scalability and expressiveness with clear and explicit information. As a
result, our method can handle more challenging view-angles and synthesize
articulated objects with high spatial degree of freedom. Experiments
demonstrate that our method achieves state-of-the-art results on FFHQ and SHHQ
datasets, both quantitatively and qualitatively. Project page:
\url{https://orthoplanes.github.io/}. |
Presents OrthoPlanes, a hybrid explicit-implicit 3D representation, to improve 3D awareness and geometry quality in 2D GANs, particularly for articulated objects. |
Existing 3D-aware GANs struggle to accurately reconstruct complex 3D shapes from 2D images, limiting realism and view-consistency, especially for non-rigid objects. |
Uses StyleGAN2 to generate feature maps representing sectional projections of a 3D scene onto groups of parallel planes. Location embeddings enhance spatial awareness and a lightweight MLP decodes features for volumetric rendering. |
Achieves state-of-the-art results on FFHQ and SHHQ datasets for 3D-aware image synthesis.
Exhibits superior view-consistency, handling challenging angles better than previous methods.
Demonstrates improved geometry reconstruction, especially for articulated objects like human bodies. |
Background artifacts persist, requiring further exploration of modeling assumptions.
Inconsistencies under view variations due to two-stage rendering process, suggesting direct RGB rendering as future work. |
3d-aware gans, image synthesis, neural rendering, view consistency, 3d reconstruction |
2309.15818
Report |
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation |
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou |
Significant advancements have been achieved in the realm of large-scale
pre-trained text-to-video Diffusion Models (VDMs). However, previous methods
either rely solely on pixel-based VDMs, which come with high computational
costs, or on latent-based VDMs, which often struggle with precise text-video
alignment. In this paper, we are the first to propose a hybrid model, dubbed as
Show-1, which marries pixel-based and latent-based VDMs for text-to-video
generation. Our model first uses pixel-based VDMs to produce a low-resolution
video of strong text-video correlation. After that, we propose a novel expert
translation method that employs the latent-based VDMs to further upsample the
low-resolution video to high resolution. Compared to latent VDMs, Show-1 can
produce high-quality videos of precise text-video alignment; Compared to pixel
VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G
vs 72G). We also validate our model on standard video generation benchmarks.
Our code and model weights are publicly available at
https://github.com/showlab/Show-1. |
Show-1, a hybrid text-to-video generation model that combines the strengths of pixel-based and latent-based Video Diffusion Models (VDMs). |
Existing pixel-based VDMs are computationally expensive while latent-based VDMs often struggle with text-video alignment. |
The model uses pixel-based VDMs for low-resolution keyframe and temporal interpolation generation, ensuring strong text-video correlation. It then employs a novel expert translation method based on latent-based VDMs for efficient upsampling to high resolution. |
Show-1 generates high-quality videos with precise text-video alignment.
It achieves state-of-the-art performance on UCF-101 and MSR-VTT benchmarks.
The model offers significant efficiency improvements, requiring only 15GB of GPU memory during inference compared to 72GB for pixel-based methods. |
The model's reliance on web data for training may lead to biases in the generated content.
Future work could explore methods for mitigating bias and further improving the model's efficiency. |
text-to-video generation, diffusion models, video synthesis, deep learning, computer vision |
2309.15807
Report |
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack |
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh |
Training text-to-image models with web scale image-text pairs enables the
generation of a wide range of visual concepts from text. However, these
pre-trained models often face challenges when it comes to generating highly
aesthetic images. This creates the need for aesthetic alignment post
pre-training. In this paper, we propose quality-tuning to effectively guide a
pre-trained model to exclusively generate highly visually appealing images,
while maintaining generality across visual concepts. Our key insight is that
supervised fine-tuning with a set of surprisingly small but extremely visually
appealing images can significantly improve the generation quality. We pre-train
a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it
with only a few thousand carefully selected high-quality images. The resulting
model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only
counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred
$68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts
and our Open User Input benchmark based on the real-world usage of
text-to-image models. In addition, we show that quality-tuning is a generic
approach that is also effective for other architectures, including pixel
diffusion and masked generative transformer models. |
This paper proposes "quality-tuning", a method for enhancing the aesthetics of text-to-image models by fine-tuning them on a small, curated dataset of high-quality images. |
Pre-trained text-to-image models often struggle to consistently generate visually appealing images. This quality-tuning approach addresses the need for improved aesthetic alignment post pre-training. |
The authors pre-train a latent diffusion model on 1.1 billion image-text pairs. They then fine-tune this model using a dataset of 2000 meticulously selected, high-quality images. This dataset is curated through a combination of automatic filtering and rigorous two-stage human evaluation based on photographic principles. |
Quality-tuning significantly improves the visual appeal of generated images, outperforming a state-of-the-art SDXLv1.0 model.
The effectiveness of quality-tuning is demonstrated even with a surprisingly small fine-tuning dataset, emphasizing the importance of quality over quantity.
The approach is generalizable and shows improvements on other architectures, including pixel diffusion and masked generative transformer models. |
Human evaluation of aesthetics is inherently subjective and may vary based on prompts, annotators, and guidelines.
Despite improved aesthetics, limitations from the pre-training stage might persist, such as difficulty generating specific objects not well-represented during pre-training. |
text-to-image generation, aesthetic alignment, quality-tuning, latent diffusion model, fine-tuning |
2309.15664
Report |
Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing |
Kai Wang, Fei Yang, Shiqi Yang, Muhammad Atif Butt, Joost van de Weijer |
Large-scale text-to-image generative models have been a ground-breaking
development in generative AI, with diffusion models showing their astounding
ability to synthesize convincing images following an input text prompt. The
goal of image editing research is to give users control over the generated
images by modifying the text prompt. Current image editing techniques are
susceptible to unintended modifications of regions outside the targeted area,
such as on the background or on distractor objects which have some semantic or
visual relationship with the targeted object. According to our experimental
findings, inaccurate cross-attention maps are at the root of this problem.
Based on this observation, we propose Dynamic Prompt Learning (DPL) to force
cross-attention maps to focus on correct noun words in the text prompt. By
updating the dynamic tokens for nouns in the textual input with the proposed
leakage repairment losses, we achieve fine-grained image editing over
particular objects while preventing undesired changes to other image regions.
Our method DPL, based on the publicly available Stable Diffusion, is
extensively evaluated on a wide range of images, and consistently obtains
superior results both quantitatively (CLIP score, Structure-Dist) and
qualitatively (on user-evaluation). We show improved prompt editing results for
Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for
complex multi-object scenes. |
This paper proposes Dynamic Prompt Learning (DPL), a method to improve text-guided image editing in diffusion models by addressing cross-attention leakage to background and distractor objects. |
Current text-guided image editing techniques often produce undesired modifications outside the targeted area due to inaccurate cross-attention maps. |
DPL optimizes dynamic tokens for nouns in the text prompt using three losses: Disjoint Object Attention Loss, Background Leakage Loss, and Attention Balancing Loss. It leverages DDIM and Null-Text inversion for image reconstruction and editing. |
DPL significantly improves the accuracy of cross-attention maps compared to the baseline (Null-Text Inversion).
DPL achieves superior quantitative results in image editing tasks like Word-Swap, as measured by CLIP-Score and Structure-Dist.
User studies confirm that DPL leads to significantly better image editing results compared to the baseline, especially for complex multi-object scenes. |
The reliance on smaller cross-attention maps (16x16) limits fine-grained structure control.
The method currently doesn't handle scenarios where multiple noun words in the prompt refer to the same object. |
image editing, diffusion models, text-to-image, cross-attention, stable diffusion |
2309.15508
Report |
DreamCom: Finetuning Text-guided Inpainting Model for Image Composition |
Lingxiao Lu, Jiangtong Li, Bo Zhang, Li Niu |
The goal of image composition is merging a foreground object into a
background image to obtain a realistic composite image. Recently, generative
composition methods are built on large pretrained diffusion models, due to
their unprecedented image generation ability. However, they are weak in
preserving the foreground object details. Inspired by recent text-to-image
generation customized for certain object, we propose DreamCom by treating image
composition as text-guided image inpainting customized for certain object.
Specifically , we finetune pretrained text-guided image inpainting model based
on a few reference images containing the same object, during which the text
prompt contains a special token associated with this object. Then, given a new
background, we can insert this object into the background with the text prompt
containing the special token. In practice, the inserted object may be adversely
affected by the background, so we propose masked attention mechanisms to avoid
negative background interference. Experimental results on DreamEditBench and
our contributed MureCom dataset show the outstanding performance of our
DreamCom. |
This supplementary material for the DreamCom paper further explores the impact of masked self-attention in the model and provides additional experiments and comparisons. |
The supplementary materials provide deeper insights into the DreamCom model's effectiveness for image composition by analyzing the roles of different components and comparing it with other state-of-the-art methods. |
The authors conducted ablation studies on masking self-attention layers, experimented with varying numbers of reference images, and presented a visual comparison of DreamCom with baselines like DreamEdit, ObjectStitch, and PbE. |
Masking outer decoder self-attention layers hurts foreground-background compatibility, while masking other layers helps prevent color leakage.
Increasing the number of reference images generally improves performance, with 4 or 5 images yielding similar results.
DreamCom outperforms baselines by achieving better foreground-background integration in terms of pose and lighting and preserving foreground details. |
The impact of the number of reference images might vary depending on the background complexity.
Future work could explore automatically determining the optimal number of reference images. |
image composition, text-guided inpainting, self-attention, reference images, dreamcom |
2309.15505
Report |
Finite Scalar Quantization: VQ-VAE Made Simple |
Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen |
We propose to replace vector quantization (VQ) in the latent representation
of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where
we project the VAE representation down to a few dimensions (typically less than
10). Each dimension is quantized to a small set of fixed values, leading to an
(implicit) codebook given by the product of these sets. By appropriately
choosing the number of dimensions and values each dimension can take, we obtain
the same codebook size as in VQ. On top of such discrete representations, we
can train the same models that have been trained on VQ-VAE representations. For
example, autoregressive and masked transformer models for image generation,
multimodal generation, and dense prediction computer vision tasks. Concretely,
we employ FSQ with MaskGIT for image generation, and with UViM for depth
estimation, colorization, and panoptic segmentation. Despite the much simpler
design of FSQ, we obtain competitive performance in all these tasks. We
emphasize that FSQ does not suffer from codebook collapse and does not need the
complex machinery employed in VQ (commitment losses, codebook reseeding, code
splitting, entropy penalties, etc.) to learn expressive discrete
representations. |
This paper proposes replacing vector quantization (VQ) in VQ-VAEs with finite scalar quantization (FSQ), a simpler method that projects representations to low dimensions and quantizes each dimension to a small set of values. |
VQ suffers from codebook collapse and requires complex techniques for optimization. FSQ simplifies the process while aiming for high codebook utilization without auxiliary losses. |
The authors replace VQ with FSQ in MaskGIT (for image generation) and UViM (for depth estimation, colorization, and segmentation), comparing their performance across various codebook sizes. |
FSQ achieves comparable performance to VQ in image generation and dense computer vision tasks.
FSQ exhibits high codebook utilization (almost 100%) without auxiliary losses, unlike VQ which struggles with larger codebooks.
FSQ performance scales with codebook size, leading to better reconstruction and sample quality. |
The study primarily focuses on MaskGIT and UViM, potentially limiting the generalizability of the findings.
Further investigation into the semantic properties of FSQ representations is needed. |
vector quantization, vq-vae, image generation, dense prediction, representation learning |
2309.15275
Report |
Efficient Low-rank Backpropagation for Vision Transformer Adaptation |
Yuedong Yang, Hung-Yueh Chiang, Guihong Li, Diana Marculescu, Radu Marculescu |
The increasing scale of vision transformers (ViT) has made the efficient
fine-tuning of these large models for specific needs a significant challenge in
various applications. This issue originates from the computationally demanding
matrix multiplications required during the backpropagation process through
linear layers in ViT. In this paper, we tackle this problem by proposing a new
Low-rank BackPropagation via Walsh-Hadamard Transformation (LBP-WHT) method.
Intuitively, LBP-WHT projects the gradient into a low-rank space and carries
out backpropagation. This approach substantially reduces the computation needed
for adapting ViT, as matrix multiplication in the low-rank space is far less
resource-intensive. We conduct extensive experiments with different models
(ViT, hybrid convolution-ViT model) on multiple datasets to demonstrate the
effectiveness of our method. For instance, when adapting an EfficientFormer-L1
model on CIFAR100, our LBP-WHT achieves 10.4% higher accuracy than the
state-of-the-art baseline, while requiring 9 MFLOPs less computation. As the
first work to accelerate ViT adaptation with low-rank backpropagation, our
LBP-WHT method is complementary to many prior efforts and can be combined with
them for better performance. |
This paper proposes LBP-WHT, a novel low-rank backpropagation method using Walsh-Hadamard Transformation to accelerate the adaptation of Vision Transformers (ViT) for specific tasks. |
Adapting large-scale ViT models, especially on resource-constrained edge devices, is challenging due to the computational complexity of backpropagation through dense linear layers. |
LBP-WHT projects gradients into a low-rank space using WHT, performs efficient matrix multiplications in this reduced space, and finally projects the results back to the original space. This reduces computational cost while maintaining accuracy. |
LBP-WHT consistently outperforms LoRA-based methods in terms of both speed and accuracy across various image classification and semantic segmentation tasks.
The method offers flexibility in balancing accuracy and computational cost by adjusting the rank used for low-rank projection.
LBP-WHT with carefully chosen ranks can achieve accuracy comparable to or even exceeding that of full-rank backpropagation while significantly reducing computational requirements. |
Accuracy degradation is observed when using a very small number of ranks for full model training with LBP-WHT.
Further research on improved approximation methods could potentially mitigate this issue. |
vision transformer, model adaptation, low-rank backpropagation, walsh-hadamard transform, efficient training |
2309.15164
Report |
3D Reconstruction with Generalizable Neural Fields using Scene Priors |
Yang Fu, Shalini De Mello, Xueting Li, Amey Kulkarni, Jan Kautz, Xiaolong Wang, Sifei Liu |
High-fidelity 3D scene reconstruction has been substantially advanced by
recent progress in neural fields. However, most existing methods train a
separate network from scratch for each individual scene. This is not scalable,
inefficient, and unable to yield good results given limited views. While
learning-based multi-view stereo methods alleviate this issue to some extent,
their multi-view setting makes it less flexible to scale up and to broad
applications. Instead, we introduce training generalizable Neural Fields
incorporating scene Priors (NFPs). The NFP network maps any single-view RGB-D
image into signed distance and radiance values. A complete scene can be
reconstructed by merging individual frames in the volumetric space WITHOUT a
fusion module, which provides better flexibility. The scene priors can be
trained on large-scale datasets, allowing for fast adaptation to the
reconstruction of a new scene with fewer views. NFP not only demonstrates SOTA
scene reconstruction performance and efficiency, but it also supports
single-image novel-view synthesis, which is underexplored in neural fields.
More qualitative results are available at:
https://oasisyang.github.io/neural-prior |
This paper proposes Neural Fields with scene Priors (NFPs), a novel method for fast and scalable 3D scene reconstruction that leverages single-view RGB-D images to learn generalizable scene priors. |
Existing neural field methods for 3D reconstruction are often scene-specific, requiring separate training for each new scene, which is time-consuming and data-intensive. NFPs address these limitations by learning generalizable priors that can be quickly adapted to novel scenes. |
NFPs employ a two-stage training paradigm: (1) a geometric reconstruction network learns to map depth images to local SDFs, and (2) this pretrained network serves as a geometric prior to train a color reconstruction network (texture prior) using volumetric rendering. |
NFPs achieve state-of-the-art scene reconstruction quality with fine geometric details and realistic textures, even with limited input views.
The method exhibits significantly faster convergence speed compared to existing approaches.
NFPs also enable high-quality single-image novel-view synthesis, which is underexplored in existing neural field methods. |
The current NFPs model requires at least sparse depth information as input and cannot be directly applied to RGB-only images.
Future work could explore incorporating SfM techniques to enable the use of NFPs with RGB images. |
3d scene reconstruction, neural fields, scene priors, single-view reconstruction, novel view synthesis |
2309.15103
Report |
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models |
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu |
This work aims to learn a high-quality text-to-video (T2V) generative model
by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a
highly desirable yet challenging task to simultaneously a) accomplish the
synthesis of visually realistic and temporally coherent videos while b)
preserving the strong creative generation nature of the pre-trained T2I model.
To this end, we propose LaVie, an integrated video generation framework that
operates on cascaded video latent diffusion models, comprising a base T2V
model, a temporal interpolation model, and a video super-resolution model. Our
key insights are two-fold: 1) We reveal that the incorporation of simple
temporal self-attentions, coupled with rotary positional encoding, adequately
captures the temporal correlations inherent in video data. 2) Additionally, we
validate that the process of joint image-video fine-tuning plays a pivotal role
in producing high-quality and creative outcomes. To enhance the performance of
LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M,
consisting of 25 million text-video pairs that prioritize quality, diversity,
and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves
state-of-the-art performance both quantitatively and qualitatively.
Furthermore, we showcase the versatility of pre-trained LaVie models in various
long video generation and personalized video synthesis applications. |
This paper introduces LaVie, a cascaded video generation framework that leverages pre-trained text-to-image diffusion models to generate high-quality, temporally coherent videos from text descriptions. |
Training text-to-video models from scratch is computationally expensive. LaVie offers a more efficient approach by building upon pre-trained models while maintaining high visual quality and creative control. |
LaVie consists of three cascaded video latent diffusion models: a base model for generating key frames, a temporal interpolation model for smoother transitions, and a video super-resolution model for enhanced visual quality. The model is trained on a new high-quality video dataset, Vimeo25M, and uses joint image-video fine-tuning to prevent catastrophic forgetting and enhance concept compositionality. |
LaVie achieves state-of-the-art performance on zero-shot text-to-video generation benchmarks, outperforming existing methods in terms of visual fidelity and text-video semantic similarity.
Joint image-video fine-tuning proves crucial in preventing catastrophic forgetting and enabling the transfer of concepts from the image domain to video generation.
The introduction of the Vimeo25M dataset significantly contributes to generating higher-quality videos compared to using existing datasets like WebVid10M. |
LaVie faces challenges in generating scenes with multiple interacting subjects and struggles to synthesize realistic human hands.
Future work will focus on extending LaVie's capabilities to generate longer videos with complex transitions and movie-level quality from script descriptions. |
text-to-video generation, diffusion models, video generation, generative ai, computer vision |
2309.15091
Report |
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning |
Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal |
Although recent text-to-video (T2V) generation methods have seen significant
advancements, most of these works focus on producing short video clips of a
single event with a single background (i.e., single-scene videos). Meanwhile,
recent large language models (LLMs) have demonstrated their capability in
generating layouts and programs to control downstream visual modules such as
image generation models. This raises an important question: can we leverage the
knowledge embedded in these LLMs for temporally consistent long video
generation? In this paper, we propose VideoDirectorGPT, a novel framework for
consistent multi-scene video generation that uses the knowledge of LLMs for
video content planning and grounded video generation. Specifically, given a
single text prompt, we first ask our video planner LLM (GPT-4) to expand it
into a 'video plan', which involves generating the scene descriptions, the
entities with their respective layouts, the background for each scene, and
consistency groupings of the entities and backgrounds. Next, guided by this
output from the video planner, our video generator, Layout2Vid, has explicit
control over spatial layouts and can maintain temporal consistency of
entities/backgrounds across scenes, while only trained with image-level
annotations. Our experiments demonstrate that VideoDirectorGPT framework
substantially improves layout and movement control in both single- and
multi-scene video generation and can generate multi-scene videos with visual
consistency across scenes, while achieving competitive performance with SOTAs
in open-domain single-scene T2V generation. We also demonstrate that our
framework can dynamically control the strength for layout guidance and can also
generate videos with user-provided images. We hope our framework can inspire
future work on better integrating the planning ability of LLMs into consistent
long video generation. |
This paper proposes VideoPlan, a novel two-stage framework for generating temporally consistent long videos with multiple scenes by leveraging the planning abilities of Large Language Models (LLMs). |
Existing text-to-video generation methods struggle to produce long, multi-scene videos with consistent entities and backgrounds. This work addresses this challenge by incorporating LLM-based planning for improved control and consistency. |
The proposed VideoPlan framework consists of two stages: (1) Video Planning: An LLM (GPT-4) generates a 'VideoPlan' containing detailed scene descriptions, entity layouts, and consistency groupings for entities/backgrounds. (2) Video Generation: A novel grounded video generation module (VideoModule) renders videos based on the VideoPlan, using image/text-based layout control and ensuring entity-level temporal consistency. |
VideoPlan significantly outperforms baselines in object layout and movement control in single-scene video generation.
The framework excels at generating multi-scene videos with impressive visual consistency across scenes.
VideoPlan achieves competitive performance with state-of-the-art models on open-domain single-scene text-to-video generation benchmarks. |
The reliance on powerful LLM APIs for VideoPlan generation can be expensive.
The current implementation is limited by the capabilities of the underlying text-to-video generation backbone (ModelScopeT2V). |
text-to-video generation, multi-scene video generation, large language models, video planning, layout control |
2309.14868
Report |
Cross-Dataset-Robust Method for Blind Real-World Image Quality Assessment |
Yuan Chen, Zhiliang Ma, Yang Zhao |
Although many effective models and real-world datasets have been presented
for blind image quality assessment (BIQA), recent BIQA models usually tend to
fit specific training set. Hence, it is still difficult to accurately and
robustly measure the visual quality of an arbitrary real-world image. In this
paper, a robust BIQA method, is designed based on three aspects, i.e., robust
training strategy, large-scale real-world dataset, and powerful backbone.
First, many individual models based on popular and state-of-the-art (SOTA)
Swin-Transformer (SwinT) are trained on different real-world BIQA datasets
respectively. Then, these biased SwinT-based models are jointly used to
generate pseudo-labels, which adopts the probability of relative quality of two
random images instead of fixed quality score. A large-scale real-world image
dataset with 1,000,000 image pairs and pseudo-labels is then proposed for
training the final cross-dataset-robust model. Experimental results on
cross-dataset tests show that the performance of the proposed method is even
better than some SOTA methods that are directly trained on these datasets, thus
verifying the robustness and generalization of our method. |
This paper proposes a cross-dataset-robust blind image quality assessment (BIQA) network and training strategy to improve the generalization ability of BIQA models in real-world scenarios. |
Existing BIQA models often overfit to specific training datasets, hindering their ability to accurately assess the quality of arbitrary real-world images. |
The method involves training multiple SwinT-IQA models on different datasets, using them to generate pseudo-labels (relative quality probabilities) for a large-scale real-world image dataset, and finally training a CDR-BIQA model on this dataset using a learning-to-rank framework. |
The CDR-BIQA model outperforms many state-of-the-art methods in cross-dataset testing, demonstrating its strong generalization ability.
The use of relative probability as pseudo-labels and the combination of multiple biased models contribute to the robustness of the proposed method.
The Swin-Transformer backbone shows superior learning ability for BIQA compared to other backbones. |
The performance improvement plateaus with more than 500,000 training image pairs, suggesting potential limitations in scaling beyond this point.
The study primarily focuses on authentic distortions, and future work could explore robustness in synthetically distorted datasets. |
blind image quality assessment, cross-dataset robustness, swin-transformer, pseudo-labels, learning-to-rank |
2309.14756
Report |
On quantifying and improving realism of images generated with diffusion |
Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian |
Recent advances in diffusion models have led to a quantum leap in the quality
of generative visual content. However, quantification of realism of the content
is still challenging. Existing evaluation metrics, such as Inception Score and
Fr\'echet inception distance, fall short on benchmarking diffusion models due
to the versatility of the generated images. Moreover, they are not designed to
quantify realism of an individual image. This restricts their application in
forensic image analysis, which is becoming increasingly important in the
emerging era of generative models. To address that, we first propose a metric,
called Image Realism Score (IRS), computed from five statistical measures of a
given image. This non-learning based metric not only efficiently quantifies
realism of the generated images, it is readily usable as a measure to classify
a given image as real or fake. We experimentally establish the model- and
data-agnostic nature of the proposed IRS by successfully detecting fake images
generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN.
We further leverage this attribute of our metric to minimize an IRS-augmented
generative loss of SDM, and demonstrate a convenient yet considerable quality
improvement of the SDM-generated content with our modification. Our efforts
have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes
generated by four high-quality models. We will release the dataset and code. |
This paper proposes Image Realism Score (IRS), a non-learning based metric to quantify the realism of images, particularly those generated by diffusion models, for distinguishing them from natural images. |
Existing metrics like Inception Score and Fréchet Inception Distance are limited in benchmarking diffusion models due to their reliance on specific datasets or models, making them unsuitable for assessing the realism of individual images, crucial for forensics in the age of generative models. |
IRS combines five image statistics: Canny Edge Density, GLCM Contrast, GLCM Energy, Variance of Laplacian, and Mean Spectrum. These measures are carefully calibrated and arranged in a specific order within a pentagon-shaped geometric representation, with the pentagon's area defining the IRS value. |
IRS values effectively benchmark popular generative models, aligning with their known capabilities.
IRS demonstrates strong performance in fake image detection across various models.
Incorporating IRS into the Stable Diffusion Model's training loss significantly enhances the realism of the generated images. |
The effectiveness of IRS for images generated by emerging diffusion models needs further investigation.
Exploring the potential of IRS in refining other generative models beyond SDMs presents a promising avenue for future research. |
image realism score, diffusion models, fake image detection, generative models, image forensics |
2309.14623
Report |
Text-to-Image Generation for Abstract Concepts |
Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, Dongmei Zhang |
Recent years have witnessed the substantial progress of large-scale models
across various domains, such as natural language processing and computer
vision, facilitating the expression of concrete concepts. Unlike concrete
concepts that are usually directly associated with physical objects, expressing
abstract concepts through natural language requires considerable effort, which
results from their intricate semantics and connotations. An alternative
approach is to leverage images to convey rich visual information as a
supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily
trained on concrete physical objects and tend to fail to visualize abstract
concepts. Inspired by the three-layer artwork theory that identifies critical
factors, intent, object and form during artistic creation, we propose a
framework of Text-to-Image generation for Abstract Concepts (TIAC). The
abstract concept is clarified into a clear intent with a detailed definition to
avoid ambiguity. LLMs then transform it into semantic-related physical objects,
and the concept-dependent form is retrieved from an LLM-extracted form pattern
set. Information from these three aspects will be integrated to generate
prompts for T2I models via LLM. Evaluation results from human assessments and
our newly designed metric concept score demonstrate the effectiveness of our
framework in creating images that can sufficiently express abstract concepts. |
This paper introduces TIAC, a novel framework for text-to-image generation of abstract concepts using LLMs. |
Existing T2I models struggle to visualize abstract concepts due to their training on concrete objects, hindering effective communication of complex ideas. |
TIAC leverages LLMs to: 1) clarify user intent with WordNet definitions, 2) transform abstract concepts into related objects, and 3) retrieve relevant form patterns from a prompt dataset. These elements are combined to generate effective T2I prompts. |
TIAC outperforms baseline methods in human evaluations, indicating its ability to generate images better representing abstract concepts.
A novel metric, concept score, combining visual-semantic similarity and aesthetic score, shows higher consistency with human preferences.
Case studies demonstrate TIAC's ability to generate meaningful images for various abstract concepts across different T2I models. |
The current implementation relies on a precise mapping of input concepts to WordNet, requiring further exploration for real-world scenarios.
Future work includes investigating methods to automatically determine the optimal level of abstraction for object transformation. |
text-to-image generation, abstract concepts, large language models, prompt optimization, concept visualization |
2309.14494
Report |
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator |
Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, Sibei Yang |
Text-to-video is a rapidly growing research area that aims to generate a
semantic, identical, and temporal coherence sequence of frames that accurately
align with the input text prompt. This study focuses on zero-shot text-to-video
generation considering the data- and cost-efficient. To generate a
semantic-coherent video, exhibiting a rich portrayal of temporal semantics such
as the whole process of flower blooming rather than a set of "moving images",
we propose a novel Free-Bloom pipeline that harnesses large language models
(LLMs) as the director to generate a semantic-coherence prompt sequence, while
pre-trained latent diffusion models (LDMs) as the animator to generate the high
fidelity frames. Furthermore, to ensure temporal and identical coherence while
maintaining semantic coherence, we propose a series of annotative modifications
to adapting LDMs in the reverse process, including joint noise sampling,
step-aware attention shift, and dual-path interpolation. Without any video data
and training requirements, Free-Bloom generates vivid and high-quality videos,
awe-inspiring in generating complex scenes with semantic meaningful frame
sequences. In addition, Free-Bloom is naturally compatible with LDMs-based
extensions. |
Free-Bloom, a zero-shot, training-free text-to-video generator that leverages Large Language Models (LLMs) for semantic coherence and pre-trained Latent Diffusion Models (LDMs) for high-quality frame generation. |
Addresses the challenge of zero-shot text-to-video generation by generating videos with meaningful temporal variations and avoiding extensive data and computational requirements. |
A three-stage pipeline: (1) Serial Prompting: LLM generates a sequence of prompts describing frame content, (2) Video Generation: LDM generates frames using joint noise sampling and step-aware attention shift for coherence, and (3) Interpolation Empowerment: Increases frame rate via dual latent space interpolation considering both contextual and semantic information. |
Generates videos exhibiting semantic coherence by depicting complete events aligned with the narrative.
Maintains identical coherence and temporal coherence, ensuring smooth transitions and consistent content.
Outperforms existing zero-shot methods in user studies regarding fidelity and semantic representation while achieving comparable temporal coherence to trained methods. |
Inherits limitations from the underlying LLM and LDM models, such as difficulties with complex scenes and sensitivity to initial noise.
Future work includes improving temporal consistency and combining strengths of zero-shot and trained approaches. |
text-to-video generation, zero-shot learning, large language models, latent diffusion models, video interpolation |
2309.14338
Report |
3D Indoor Instance Segmentation in an Open-World |
Mohamed El Amine Boudjoghra, Salwa K. Al Khatib, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan |
Existing 3D instance segmentation methods typically assume that all semantic
classes to be segmented would be available during training and only seen
categories are segmented at inference. We argue that such a closed-world
assumption is restrictive and explore for the first time 3D indoor instance
segmentation in an open-world setting, where the model is allowed to
distinguish a set of known classes as well as identify an unknown object as
unknown and then later incrementally learning the semantic category of the
unknown when the corresponding category labels are available. To this end, we
introduce an open-world 3D indoor instance segmentation method, where an
auto-labeling scheme is employed to produce pseudo-labels during training and
induce separation to separate known and unknown category labels. We further
improve the pseudo-labels quality at inference by adjusting the unknown class
probability based on the objectness score distribution. We also introduce
carefully curated open-world splits leveraging realistic scenarios based on
inherent object distribution, region-based indoor scene exploration and
randomness aspect of open-world classes. Extensive experiments reveal the
efficacy of the proposed contributions leading to promising open-world 3D
instance segmentation performance. |
This paper proposes the first open-world 3D indoor instance segmentation method, enabling identification of unknown objects and incremental learning of new classes. |
Existing 3D instance segmentation methods rely on a closed-world assumption, limiting their applicability in real-world scenarios with numerous unseen object classes. |
The method utilizes an auto-labeling scheme for pseudo-label generation, contrastive clustering for class separation, and a reachability-based probability correction scheme for improved unknown object recognition. It also employs exemplar replay for incremental learning. |
The proposed method outperforms adapted baselines in open-world 3D instance segmentation.
It effectively preserves knowledge of previously learned classes during incremental learning.
Qualitative results demonstrate the method's ability to correctly identify and segment both known and unknown objects. |
The confidence thresholding approach, while improving known class performance, limits unknown class segmentation due to fewer pseudo-labels.
Probability correction's efficacy depends on cluster characteristics and may deteriorate in imbalanced data scenarios. |
3d instance segmentation, open-world learning, incremental learning, pseudo-labeling, contrastive clustering |
2309.14335
Report |
UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation |
Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Wayne Wu, Ziwei Liu |
Human generation has achieved significant progress. Nonetheless, existing
methods still struggle to synthesize specific regions such as faces and hands.
We argue that the main reason is rooted in the training data. A holistic human
dataset inevitably has insufficient and low-resolution information on local
parts. Therefore, we propose to use multi-source datasets with various
resolution images to jointly learn a high-resolution human generative model.
However, multi-source data inherently a) contains different parts that do not
spatially align into a coherent human, and b) comes with different scales. To
tackle these challenges, we propose an end-to-end framework, UnitedHuman, that
empowers continuous GAN with the ability to effectively utilize multi-source
data for high-resolution human generation. Specifically, 1) we design a
Multi-Source Spatial Transformer that spatially aligns multi-source images to
full-body space with a human parametric model. 2) Next, a continuous GAN is
proposed with global-structural guidance and CutMix consistency. Patches from
different datasets are then sampled and transformed to supervise the training
of this scale-invariant generative model. Extensive experiments demonstrate
that our model jointly learned from multi-source data achieves superior quality
than those learned from a holistic dataset. |
UnitedHuman is a novel end-to-end framework that leverages multi-source datasets to generate high-resolution, full-body human images. |
Existing human generation methods struggle to synthesize high-fidelity images, especially for detailed regions like faces and hands, due to the limitations of holistic human datasets. |
The framework employs a two-stage approach: 1) Multi-source Spatial Transformer aligns body parts from different datasets using a parametric human model. 2) Continuous GAN, trained with global structural guidance and CutMix consistency, synthesizes image patches at various scales and stitches them together for the final output. |
UnitedHuman outperforms baseline methods (StyleGAN-Human, InsetGAN, AnyRes) in generating high-resolution human images with finer details, even when trained on a smaller dataset of high-resolution images.
Quantitative evaluations (kFID, pFID) demonstrate the superiority of UnitedHuman in capturing local textures and details, particularly for hands and faces.
Ablation studies confirm the efficacy of the proposed Multi-Source Spatial Transformer and the Continuous GAN in improving alignment and leveraging multi-source datasets effectively. |
The underlying StyleGAN3 architecture may limit the representation of high-frequency information, causing potential artifacts in upscaling.
The diversity of generated poses and garments is constrained by the training data, necessitating future work on data augmentation and incorporating more varied datasets. |
human generation, multi-scale generation, generative adversarial networks (gans), multi-source data, human body alignment |
2309.14289
Report |
CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free |
Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni |
The emergence of CLIP has opened the way for open-world image perception. The
zero-shot classification capabilities of the model are impressive but are
harder to use for dense tasks such as image segmentation. Several methods have
proposed different modifications and learning schemes to produce dense output.
Instead, we propose in this work an open-vocabulary semantic segmentation
method, dubbed CLIP-DIY, which does not require any additional training or
annotations, but instead leverages existing unsupervised object localization
approaches. In particular, CLIP-DIY is a multi-scale approach that directly
exploits CLIP classification abilities on patches of different sizes and
aggregates the decision in a single map. We further guide the segmentation
using foreground/background scores obtained using unsupervised object
localization methods. With our method, we obtain state-of-the-art zero-shot
semantic segmentation results on PASCAL VOC and perform on par with the best
methods on COCO. The code is available at
http://github.com/wysoczanska/clip-diy |
Introduces CLIP-DIY, a zero-shot open-vocabulary semantic segmentation method that leverages CLIP's classification ability and unsupervised object localization. |
Addresses limitations of supervised semantic segmentation methods that require expensive annotations and struggle with open vocabularies. |
Performs multi-scale dense inference with CLIP on image patches, aggregating predictions into a single map. Refines the map using an unsupervised foreground/background segmentation model (FOUND) for objectness guidance. |
Achieves state-of-the-art zero-shot open-vocabulary semantic segmentation on PASCAL VOC (+4.9 mIoU over previous best).
Performs on par with the best methods on COCO Object dataset.
Demonstrates effectiveness of multi-scale patch classification and objectness guidance for accurate and robust segmentation. |
Performance limited by accuracy of the unsupervised object localization model in complex scenes.
Sensitivity to text ambiguities inherited from CLIP can lead to misclassifications. |
semantic segmentation, open vocabulary, zero-shot learning, clip, unsupervised object localization |
2309.14207
Report |
Automatic Animation of Hair Blowing in Still Portrait Photos |
Wenpeng Xiao, Wentao Liu, Yitong Wang, Bernard Ghanem, Bing Li |
We propose a novel approach to animate human hair in a still portrait photo.
Existing work has largely studied the animation of fluid elements such as water
and fire. However, hair animation for a real image remains underexplored, which
is a challenging problem, due to the high complexity of hair structure and
dynamics. Considering the complexity of hair structure, we innovatively treat
hair wisp extraction as an instance segmentation problem, where a hair wisp is
referred to as an instance. With advanced instance segmentation networks, our
method extracts meaningful and natural hair wisps. Furthermore, we propose a
wisp-aware animation module that animates hair wisps with pleasing motions
without noticeable artifacts. The extensive experiments show the superiority of
our method. Our method provides the most pleasing and compelling viewing
experience in the qualitative experiments and outperforms state-of-the-art
still-image animation methods by a large margin in the quantitative evaluation.
Project url: \url{https://nevergiveu.github.io/AutomaticHairBlowing/} |
This paper proposes a novel approach to automatically animate human hair in still portrait photos, converting them into dynamic and engaging cinemagraphs. |
Existing methods for still-image animation primarily focus on fluid elements and lack the ability to realistically animate hair, a crucial aspect for creating compelling portrait visuals. This work addresses this gap by enabling automatic hair animation in real-world portrait photos. |
The method employs a three-step process: (1) Instance-based Hair Wisp Extraction (IHWE) identifies and segments individual hair wisps using instance segmentation networks trained on a novel hair wisp dataset. (2) Hair Wisp Animation (HWA) represents each wisp with a multi-layer mesh and simulates natural motions using a physics-based mass-spring system. (3) Depth-aware frame composition ensures proper occlusion relationships between animated hair, face, and background during video generation. |
The proposed method outperforms state-of-the-art single-image-to-video generation techniques in both quantitative metrics like Frechet Video Distance (FVD) and Warping Error, indicating superior video quality and temporal consistency.
Qualitative comparisons demonstrate the method's ability to generate more realistic and visually appealing hair animations compared to baselines, avoiding artifacts like distortions, unnatural movements, and flickering.
Subjective user studies confirm that the generated videos are significantly preferred by human viewers, highlighting the effectiveness of the approach in enhancing portrait aesthetics. |
The current method primarily focuses on animating hair blowing in the wind and may not generalize well to other hair motions like shaking or swaying.
Future work could explore incorporating user controls to fine-tune animation parameters like wind direction and intensity. |
hair animation, cinemagraph generation, instance segmentation, physics-based animation, portrait image animation |
2309.14136
Report |
Masked Image Residual Learning for Scaling Deeper Vision Transformers |
Guoxi Huang, Hongtao Fu, Adrian G. Bors |
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a
degradation problem in deeper layers of ViT when using masked image modeling
(MIM) for pre-training. To ease the training of deeper ViTs, we introduce a
self-supervised learning framework called Masked Image Residual Learning
(MIRL), which significantly alleviates the degradation problem, making scaling
ViT along depth a promising direction for performance upgrade. We reformulate
the pre-training objective for deeper layers of ViT as learning to recover the
residual of the masked image. We provide extensive empirical evidence showing
that deeper ViTs can be effectively optimized using MIRL and easily gain
accuracy from increased depth. With the same level of computational complexity
as ViT-Base and ViT-Large, we instantiate 4.5$\times$ and 2$\times$ deeper
ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3$\times$ less
than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves
86.2% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with
MIRL exhibit excellent generalization capabilities on downstream tasks, such as
object detection and semantic segmentation. On the other hand, MIRL
demonstrates high pre-training efficiency. With less pre-training time, MIRL
yields competitive performance compared to other approaches. |
This paper identifies a degradation problem in deeper layers of Vision Transformers (ViTs) pre-trained with Masked Image Modeling (MIM) and proposes Masked Image Residual Learning (MIRL) to address it. |
Scaling ViTs along the depth dimension is challenging due to the degradation problem, which hinders performance improvement. This work makes deep ViTs a promising direction for performance upgrade. |
MIRL reformulates the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image, leveraging a multi-decoding process and shortcut connections. |
Deeper ViTs pre-trained with MIRL outperform shallower counterparts with similar complexity (ViT-S-54 surpasses ViT-B).
MIRL enables training ViTs with significantly increased depth, achieving competitive results with less computational cost (ViT-B-48 outperforms ViT-L).
MIRL shows strong generalization capabilities, improving performance on downstream tasks like object detection and semantic segmentation. |
A comprehensive theoretical explanation for MIRL’s effectiveness is still under exploration.
Further exploration of depth scaling beyond the presented 54 blocks is needed. |
vision transformer, self-supervised learning, masked image modeling, image residual learning, deep learning |
2309.14068
Report |
Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models |
Yangming Li, Boris van Breugel, Mihaela van der Schaar |
Because diffusion models have shown impressive performances in a number of
tasks, such as image synthesis, there is a trend in recent works to prove (with
certain assumptions) that these models have strong approximation capabilities.
In this paper, we show that current diffusion models actually have an
expressive bottleneck in backward denoising and some assumption made by
existing theoretical guarantees is too strong. Based on this finding, we prove
that diffusion models have unbounded errors in both local and global denoising.
In light of our theoretical studies, we introduce soft mixture denoising (SMD),
an expressive and efficient model for backward denoising. SMD not only permits
diffusion models to well approximate any Gaussian mixture distributions in
theory, but also is simple and efficient for implementation. Our experiments on
multiple image datasets show that SMD significantly improves different types of
diffusion models (e.g., DDPM), espeically in the situation of few backward
iterations. |
This paper identifies an expressive bottleneck in diffusion models' backward denoising process due to its Gaussian parameterization, limiting its ability to approximate multimodal data distributions, and proposes Soft Mixture Denoising (SMD) to address this limitation. |
Existing theoretical guarantees for diffusion models often assume bounded score estimation errors, which is shown to be too strong an assumption. This paper demonstrates the limitations of the Gaussian denoising paradigm and aims to provide a more expressive alternative. |
The paper provides theoretical proofs to demonstrate the unbounded errors in both local and global denoising in current diffusion models. It then introduces SMD, a continuous relaxation of a Gaussian mixture model for the denoising posterior, and proves its ability to accurately approximate any Gaussian mixture distribution. |
Current diffusion models suffer from an expressive bottleneck in backward denoising, leading to unbounded denoising errors.
The assumption of bounded score estimation errors made by existing theoretical guarantees for diffusion models is too strong.
SMD significantly improves the generation quality of various diffusion models, especially with few backward iterations, enabling faster sampling and reduced computational costs. |
The paper assumes globally optimized neural networks in its theoretical analysis of SMD.
Future work includes extending SMD to other applications like text-to-image translation and speech synthesis, and integrating it into diffusion model libraries to speed up training and inference. |
diffusion models, generative models, denoising, gaussian mixture models, expressive bottleneck |
2309.14052
Report |
Single Image Test-Time Adaptation for Segmentation |
Klara Janouskova, Tamir Shor, Chaim Baskin, Jiri Matas |
Test-Time Adaptation (TTA) methods improve the robustness of deep neural
networks to domain shift on a variety of tasks such as image classification or
segmentation. This work explores adapting segmentation models to a single
unlabelled image with no other data available at test-time. In particular, this
work focuses on adaptation by optimizing self-supervised losses at test-time.
Multiple baselines based on different principles are evaluated under diverse
conditions and a novel adversarial training is introduced for adaptation with
mask refinement. Our additions to the baselines result in a 3.51 and 3.28 %
increase over non-adapted baselines, without these improvements, the increase
would be 1.7 and 2.16 % only. |
This paper explores adapting segmentation models to a single unlabeled image at test-time by optimizing self-supervised losses, introducing novel adversarial training for adaptation with mask refinement. |
Test-time adaptation (TTA) methods improve the robustness of deep neural networks to domain shift, a common problem when deploying models in real-world scenarios. |
The authors evaluate several TTA baselines based on entropy minimization, pseudo-labeling, mask refinement, augmentation consistency, and adversarial transformation invariance. They introduce a novel adversarial training method for mask refinement and propose new evaluation metrics to account for class imbalance and per-image performance. |
Optimizing Intersection over Union (IoU) loss consistently outperforms cross-entropy loss for all evaluated methods.
Test-time training with pseudo-labels and mask refinement with adversarial training are identified as the overall best-performing methods.
The effectiveness of TTA methods is highly dependent on domain shift type and severity, and a single set of hyperparameters may not perform well across all conditions. |
Single image TTA by optimizing model parameters has high computational cost.
Finding optimal hyperparameters for different domain shifts remains a challenge. |
test-time adaptation, semantic segmentation, domain shift, adversarial training, mask refinement |
2309.13956
Report |
In-Domain GAN Inversion for Faithful Reconstruction and Editability |
Jiapeng Zhu, Yujun Shen, Yinghao Xu, Deli Zhao, Qifeng Chen, Bolei Zhou |
Generative Adversarial Networks (GANs) have significantly advanced image
synthesis through mapping randomly sampled latent codes to high-fidelity
synthesized images. However, applying well-trained GANs to real image editing
remains challenging. A common solution is to find an approximate latent code
that can adequately recover the input image to edit, which is also known as GAN
inversion. To invert a GAN model, prior works typically focus on reconstructing
the target image at the pixel level, yet few studies are conducted on whether
the inverted result can well support manipulation at the semantic level. This
work fills in this gap by proposing in-domain GAN inversion, which consists of
a domain-guided encoder and a domain-regularized optimizer, to regularize the
inverted code in the native latent space of the pre-trained GAN model. In this
way, we manage to sufficiently reuse the knowledge learned by GANs for image
reconstruction, facilitating a wide range of editing applications without any
retraining. We further make comprehensive analyses on the effects of the
encoder structure, the starting inversion point, as well as the inversion
parameter space, and observe the trade-off between the reconstruction quality
and the editing property. Such a trade-off sheds light on how a GAN model
represents an image with various semantics encoded in the learned latent
distribution. Code, models, and demo are available at the project page:
https://genforce.github.io/idinvert/. |
This paper proposes IDInvert, an in-domain GAN inversion method that focuses on the semantic properties of inverted latent codes to ensure editability for real image editing. |
Existing GAN inversion methods primarily focus on pixel-level reconstruction and overlook the semantic meaning of inverted codes, limiting their use in real-world image editing applications. |
IDInvert uses a domain-guided encoder trained on real images to map images to the latent space of a pre-trained GAN. This encoder acts as a regularizer during a subsequent domain-regularized optimization step, ensuring the inverted code stays within the semantic domain of the generator. |
IDInvert produces inverted codes that are more semantically meaningful, as demonstrated by better performance in attribute classification tasks compared to baselines.
The method facilitates high-quality image editing applications like interpolation and semantic manipulation, outperforming existing techniques in visual quality and semantic consistency.
The study reveals a trade-off between reconstruction quality and editability, showing that increasing reconstruction accuracy often comes at the cost of reduced editing capabilities. |
IDInvert's performance is limited to images that share a similar distribution with the training data.
Future work could explore methods to mitigate the trade-off between reconstruction quality and editability. |
gan inversion, image editing, semantic editing, latent space, generative adversarial networks |
2309.13415
Report |
Dream the Impossible: Outlier Imagination with Diffusion Models |
Xuefeng Du, Yiyou Sun, Xiaojin Zhu, Yixuan Li |
Utilizing auxiliary outlier datasets to regularize the machine learning model
has demonstrated promise for out-of-distribution (OOD) detection and safe
prediction. Due to the labor intensity in data collection and cleaning,
automating outlier data generation has been a long-desired alternative. Despite
the appeal, generating photo-realistic outliers in the high dimensional pixel
space has been an open challenge for the field. To tackle the problem, this
paper proposes a new framework DREAM-OOD, which enables imagining
photo-realistic outliers by way of diffusion models, provided with only the
in-distribution (ID) data and classes. Specifically, DREAM-OOD learns a
text-conditioned latent space based on ID data, and then samples outliers in
the low-likelihood region via the latent, which can be decoded into images by
the diffusion model. Different from prior works, DREAM-OOD enables visualizing
and understanding the imagined outliers, directly in the pixel space. We
conduct comprehensive quantitative and qualitative studies to understand the
efficacy of DREAM-OOD, and show that training with the samples generated by
DREAM-OOD can benefit OOD detection performance. Code is publicly available at
https://github.com/deeplearning-wisc/dream-ood. |
\model is the first method to generate photo-realistic high-resolution outliers in the pixel space for improving OOD detection. |
Existing methods for OOD detection rely on auxiliary outlier datasets, which are labor-intensive to collect and curate. \model addresses this limitation by automating outlier generation using diffusion models. |
\model learns a text-conditioned latent space using ID data and a pre-trained diffusion model (Stable Diffusion). Outliers are then generated by: 1) sampling new embeddings in the low-likelihood regions of the latent space and 2) decoding these embeddings into images using the diffusion model. |
\model significantly improves OOD detection performance on CIFAR-100 and ImageNet-100 benchmarks, outperforming existing methods including those based on GANs and latent-space outlier synthesis (VOS, NPOS).
Analysis of generated images shows that \model effectively creates a wide spectrum of outliers ranging from near-OOD to far-OOD.
An extension of the method, \textsc{Dream-id}, demonstrates the ability to generate in-distribution samples which improves model generalization on ImageNet, ImageNet-A and ImageNet-v2. |
The outlier generation process in \model relies on the quality and diversity of the pre-trained diffusion model.
Finding the optimal parameters for sampling outlier embeddings, such as variance and neighborhood size, requires careful tuning. |
out-of-distribution (ood) detection, diffusion models, outlier generation, data augmentation, generalization |
2309.13274
Report |
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER |
Mingzhen Sun, Weining Wang, Zihan Qin, Jiahui Sun, Sihan Chen, Jing Liu |
Video generation necessitates both global coherence and local realism. This
work presents a novel non-autoregressive method GLOBER, which first generates
global features to obtain comprehensive global guidance and then synthesizes
video frames based on the global features to generate coherent videos.
Specifically, we propose a video auto-encoder, where a video encoder encodes
videos into global features, and a video decoder, built on a diffusion model,
decodes the global features and synthesizes video frames in a
non-autoregressive manner. To achieve maximum flexibility, our video decoder
perceives temporal information through normalized frame indexes, which enables
it to synthesize arbitrary sub video clips with predetermined starting and
ending frame indexes. Moreover, a novel adversarial loss is introduced to
improve the global coherence and local realism between the synthesized video
frames. Finally, we employ a diffusion-based video generator to fit the global
features outputted by the video encoder for video generation. Extensive
experimental results demonstrate the effectiveness and efficiency of our
proposed method, and new state-of-the-art results have been achieved on
multiple benchmarks. |
GLOBER: a novel non-autoregressive diffusion-based video generation method that prioritizes global guidance by first generating 2D global features and then synthesizing video frames based on these features, enhancing coherence and realism. |
Existing video generation methods struggle to balance global coherence and local realism due to computational limitations and the potentially infinite number of video frames. GLOBER addresses this by separating global guidance generation from frame-wise detail synthesis. |
GLOBER utilizes a video auto-encoder: an encoder compresses video keyframes into 2D global features, and a diffusion-based decoder synthesizes frames based on these features and frame indexes. A novel Coherence and Realism Adversarial (CRA) loss improves global coherence and local realism. A separate diffusion model generates novel global features for video generation. |
Achieves new state-of-the-art results on multiple benchmarks, including UCF-101, Sky Time-lapse, and TaiChi-HD.
Significantly faster in generating video frames compared to autoregressive methods, thanks to its non-autoregressive strategy.
Generates videos with enhanced coherence and realism due to the use of global features as guidance. |
Difficulty in processing videos with frequent scene changes.
Limited exploration in open-domain video generation tasks due to computational constraints. |
video generation, diffusion models, global guidance, non-autoregressive, coherence and realism |
2309.13196
Report |
ClusterFormer: Clustering As A Universal Visual Learner |
James C. Liang, Yiming Cui, Qifan Wang, Tong Geng, Wenguan Wang, Dongfang Liu |
This paper presents CLUSTERFORMER, a universal vision model that is based on
the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1.
recurrent cross-attention clustering, which reformulates the cross-attention
mechanism in Transformer and enables recursive updates of cluster centers to
facilitate strong representation learning; and 2. feature dispatching, which
uses the updated cluster centers to redistribute image features through
similarity-based metrics, resulting in a transparent pipeline. This elegant
design streamlines an explainable and transferable workflow, capable of
tackling heterogeneous vision tasks (i.e., image classification, object
detection, and image segmentation) with varying levels of clustering
granularity (i.e., image-, box-, and pixel-level). Empirical results
demonstrate that CLUSTERFORMER outperforms various well-known specialized
architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image
classification, 54.2% and 47.0% mAP over MS COCO for object detection and
instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and
55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we
hope our work can catalyze a paradigm shift in universal models in computer
vision. |
This paper introduces ClusterFormer, a universal vision model leveraging clustering within a Transformer architecture to excel at various vision tasks with varying levels of granularity. |
Inspired by the human vision system's ability to group visual information for diverse tasks, ClusterFormer aims to achieve similar versatility and performance in a single model. |
ClusterFormer uses a recurrent cross-attention clustering mechanism to iteratively update cluster centers based on image features. Then, it employs feature dispatching to redistribute image features based on their similarity to updated cluster centers. |
ClusterFormer outperforms Swin Transformer on ImageNet classification by up to 0.39% in top-1 accuracy.
For object detection on MS COCO, ClusterFormer surpasses DINO by up to 1.1% mAP.
ClusterFormer achieves state-of-the-art performance on instance segmentation (MS COCO), semantic segmentation (ADE20K), and panoptic segmentation (COCO Panoptic). |
The computational cost of ClusterFormer can be high for high-resolution images.
Exploring the optimal number of clusters for different tasks and datasets is crucial for future work. |
universal vision model, clustering, transformer, recurrent cross-attention, feature dispatching |
2309.13101
Report |
Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction |
Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, Xiaogang Jin |
Implicit neural representation has paved the way for new approaches to
dynamic scene reconstruction and rendering. Nonetheless, cutting-edge dynamic
neural rendering methods rely heavily on these implicit representations, which
frequently struggle to capture the intricate details of objects in the scene.
Furthermore, implicit methods have difficulty achieving real-time rendering in
general dynamic scenes, limiting their use in a variety of tasks. To address
the issues, we propose a deformable 3D Gaussians Splatting method that
reconstructs scenes using 3D Gaussians and learns them in canonical space with
a deformation field to model monocular dynamic scenes. We also introduce an
annealing smoothing training mechanism with no extra overhead, which can
mitigate the impact of inaccurate poses on the smoothness of time interpolation
tasks in real-world datasets. Through a differential Gaussian rasterizer, the
deformable 3D Gaussians not only achieve higher rendering quality but also
real-time rendering speed. Experiments show that our method outperforms
existing methods significantly in terms of both rendering quality and speed,
making it well-suited for tasks such as novel-view synthesis, time
interpolation, and real-time rendering. |
This paper presents a novel method for high-fidelity monocular dynamic scene reconstruction using deformable 3D Gaussians, achieving real-time rendering and high-quality results. |
Existing dynamic neural rendering methods often struggle to capture intricate scene details and achieve real-time performance. This work addresses these limitations by utilizing a deformable 3D Gaussian framework. |
The proposed method learns 3D Gaussians in canonical space and employs a deformation field to model their variations over time. This is accomplished using a differential Gaussian rasterization pipeline and a novel annealing smoothing training mechanism. |
The method significantly outperforms previous approaches in terms of rendering quality and speed on both synthetic and real-world datasets.
It effectively reconstructs fine details and ensures temporal smoothness, even with inaccurate pose estimations.
The method achieves real-time rendering capabilities when the number of 3D Gaussians is below 250k. |
The method's performance is dependent on viewpoint diversity and pose estimation accuracy.
Scenes with an extremely high number of 3D Gaussians can lead to increased training time and memory consumption. |
neural rendering, dynamic scene reconstruction, 3d gaussians, deformable models, real-time rendering |
2309.13097
Report |
Zero-Shot Object Counting with Language-Vision Models |
Jingyi Xu, Hieu Le, Dimitris Samaras |
Class-agnostic object counting aims to count object instances of an arbitrary
class at test time. It is challenging but also enables many potential
applications. Current methods require human-annotated exemplars as inputs which
are often unavailable for novel categories, especially for autonomous systems.
Thus, we propose zero-shot object counting (ZSC), a new setting where only the
class name is available during test time. This obviates the need for human
annotators and enables automated operation. To perform ZSC, we propose finding
a few object crops from the input image and use them as counting exemplars. The
goal is to identify patches containing the objects of interest while also being
visually representative for all instances in the image. To do this, we first
construct class prototypes using large language-vision models, including CLIP
and Stable Diffusion, to select the patches containing the target objects.
Furthermore, we propose a ranking model that estimates the counting error of
each patch to select the most suitable exemplars for counting. Experimental
results on a recent class-agnostic counting dataset, FSC-147, validate the
effectiveness of our method. |
Introduces zero-shot object counting (ZSC), counting instances of a specific class using only the class name, without exemplars. |
Enables automated object counting for arbitrary classes without requiring human-annotated exemplars, unlike traditional methods. |
Two-step approach: 1) Class-relevant patch selection using class prototypes generated from either a VAE (trained on semantic embeddings) or Stable Diffusion (conditioned on class name). 2) Optimal patch selection via an error prediction network that predicts counting error for candidate patches. |
Significantly reduces error rates compared to using RPN proposals directly as exemplars.
Selected patches effectively represent target objects and lead to meaningful density maps.
Generalizes well to other exemplar-based counting methods. |
Performance may be limited by the quality of generated prototypes.
Further research on handling diverse object scales and occlusions. |
zero-shot learning, object counting, class-agnostic counting, language-vision models, stable diffusion |
2309.13042
Report |
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation |
Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy |
We present MosaicFusion, a simple yet effective diffusion-based data
augmentation approach for large vocabulary instance segmentation. Our method is
training-free and does not rely on any label supervision. Two key designs
enable us to employ an off-the-shelf text-to-image diffusion model as a useful
dataset generator for object instances and mask annotations. First, we divide
an image canvas into several regions and perform a single round of diffusion
process to generate multiple instances simultaneously, conditioning on
different text prompts. Second, we obtain corresponding instance masks by
aggregating cross-attention maps associated with object prompts across layers
and diffusion time steps, followed by simple thresholding and edge-aware
refinement processing. Without bells and whistles, our MosaicFusion can produce
a significant amount of synthetic labeled data for both rare and novel
categories. Experimental results on the challenging LVIS long-tailed and
open-vocabulary benchmarks demonstrate that MosaicFusion can significantly
improve the performance of existing instance segmentation models, especially
for rare and novel categories. Code will be released at
https://github.com/Jiahao000/MosaicFusion. |
This paper proposes MosaicFusion, a training-free data augmentation pipeline using diffusion models to generate images with multiple objects and their corresponding masks for large vocabulary instance segmentation. |
Large vocabulary instance segmentation suffers from data scarcity, especially for rare and novel categories, limiting model performance. MosaicFusion addresses this by synthesizing labeled data for these categories. |
MosaicFusion leverages a text-to-image diffusion model (Stable Diffusion) with two key components: 1) Image generation: an image canvas is divided into regions, each assigned a text prompt describing a specific object. The diffusion process runs on each region in parallel to generate the final image. 2) Mask generation: cross-attention maps corresponding to object prompts are aggregated across layers and time steps, then thresholded and refined to produce instance masks. |
Generating multiple objects per image is more effective than single object generation.
MosaicFusion consistently improves performance across different instance segmentation baselines (Mask R-CNN, CenterNet2) and backbones.
It significantly boosts performance on rare and novel categories in both long-tailed and open-vocabulary settings on LVIS, showing complementarity with CLIP-based methods. |
The study is limited to Stable Diffusion; exploring other diffusion models could be beneficial.
Current diffusion models have limited expressiveness, leading to a domain gap between synthetic and real images. Generating more complex scenes is a future direction. |
text-to-image diffusion models, data augmentation, long tail, open vocabulary, instance segmentation |
2309.13038
Report |
Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception? |
Xiaoxiao Sun, Nidham Gazagnadou, Vivek Sharma, Lingjuan Lyu, Hongdong Li, Liang Zheng |
Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used
to evaluate model privacy risk under reconstruction attacks. Under these
metrics, reconstructed images that are determined to resemble the original one
generally indicate more privacy leakage. Images determined as overall
dissimilar, on the other hand, indicate higher robustness against attack.
However, there is no guarantee that these metrics well reflect human opinions,
which, as a judgement for model privacy leakage, are more trustworthy. In this
paper, we comprehensively study the faithfulness of these hand-crafted metrics
to human perception of privacy information from the reconstructed images. On 5
datasets ranging from natural images, faces, to fine-grained classes, we use 4
existing attack methods to reconstruct images from many different
classification models and, for each reconstructed image, we ask multiple human
annotators to assess whether this image is recognizable. Our studies reveal
that the hand-crafted metrics only have a weak correlation with the human
evaluation of privacy leakage and that even these metrics themselves often
contradict each other. These observations suggest risks of current metrics in
the community. To address this potential risk, we propose a learning-based
measure called SemSim to evaluate the Semantic Similarity between the original
and reconstructed images. SemSim is trained with a standard triplet loss, using
an original image as an anchor, one of its recognizable reconstructed images as
a positive sample, and an unrecognizable one as a negative. By training on
human annotations, SemSim exhibits a greater reflection of privacy leakage on
the semantic level. We show that SemSim has a significantly higher correlation
with human judgment compared with existing metrics. Moreover, this strong
correlation generalizes to unseen datasets, models and attack methods. |
This paper investigates the faithfulness of existing hand-crafted image quality metrics (e.g., PSNR, SSIM) in evaluating the privacy risks of image classification models under reconstruction attacks, and proposes a new learning-based metric, SemSim, which demonstrates a stronger correlation with human perception of privacy leakage. |
Existing image quality metrics, commonly used to evaluate model privacy risks, often show inconsistency with human perception of privacy leakage from reconstructed images, indicating potential risks in privacy assessment. |
The authors collect human annotations on the recognizability of reconstructed images from various datasets, models, and attack methods. They then analyze the correlation between human evaluation and existing metrics. Finally, they propose SemSim, a learning-based metric trained with a triplet loss on human-annotated data to better capture semantic similarity and reflect privacy leakage. |
Existing image quality metrics show weak correlation with human perception of privacy leakage.
SemSim exhibits a significantly stronger correlation with human judgment compared to existing metrics.
SemSim generalizes well to unseen datasets, models, and attack methods. |
SemSim's performance might decrease when facing significant distributional shifts.
The binary nature of privacy leakage in the current study could be extended to a more continuous measure in future work. |
privacy, reconstruction attacks, image quality assessment, human perception, semantic similarity |
2309.12969
Report |
Detect Everything with Few Examples |
Xinyu Zhang, Yuting Wang, Abdeslam Boularias |
Few-shot object detection aims at detecting novel categories given a few
example images. Recent methods focus on finetuning strategies, with complicated
procedures that prohibit a wider application. In this paper, we introduce
DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's
novel architecture is based on a new region-propagation mechanism for
localization. The propagated region masks are transformed into bounding boxes
through a learnable spatial integral layer. Instead of training prototype
classifiers, we propose to use prototypes to project ViT features into a
subspace that is robust to overfitting on base classes. We evaluate DE-ViT on
few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and
LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks.
Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and
7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms
few-shot SoTA by 20 box APr. |
This paper introduces DE-ViT, a fast few-shot object detector that can detect novel objects without finetuning by leveraging the generalization power of strong pretrained ViT backbones. |
Existing few-shot object detection methods rely heavily on finetuning, which leads to limitations in practical use, overfitting on base classes, and a large accuracy gap between base and novel classes. |
DE-ViT utilizes a novel region-propagation-based localization architecture with a learnable spatial integral layer to transform predicted regions into bounding boxes. It also employs feature subspace projection using class prototypes to mitigate overfitting on base classes. |
DE-ViT achieves state-of-the-art results on few-shot object detection benchmarks, including Pascal VOC, COCO, and LVIS.
On COCO, DE-ViT surpasses the previous SoTA LVC by 15 mAP on 10-shot and 7.2 mAP on 30-shot.
On LVIS, DE-ViT outperforms the previous SoTA DiGeo by 20 box APr. |
The feature subspace projection creates separate features for each class, introducing inference overhead.
The full potential of the region propagation network and spatial integral layer for general object detection tasks, including segmentation, is not fully explored in this work. |
few-shot object detection, vision transformer, region propagation, spatial integral layer, feature subspace projection |
2309.12790
Report |
NTO3D: Neural Target Object 3D Reconstruction with Segment Anything |
Xiaobao Wei, Renrui Zhang, Jiarui Wu, Jiaming Liu, Ming Lu, Yandong Guo, Shanghang Zhang |
Neural 3D reconstruction from multi-view images has recently attracted
increasing attention from the community. Existing methods normally learn a
neural field for the whole scene, while it is still under-explored how to
reconstruct a target object indicated by users. Considering the Segment
Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in
this paper, we propose NTO3D, a novel high-quality Neural Target Object 3D
(NTO3D) reconstruction method, which leverages the benefits of both neural
field and SAM. We first propose a novel strategy to lift the multi-view 2D
segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy
field is then projected into 2D space and generates the new prompts for SAM.
This process is iterative until convergence to separate the target object from
the scene. After this, we then lift the 2D features of the SAM encoder into a
3D feature field in order to improve the reconstruction quality of the target
object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field
for high-quality neural target object 3D reconstruction. We conduct detailed
experiments on several benchmark datasets to demonstrate the advantages of our
method. The code will be available at: https://github.com/ucwxb/NTO3D. |
This paper introduces NTO3D, a novel method for reconstructing a user-specified target object from multi-view images by leveraging the strengths of neural fields and the Segment Anything Model (SAM). |
Existing neural 3D reconstruction methods typically model the entire scene, neglecting the need to isolate and reconstruct specific objects. NTO3D addresses this gap by enabling user-guided, on-the-fly target object reconstruction, improving both flexibility and reconstruction quality. |
NTO3D operates in two stages. First, it trains a 3D occupancy field to merge multi-view 2D segmentation masks obtained from SAM, effectively isolating the target object in 3D space. Second, it refines the reconstruction by lifting SAM's 2D features into a 3D feature field, enhancing surface quality and detail. |
NTO3D achieves superior segmentation accuracy compared to baselines, generating high-quality multi-view consistent masks of target objects.
NTO3D demonstrates substantial improvements in novel view synthesis quality, as evidenced by higher PSNR and SSIM values compared to state-of-the-art methods.
The reconstructed 3D models produced by NTO3D exhibit higher fidelity, achieving lower Chamfer distances compared to existing techniques. |
NTO3D's performance depends on SAM's ability to segment the target object effectively; challenges arise when SAM encounters complex scenes or fails to provide accurate masks.
Future work could focus on enhancing NTO3D's robustness by incorporating techniques like parameter-efficient fine-tuning for SAM to handle challenging segmentation scenarios. |
3d reconstruction, neural fields, segment anything model (sam), target object segmentation, multi-view images |
2309.12757
Report |
Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where |
Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu |
While image data starts to enjoy the simple-but-effective self-supervised
learning scheme built upon masking and self-reconstruction objective thanks to
the introduction of tokenization procedure and vision transformer backbone,
convolutional neural networks as another important and widely-adopted
architecture for image data, though having contrastive-learning techniques to
drive the self-supervised learning, still face the difficulty of leveraging
such straightforward and general masking operation to benefit their learning
process significantly. In this work, we aim to alleviate the burden of
including masking operation into the contrastive-learning framework for
convolutional neural networks as an extra augmentation method. In addition to
the additive but unwanted edges (between masked and unmasked regions) as well
as other adverse effects caused by the masking operations for ConvNets, which
have been discussed by prior works, we particularly identify the potential
problem where for one view in a contrastive sample-pair the randomly-sampled
masking regions could be overly concentrated on important/salient objects thus
resulting in misleading contrastiveness to the other view. To this end, we
propose to explicitly take the saliency constraint into consideration in which
the masked regions are more evenly distributed among the foreground and
background for realizing the masking-based augmentation. Moreover, we introduce
hard negative samples by masking larger regions of salient patches in an input
image. Extensive experiments conducted on various datasets, contrastive
learning mechanisms, and downstream tasks well verify the efficacy as well as
the superior performance of our proposed method with respect to several
state-of-the-art baselines. |
This paper proposes a novel saliency-guided masking augmentation method for improving contrastive self-supervised learning in convolutional neural networks. |
While masking and self-reconstruction have been successful in self-supervised learning for vision transformers, convolutional networks struggle to effectively utilize these techniques. This paper addresses this gap by incorporating masking as an augmentation strategy in a contrastive learning framework. |
The method utilizes a pretrained localization network to generate saliency maps, guiding the masking operation to distribute masked patches evenly between foreground and background. It introduces three masking strategies to mitigate parasitic edges: high-pass filtering, strong blurring, and mean filling. Additionally, it explores the creation of hard negative samples by masking large portions of salient patches. |
Saliency-guided masking consistently outperforms random masking and other baselines in various downstream tasks, including image classification, object detection, and instance segmentation.
Masking solely the query branch of the Siamese network, motivated by variance manipulation, further improves performance.
Hard negative samples, generated by masking salient patches, provide additional performance benefits. |
The strong blurring strategy's efficiency is limited by GPU I/O, requiring further optimization.
The impact of different localization network choices on the performance requires further investigation. |
self-supervised learning, contrastive learning, convolutional neural networks, masking augmentation, saliency |
2309.12412
Report |
Speeding up Resnet Architecture with Layers Targeted Low Rank Decomposition |
Walid Ahmed, Habib Hajimolahoseini, Austin Wen, Yang Liu |
Compression of a neural network can help in speeding up both the training and
the inference of the network. In this research, we study applying compression
using low rank decomposition on network layers. Our research demonstrates that
to acquire a speed up, the compression methodology should be aware of the
underlying hardware as analysis should be done to choose which layers to
compress. The advantage of our approach is demonstrated via a case study of
compressing ResNet50 and training on full ImageNet-ILSVRC2012. We tested on two
different hardware systems Nvidia V100 and Huawei Ascend910. With hardware
targeted compression, results on Ascend910 showed 5.36% training speedup and
15.79% inference speed on Ascend310 with only 1% drop in accuracy compared to
the original uncompressed model |
This paper proposes a hardware-aware low-rank decomposition (LRD) framework for compressing neural networks to achieve faster training and inference. |
Compressing large neural networks is crucial for deployment on resource-constrained devices and for faster training and inference. |
The proposed method uses LRD with tensor decomposition, selectively compressing layers based on hardware and introducing compression modes to find the optimal layers for compression. It also includes final dense layer compression rate adjustment and rank quantization. |
Not all layers benefit from compression equally, and hardware plays a significant role in determining the effectiveness.
The proposed framework achieved a 5.6% training speedup on Huawei Ascend910 and a 15.79% inference speedup on Ascend310 for ResNet50 with minimal accuracy loss.
Simply reducing parameters or FLOPs does not guarantee speedup, and careful layer selection and hardware awareness are essential. |
The study focuses on ResNet50 and two hardware platforms; further validation on diverse architectures and hardware is needed.
The search for the optimal compression mode could be automated and potentially improved by incorporating hardware-specific performance models. |
model compression, low-rank decomposition, hardware-aware compression, neural network speedup, resnet50 |
2309.11955
Report |
A Study of Forward-Forward Algorithm for Self-Supervised Learning |
Jonas Brenig, Radu Timofte |
Self-supervised representation learning has seen remarkable progress in the
last few years, with some of the recent methods being able to learn useful
image representations without labels. These methods are trained using
backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the
forward-forward algorithm as an alternative training method. It utilizes two
forward passes and a separate loss function for each layer to train the network
without backpropagation.
In this study, for the first time, we study the performance of
forward-forward vs. backpropagation for self-supervised representation learning
and provide insights into the learned representation spaces. Our benchmark
employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and
three commonly used self-supervised representation learning techniques, namely
rotation, flip and jigsaw.
Our main finding is that while the forward-forward algorithm performs
comparably to backpropagation during (self-)supervised training, the transfer
performance is significantly lagging behind in all the studied settings. This
may be caused by a combination of factors, including having a loss function for
each layer and the way the supervised training is realized in the
forward-forward paradigm. In comparison to backpropagation, the forward-forward
algorithm focuses more on the boundaries and drops part of the information
unnecessary for making decisions which harms the representation learning goal.
Further investigation and research are necessary to stabilize the
forward-forward strategy for self-supervised learning, to work beyond the
datasets and configurations demonstrated by Geoffrey Hinton. |
This paper presents the first study on the performance of the forward-forward algorithm, a novel alternative to backpropagation, for self-supervised representation learning. |
The forward-forward algorithm, while promising for its biological plausibility, requires evaluation in the context of self-supervised learning, a powerful technique for learning image representations without labels. |
The authors benchmark the forward-forward algorithm against traditional backpropagation on four datasets (MNIST, F-MNIST, SVHN, CIFAR-10) and three self-supervised tasks (rotation, flip, jigsaw). They analyze accuracy on the self-supervised tasks and the transfer learning performance on classification using a linear classifier. |
The forward-forward algorithm performs comparably to backpropagation during self-supervised pre-training.
The transfer performance of forward-forward significantly lags behind backpropagation in all tested settings, indicating a difficulty in generalizing learned representations.
The forward-forward algorithm appears to focus heavily on features directly relevant to the self-supervised task, potentially discarding information crucial for downstream tasks. |
The specific self-supervised tasks and their implementation details might influence the performance of the forward-forward algorithm.
Future work should explore alternative SSL tasks better suited for the forward-forward algorithm and investigate Siamese network structures and generative extensions of the algorithm. |
forward-forward algorithm, self-supervised learning, representation learning, backpropagation, transfer learning |
2309.11923
Report |
TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training |
Xiaozhou You, Jian Zhang |
Text-guided image generation aimed to generate desired images conditioned on
given texts, while text-guided image manipulation refers to semantically edit
parts of a given image based on specified texts. For these two similar tasks,
the key point is to ensure image fidelity as well as semantic consistency. Many
previous approaches require complex multi-stage generation and adversarial
training, while struggling to provide a unified framework for both tasks. In
this work, we propose TextCLIP, a unified framework for text-guided image
generation and manipulation without adversarial training. The proposed method
accepts input from images or random noise corresponding to these two different
tasks, and under the condition of the specific texts, a carefully designed
mapping network that exploits the powerful generative capabilities of StyleGAN
and the text image representation capabilities of Contrastive Language-Image
Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that
can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ
dataset have demonstrated that our proposed method outperforms existing
state-of-the-art methods, both on text-guided generation tasks and manipulation
tasks. |
TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training, is proposed. |
Many previous methods for text-guided image generation and manipulation rely on complex multi-stage generation and adversarial training, leading to challenges in efficiency and training difficulty. This work aims to address these limitations and provide a unified framework for both tasks. |
TextCLIP leverages a pretrained encoder, level-channel mapper, and a pretrained StyleGAN generator. It maps input (image or random noise) to StyleGAN's latent space, uses a level-channel mapper to encode text information, and feeds the resulting latent code to the StyleGAN generator for image synthesis. Different processing is applied based on the task (generation or manipulation). |
TextCLIP outperforms state-of-the-art methods in both text-guided image generation and manipulation tasks on the Multi-modal CelebA-HQ dataset.
The proposed level-channel mapper effectively maps textual information to the latent space of StyleGAN for high-quality image generation.
The designed loss functions ensure both image realism and semantic alignment between generated images and given texts. |
TextCLIP is currently limited to the face domain and needs further exploration for generalization to other domains like flowers or birds.
The performance of TextCLIP is affected by the inherent limitations of StyleGAN and CLIP, such as the representation of certain attributes and potential vulnerabilities of CLIP. |
text-guided image generation, text-guided image manipulation, stylegan, clip, adversarial training |
2309.11747
Report |
MarkNerf:Watermarking for Neural Radiance Field |
Lifeng Chen, Jia Liu, Yan Ke, Wenquan Sun, Weina Dong, Xiaozhong Pan |
A watermarking algorithm is proposed in this paper to address the copyright
protection issue of implicit 3D models. The algorithm involves embedding
watermarks into the images in the training set through an embedding network,
and subsequently utilizing the NeRF model for 3D modeling. A copyright verifier
is employed to generate a backdoor image by providing a secret perspective as
input to the neural radiation field. Subsequently, a watermark extractor is
devised using the hyperparameterization method of the neural network to extract
the embedded watermark image from that perspective. In a black box scenario, if
there is a suspicion that the 3D model has been used without authorization, the
verifier can extract watermarks from a secret perspective to verify network
copyright. Experimental results demonstrate that the proposed algorithm
effectively safeguards the copyright of 3D models. Furthermore, the extracted
watermarks exhibit favorable visual effects and demonstrate robust resistance
against various types of noise attacks. |
This paper proposes MarkNerf, a novel watermarking algorithm for Neural Radiance Fields (NeRF) to address copyright protection issues in implicit 3D models. |
With the increasing popularity of NeRF in generating and sharing 3D content, protecting the copyright of these implicit 3D models becomes crucial. |
The algorithm embeds watermarks into training images using an embedding network before training the NeRF model. A secret perspective, acting as the key, is used during training. Watermark extraction employs an over-parameterized network, only revealing the watermark when presented with an image rendered from the secret perspective. |
The embedded watermarks exhibit high imperceptibility, ensuring minimal visual difference between watermarked and original images.
The algorithm demonstrates robustness against various noise attacks, preserving watermark integrity.
Watermark extraction is only successful when the image is rendered from the secret perspective, guaranteeing copyright protection. |
The extraction network's structure could be further improved to mitigate potential watermark extraction from adjacent views.
Future work can focus on exploring different watermark embedding and extraction techniques to enhance security and robustness. |
neural radiance field, 3d watermarking, copyright protection, deep learning, implicit 3d models |
2309.11525
Report |
Light Field Diffusion for Single-View Novel View Synthesis |
Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Hao Tang, Xiaohui Xie |
Single-view novel view synthesis (NVS), the task of generating images from
new viewpoints based on a single reference image, is important but challenging
in computer vision. Recent advancements in NVS have leveraged Denoising
Diffusion Probabilistic Models (DDPMs) for their exceptional ability to produce
high-fidelity images. However, current diffusion-based methods typically
utilize camera pose matrices to globally and implicitly enforce 3D constraints,
which can lead to inconsistencies in images generated from varying viewpoints,
particularly in regions with complex textures and structures.
To address these limitations, we present Light Field Diffusion (LFD), a novel
conditional diffusion-based approach that transcends the conventional reliance
on camera pose matrices. Starting from the camera pose matrices, LFD transforms
them into light field encoding, with the same shape as the reference image, to
describe the direction of each ray. By integrating light field encoding with
the reference image, our method imposes local pixel-wise constraints within the
diffusion process, fostering enhanced view consistency. Our approach not only
involves training image LFD on the ShapeNet Car dataset but also includes
fine-tuning a pre-trained latent diffusion model on the Objaverse dataset. This
enables our latent LFD model to exhibit remarkable zero-shot generalization
capabilities across out-of-distribution datasets like RTMV as well as
in-the-wild images. Experiments demonstrate that LFD not only produces
high-fidelity images but also achieves superior 3D consistency in complex
regions, outperforming existing novel view synthesis methods. |
This paper presents Light Field Diffusion (LFD), a novel conditional diffusion-based approach for single-view novel view synthesis that utilizes light field encoding of camera poses to impose local pixel-wise constraints, enhancing view consistency. |
Existing diffusion-based methods for novel view synthesis rely on camera pose matrices, which provide only global and implicit 3D constraints, leading to inconsistencies in generated images from varying viewpoints, particularly in complex regions. |
LFD transforms camera pose matrices into light field encoding, describing the direction of each ray. This encoding is integrated with the reference and target images during the diffusion process using a U-Net architecture with cross-attention, enabling local pixel correspondences and enhanced view consistency. Both image-space and latent-space implementations are explored. |
LFD outperforms existing methods on Objaverse and ShapeNet datasets in terms of view consistency and image quality.
The latent LFD model demonstrates zero-shot generalization, effectively synthesizing novel views for out-of-distribution datasets like RTMV and in-the-wild images.
LFD with light field encoding shows superior performance compared to using camera pose matrices directly in the diffusion model. |
Latent LFD faces challenges with highly complex, in-the-wild images, particularly landscapes, due to its training predominantly on synthetic data.
The current light field encoding does not explicitly provide depth information or details about the scene's light source. |
novel view synthesis, diffusion models, light field, single-view reconstruction, 3d vision |
2309.11497
Report |
FreeU: Free Lunch in Diffusion U-Net |
Chenyang Si, Ziqi Huang, Yuming Jiang, Ziwei Liu |
In this paper, we uncover the untapped potential of diffusion U-Net, which
serves as a "free lunch" that substantially improves the generation quality on
the fly. We initially investigate the key contributions of the U-Net
architecture to the denoising process and identify that its main backbone
primarily contributes to denoising, whereas its skip connections mainly
introduce high-frequency features into the decoder module, causing the network
to overlook the backbone semantics. Capitalizing on this discovery, we propose
a simple yet effective method-termed "FreeU" - that enhances generation quality
without additional training or finetuning. Our key insight is to strategically
re-weight the contributions sourced from the U-Net's skip connections and
backbone feature maps, to leverage the strengths of both components of the
U-Net architecture. Promising results on image and video generation tasks
demonstrate that our FreeU can be readily integrated to existing diffusion
models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion,
to improve the generation quality with only a few lines of code. All you need
is to adjust two scaling factors during inference. Project page:
https://chenyangsi.top/FreeU/. |
Proposes "FreeU," a method to improve diffusion model sample quality during inference by re-weighting feature contributions from the U-Net's backbone and skip connections. |
Diffusion U-Net's internal properties are under-explored, and improving sample quality usually requires computationally expensive training or fine-tuning. |
Analyzes the contributions of the U-Net backbone and skip connections to the denoising process, then introduces backbone and skip feature scaling factors during inference to re-balance their contributions. |
Amplifying backbone features improves denoising and image quality.
Skip connections primarily contribute high-frequency information, and their modulation has less impact on overall quality.
FreeU significantly improves image and video generation quality in Stable Diffusion, DreamBooth, ModelScope, Rerender, and ReVersion without additional training. |
The impact of different scaling factors and their optimal values need further investigation.
The oversmoothing of textures when amplifying backbone features needs to be addressed more robustly. |
diffusion models, u-net, image generation, video generation, sample quality |
2309.11043
Report |
Score Mismatching for Generative Modeling |
Senmao Ye, Fei Liu |
We propose a new score-based model with one-step sampling. Previously,
score-based models were burdened with heavy computations due to iterative
sampling. For substituting the iterative process, we train a standalone
generator to compress all the time steps with the gradient backpropagated from
the score network. In order to produce meaningful gradients for the generator,
the score network is trained to simultaneously match the real data distribution
and mismatch the fake data distribution. This model has the following
advantages: 1) For sampling, it generates a fake image with only one step
forward. 2) For training, it only needs 10 diffusion steps.3) Compared with
consistency model, it is free of the ill-posed problem caused by consistency
loss. On the popular CIFAR-10 dataset, our model outperforms Consistency Model
and Denoising Score Matching, which demonstrates the potential of the
framework. We further provide more examples on the MINIST and LSUN datasets.
The code is available on GitHub. |
This paper proposes Score Mismatching (SMM), a novel score-based generative model that achieves one-step sampling by training a standalone generator to compress all diffusion time steps, guided by gradients from a score network trained to match real and mismatch fake data distributions. |
Score-based models typically suffer from heavy computational burdens due to iterative sampling. This work aims to accelerate this process for wider application, particularly in resource-constrained environments like mobile devices. |
SMM trains a generator and a score network jointly. The score network is trained to match the score of the true data distribution and mismatch the generated data distribution. The standalone generator is trained to fool the score network by generating samples close to the real data manifold. A zero-mean noise injection pipeline is used during training to eliminate noise corruption in the generated samples. |
SMM outperforms state-of-the-art one-step score-based models, such as Consistency Model, on CIFAR-10 image generation.
Only 10 diffusion steps are needed during training, significantly less than traditional score-based models.
SMM avoids the ill-posed problem encountered in Consistency Model by not relying on direct pixel-wise mapping between noisy and clean images. |
The performance of SMM is sensitive to the choice of noise corruption strategy and network architectures.
Exploiting pre-trained score-based models for further performance improvement is challenging due to their reliance on noisy score estimation. |
generative models, score-based models, one-step sampling, diffusion models, adversarial training |
2309.11009
Report |
Controllable Dynamic Appearance for Neural 3D Portraits |
ShahRukh Athar, Zhixin Shu, Zexiang Xu, Fujun Luan, Sai Bi, Kalyan Sunkavalli, Dimitris Samaras |
Recent advances in Neural Radiance Fields (NeRFs) have made it possible to
reconstruct and reanimate dynamic portrait scenes with control over head-pose,
facial expressions and viewing direction. However, training such models assumes
photometric consistency over the deformed region e.g. the face must be evenly
lit as it deforms with changing head-pose and facial expression. Such
photometric consistency across frames of a video is hard to maintain, even in
studio environments, thus making the created reanimatable neural portraits
prone to artifacts during reanimation. In this work, we propose CoDyNeRF, a
system that enables the creation of fully controllable 3D portraits in
real-world capture conditions. CoDyNeRF learns to approximate illumination
dependent effects via a dynamic appearance model in the canonical space that is
conditioned on predicted surface normals and the facial expressions and
head-pose deformations. The surface normals prediction is guided using 3DMM
normals that act as a coarse prior for the normals of the human head, where
direct prediction of normals is hard due to rigid and non-rigid deformations
induced by head-pose and facial expression changes. Using only a
smartphone-captured short video of a subject for training, we demonstrate the
effectiveness of our method on free view synthesis of a portrait scene with
explicit head pose and expression controls, and realistic lighting effects. The
project page can be found here:
http://shahrukhathar.github.io/2023/08/22/CoDyNeRF.html |
This paper introduces CoDyNeRF, a system that creates controllable and reanimatable 3D neural portraits from videos captured in real-world lighting conditions. |
Existing NeRF-based portrait animation methods often fail in realistic lighting due to the assumption of photometric consistency, leading to artifacts in relighting and shadowing. |
CoDyNeRF employs a dynamic canonical appearance model conditioned on surface normals, head pose, facial expressions, and other shading cues. It predicts dynamic surface normals using an MLP trained with 3DMM and scene normals as priors. |
CoDyNeRF realistically reproduces shadowing, shading, and specularity effects during reanimation.
Quantitative evaluations demonstrate superior performance compared to state-of-the-art methods like RigNeRF and Neural Head Avatars.
Ablation studies confirm the importance of dynamic appearance conditioning and accurate normal prediction. |
CoDyNeRF is currently subject-specific, requiring training for each individual.
It does not support relighting with novel lighting conditions. |
neural radiance fields, 3d portrait animation, dynamic appearance modeling, surface normal prediction, realistic relighting |
2309.10810
Report |
PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance |
Peiqing Yang, Shangchen Zhou, Qingyi Tao, Chen Change Loy |
Exploiting pre-trained diffusion models for restoration has recently become a
favored alternative to the traditional task-specific training approach.
Previous works have achieved noteworthy success by limiting the solution space
using explicit degradation models. However, these methods often fall short when
faced with complex degradations as they generally cannot be precisely modeled.
In this paper, we propose PGDiff by introducing partial guidance, a fresh
perspective that is more adaptable to real-world degradations compared to
existing works. Rather than specifically defining the degradation process, our
approach models the desired properties, such as image structure and color
statistics of high-quality images, and applies this guidance during the reverse
diffusion process. These properties are readily available and make no
assumptions about the degradation process. When combined with a diffusion
prior, this partial guidance can deliver appealing results across a range of
restoration tasks. Additionally, PGDiff can be extended to handle composite
tasks by consolidating multiple high-quality image properties, achieved by
integrating the guidance from respective tasks. Experimental results
demonstrate that our method not only outperforms existing diffusion-prior-based
approaches but also competes favorably with task-specific models. |
This paper introduces PGDiff, a novel approach for image restoration that leverages the generative prior of pre-trained diffusion models without relying on explicit degradation models. |
Existing diffusion-based restoration methods, while versatile, struggle with complex, real-world degradations due to their dependence on accurate degradation modeling. PGDiff addresses this limitation, offering a more generalizable solution. |
PGDiff employs 'partial guidance', which focuses on modeling desired properties of high-quality images (e.g., structure, color statistics) rather than the degradation process. This guidance, implemented using classifier guidance with dynamic adjustments, steers the diffusion model's denoising process. |
PGDiff effectively handles various restoration tasks, including blind face restoration, colorization, and inpainting, outperforming existing diffusion-prior-based methods.
The method demonstrates strong performance on challenging cases, such as old photo restoration with scratches, by combining guidance from multiple restoration tasks.
PGDiff exhibits flexibility in incorporating additional guidance, exemplified by reference-based restoration using identity features and quality enhancement with perceptual and adversarial losses. |
The performance of PGDiff is contingent upon the capabilities of the pre-trained diffusion model used.
The current implementation primarily focuses on face restoration due to the use of a face-specific diffusion model. Extending it to broader object categories is left for future work. |
image restoration, diffusion models, generative prior, partial guidance, classifier guidance |
2309.10713
Report |
Interpret Vision Transformers as ConvNets with Dynamic Convolutions |
Chong Zhou, Chen Change Loy, Bo Dai |
There has been a debate about the superiority between vision Transformers and
ConvNets, serving as the backbone of computer vision models. Although they are
usually considered as two completely different architectures, in this paper, we
interpret vision Transformers as ConvNets with dynamic convolutions, which
enables us to characterize existing Transformers and dynamic ConvNets in a
unified framework and compare their design choices side by side. In addition,
our interpretation can also guide the network design as researchers now can
consider vision Transformers from the design space of ConvNets and vice versa.
We demonstrate such potential through two specific studies. First, we inspect
the role of softmax in vision Transformers as the activation function and find
it can be replaced by commonly used ConvNets modules, such as ReLU and Layer
Normalization, which results in a faster convergence rate and better
performance. Second, following the design of depth-wise convolution, we create
a corresponding depth-wise vision Transformer that is more efficient with
comparable performance. The potential of the proposed unified interpretation is
not limited to the given examples and we hope it can inspire the community and
give rise to more advanced network architectures. |
Presents a novel interpretation of vision Transformers as ConvNets with dynamic convolutions, providing a unified framework to understand and compare these architectures. |
This interpretation bridges the gap between Transformers and ConvNets, enabling the transfer of design principles between them for developing more advanced architectures. |
The authors reformulate the self-attention mechanism in Transformers as a series of operations equivalent to static and dynamic convolutions. |
Softmax in vision Transformers can be effectively replaced by other normalization and activation techniques common in ConvNets, such as Layer Normalization and ReLU, leading to faster convergence and even improved performance.
Inspired by depth-wise convolutions, the authors propose depth-wise vision Transformers, achieving comparable performance to standard Transformers while being more efficient.
The proposed framework provides a new lens for analyzing design choices in Transformers and ConvNets side-by-side, fostering cross-architectural inspiration. |
The paper focuses on $1x1$ dynamic convolutions for Transformers, leaving the exploration of larger kernel sizes for future work.
Further investigation is needed to explore the full potential of this unified framework, such as designing self-attention-like dynamic convolutions with strides. |
vision transformers, convolutional neural networks, dynamic convolutions, self-attention, network architecture design |
2309.10556
Report |
Forgedit: Text Guided Image Editing via Learning and Forgetting |
Shiwen Zhang, Shuai Xiao, Weilin Huang |
Text-guided image editing on real or synthetic images, given only the
original image itself and the target text prompt as inputs, is a very general
and challenging task. It requires an editing model to estimate by itself which
part of the image should be edited, and then perform either rigid or non-rigid
editing while preserving the characteristics of original image. In this paper,
we design a novel text-guided image editing method, named as Forgedit. First,
we propose a vision-language joint optimization framework capable of
reconstructing the original image in 30 seconds, much faster than previous SOTA
and much less overfitting. Then we propose a novel vector projection mechanism
in text embedding space of Diffusion Models, which is capable to control the
identity similarity and editing strength seperately. Finally, we discovered a
general property of UNet in Diffusion Models, i.e., Unet encoder learns space
and structure, Unet decoder learns appearance and identity. With such a
property, we design forgetting mechanisms to successfully tackle the fatal and
inevitable overfitting issues when fine-tuning Diffusion Models on one image,
thus significantly boosting the editing capability of Diffusion Models. Our
method, Forgedit, built on Stable Diffusion, achieves new state-of-the-art
results on the challenging text-guided image editing benchmark: TEdBench,
surpassing the previous SOTA methods such as Imagic with Imagen, in terms of
both CLIP score and LPIPS score. Codes are available at
https://github.com/witcherofresearch/Forgedit |
Proposes Forgedit, a novel text-guided image editing method using diffusion models, addressing limitations of previous optimization-based methods like slow fine-tuning and overfitting. |
Enables efficient and precise text-guided image editing on real or synthetic images, crucial for tasks like visual storytelling and content creation, while preserving characteristics of the original image. |
Employs a two-stage approach: 1) **Joint fine-tuning:** Reconstructs the original image using a vision-language joint optimization framework with a BLIP-generated source prompt, achieving faster convergence and less overfitting. 2) **Editing:** Utilizes a novel vector projection mechanism in text embedding space for controlled editing and a forgetting strategy based on the observed UNet property (encoder learns structure, decoder learns appearance) to mitigate overfitting during sampling. |
Achieves state-of-the-art results on TEdBench, outperforming previous SOTA methods like Imagic.
Significantly faster fine-tuning than Imagic (30 seconds vs. 7 minutes).
Successfully tackles overfitting issues common in optimization-based editing methods. |
Editing quality can be influenced by randomness in fine-tuning and sampling.
Limited by the editing capabilities of the underlying diffusion model. |
text-guided image editing, diffusion models, overfitting, vector projection, visual storytelling |
2309.10503
Report |
Steganography for Neural Radiance Fields by Backdooring |
Weina Dong, Jia Liu, Yan Ke, Lifeng Chen, Wenquan Sun, Xiaozhong Pan |
The utilization of implicit representation for visual data (such as images,
videos, and 3D models) has recently gained significant attention in computer
vision research. In this letter, we propose a novel model steganography scheme
with implicit neural representation. The message sender leverages Neural
Radiance Fields (NeRF) and its viewpoint synthesis capabilities by introducing
a viewpoint as a key. The NeRF model generates a secret viewpoint image, which
serves as a backdoor. Subsequently, we train a message extractor using
overfitting to establish a one-to-one mapping between the secret message and
the secret viewpoint image. The sender delivers the trained NeRF model and the
message extractor to the receiver over the open channel, and the receiver
utilizes the key shared by both parties to obtain the rendered image in the
secret view from the NeRF model, and then obtains the secret message through
the message extractor. The inherent complexity of the viewpoint information
prevents attackers from stealing the secret message accurately. Experimental
results demonstrate that the message extractor trained in this letter achieves
high-capacity steganography with fast performance, achieving a 100\% accuracy
in message extraction. Furthermore, the extensive viewpoint key space of NeRF
ensures the security of the steganography scheme. |
This paper proposes a novel model steganography scheme utilizing Neural Radiance Fields (NeRF) and implicit neural representation. |
The scheme leverages the viewpoint synthesis capabilities of NeRF to create a backdoor for embedding and extracting secret messages, addressing the need for secure communication in the context of increasing use of deep learning models. |
The sender trains a NeRF model on a 3D scene and selects a secret viewpoint as the key. A message extractor is trained via overfitting to establish a one-to-one mapping between the secret viewpoint image and the secret message. The receiver uses the shared key and the received model to extract the message. |
The message extractor achieves 100% accuracy in message extraction from the secret viewpoint image.
The scheme demonstrates high message capacity.
Even slight deviations from the secret viewpoint render message extraction impossible, ensuring steganographic security. |
The scheme currently requires republishing the message extractor with each message, incurring security and efficiency drawbacks.
Future work includes enhancing the message extractor with additional image processing functionalities to improve steganographic robustness. |
steganography, neural radiance fields (nerf), implicit neural representation, model steganography, message extractor |
2309.10388
Report |
SideGAN: 3D-Aware Generative Model for Improved Side-View Image Synthesis |
Kyungmin Jo, Wonjoon Jin, Jaegul Choo, Hyunjoon Lee, Sunghyun Cho |
While recent 3D-aware generative models have shown photo-realistic image
synthesis with multi-view consistency, the synthesized image quality degrades
depending on the camera pose (e.g., a face with a blurry and noisy boundary at
a side viewpoint). Such degradation is mainly caused by the difficulty of
learning both pose consistency and photo-realism simultaneously from a dataset
with heavily imbalanced poses. In this paper, we propose SideGAN, a novel 3D
GAN training method to generate photo-realistic images irrespective of the
camera pose, especially for faces of side-view angles. To ease the challenging
problem of learning photo-realistic and pose-consistent image synthesis, we
split the problem into two subproblems, each of which can be solved more
easily. Specifically, we formulate the problem as a combination of two simple
discrimination problems, one of which learns to discriminate whether a
synthesized image looks real or not, and the other learns to discriminate
whether a synthesized image agrees with the camera pose. Based on this, we
propose a dual-branched discriminator with two discrimination branches. We also
propose a pose-matching loss to learn the pose consistency of 3D GANs. In
addition, we present a pose sampling strategy to increase learning
opportunities for steep angles in a pose-imbalanced dataset. With extensive
validation, we demonstrate that our approach enables 3D GANs to generate
high-quality geometries and photo-realistic images irrespective of the camera
pose. |
This paper proposes SideGAN, a 3D GAN training method to generate photo-realistic images irrespective of the camera pose, particularly for faces at side-view angles. |
Existing 3D GANs struggle to generate high-quality images at steep angles due to the difficulty of learning both pose consistency and photo-realism simultaneously from datasets with imbalanced poses (more frontal views). |
SideGAN splits the problem into two subproblems: real/fake image discrimination and pose-consistency discrimination. It introduces a dual-branched discriminator, a pose-matching loss, and an additional uniform pose sampling (AUPS) strategy to address the challenges. |
SideGAN generates high-quality images and shapes at various camera poses, outperforming baselines in terms of FID and depth accuracy.
SideGAN effectively learns to synthesize realistic details (e.g., ears) even at steep angles, unlike previous methods which produce blurry results.
Ablation studies demonstrate the benefits of each proposed component, including improved FID scores and more accurate 3D geometry. |
The model sometimes generates artifacts (black spots) behind ears, especially for animal faces.
Background separation may not be perfect despite using a background network. |
generative adversarial networks (gans), 3d-aware image synthesis, multi-view consistency, pose-controllable image generation, neural radiance fields (nerf) |
2309.10336
Report |
Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail |
Yiyu Zhuang, Qi Zhang, Ying Feng, Hao Zhu, Yao Yao, Xiaoyu Li, Yan-Pei Cao, Ying Shan, Xun Cao |
We present LoD-NeuS, an efficient neural representation for high-frequency
geometry detail recovery and anti-aliased novel view rendering. Drawing
inspiration from voxel-based representations with the level of detail (LoD), we
introduce a multi-scale tri-plane-based scene representation that is capable of
capturing the LoD of the signed distance function (SDF) and the space radiance.
Our representation aggregates space features from a multi-convolved
featurization within a conical frustum along a ray and optimizes the LoD
feature volume through differentiable rendering. Additionally, we propose an
error-guided sampling strategy to guide the growth of the SDF during the
optimization. Both qualitative and quantitative evaluations demonstrate that
our method achieves superior surface reconstruction and photorealistic view
synthesis compared to state-of-the-art approaches. |
Presents LoD-NeuS, a novel neural implicit surface representation that leverages encoding level of detail (LoD) for high-quality surface reconstruction and anti-aliased novel view rendering. |
Addresses the limitations of existing neural implicit surface reconstruction methods in capturing fine-grained details and mitigating aliasing artifacts. |
Introduces a multi-scale tri-plane-based scene representation to capture LoD of SDF and radiance, employs multi-convolved featurization within conical frustums to approximate cone sampling, and develops an error-guided sampling strategy for SDF growth. |
Achieves superior surface reconstruction with finer details compared to state-of-the-art methods, particularly for objects with intricate geometries.
Effectively reduces aliasing artifacts in novel view rendering by accounting for pixel size and shape through cone sampling approximation.
Demonstrates computational efficiency with reduced MLP queries compared to super-sampling techniques. |
SDF growth refinement, while effective, has been tested on a limited number of cases due to computational constraints.
Exploring the application of the proposed method to dynamic scenes with time-varying geometry and appearance. |
neural implicit surface, signed distance function, volume rendering, anti-aliasing, level of detail |
2309.10279
Report |
360$^\circ$ Reconstruction From a Single Image Using Space Carved Outpainting |
Nuri Ryu, Minsu Gong, Geonung Kim, Joo-Haeng Lee, Sunghyun Cho |
We introduce POP3D, a novel framework that creates a full $360^\circ$-view 3D
model from a single image. POP3D resolves two prominent issues that limit the
single-view reconstruction. Firstly, POP3D offers substantial generalizability
to arbitrary categories, a trait that previous methods struggle to achieve.
Secondly, POP3D further improves reconstruction fidelity and naturalness, a
crucial aspect that concurrent works fall short of. Our approach marries the
strengths of four primary components: (1) a monocular depth and normal
predictor that serves to predict crucial geometric cues, (2) a space carving
method capable of demarcating the potentially unseen portions of the target
object, (3) a generative model pre-trained on a large-scale image dataset that
can complete unseen regions of the target, and (4) a neural implicit surface
reconstruction method tailored in reconstructing objects using RGB images along
with monocular geometric cues. The combination of these components enables
POP3D to readily generalize across various in-the-wild images and generate
state-of-the-art reconstructions, outperforming similar works by a significant
margin. Project page: \url{http://cg.postech.ac.kr/research/POP3D} |
POP3D, a novel framework for reconstructing full 360° 3D models from single images, addresses limitations in generalizability and reconstruction fidelity. |
Generating high-quality 3D models from single images is crucial for various applications but remains challenging due to limited generalizability and fidelity in existing methods. |
POP3D leverages pre-trained priors for depth, normals, and image generation to progressively outpaint unseen regions. It uses a camera schedule to capture 360° views and refines a neural implicit surface representation using the generated pseudo-ground-truth data. |
Outperforms existing methods in input-view reconstruction fidelity.
Generates semantically similar and high-quality novel views compared to ground truth.
Produces high-fidelity 3D shapes and appearances surpassing alternative approaches. |
Performance relies on the accuracy of off-the-shelf priors used in the pipeline.
Reconstruction time can be long due to iterative nature and reliance on 3D model training. |
single-view 3d reconstruction, shape and appearance reconstruction, novel-view synthesis, space carving, outpainting |
2309.10206
Report |
Image-Text Pre-Training for Logo Recognition |
Mark Hubenthal, Suren Kumar |
Open-set logo recognition is commonly solved by first detecting possible logo
regions and then matching the detected parts against an ever-evolving dataset
of cropped logo images. The matching model, a metric learning problem, is
especially challenging for logo recognition due to the mixture of text and
symbols in logos. We propose two novel contributions to improve the matching
model's performance: (a) using image-text paired samples for pre-training, and
(b) an improved metric learning loss function. A standard paradigm of
fine-tuning ImageNet pre-trained models fails to discover the text sensitivity
necessary to solve the matching problem effectively. This work demonstrates the
importance of pre-training on image-text pairs, which significantly improves
the performance of a visual embedder trained for the logo retrieval task,
especially for more text-dominant classes. We construct a composite public logo
dataset combining LogoDet3K, OpenLogo, and FlickrLogos-47 deemed
OpenLogoDet3K47. We show that the same vision backbone pre-trained on
image-text data, when fine-tuned on OpenLogoDet3K47, achieves $98.6\%$
recall@1, significantly improving performance over pre-training on Imagenet1K
($97.6\%$). We generalize the ProxyNCA++ loss function to propose ProxyNCAHN++
which incorporates class-specific hard negative images. The proposed method
sets new state-of-the-art on five public logo datasets considered, with a
$3.5\%$ zero-shot recall@1 improvement on LogoDet3K test, $4\%$ on OpenLogo,
$6.5\%$ on FlickrLogos-47, $6.2\%$ on Logos In The Wild, and $0.6\%$ on
BelgaLogo. |
The paper introduces a novel approach for open-set logo recognition by utilizing image-text pre-training for improved text sensitivity in logo matching models and proposes a new metric learning loss function, ProxyNCAHN++, for enhanced class separation. |
Open-set logo recognition is crucial for various applications but challenging due to the evolving nature of logo designs and the presence of text and symbols. Existing methods often struggle with text sensitivity, leading to inaccurate matching. |
The authors leverage image-text paired data for pre-training a vision backbone, enabling it to develop inherent OCR capabilities. They also introduce ProxyNCAHN++, a metric learning loss function that incorporates hard negative sampling to refine class boundaries in the embedding space. |
Image-text pre-trained models significantly outperform ImageNet pre-trained models, achieving up to a 6.5% improvement in recall@1 on various public logo datasets.
The proposed ProxyNCAHN++ loss function further enhances class separation, leading to a 0.1% improvement in recall@1 on the OpenLogoDet3K47 dataset.
The study demonstrates the effectiveness of image-text pre-training in improving text sensitivity for logo recognition, particularly for text-dominant logo classes. |
The impact of logo detector accuracy on the overall system performance requires further investigation.
Challenges such as poorly aligned bounding boxes, blurry logo regions, and stylized text need to be addressed in future work. |
logo recognition, open-set recognition, metric learning, image-text pre-training, hard negative mining |
2309.09858
Report |
Unsupervised Open-Vocabulary Object Localization in Videos |
Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He |
In this paper, we show that recent advances in video representation learning
and pre-trained vision-language models allow for substantial improvements in
self-supervised video object localization. We propose a method that first
localizes objects in videos via a slot attention approach and then assigns text
to the obtained slots. The latter is achieved by an unsupervised way to read
localized semantic information from the pre-trained CLIP model. The resulting
video object localization is entirely unsupervised apart from the implicit
annotation contained in CLIP, and it is effectively the first unsupervised
approach that yields good results on regular video benchmarks. |
This paper introduces the first unsupervised approach for localizing and naming objects in real-world videos, leveraging slot attention and a modified CLIP model for local feature alignment. |
This work is important because it bypasses the need for expensive manual annotation in video object localization, paving the way for self-supervised video understanding. |
The method uses a three-part pipeline: 1) Video slot learning with a self-supervised video encoder and slot attention for spatiotemporal object localization. 2) Video slot labeling by adapting CLIP for local feature alignment and assigning text labels to the localized slots. 3) Post-processing to merge slots and improve localization and labeling using both visual and semantic information. |
The approach achieves high-quality and spatio-temporally consistent object localization in real-world videos without any labeled training data.
The proposed patch-based CLIP adaptation effectively aligns CLIP with local features for improved semantic labeling.
Joint optimization using both text and image features significantly improves both localization and labeling performance, as demonstrated by quantitative results and qualitative examples. |
The current method struggles to differentiate between individual object instances.
The resolution of patch tokens limits the precision of object localization. |
unsupervised learning, object localization, video understanding, slot attention, clip |
2309.09818
Report |
Grasp-Anything: Large-scale Grasp Dataset from Foundation Models |
An Dinh Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, Anh Nguyen |
Foundation models such as ChatGPT have made significant strides in robotic
tasks due to their universal representation of real-world domains. In this
paper, we leverage foundation models to tackle grasp detection, a persistent
challenge in robotics with broad industrial applications. Despite numerous
grasp datasets, their object diversity remains limited compared to real-world
figures. Fortunately, foundation models possess an extensive repository of
real-world knowledge, including objects we encounter in our daily lives. As a
consequence, a promising solution to the limited representation in previous
grasp datasets is to harness the universal knowledge embedded in these
foundation models. We present Grasp-Anything, a new large-scale grasp dataset
synthesized from foundation models to implement this solution. Grasp-Anything
excels in diversity and magnitude, boasting 1M samples with text descriptions
and more than 3M objects, surpassing prior datasets. Empirically, we show that
Grasp-Anything successfully facilitates zero-shot grasp detection on
vision-based tasks and real-world robotic experiments. Our dataset and code are
available at https://grasp-anything-2023.github.io. |
The paper introduces Grasp-Anything, a new large-scale language-driven dataset for robotic grasp detection, leveraging foundation models to overcome object diversity limitations in previous datasets. |
Existing grasp datasets are limited in object diversity and real-world scene representation, hindering robust generalization in grasp detection. This dataset addresses these limitations using the knowledge embedded in foundation models. |
The dataset is generated in a multi-stage process involving: 1) Prompt engineering with ChatGPT to create diverse scene descriptions. 2) Image synthesis using Stable Diffusion based on the generated text prompts. 3) Automatic grasp pose annotation and evaluation using a pre-trained model (RAGT-3/3) and a physics-based evaluation method. |
Grasp-Anything significantly surpasses previous datasets in diversity and magnitude, containing 1 million samples with text descriptions and over 3 million objects.
Zero-shot grasp detection experiments demonstrate that Grasp-Anything effectively supports generalization to unseen objects and outperforms other datasets in cross-dataset transfer learning.
Real-world robot experiments confirm the effectiveness of Grasp-Anything, achieving higher grasp success rates compared to models trained on other datasets. |
Dataset creation is time-consuming and relies on access to commercial APIs like ChatGPT.
The dataset currently lacks 3D point clouds, which could enhance its applicability in robotic tasks. Future work could explore generating point clouds from the existing data. |
grasp detection, robotics, dataset, foundation models, zero-shot learning |
2309.09724
Report |
Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering |
Chi Zhang, Wei Yin, Gang Yu, Zhibin Wang, Tao Chen, Bin Fu, Joey Tianyi Zhou, Chunhua Shen |
In this study, we address the challenge of 3D scene structure recovery from
monocular depth estimation. While traditional depth estimation methods leverage
labeled datasets to directly predict absolute depth, recent advancements
advocate for mix-dataset training, enhancing generalization across diverse
scenes. However, such mixed dataset training yields depth predictions only up
to an unknown scale and shift, hindering accurate 3D reconstructions. Existing
solutions necessitate extra 3D datasets or geometry-complete depth annotations,
constraints that limit their versatility. In this paper, we propose a learning
framework that trains models to predict geometry-preserving depth without
requiring extra data or annotations. To produce realistic 3D structures, we
render novel views of the reconstructed scenes and design loss functions to
promote depth estimation consistency across different views. Comprehensive
experiments underscore our framework's superior generalization capabilities,
surpassing existing state-of-the-art methods on several benchmark datasets
without leveraging extra training information. Moreover, our innovative loss
functions empower the model to autonomously recover domain-specific
scale-and-shift coefficients using solely unlabeled images. |
This paper proposes a novel depth estimation learning framework that produces geometry-preserving depth for 3D scene recovery without requiring extra data or annotations. |
Existing mix-dataset training methods for depth estimation, while offering strong generalization, produce depth predictions with unknown scale and shift, hindering accurate 3D reconstruction. This paper addresses this challenge to enable robust 3D scene recovery from monocular images. |
The proposed framework leverages differentiable rendering to generate novel views of the 3D scene reconstructed from the predicted depth. It then enforces consistency between depth predictions of the original and rendered views using novel loss functions. |
The method outperforms existing geometry-preserving depth estimation techniques without using additional data or annotations.
The proposed consistency loss enables self-supervised recovery of domain-specific scale and shift coefficients for pre-trained models.
The framework can also estimate camera intrinsic parameters, such as focal length, by minimizing the proposed consistency losses over a range of possible values. |
The focal length needs to be estimated for point cloud reconstruction when not available.
Future work could explore extending the framework to handle dynamic scenes. |
depth estimation, 3d reconstruction, differentiable rendering, self-supervised learning, multi-view consistency |
2309.09614
Report |
Gradpaint: Gradient-Guided Inpainting with Diffusion Models |
Asya Grechka, Guillaume Couairon, Matthieu Cord |
Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved
remarkable results in conditional and unconditional image generation. The
pre-trained models can be adapted without further training to different
downstream tasks, by guiding their iterative denoising process at inference
time to satisfy additional constraints. For the specific task of image
inpainting, the current guiding mechanism relies on copying-and-pasting the
known regions from the input image at each denoising step. However, diffusion
models are strongly conditioned by the initial random noise, and therefore
struggle to harmonize predictions inside the inpainting mask with the real
parts of the input image, often producing results with unnatural artifacts.
Our method, dubbed GradPaint, steers the generation towards a globally
coherent image. At each step in the denoising process, we leverage the model's
"denoised image estimation" by calculating a custom loss measuring its
coherence with the masked input image. Our guiding mechanism uses the gradient
obtained from backpropagating this loss through the diffusion model itself.
GradPaint generalizes well to diffusion models trained on various datasets,
improving upon current state-of-the-art supervised and unsupervised methods. |
Presents GradPaint, a training-free algorithm that guides diffusion models for image inpainting by harmonizing generated content with the known context through gradient descent. |
Leveraging pre-trained diffusion models for inpainting without retraining is highly desirable due to their computational cost. Existing methods struggle to harmonize generated regions, leading to unnatural artifacts. |
Introduces a novel gradient-based update mechanism during the diffusion denoising process. Uses a custom loss combining masked MSE and an alignment loss to ensure smooth transitions between generated and real regions. Backpropagates the loss through the diffusion model itself to guide the generation at each step. |
Achieves state-of-the-art FID scores on FFHQ, outperforming training-based and training-free inpainting methods.
Significantly improves harmonization between generated and real regions, resulting in more natural and coherent inpainted images.
Generalizes well to various datasets (CelebA-HQ, FFHQ, ImageNet, Places2) and pre-trained diffusion models, including latent diffusion models (Stable Diffusion). |
Computational cost is higher than gradient-free baselines, although early stopping can mitigate this.
May introduce unintended bias from the background context into the generated regions in certain cases. |
image inpainting, diffusion models, generative models, gradient-based optimization, zero-shot learning |
2309.09466
Report |
Progressive Text-to-Image Diffusion with Soft Latent Direction |
YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang |
In spite of the rapidly evolving landscape of text-to-image generation, the
synthesis and manipulation of multiple entities while adhering to specific
relational constraints pose enduring challenges. This paper introduces an
innovative progressive synthesis and editing operation that systematically
incorporates entities into the target image, ensuring their adherence to
spatial and relational constraints at each sequential step. Our key insight
stems from the observation that while a pre-trained text-to-image diffusion
model adeptly handles one or two entities, it often falters when dealing with a
greater number. To address this limitation, we propose harnessing the
capabilities of a Large Language Model (LLM) to decompose intricate and
protracted text descriptions into coherent directives adhering to stringent
formats. To facilitate the execution of directives involving distinct semantic
operations-namely insertion, editing, and erasing-we formulate the Stimulus,
Response, and Fusion (SRF) framework. Within this framework, latent regions are
gently stimulated in alignment with each operation, followed by the fusion of
the responsive latent components to achieve cohesive entity manipulation. Our
proposed framework yields notable advancements in object synthesis,
particularly when confronted with intricate and lengthy textual inputs.
Consequently, it establishes a new benchmark for text-to-image generation
tasks, further elevating the field's performance standards. |
This paper introduces a novel progressive text-to-image diffusion framework that uses a Large Language Model (LLM) to decompose complex text descriptions into a sequence of short prompts, enabling the synthesis and manipulation of multiple entities with specific spatial and relational constraints. |
Existing text-to-image generation methods struggle with synthesizing or editing images with multiple objects and complex relationships described in lengthy textual prompts. This work aims to improve the accuracy and controllability of multi-entity image generation. |
The proposed framework consists of three main components: 1) Text Decomposition using a fine-tuned GPT model to break down complex text into short, structured prompts. 2) Stimulus & Response, where a stimulus loss function guides the cross-attention map of a diffusion model to focus on relevant spatial regions for object generation. 3) Latent Fusion, which seamlessly blends the generated object features with the background image from the previous stage. |
The proposed method outperforms single-stage generation baselines in object recall and relation accuracy, demonstrating its ability to handle complex scenes with multiple objects and relationships.
Compared to other progressive generation methods, this approach achieves superior image fidelity and controllability in synthesizing, editing, and erasing objects according to textual instructions.
Ablation studies confirm the importance of both Stimulus & Response and Latent Fusion components in achieving accurate and consistent image generation. |
The text decomposition method may not be effective for all types of complex sentences, particularly those with deeply nested clauses and relationships.
Future work includes improving the parsing capabilities of the GPT model for handling a wider variety of complex text descriptions. |
text-to-image generation, diffusion models, large language models, progressive synthesis, image manipulation |
2309.09456
Report |
Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection |
Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu, Kai Chen |
Point cloud-based open-vocabulary 3D object detection aims to detect 3D
categories that do not have ground-truth annotations in the training set. It is
extremely challenging because of the limited data and annotations (bounding
boxes with class labels or text descriptions) of 3D scenes. Previous approaches
leverage large-scale richly-annotated image datasets as a bridge between 3D and
category semantics but require an extra alignment process between 2D images and
3D points, limiting the open-vocabulary ability of 3D detectors. Instead of
leveraging 2D images, we propose Object2Scene, the first approach that
leverages large-scale large-vocabulary 3D object datasets to augment existing
3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts
objects from different sources into 3D scenes to enrich the vocabulary of 3D
scene datasets and generates text descriptions for the newly inserted objects.
We further introduce a framework that unifies 3D detection and visual
grounding, named L3Det, and propose a cross-domain category-level contrastive
learning approach to mitigate the domain gap between 3D objects from different
datasets. Extensive experiments on existing open-vocabulary 3D object detection
benchmarks show that Object2Scene obtains superior performance over existing
methods. We further verify the effectiveness of Object2Scene on a new benchmark
OV-ScanNet-200, by holding out all rare categories as novel categories not seen
during training. |
This paper presents Object2Scene, a novel approach that leverages large-scale 3D object datasets to enrich existing 3D scene datasets, enabling open-vocabulary 3D object detection. |
Open-vocabulary 3D object detection is crucial for real-world applications but challenging due to the limited annotations in 3D scenes. This work addresses this by leveraging readily available 3D object datasets. |
Object2Scene inserts objects from large-vocabulary 3D object datasets into 3D scenes and generates language grounding prompts to guide model training. It introduces a unified model, L3Det, for 3D detection and grounding, incorporating cross-domain contrastive learning to mitigate domain gaps. |
Object2Scene achieves state-of-the-art performance, outperforming previous methods by significant margins on OV-ScanNet20 and OV-SUN RGB-D20 benchmarks.
Relative location prompts, generated by Object2Scene, prove to be highly effective for open-vocabulary 3D detection.
Cross-domain category-level contrastive learning effectively reduces the domain gap between inserted objects and original scenes, improving performance. |
The model may struggle with objects that have significant variations in point cloud distributions, such as chairs tucked under tables.
Future work will explore better ways to align point cloud distributions from different sources and address challenging cases. |
open-vocabulary learning, 3d object detection, 3d visual grounding, point cloud processing, domain adaptation |
2309.09256
Report |
LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models |
Kazuto Nakashima, Ryo Kurazume |
Generative modeling of 3D LiDAR data is an emerging task with promising
applications for autonomous mobile robots, such as scalable simulation, scene
manipulation, and sparse-to-dense completion of LiDAR point clouds. While
existing approaches have demonstrated the feasibility of image-based LiDAR data
generation using deep generative models, they still struggle with fidelity and
training stability. In this work, we present R2DM, a novel generative model for
LiDAR data that can generate diverse and high-fidelity 3D scene point clouds
based on the image representation of range and reflectance intensity. Our
method is built upon denoising diffusion probabilistic models (DDPMs), which
have shown impressive results among generative model frameworks in recent
years. To effectively train DDPMs in the LiDAR domain, we first conduct an
in-depth analysis of data representation, loss functions, and spatial inductive
biases. Leveraging our R2DM model, we also introduce a flexible LiDAR
completion pipeline based on the powerful capabilities of DDPMs. We demonstrate
that our method surpasses existing methods in generating tasks on the KITTI-360
and KITTI-Raw datasets, as well as in the completion task on the KITTI-360
dataset. Our project page can be found at https://kazuto1011.github.io/r2dm. |
This paper presents R2DM, a novel denoising diffusion probabilistic model for generating realistic LiDAR range and reflectance images, and demonstrates its effectiveness for LiDAR point cloud completion. |
Generative modeling of LiDAR point clouds is crucial for applications like autonomous driving, enabling scalable simulation, scene manipulation, and sparse-to-dense completion of LiDAR data. |
The authors investigate various aspects of DDPM design for LiDAR data, including loss functions, data representation, and spatial inductive bias. They find that using Fourier features for positional encoding significantly improves generation quality. They also integrate R2DM with the RePaint method for LiDAR completion tasks. |
R2DM achieves state-of-the-art generation performance on KITTI-360 and KITTI-Raw datasets, outperforming previous GAN-based and diffusion-based methods.
Fourier features as spatial inductive bias are found to be crucial for generating high-fidelity LiDAR point clouds.
The proposed R2DM-based completion pipeline outperforms baseline methods on beam-level upsampling tasks, demonstrating its effectiveness for LiDAR data completion. |
The paper mainly focuses on relatively clean and downsampled LiDAR data, and future work could explore noise-robust training and handling full-resolution point clouds.
Further investigation is needed to explore the scalability of the model and its applications to downstream perception tasks. |
lidar, generative model, diffusion model, point cloud completion, autonomous driving |
2309.08957
Report |
ExBluRF: Efficient Radiance Fields for Extreme Motion Blurred Images |
Dongwoo Lee, Jeongtaek Oh, Jaesung Rim, Sunghyun Cho, Kyoung Mu Lee |
We present ExBluRF, a novel view synthesis method for extreme motion blurred
images based on efficient radiance fields optimization. Our approach consists
of two main components: 6-DOF camera trajectory-based motion blur formulation
and voxel-based radiance fields. From extremely blurred images, we optimize the
sharp radiance fields by jointly estimating the camera trajectories that
generate the blurry images. In training, multiple rays along the camera
trajectory are accumulated to reconstruct single blurry color, which is
equivalent to the physical motion blur operation. We minimize the
photo-consistency loss on blurred image space and obtain the sharp radiance
fields with camera trajectories that explain the blur of all images. The joint
optimization on the blurred image space demands painfully increasing
computation and resources proportional to the blur size. Our method solves this
problem by replacing the MLP-based framework to low-dimensional 6-DOF camera
poses and voxel-based radiance fields. Compared with the existing works, our
approach restores much sharper 3D scenes from challenging motion blurred views
with the order of 10 times less training time and GPU memory consumption. |
ExBluRF, an efficient radiance field method for synthesizing sharp novel views from images with extreme motion blur. |
Existing NeRF-based methods struggle with extreme motion blur due to shape-radiance ambiguity and computational limitations. |
Models motion blur using a 6-DOF camera trajectory optimized via Bézier curves and utilizes voxel-based radiance fields for efficiency. |
Achieves superior deblurring and novel view synthesis compared to previous methods on both real and synthetic datasets.
Significantly reduces memory consumption and training time, enabling scalability to extreme blur.
Estimated camera trajectories accurately converge to ground truth trajectories without additional supervision. |
Potential overfitting with unnecessarily high-order Bézier curves.
Reliance on accurate camera pose estimation for evaluation on real datasets with limited ground truth. |
neural radiance fields, motion deblurring, novel view synthesis, voxel-based radiance fields, camera trajectory estimation |
2309.08826
Report |
Dual-Camera Joint Deblurring-Denoising |
Shayan Shekarforoush, Amanpreet Walia, Marcus A. Brubaker, Konstantinos G. Derpanis, Alex Levinshtein |
Recent image enhancement methods have shown the advantages of using a pair of
long and short-exposure images for low-light photography. These image
modalities offer complementary strengths and weaknesses. The former yields an
image that is clean but blurry due to camera or object motion, whereas the
latter is sharp but noisy due to low photon count. Motivated by the fact that
modern smartphones come equipped with multiple rear-facing camera sensors, we
propose a novel dual-camera method for obtaining a high-quality image. Our
method uses a synchronized burst of short exposure images captured by one
camera and a long exposure image simultaneously captured by another. Having a
synchronized short exposure burst alongside the long exposure image enables us
to (i) obtain better denoising by using a burst instead of a single image, (ii)
recover motion from the burst and use it for motion-aware deblurring of the
long exposure image, and (iii) fuse the two results to further enhance quality.
Our method is able to achieve state-of-the-art results on synthetic dual-camera
images from the GoPro dataset with five times fewer training parameters
compared to the next best method. We also show that our method qualitatively
outperforms competing approaches on real synchronized dual-camera captures. |
This paper introduces a novel dual-camera method for enhancing image quality, leveraging a synchronized burst of short exposure images from one camera and a long exposure image from another. |
This approach overcomes limitations of single-image restoration by utilizing complementary information from both short and long exposure modalities. |
The method employs a flow-guided deblurring network to remove motion blur from the long exposure image based on optical flow estimated from the burst. It also incorporates a burst denoising module to produce a clean image from the short exposures. Finally, a fusion module combines features from both deblurred and denoised outputs. |
The method achieves state-of-the-art results on synthetic dual-camera images, surpassing previous joint deblurring-denoising approaches.
It outperforms single-task baselines, demonstrating the effectiveness of combining deblurring and denoising.
Qualitative evaluations on real dual-camera captures show superior performance compared to competing methods. |
The current implementation assumes relative rigidity between cameras, limiting its applicability in scenarios with significant camera motion.
Deployment on smartphones with limited computational resources requires further optimization and engineering. |
image deblurring, image denoising, dual-camera systems, burst processing, optical flow |
2309.08816
Report |
EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding |
Chenchen Zhu, Fanyi Xiao, Andres Alvarado, Yasmine Babaei, Jiabo Hu, Hichem El-Mohri, Sean Chang Culatana, Roshan Sumbaly, Zhicheng Yan |
Object understanding in egocentric visual data is arguably a fundamental
research topic in egocentric vision. However, existing object datasets are
either non-egocentric or have limitations in object categories, visual content,
and annotation granularities. In this work, we introduce EgoObjects, a
large-scale egocentric dataset for fine-grained object understanding. Its Pilot
version contains over 9K videos collected by 250 participants from 50+
countries using 4 wearable devices, and over 650K object annotations from 368
object categories. Unlike prior datasets containing only object category
labels, EgoObjects also annotates each object with an instance-level
identifier, and includes over 14K unique object instances. EgoObjects was
designed to capture the same object under diverse background complexities,
surrounding objects, distance, lighting and camera motion. In parallel to the
data collection, we conducted data annotation by developing a multi-stage
federated annotation process to accommodate the growing nature of the dataset.
To bootstrap the research on EgoObjects, we present a suite of 4 benchmark
tasks around the egocentric object understanding, including a novel instance
level- and the classical category level object detection. Moreover, we also
introduce 2 novel continual learning object detection tasks. The dataset and
API are available at https://github.com/facebookresearch/EgoObjects. |
EgoObjects, a large-scale egocentric video dataset for fine-grained object understanding, is introduced. It contains over 9K videos, 650K object annotations from 368 categories, and 14K unique object instances captured under diverse conditions (background, lighting, distance, camera motion). |
Existing object datasets are limited in their suitability for egocentric object understanding due to factors like non-egocentric viewpoints, limited object categories, lack of visual content variations, and coarse annotation granularities. |
Data is collected by participants worldwide using various wearable devices. A multi-stage federated annotation process ensures rich annotations including bounding boxes, category labels, and instance IDs. |
Target-aware instance detection model significantly outperforms target-agnostic baseline.
Continual learning benchmarks for object detection at both instance and category levels are established.
Category-level object detection on EgoObjects presents unique challenges compared to existing exocentric datasets. |
Current version is a pilot release representing 10% of the full dataset.
Continual learning models require further research to address scalability and architecture limitations. |
egocentric vision, object detection, instance-level detection, continual learning, dataset |
2309.08586
Report |
Replacing softmax with ReLU in Vision Transformers |
Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith |
Previous research observed accuracy degradation when replacing the attention
softmax with a point-wise activation such as ReLU. In the context of vision
transformers, we find that this degradation is mitigated when dividing by
sequence length. Our experiments training small to large vision transformers on
ImageNet-21k indicate that ReLU-attention can approach or match the performance
of softmax-attention in terms of scaling behavior as a function of compute. |
This paper demonstrates that replacing the softmax function in the attention mechanism of Vision Transformers with ReLU, when divided by the sequence length, can achieve comparable performance to traditional softmax attention in terms of scaling behavior with compute. |
The softmax operation in traditional attention is computationally expensive and difficult to parallelize. This research offers a more efficient alternative using ReLU that is easier to parallelize and may lead to faster training and inference. |
The authors conducted experiments replacing the softmax in the attention mechanism with ReLU divided by the sequence length. They trained Vision Transformers of various sizes on ImageNet-21k and compared their performance to models using traditional softmax attention. Additionally, they explored the effects of different sequence length scaling factors, alternative activation functions, qk-layernorm, and the addition of a gated unit. |
ReLU-attention, when scaled by the inverse of the sequence length, exhibits similar scaling behavior to softmax-attention in terms of compute for Vision Transformers trained on ImageNet-21k.
Scaling the activation function by a factor involving sequence length is crucial for achieving high accuracy with ReLU-attention.
Using qk-layernorm and adding a gated attention unit did not significantly impact the performance of ReLU-attention with sequence length scaling. |
The theoretical reasoning behind the effectiveness of the sequence length scaling factor remains unclear and needs further investigation.
Future research could explore the performance of ReLU-attention with a learnable sequence length scaling factor and investigate other potentially more effective activation functions. |
vision transformer, attention mechanism, relu, softmax, sequence length scaling |
2309.08523
Report |
Breathing New Life into 3D Assets with Generative Repainting |
Tianfu Wang, Menelaos Kanakis, Konrad Schindler, Luc Van Gool, Anton Obukhov |
Diffusion-based text-to-image models ignited immense attention from the
vision community, artists, and content creators. Broad adoption of these models
is due to significant improvement in the quality of generations and efficient
conditioning on various modalities, not just text. However, lifting the rich
generative priors of these 2D models into 3D is challenging. Recent works have
proposed various pipelines powered by the entanglement of diffusion models and
neural fields. We explore the power of pretrained 2D diffusion models and
standard 3D neural radiance fields as independent, standalone tools and
demonstrate their ability to work together in a non-learned fashion. Such
modularity has the intrinsic advantage of eased partial upgrades, which became
an important property in such a fast-paced domain. Our pipeline accepts any
legacy renderable geometry, such as textured or untextured meshes, orchestrates
the interaction between 2D generative refinement and 3D consistency enforcement
tools, and outputs a painted input geometry in several formats. We conduct a
large-scale study on a wide range of objects and categories from the
ShapeNetSem dataset and demonstrate the advantages of our approach, both
qualitatively and quantitatively. Project page:
https://www.obukhov.ai/repainting_3d_assets |
This paper presents a novel pipeline for text-guided painting of 3D assets by leveraging pre-trained 2D image diffusion models and neural radiance fields (NeRF) in a modular and interpretable fashion. |
Lifting the power of 2D generative models into 3D is challenging, and existing methods often suffer from limitations like UV unwrapping artifacts and lack of modularity. This work addresses these issues by utilizing readily available tools. |
The pipeline iteratively generates novel views using a text- and depth-conditioned diffusion model, remaps existing views for consistency, and employs NeRF for global reconciliation. This process allows for painting complex geometries without relying on UV maps. |
The method achieves state-of-the-art results on the ShapeNetSem dataset, outperforming existing methods on FID and KID metrics.
The modular design allows for partial upgrades as diffusion models and NeRF technology advance.
The pipeline supports various input formats and can even be extended to text-to-3D generation using Point-E. |
The method currently assumes opaque surfaces, limiting its applicability to some object types.
Future work could explore faster and more efficient view selection strategies and incorporate recent advancements in diffusion and NeRF research. |
generative 3d models, text-guided painting, diffusion models, neural radiance fields, 3d asset creation |
2309.08273
Report |
A Generative Framework for Self-Supervised Facial Representation Learning |
Ruian He, Zhen Xing, Weimin Tan, Bo Yan |
Self-supervised representation learning has gained increasing attention for
strong generalization ability without relying on paired datasets. However, it
has not been explored sufficiently for facial representation. Self-supervised
facial representation learning remains unsolved due to the coupling of facial
identities, expressions, and external factors like pose and light. Prior
methods primarily focus on contrastive learning and pixel-level consistency,
leading to limited interpretability and suboptimal performance. In this paper,
we propose LatentFace, a novel generative framework for self-supervised facial
representations. We suggest that the disentangling problem can be also
formulated as generative objectives in space and time, and propose the solution
using a 3D-aware latent diffusion model. First, we introduce a 3D-aware
autoencoder to encode face images into 3D latent embeddings. Second, we propose
a novel representation diffusion model to disentangle 3D latent into facial
identity and expression. Consequently, our method achieves state-of-the-art
performance in facial expression recognition (FER) and face verification among
self-supervised facial representation learning models. Our model achieves a
3.75\% advantage in FER accuracy on RAF-DB and 3.35\% on AffectNet compared to
SOTA methods. |
This paper proposes LatentFace, a novel generative framework for self-supervised facial representation learning using a 3D-aware latent diffusion model to disentangle facial identity and expression. |
Self-supervised facial representation learning is important for its generalization ability without paired datasets, but previous methods suffer from limited interpretability and performance due to the coupling of facial identities, expressions, and external factors. |
The methodology involves two stages: 1) 3D Latent Autoencoding disentangles facial texture and shape from pose and illumination using a 3D-aware autoencoder. 2) Latent Space Disentangling predicts facial identity as the time-invariant component of facial features using a Representation Diffusion Model (RDM) trained on video sequences. |
LatentFace achieves state-of-the-art performance in facial expression recognition (FER) and face verification among self-supervised methods.
The model outperforms previous SOTA methods by 3.75% in FER accuracy on RAF-DB and 3.35% on AffectNet.
Qualitative results demonstrate improved disentanglement of facial identity and expression compared to previous methods. |
Interpreting faces with large deflection angles remains challenging due to occlusion.
Potential application risks exist due to the model's ability to generate realistic facial textures and shapes. |
self-supervised learning, facial representation learning, diffusion models, 3d face modeling, disentanglement |
2309.08009
Report |
Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset |
Iya Chivileva, Philip Lynch, Tomas E. Ward, Alan F. Smeaton |
Evaluating the quality of videos generated from text-to-video (T2V) models is
important if they are to produce plausible outputs that convince a viewer of
their authenticity. We examine some of the metrics used in this area and
highlight their limitations. The paper presents a dataset of more than 1,000
generated videos from 5 very recent T2V models on which some of those commonly
used quality metrics are applied. We also include extensive human quality
evaluations on those videos, allowing the relative strengths and weaknesses of
metrics, including human assessment, to be compared. The contribution is an
assessment of commonly used quality metrics, and a comparison of their
performances and the performance of human evaluations on an open dataset of T2V
videos. Our conclusion is that naturalness and semantic matching with the text
prompt used to generate the T2V output are important but there is no single
measure to capture these subtleties in assessing T2V model output. |
This paper presents a dataset of over 1,000 videos generated by 5 recent text-to-video models and uses it to compare commonly used quality metrics with human evaluations, revealing limitations in existing metrics. |
Evaluating the quality of text-to-video models is crucial for producing plausible outputs, but developing reliable metrics is an often-overlooked challenge. |
The authors generated videos from various models, computed quality metrics (including their own ensemble metric), and collected human annotations for alignment and perception. These results were compared to assess the metrics' effectiveness. |
Human evaluations generally align with common metrics but not always, highlighting limitations.
Text2Video-Zero was the best-performing model, while Aphantasia performed the worst.
Shorter prompts generally resulted in better video quality across all models. |
The naturalness classifier needs further training with more diverse cartoon-style videos.
Future work could explore alternative metrics and ensemble approaches for comprehensive quality assessment. |
text-to-video models, video synthesis, evaluation metrics, human evaluation, dataset |
2309.07986
Report |
Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models |
James Burgess, Kuan-Chieh Wang, Serena Yeung |
Text-to-image diffusion models understand spatial relationship between
objects, but do they represent the true 3D structure of the world from only 2D
supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image
diffusion models like Stable Diffusion, and we show that this structure can be
exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion
(ViewNeTI), controls the 3D viewpoint of objects in generated images from
frozen diffusion models. We train a small neural mapper to take camera
viewpoint parameters and predict text encoder latents; the latents then
condition the diffusion generation process to produce images with the desired
camera viewpoint.
ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the
frozen diffusion model as a prior, we can solve NVS with very few input views;
we can even do single-view novel view synthesis. Our single-view NVS
predictions have good semantic details and photorealism compared to prior
methods. Our approach is well suited for modeling the uncertainty inherent in
sparse 3D vision problems because it can efficiently generate diverse samples.
Our view-control mechanism is general, and can even change the camera view in
images generated by user-defined prompts. |
The paper introduces Viewpoint Neural Textual Inversion (ViewNeTI), a method to control the 3D viewpoint of objects in images generated by frozen text-to-image diffusion models, enabling novel view synthesis from as little as a single input view. |
Leveraging pre-trained diffusion models for 3D vision tasks is appealing due to their large and diverse training data and their ability to model ambiguity inherent in sparse 3D data. |
ViewNeTI trains a small neural network (view-mapper) that takes camera parameters as input and predicts text encoder latents, conditioning the diffusion model to generate images from the desired viewpoint. |
The method successfully interpolates novel viewpoints when trained on a single scene with sparse viewpoints.
Pre-training the view-mapper on a multi-scene dataset allows for extrapolation to novel viewpoints and generalization to new scenes, even enabling single-view novel view synthesis.
ViewNeTI can be used for view-controlled text-to-image generation by prepending the view-mapper's token to user-defined text prompts. |
A major limitation is the potential misalignment of generated objects compared to ground truth, impacting PSNR scores.
Generating precise object details remains challenging, although this is an active research area in textual inversion. |
novel view synthesis, textual inversion, diffusion models, 3d vision, single-view reconstruction |
2309.07920
Report |
Large-Vocabulary 3D Diffusion Model with Transformer |
Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu |
Creating diverse and high-quality 3D assets with an automatic generative
model is highly desirable. Despite extensive efforts on 3D generation, most
existing works focus on the generation of a single category or a few
categories. In this paper, we introduce a diffusion-based feed-forward
framework for synthesizing massive categories of real-world 3D objects with a
single generative model. Notably, there are three major challenges for this
large-vocabulary 3D generation: a) the need for expressive yet efficient 3D
representation; b) large diversity in geometry and texture across categories;
c) complexity in the appearances of real-world objects. To this end, we propose
a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for
handling challenges via three aspects. 1) Considering efficiency and
robustness, we adopt a revised triplane representation and improve the fitting
speed and accuracy. 2) To handle the drastic variations in geometry and
texture, we regard the features of all 3D objects as a combination of
generalized 3D knowledge and specialized 3D features. To extract generalized 3D
knowledge from diverse categories, we propose a novel 3D-aware transformer with
shared cross-plane attention. It learns the cross-plane relations across
different planes and aggregates the generalized 3D knowledge with specialized
3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance
the generalized 3D knowledge in the encoded triplanes for handling categories
with complex appearances. Extensive experiments on ShapeNet and OmniObject3D
(over 200 diverse real-world categories) convincingly demonstrate that a single
DiffTF model achieves state-of-the-art large-vocabulary 3D object generation
performance with large diversity, rich semantics, and high quality. |
This paper introduces DiffTF, a novel triplane-based 3D-aware diffusion model with Transformer, for synthesizing massive categories of real-world 3D objects with a single generative model. |
Generating diverse and high-quality 3D assets across a large vocabulary is crucial for applications in gaming, robotics, and architecture, but existing methods struggle to maintain robustness across diverse objects. |
DiffTF utilizes a revised triplane representation with improved fitting speed and accuracy. It leverages a 3D-aware transformer to extract generalized 3D knowledge across various categories and integrate it with specialized 3D features of individual objects. Additionally, a 3D-aware encoder/decoder enhances 3D awareness and semantic information in triplanes. |
DiffTF achieves state-of-the-art performance in large-vocabulary 3D object generation on ShapeNet and OmniObject3D, surpassing GAN-based and other diffusion-based methods in both 2D image quality and 3D geometry metrics.
The generated 3D objects exhibit large diversity, rich semantics, and high quality, demonstrating the effectiveness of the proposed 3D-aware modules.
The method shows promising results in capturing complex geometry and textures, even for challenging categories like fruits and sculptures. |
The triplane fitting process, while accelerated, remains time-consuming when scaled to millions of objects.
Details in generated triplanes for some complicated categories have room for improvement. |
3d object generation, diffusion models, transformers, triplane representation, large vocabulary |
2309.07906
Report |
Generative Image Dynamics |
Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski |
We present an approach to modeling an image-space prior on scene motion. Our
prior is learned from a collection of motion trajectories extracted from real
video sequences depicting natural, oscillatory dynamics such as trees, flowers,
candles, and clothes swaying in the wind. We model this dense, long-term motion
prior in the Fourier domain:given a single image, our trained model uses a
frequency-coordinated diffusion sampling process to predict a spectral volume,
which can be converted into a motion texture that spans an entire video. Along
with an image-based rendering module, these trajectories can be used for a
number of downstream applications, such as turning still images into seamlessly
looping videos, or allowing users to realistically interact with objects in
real pictures by interpreting the spectral volumes as image-space modal bases,
which approximate object dynamics. |
This paper introduces a novel method for modeling an image-space prior on scene motion, enabling the animation of still images with natural, oscillatory dynamics. |
Synthesizing realistic scene motion is crucial for visual content creation, as human perception is highly sensitive to motion. Existing methods struggle with issues like temporal inconsistency and unrealistic motion. |
The method leverages spectral volumes, a frequency-domain motion representation, to capture long-range pixel trajectories. A diffusion model trained on real video sequences learns to predict spectral volumes conditioned on a single input image. An image-based rendering module then animates the image using the predicted motion. |
The approach significantly outperforms previous single-image animation methods in terms of realism and temporal coherence.
The method enables the creation of seamlessly looping videos and interactive dynamic images from a single picture.
The use of spectral volumes allows for efficient representation of long-range motions and facilitates long-term temporal consistency in generated videos. |
The model may not accurately capture non-oscillatory or high-frequency motions.
Generating motions that require large amounts of novel content can lead to visual artifacts. |
motion synthesis, diffusion models, image animation, spectral volumes, interactive dynamics |
2309.07867
Report |
Beta Diffusion |
Mingyuan Zhou, Tianqi Chen, Zhendong Wang, Huangjie Zheng |
We introduce beta diffusion, a novel generative modeling method that
integrates demasking and denoising to generate data within bounded ranges.
Using scaled and shifted beta distributions, beta diffusion utilizes
multiplicative transitions over time to create both forward and reverse
diffusion processes, maintaining beta distributions in both the forward
marginals and the reverse conditionals, given the data at any point in time.
Unlike traditional diffusion-based generative models relying on additive
Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is
multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived
from the convexity of the KL divergence. We demonstrate that the proposed KLUBs
are more effective for optimizing beta diffusion compared to negative ELBOs,
which can also be derived as the KLUBs of the same KL divergence with its two
arguments swapped. The loss function of beta diffusion, expressed in terms of
Bregman divergence, further supports the efficacy of KLUBs for optimization.
Experimental results on both synthetic data and natural images demonstrate the
unique capabilities of beta diffusion in generative modeling of range-bounded
data and validate the effectiveness of KLUBs in optimizing diffusion models,
thereby making them valuable additions to the family of diffusion-based
generative models and the optimization techniques used to train them. |
This paper introduces beta diffusion, a novel generative modeling method specifically designed for range-bounded data by incorporating multiplicative noise, unlike traditional Gaussian diffusion using additive noise. |
The existing diffusion models mostly rely on additive Gaussian noise. This paper proposes a new diffusion model that uses multiplicative noise and is specifically designed for modeling range-bounded data. |
The paper proposes to use scaled and shifted beta distributions for both forward and reverse diffusion processes. It further proposes a novel objective function, Kullback--Leibler Upper Bounds (KLUBs), along with its corresponding Bregman divergence formulation for efficient optimization. |
KLUBs are more effective than negative ELBOs in optimizing beta diffusion.
Beta diffusion demonstrates superior performance in modeling range-bounded data, including point masses, compared to Gaussian diffusion.
Beta diffusion exhibits competitive performance on CIFAR-10 image generation. |
Training beta diffusion can be computationally expensive, similar to Gaussian diffusion.
Further exploration of network architectures and hyperparameter optimization tailored for beta diffusion is needed. |
generative modeling, diffusion models, beta distribution, klub, range-bounded data |
2309.07749
Report |
OmnimatteRF: Robust Omnimatte with 3D Background Modeling |
Geng Lin, Chen Gao, Jia-Bin Huang, Changil Kim, Yipeng Wang, Matthias Zwicker, Ayush Saraf |
Video matting has broad applications, from adding interesting effects to
casually captured movies to assisting video production professionals. Matting
with associated effects such as shadows and reflections has also attracted
increasing research activity, and methods like Omnimatte have been proposed to
separate dynamic foreground objects of interest into their own layers. However,
prior works represent video backgrounds as 2D image layers, limiting their
capacity to express more complicated scenes, thus hindering application to
real-world videos. In this paper, we propose a novel video matting method,
OmnimatteRF, that combines dynamic 2D foreground layers and a 3D background
model. The 2D layers preserve the details of the subjects, while the 3D
background robustly reconstructs scenes in real-world videos. Extensive
experiments demonstrate that our method reconstructs scenes with better quality
on various videos. |
The paper proposes OmnimatteRF, a novel video matting method combining 2D foreground layers for details with a 3D background model for robust scene reconstruction in real-world videos with parallax effects. |
Existing video matting methods struggle with complex scenes and parallax effects due to their 2D background representation. OmnimatteRF addresses this limitation, enabling high-quality matting in more realistic settings. |
OmnimatteRF utilizes a two-branch network: a foreground branch predicting RGBA layers for each object and a background branch employing a 3D radiance field. The model is trained jointly with reconstruction and regularization losses. A masked retraining step refines the background, removing artifacts. |
Outperforms state-of-the-art methods (Omnimatte, D^2NeRF, LNA) in background reconstruction quality on synthetic datasets with parallax effects.
Demonstrates robustness and generalization to diverse real-world videos with complex scenes and camera motions.
Enables cleaner background reconstruction by leveraging learned foreground masks in a retraining step. |
Background reconstruction can be impacted if a region is constantly shadowed.
Unrelated background motions might be captured by the foreground layer due to the static nature of the background model. |
video matting, 3d background modeling, radiance fields, omnimatte, parallax effects |
2309.07499
Report |
Efficiently Robustify Pre-trained Models |
Nishant Jain, Harkirat Behl, Yogesh Singh Rawat, Vibhav Vineet |
A recent trend in deep learning algorithms has been towards training large
scale models, having high parameter count and trained on big dataset. However,
robustness of such large scale models towards real-world settings is still a
less-explored topic. In this work, we first benchmark the performance of these
models under different perturbations and datasets thereby representing
real-world shifts, and highlight their degrading performance under these
shifts. We then discuss on how complete model fine-tuning based existing
robustification schemes might not be a scalable option given very large scale
networks and can also lead them to forget some of the desired characterstics.
Finally, we propose a simple and cost-effective method to solve this problem,
inspired by knowledge transfer literature. It involves robustifying smaller
models, at a lower computation cost, and then use them as teachers to tune a
fraction of these large scale networks, reducing the overall computational
overhead. We evaluate our proposed method under various vision perturbations
including ImageNet-C,R,S,A datasets and also for transfer learning, zero-shot
evaluation setups on different datasets. Benchmark results show that our method
is able to induce robustness to these large scale models efficiently, requiring
significantly lower time and also preserves the transfer learning, zero-shot
properties of the original model which none of the existing methods are able to
achieve. |
This paper proposes a novel knowledge transfer method to efficiently induce robustness in large pre-trained vision models, preserving their original properties like clean accuracy and transfer learning capabilities. |
Large vision models, though achieving impressive performance on various tasks, are brittle under distribution shifts. Existing robustification methods are computationally expensive and can lead to forgetting of the original properties. |
The method involves robustifying a smaller teacher model using advanced augmentation techniques and then distilling this robustness to a small, tunable portion of the large student model using an uncertainty-aware knowledge distillation technique. This allows for selective utilization of clean and robust heads during inference based on input characteristics. |
The proposed method outperforms existing approaches on robust accuracy while maintaining comparable clean accuracy.
It preserves transfer learning capabilities, unlike methods involving extensive fine-tuning.
It is computationally efficient, requiring significantly lower training time compared to full fine-tuning. |
The paper lacks a theoretical analysis of the proposed approach.
Future work could explore test-time adaptation of small models and subsequent distillation to large models. |
robustness, knowledge distillation, distribution shift, large pre-trained models, computer vision |
2309.07277
Report |
Limitations of Face Image Generation |
Harrison Rosenberg, Shimaa Ahmed, Guruprasad V Ramesh, Ramya Korlakai Vinayak, Kassem Fawaz |
Text-to-image diffusion models have achieved widespread popularity due to
their unprecedented image generation capability. In particular, their ability
to synthesize and modify human faces has spurred research into using generated
face images in both training data augmentation and model performance
assessments. In this paper, we study the efficacy and shortcomings of
generative models in the context of face generation. Utilizing a combination of
qualitative and quantitative measures, including embedding-based metrics and
user studies, we present a framework to audit the characteristics of generated
faces conditioned on a set of social attributes. We applied our framework on
faces generated through state-of-the-art text-to-image diffusion models. We
identify several limitations of face image generation that include faithfulness
to the text prompt, demographic disparities, and distributional shifts.
Furthermore, we present an analytical model that provides insights into how
training data selection contributes to the performance of generative models. |
This paper presents a framework to audit the characteristics of generated faces conditioned on social attributes, focusing on the efficacy and shortcomings of text-to-image diffusion models in face generation. |
The ability of diffusion models to synthesize and modify human faces makes them valuable for data augmentation and model performance assessments in facial recognition, necessitating an understanding of their capabilities and limitations. |
The study uses a data generation pipeline with Stable Diffusion and a fine-tuned Realistic Vision model, combined with SEGA for attribute manipulation. Evaluation includes quantitative metrics, face verification accuracy, and user studies assessing image quality and attribute correctness. |
Generated faces exhibit demographic disparities in face recognition systems and user-perceived quality, often favoring majority demographics.
Quantitative metrics like CLIP-I and DINO-I show weak correlation with human perception of identity retention and transformation correctness.
An analytical model demonstrates how bias in training data propagates to generated images, impacting the fidelity of synthetic datasets. |
Limited exploration of the Own Race Effect (ORE) in the context of generated images.
Reliance on CLIP, which has its own biases and limitations in understanding nuanced facial features and cultural constructs. |
face generation, diffusion models, demographic bias, image quality assessment, user study |
2309.07125
Report |
Text-Guided Generation and Editing of Compositional 3D Avatars |
Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, Michael J. Black |
Our goal is to create a realistic 3D facial avatar with hair and accessories
using only a text description. While this challenge has attracted significant
recent interest, existing methods either lack realism, produce unrealistic
shapes, or do not support editing, such as modifications to the hairstyle. We
argue that existing methods are limited because they employ a monolithic
modeling approach, using a single representation for the head, face, hair, and
accessories. Our observation is that the hair and face, for example, have very
different structural qualities that benefit from different representations.
Building on this insight, we generate avatars with a compositional model, in
which the head, face, and upper body are represented with traditional 3D
meshes, and the hair, clothing, and accessories with neural radiance fields
(NeRF). The model-based mesh representation provides a strong geometric prior
for the face region, improving realism while enabling editing of the person's
appearance. By using NeRFs to represent the remaining components, our method is
able to model and synthesize parts with complex geometry and appearance, such
as curly hair and fluffy scarves. Our novel system synthesizes these
high-quality compositional avatars from text descriptions. The experimental
results demonstrate that our method, Text-guided generation and Editing of
Compositional Avatars (TECA), produces avatars that are more realistic than
those of recent methods while being editable because of their compositional
nature. For example, our TECA enables the seamless transfer of compositional
features like hairstyles, scarves, and other accessories between avatars. This
capability supports applications such as virtual try-on. |
Presents TECA, a novel method for generating realistic 3D facial avatars with hair and accessories from text descriptions using a compositional model, combining mesh-based representations for the face and body with NeRF for hair and clothing. |
Existing text-to-3D avatar generation methods struggle with realism, shape fidelity, and editability. TECA addresses these limitations by using distinct representations for different avatar components, resulting in higher-quality, customizable avatars. |
The pipeline starts by generating a face image from text using Stable Diffusion and fitting an SMPL-X model to extract 3D geometry. Texture is generated via iterative inpainting. Hair and accessories are generated using latent NeRF, optimized with SDS and guided by CLIPSeg segmentation masks. Finally, refinement is performed in RGB space using SDS and BLIP losses. |
Outperforms SOTA methods in visual realism and text consistency, as shown in qualitative comparisons and a perceptual study.
Enables editing of individual components, such as transferring hairstyles and accessories between avatars.
Demonstrates animation capabilities through SMPL-X parameter manipulation. |
Reliance on CLIPSeg for segmentation can lead to artifacts if segmentation is inaccurate.
Limited ability to handle complex dynamics, such as realistic hair and clothing movement. |
3d avatar generation, text-to-3d, neural radiance fields (nerf), compositional modeling, score distillation sampling (sds) |
2309.06933
Report |
DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models |
Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, Daesik Kim, Seung-Hun Nam, Kibeom Hong |
Recent progresses in large-scale text-to-image models have yielded remarkable
accomplishments, finding various applications in art domain. However,
expressing unique characteristics of an artwork (e.g. brushwork, colortone, or
composition) with text prompts alone may encounter limitations due to the
inherent constraints of verbal description. To this end, we introduce
DreamStyler, a novel framework designed for artistic image synthesis,
proficient in both text-to-image synthesis and style transfer. DreamStyler
optimizes a multi-stage textual embedding with a context-aware text prompt,
resulting in prominent image quality. In addition, with content and style
guidance, DreamStyler exhibits flexibility to accommodate a range of style
references. Experimental results demonstrate its superior performance across
multiple scenarios, suggesting its promising potential in artistic product
creation. |
DreamStyler is a novel framework for artistic image synthesis that excels at both text-to-image synthesis and style transfer using a single style reference image. |
Existing methods struggle to accurately capture and apply the unique styles of artworks, often leading to a trade-off between preserving the content of text prompts and replicating artistic styles. |
DreamStyler introduces multi-stage textual inversion to increase style representation capacity, context-aware prompt augmentation to disentangle style from content, and style and context guidance for user control over image generation. |
DreamStyler demonstrates superior performance in balancing text prompt adherence and accurate style replication.
It surpasses state-of-the-art methods in style transfer while preserving content structure.
The multi-stage textual inversion enables novel style mixing from diverse references. |
The framework's applicability to abstract or highly nuanced artistic styles requires further investigation.
Determining when style and context guidance are most effective remains an open question for future research. |
text-to-image synthesis, style transfer, artistic image generation, diffusion models, textual inversion |
2309.06922
Report |
Hydra: Multi-head Low-rank Adaptation for Parameter Efficient Fine-tuning |
Sanghyeon Kim, Hyunmo Yang, Younghyun Kim, Youngjoon Hong, Eunbyung Park |
The recent surge in large-scale foundation models has spurred the development
of efficient methods for adapting these models to various downstream tasks.
Low-rank adaptation methods, such as LoRA, have gained significant attention
due to their outstanding parameter efficiency and no additional inference
latency. This paper investigates a more general form of adapter module based on
the analysis that parallel and sequential adaptation branches learn novel and
general features during fine-tuning, respectively. The proposed method, named
Hydra, due to its multi-head computational branches, combines parallel and
sequential branch to integrate capabilities, which is more expressive than
existing single branch methods and enables the exploration of a broader range
of optimal points in the fine-tuning process. In addition, the proposed
adaptation method explicitly leverages the pre-trained weights by performing a
linear combination of the pre-trained features. It allows the learned features
to have better generalization performance across diverse downstream tasks.
Furthermore, we perform a comprehensive analysis of the characteristics of each
adaptation branch with empirical evidence. Through an extensive range of
experiments, encompassing comparisons and ablation studies, we substantiate the
efficiency and demonstrate the superior performance of Hydra. This
comprehensive evaluation underscores the potential impact and effectiveness of
Hydra in a variety of applications. Our code is available on
\url{https://github.com/extremebird/Hydra} |
This paper proposes "Hydra," a novel adapter module for parameter-efficient fine-tuning (PEFT) that combines parallel and sequential branches for enhanced expressiveness and generalization performance. |
Efficiently adapting large-scale pre-trained models to downstream tasks is crucial due to their size and computational demands. Existing PEFT methods, particularly adapter-based ones, are limited to either parallel or sequential approaches, potentially missing learning opportunities. |
Hydra integrates the parallel feature learning of LoRA (learning novel features) with the sequential approach (leveraging pre-trained features for generalizability) using linear adapter modules to avoid inference latency. This multi-branch structure is applied to MLP blocks in transformers and evaluated on various vision and NLP tasks. |
Hydra consistently outperforms other PEFT methods, achieving higher accuracy on the ELEVATER benchmark and VTAB-1k benchmark.
Analysis of the weight matrices and feature space visualizations reveals that parallel and sequential branches learn distinct and complementary features.
Ablation studies confirm the advantage of combining branches and the MLP block as the optimal position for Hydra. |
While theoretically similar in complexity, Hydra's multi-branch design might lead to slight bottlenecks on GPUs compared to single-branch methods.
Exploration of more sophisticated adapter modules within the Hydra framework could further enhance performance. |
parameter-efficient fine-tuning, adapter modules, transformer networks, few-shot learning, transfer learning |
2309.06895
Report |
MagiCapture: High-Resolution Multi-Concept Portrait Customization |
Junha Hyung, Jaeyo Shin, Jaegul Choo |
Large-scale text-to-image models including Stable Diffusion are capable of
generating high-fidelity photorealistic portrait images. There is an active
research area dedicated to personalizing these models, aiming to synthesize
specific subjects or styles using provided sets of reference images. However,
despite the plausible results from these personalization methods, they tend to
produce images that often fall short of realism and are not yet on a
commercially viable level. This is particularly noticeable in portrait image
generation, where any unnatural artifact in human faces is easily discernible
due to our inherent human bias. To address this, we introduce MagiCapture, a
personalization method for integrating subject and style concepts to generate
high-resolution portrait images using just a few subject and style references.
For instance, given a handful of random selfies, our fine-tuned model can
generate high-quality portrait images in specific styles, such as passport or
profile photos. The main challenge with this task is the absence of ground
truth for the composed concepts, leading to a reduction in the quality of the
final output and an identity shift of the source subject. To address these
issues, we present a novel Attention Refocusing loss coupled with auxiliary
priors, both of which facilitate robust learning within this weakly supervised
learning setting. Our pipeline also includes additional post-processing steps
to ensure the creation of highly realistic outputs. MagiCapture outperforms
other baselines in both quantitative and qualitative evaluations and can also
be generalized to other non-human objects. |
This paper introduces MagiCapture, a novel multi-concept personalization method for generating high-resolution portrait images that blend subject identity and reference style from a few input images. |
Existing personalization methods for text-to-image models often lack realism, especially in challenging areas like portrait generation where identity preservation is crucial. |
MagiCapture leverages a two-phase optimization process with masked reconstruction, a novel attention refocusing loss to enhance information disentanglement, and composed prompt learning with pseudo-labels for robust style integration. |
MagiCapture quantitatively outperforms baselines like DreamBooth, Textual Inversion, and Custom Diffusion in identity similarity, style preservation, and aesthetic quality.
Qualitative evaluations demonstrate superior image fidelity and faithful reflection of both source and reference images, as supported by a user study.
The method exhibits generalization capabilities, allowing for further image manipulation using textual prompts and adaptation to non-human objects. |
The model may occasionally generate unrealistic body parts and shows limitations in handling diverse ethnicities and gender representations.
Addressing the inherent biases of pre-trained text-to-image models within a few-shot setting poses a challenge for future work. |
image generation, personalization, text-to-image synthesis, diffusion models, few-shot learning |
2309.06802
Report |
Dynamic NeRFs for Soccer Scenes |
Sacha Lewin, Maxime Vandegar, Thomas Hoyoux, Olivier Barnich, Gilles Louppe |
The long-standing problem of novel view synthesis has many applications,
notably in sports broadcasting. Photorealistic novel view synthesis of soccer
actions, in particular, is of enormous interest to the broadcast industry. Yet
only a few industrial solutions have been proposed, and even fewer that achieve
near-broadcast quality of the synthetic replays. Except for their setup of
multiple static cameras around the playfield, the best proprietary systems
disclose close to no information about their inner workings. Leveraging
multiple static cameras for such a task indeed presents a challenge rarely
tackled in the literature, for a lack of public datasets: the reconstruction of
a large-scale, mostly static environment, with small, fast-moving elements.
Recently, the emergence of neural radiance fields has induced stunning progress
in many novel view synthesis applications, leveraging deep learning principles
to produce photorealistic results in the most challenging settings. In this
work, we investigate the feasibility of basing a solution to the task on
dynamic NeRFs, i.e., neural models purposed to reconstruct general dynamic
content. We compose synthetic soccer environments and conduct multiple
experiments using them, identifying key components that help reconstruct soccer
scenes with dynamic NeRFs. We show that, although this approach cannot fully
meet the quality requirements for the target application, it suggests promising
avenues toward a cost-efficient, automatic solution. We also make our work
dataset and code publicly available, with the goal to encourage further efforts
from the research community on the task of novel view synthesis for dynamic
soccer scenes. For code, data, and video results, please see
https://soccernerfs.isach.be. |
This work explores the feasibility of using dynamic Neural Radiance Fields (NeRFs) for novel view synthesis of soccer scenes, aiming to create broadcast-quality replays. |
Developing an automated and cost-efficient solution for generating photorealistic virtual replays of soccer actions is highly valuable for sports broadcasting. |
The authors compose increasingly complex synthetic soccer environments and conduct experiments using state-of-the-art dynamic NeRF models (K-Planes and NeRFPlayer) to evaluate their performance under different camera setups. |
Dynamic NeRFs can reconstruct detailed soccer scenes with close-up camera views.
Performance significantly degrades with distant, broadcast-style camera setups, even with enhancements like ray importance sampling.
While promising, general dynamic NeRFs currently fall short of broadcast-quality standards for complex soccer scene reconstruction. |
The study is limited to synthetic datasets due to the lack of suitable public real-world soccer datasets.
Domain-specific knowledge and additional components, such as incorporating broadcast camera views, may be necessary to reach broadcast-quality results. |
neural radiance fields, novel view synthesis, dynamic scene reconstruction, sports broadcasting, soccer replays |
2309.06714
Report |
MPI-Flow: Learning Realistic Optical Flow with Multiplane Images |
Yingping Liang, Jiaming Liu, Debing Zhang, Ying Fu |
The accuracy of learning-based optical flow estimation models heavily relies
on the realism of the training datasets. Current approaches for generating such
datasets either employ synthetic data or generate images with limited realism.
However, the domain gap of these data with real-world scenes constrains the
generalization of the trained model to real-world applications. To address this
issue, we investigate generating realistic optical flow datasets from
real-world images. Firstly, to generate highly realistic new images, we
construct a layered depth representation, known as multiplane images (MPI),
from single-view images. This allows us to generate novel view images that are
highly realistic. To generate optical flow maps that correspond accurately to
the new image, we calculate the optical flows of each plane using the camera
matrix and plane depths. We then project these layered optical flows into the
output optical flow map with volume rendering. Secondly, to ensure the realism
of motion, we present an independent object motion module that can separate the
camera and dynamic object motion in MPI. This module addresses the deficiency
in MPI-based single-view methods, where optical flow is generated only by
camera motion and does not account for any object movement. We additionally
devise a depth-aware inpainting module to merge new images with dynamic objects
and address unnatural motion occlusions. We show the superior performance of
our method through extensive experiments on real-world datasets. Moreover, our
approach achieves state-of-the-art performance in both unsupervised and
supervised training of learning-based models. The code will be made publicly
available at: \url{https://github.com/Sharpiless/MPI-Flow}. |
This paper presents MPI-Flow, a novel framework for generating large-scale optical flow datasets from single-view images, enhancing both image realism and motion realism for training optical flow estimation models. |
Existing synthetic optical flow datasets lack realism, hindering the generalization of trained models to real-world applications. This paper tackles this limitation by enabling the creation of highly realistic datasets from real-world images. |
The method leverages Multiplane Images (MPI) to construct layered depth representations of single-view images. It utilizes volume rendering for realistic novel view synthesis and incorporates an independent object motion module to simulate diverse motions. A depth-aware inpainting module further refines the generated images by addressing occlusions. |
MPI-Flow generates significantly more realistic images and optical flows compared to previous methods like Depthstillation and RealFlow.
Training optical flow estimation models (specifically RAFT) on datasets generated by MPI-Flow achieves superior performance on real-world benchmarks, outperforming models trained on synthetic data or datasets from other generation methods.
The proposed approach exhibits strong generalization capabilities across diverse datasets like Sintel, KITTI, and DAVIS, demonstrating its potential for advancing real-world optical flow estimation. |
The current implementation primarily focuses on single-object motion, limiting its applicability to scenes with complex multi-object interactions.
Further exploration of advanced techniques for handling occlusions and disocclusions in dynamic scenes can further enhance the realism of generated datasets. |
optical flow, dataset generation, multiplane images (mpi), novel view synthesis, computer vision |
2309.06660
Report |
Generalizable Neural Fields as Partially Observed Neural Processes |
Jeffrey Gu, Kuan-Chieh Wang, Serena Yeung |
Neural fields, which represent signals as a function parameterized by a
neural network, are a promising alternative to traditional discrete vector or
grid-based representations. Compared to discrete representations, neural
representations both scale well with increasing resolution, are continuous, and
can be many-times differentiable. However, given a dataset of signals that we
would like to represent, having to optimize a separate neural field for each
signal is inefficient, and cannot capitalize on shared information or
structures among signals. Existing generalization methods view this as a
meta-learning problem and employ gradient-based meta-learning to learn an
initialization which is then fine-tuned with test-time optimization, or learn
hypernetworks to produce the weights of a neural field. We instead propose a
new paradigm that views the large-scale training of neural representations as a
part of a partially-observed neural process framework, and leverage neural
process algorithms to solve this task. We demonstrate that this approach
outperforms both state-of-the-art gradient-based meta-learning approaches and
hypernetwork approaches. |
This paper introduces a new neural process-inspired framework (PONP) for the efficient training of neural fields over large datasets, addressing the challenge of learning a single neural field representation for multiple signals. |
Existing methods for generalizing neural fields to multiple signals, such as gradient-based meta-learning and hypernetworks, have limitations in terms of efficiency, flexibility, and scalability. This paper proposes a neural process-based approach as a more promising alternative. |
The paper proposes a partially-observed neural process (PONP) framework that consists of a task-specific encoder to aggregate partial observations into a representation and a decoder with a conditional neural field conditioned on this representation. The framework utilizes probabilistic inference for training and accommodates various neural process architectures and latent variable approaches. |
PONP significantly outperforms gradient-based meta-learning and hypernetwork methods on tasks such as 2D image regression and completion.
PONP demonstrates superior performance in 2D CT reconstruction from sparse projections, exceeding the performance of Reptile and random initialization, even without test-time optimization.
In the ShapeNet view synthesis task, PONP achieves comparable or better results than the state-of-the-art Transformer INR method, showcasing its effectiveness in handling complex 3D data. |
While PONP outperforms previous approaches, there is still room for improvement, particularly in further closing the performance gap in fully-observed settings.
The choice of neural process architecture and encoder design significantly influences PONP's effectiveness, suggesting a need for further research into task-specific architectures. |
neural fields, neural processes, meta-learning, implicit representations, generalization |
2309.06581
Report |
Zero-Shot Visual Classification with Guided Cropping |
Piyapat Saranrittichai, Mauricio Munoz, Volker Fischer, Chaithanya Kumar Mummadi |
Pretrained vision-language models, such as CLIP, show promising zero-shot
performance across a wide variety of datasets. For closed-set classification
tasks, however, there is an inherent limitation: CLIP image encoders are
typically designed to extract generic image-level features that summarize
superfluous or confounding information for the target tasks. This results in
degradation of classification performance, especially when objects of interest
cover small areas of input images. In this work, we propose CLIP with Guided
Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection
model in a preprocessing step to increase focus of zero-shot classifier to the
object of interest and minimize influence of extraneous image regions. We
empirically show that our approach improves zero-shot classification results
across architectures and datasets, favorably for small objects. |
The paper proposes GC-CLIP, a method to improve CLIPs zero-shot object classification performance by cropping input images around objects of interest using bounding boxes from OWL-ViT. |
CLIP image encoders extract generic image-level features that may include superfluous or confounding information for specific classification tasks, degrading performance, especially for small objects. |
GC-CLIP uses OWL-ViT to extract bounding boxes around potential objects of interest. These boxes are then used to crop the input image before passing it to the CLIP image encoder. |
GC-CLIP consistently improves zero-shot classification accuracy, especially for images with small objects.
Test-time box augmentation further improves performance, with Multi-Margin Augmentation (MAug) generally outperforming Random Crop Augmentation (RAug).
Using OWL-ViT directly as a classifier results in poor performance compared to CLIP baselines. |
Current box selection strategy does not dynamically weight the importance of context information.
Future work could investigate methods for dynamically weighting context based on distance and semantic relationship to the target object. |
zero-shot learning, object classification, clip, owl-vit, guided cropping |
2309.06380
Report |
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation |
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, Qiang Liu |
Diffusion models have revolutionized text-to-image generation with its
exceptional quality and creativity. However, its multi-step sampling process is
known to be slow, often requiring tens of inference steps to obtain
satisfactory results. Previous attempts to improve its sampling speed and
reduce computational costs through distillation have been unsuccessful in
achieving a functional one-step model. In this paper, we explore a recent
method called Rectified Flow, which, thus far, has only been applied to small
datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which
straightens the trajectories of probability flows, refines the coupling between
noises and images, and facilitates the distillation process with student
models. We propose a novel text-conditioned pipeline to turn Stable Diffusion
(SD) into an ultra-fast one-step model, in which we find reflow plays a
critical role in improving the assignment between noise and images. Leveraging
our new pipeline, we create, to the best of our knowledge, the first one-step
diffusion-based text-to-image generator with SD-level image quality, achieving
an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing
the previous state-of-the-art technique, progressive distillation, by a
significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an
expanded network with 1.7B parameters, we further improve the FID to $22.4$. We
call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow
yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second
regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably,
the training of InstaFlow only costs 199 A100 GPU days. Codes and pre-trained
models are available at \url{github.com/gnobitab/InstaFlow}. |
This paper introduces InstaFlow, the first one-step text-to-image diffusion model based on Stable Diffusion that achieves high-quality generation. |
Large-scale text-to-image generation models are computationally expensive and time-consuming. InstaFlow offers a solution for ultra-fast generation with minimal quality loss. |
The authors propose a novel text-conditioned Rectified Flow pipeline. This pipeline leverages a "reflow" procedure to straighten the trajectories of probability flows, improving the coupling between noise and image data. This refined coupling facilitates the distillation process, leading to a high-quality one-step model. |
InstaFlow-0.9B achieves an FID of 23.4 on MS COCO 2017-5k in just 0.09 seconds, surpassing previous state-of-the-art distillation techniques.
InstaFlow-1.7B, a larger variant, further reduces the FID to 22.4.
On MS COCO 2014-30k, InstaFlow-0.9B achieves an FID of 13.1 in 0.09 seconds, outperforming StyleGAN-T in speed and quality. |
InstaFlow can struggle with complex compositions within text prompts.
Future work includes exploring longer training durations and larger datasets for potential improvements. |
text-to-image generation, diffusion models, rectified flow, knowledge distillation, one-step generation |
2309.06323
Report |
SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image |
Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang |
Recent novel view synthesis methods obtain promising results for relatively
small scenes, e.g., indoor environments and scenes with a few objects, but tend
to fail for unbounded outdoor scenes with a single image as input. In this
paper, we introduce SAMPLING, a Scene-adaptive Hierarchical Multiplane Images
Representation for Novel View Synthesis from a Single Image based on improved
multiplane images (MPI). Observing that depth distribution varies significantly
for unbounded outdoor scenes, we employ an adaptive-bins strategy for MPI to
arrange planes in accordance with each scene image. To represent intricate
geometry and multi-scale details, we further introduce a hierarchical
refinement branch, which results in high-quality synthesized novel views. Our
method demonstrates considerable performance gains in synthesizing large-scale
unbounded outdoor scenes using a single image on the KITTI dataset and
generalizes well to the unseen Tanks and Temples dataset.The code and models
will soon be made available. |
This paper proposes SAMPLING, a novel single-image view synthesis method for unbounded outdoor scenes based on an improved Multiplane Images (MPI) representation. |
Existing methods struggle to synthesize novel views of large-scale outdoor scenes from single images due to limitations in handling complex geometry and multi-scale details. |
SAMPLING utilizes an adaptive-bins strategy to arrange MPI planes according to each scene's depth distribution and employs a hierarchical refinement branch to capture multi-scale features. |
SAMPLING achieves state-of-the-art performance on the KITTI dataset for outdoor scene synthesis.
The method demonstrates strong generalization ability, achieving competitive results on the indoor Tanks and Temples dataset despite being trained on outdoor scenes.
Ablation studies confirm the effectiveness of the adaptive-bins strategy and the hierarchical refinement branch in improving synthesis quality. |
SAMPLING, based on MPI, struggles with synthesizing views far from the input view, leading to distortions.
Areas with strong diffuse reflections and thin structures pose challenges for accurate scene representation. |
novel view synthesis, multiplane images, unbounded outdoor scenes, single image, hierarchical refinement |
2309.06169
Report |
Elucidating the solution space of extended reverse-time SDE for diffusion models |
Qinpeng Cui, Xinyi Zhang, Zongqing Lu, Qingmin Liao |
Diffusion models (DMs) demonstrate potent image generation capabilities in
various generative modeling tasks. Nevertheless, their primary limitation lies
in slow sampling speed, requiring hundreds or thousands of sequential function
evaluations through large neural networks to generate high-quality images.
Sampling from DMs can be seen alternatively as solving corresponding stochastic
differential equations (SDEs) or ordinary differential equations (ODEs). In
this work, we formulate the sampling process as an extended reverse-time SDE
(ER SDE), unifying prior explorations into ODEs and SDEs. Leveraging the
semi-linear structure of ER SDE solutions, we offer exact solutions and
arbitrarily high-order approximate solutions for VP SDE and VE SDE,
respectively. Based on the solution space of the ER SDE, we yield mathematical
insights elucidating the superior performance of ODE solvers over SDE solvers
in terms of fast sampling. Additionally, we unveil that VP SDE solvers stand on
par with their VE SDE counterparts. Finally, we devise fast and training-free
samplers, ER-SDE-Solvers, achieving state-of-the-art performance across all
stochastic samplers. Experimental results demonstrate achieving 3.45 FID in 20
function evaluations and 2.24 FID in 50 function evaluations on the ImageNet
$64\times64$ dataset. |
This paper proposes ER-SDE-Solvers, a family of fast and training-free samplers for diffusion models based on an extended reverse-time SDE (ER SDE) formulation. |
Diffusion models excel in image generation but suffer from slow sampling speed. This work aims to improve sampling speed without retraining. |
The paper unifies previous ODE and SDE sampling methods into the ER SDE framework, derives exact and approximate solutions for both VP and VE SDEs, analyzes the solution space to understand sampler performance, and designs customized noise scale functions for fast sampling. |
Mathematically proves the superior performance of ODE solvers over SDE solvers for fast sampling due to lower discretization errors.
Demonstrates that VP SDE solvers achieve comparable image quality to VE SDE solvers given consistent pretrained models.
ER-SDE-Solvers achieve state-of-the-art performance among stochastic samplers, significantly accelerating generation across various datasets (e.g., 3.45 FID in 20 NFE on ImageNet 64x64). |
ER-SDE-Solvers focus on fast sampling and may not be suitable for accelerating likelihood evaluation in diffusion models.
While achieving state-of-the-art performance among training-free stochastic samplers, ER-SDE-Solvers are still not as fast as some highly optimized GANs or flow-based models. |
diffusion models, fast sampling, training-free, stochastic differential equations, ordinary differential equations |
2309.06023
Report |
Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration |
Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu |
Contrastive learning has emerged as a prevailing paradigm for high-level
vision tasks, which, by introducing properly negative samples, has also been
exploited for low-level vision tasks to achieve a compact optimization space to
account for their ill-posed nature. However, existing methods rely on manually
predefined and task-oriented negatives, which often exhibit pronounced
task-specific biases. To address this challenge, our paper introduces an
innovative method termed 'learning from history', which dynamically generates
negative samples from the target model itself. Our approach, named Model
Contrastive Learning for Image Restoration (MCLIR), rejuvenates latency models
as negative models, making it compatible with diverse image restoration tasks.
We propose the Self-Prior guided Negative loss (SPN) to enable it. This
approach significantly enhances existing models when retrained with the
proposed model contrastive paradigm. The results show significant improvements
in image restoration across various tasks and architectures. For example,
models retrained with SPN outperform the original FFANet and DehazeFormer by
3.41 dB and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly,
they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image
deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over
lightweight SwinIR, respectively. Code and retrained models are available at
https://github.com/Aitical/MCLIR. |
This paper proposes Model Contrastive Learning for Image Restoration (MCLIR), a novel method that dynamically generates negative samples from the target model itself for contrastive learning in image restoration tasks. |
Existing contrastive learning methods for image restoration rely on manually predefined negative samples, limiting their generalization capability and introducing task-specific biases. MCLIR addresses these limitations by generating adaptive negatives directly from the model. |
MCLIR utilizes a latency model updated with exponential moving averages (EMA) of the target model's parameters. A Self-Prior guided Negative loss (SPN) compares features from the target model's output and the latency model, guiding the target model towards a more optimal solution. |
MCLIR consistently improves performance across various image restoration tasks (super-resolution, dehazing, deraining, deblurring) and architectures (CNNs, Transformers).
Retrained models with MCLIR outperform their original counterparts and even surpass some state-of-the-art methods.
Ablation studies confirm the effectiveness of using EMA for latency model updates, the impact of negative step size, and the importance of the balancing coefficient in the loss function. |
The paper primarily focuses on image restoration tasks, leaving its application to other dense prediction tasks unexplored.
Hyperparameter tuning was not extensively evaluated across all image restoration tasks. |
contrastive learning, image restoration, self-supervised learning, negative sample generation, deep learning |
2309.05956
Report |
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation |
Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Neel Joshi, Laurent Itti, Vibhav Vineet |
We propose a new paradigm to automatically generate training data with
accurate labels at scale using the text-to-image synthesis frameworks (e.g.,
DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data
generation into foreground object generation, and contextually coherent
background generation. To generate foreground objects, we employ a
straightforward textual template, incorporating the object class name as input
prompts. This is fed into a text-to-image synthesis framework, producing
various foreground images set against isolated backgrounds. A
foreground-background segmentation algorithm is then used to generate
foreground object masks. To generate context images, we begin by creating
language descriptions of the context. This is achieved by applying an image
captioning method to a small set of images representing the desired context.
These textual descriptions are then transformed into a diverse array of context
images via a text-to-image synthesis framework. Subsequently, we composite
these with the foreground object masks produced in the initial step, utilizing
a cut-and-paste method, to formulate the training data. We demonstrate the
advantages of our approach on five object detection and segmentation datasets,
including Pascal VOC and COCO. We found that detectors trained solely on
synthetic data produced by our method achieve performance comparable to those
trained on real data (Fig. 1). Moreover, a combination of real and synthetic
data yields even much better results. Further analysis indicates that the
synthetic data distribution complements the real data distribution effectively.
Additionally, we emphasize the compositional nature of our data generation
approach in out-of-distribution and zero-shot data generation scenarios. We
open-source our code at https://github.com/gyhandy/Text2Image-for-Detection |
This paper presents a novel method leveraging text-to-image synthesis models (e.g., DALL-E, Stable Diffusion) to automatically generate large-scale, accurately labeled training datasets for object detection and segmentation. |
This approach addresses the high cost and labor-intensive nature of acquiring large labeled datasets, essential for training modern deep learning models. |
The proposed pipeline generates foreground object masks and contextually coherent backgrounds separately. Foreground masks are generated by feeding object class names into a text-to-image model and segmenting the output. Backgrounds are generated by captioning a few exemplar images (or using predefined templates in zero-shot scenarios) and feeding the augmented captions to the text-to-image model. Finally, foregrounds are pasted onto backgrounds to create pseudo-labeled training data. |
Training object detectors solely on synthetic data generated by this method achieves performance comparable to training on real data, particularly in low-resource scenarios.
Combining synthetic and real data further improves performance, indicating the synthetic data effectively complements real data.
The method generalizes well to multiple object detection and segmentation datasets and benefits from the compositionality of language, enabling easy modification of generated data by editing textual descriptions. |
The current approach lacks control over factors like illumination, viewpoint, and object pose.
The method is not directly applicable to 3D geometry tasks like 3D object pose estimation. |
synthetic data generation, text-to-image synthesis, object detection, instance segmentation, low-resource learning |
2309.05940
Report |
Catch You Everything Everywhere: Guarding Textual Inversion via Concept Watermarking |
Weitao Feng, Jiyan He, Jie Zhang, Tianwei Zhang, Wenbo Zhou, Weiming Zhang, Nenghai Yu |
AIGC (AI-Generated Content) has achieved tremendous success in many
applications such as text-to-image tasks, where the model can generate
high-quality images with diverse prompts, namely, different descriptions in
natural languages. More surprisingly, the emerging personalization techniques
even succeed in describing unseen concepts with only a few personal images as
references, and there have been some commercial platforms for sharing the
valuable personalized concept. However, such an advanced technique also
introduces a severe threat, where malicious users can misuse the target concept
to generate highly-realistic illegal images. Therefore, it becomes necessary
for the platform to trace malicious users and hold them accountable.
In this paper, we focus on guarding the most popular lightweight
personalization model, ie, Textual Inversion (TI). To achieve it, we propose
the novel concept watermarking, where watermark information is embedded into
the target concept and then extracted from generated images based on the
watermarked concept. Specifically, we jointly train a watermark encoder and a
watermark decoder with the sampler in the loop.
It shows great resilience to different diffusion sampling processes possibly
chosen by malicious users, meanwhile preserving utility for normal use. In
practice, the concept owner can upload his concept with different watermarks
(ie, serial numbers) to the platform, and the platform allocates different
users with different serial numbers for subsequent tracing and forensics. |
This paper proposes "concept watermarking," a novel method to embed watermark information into Textual Inversion embeddings for tracing the misuse of personalized AI-generated content. |
Sharing personalized AI concepts raises concerns about misuse for illegal commercial purposes or generating harmful content. Tracing malicious users is crucial for accountability. |
The method jointly trains a watermark encoder and decoder with the diffusion sampler in the loop, ensuring the watermark's robustness against different diffusion configurations and preserving the fidelity and editability of the original concept. |
The proposed method effectively embeds watermarks into concepts with a high success rate while preserving image fidelity and textual editability.
It exhibits robustness against various distortions, including different diffusion sampling configurations, post-processing on generated images, and pre-processing on watermarked concepts.
The method shows resilience against adaptive attacks like retraining concept embeddings and forgery attacks. |
The capacity for encoding information is limited by the number of tokens used in Textual Inversion.
The training process can be computationally expensive due to the involvement of the diffusion sampling pipeline. |
ai-generated content, textual inversion, concept watermarking, copyright protection, diffusion models |
2309.05793
Report |
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models |
Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, Min Zheng |
Personalized text-to-image generation has emerged as a powerful and
sought-after tool, empowering users to create customized images based on their
specific concepts and prompts. However, existing approaches to personalization
encounter multiple challenges, including long tuning times, large storage
requirements, the necessity for multiple input images per identity, and
limitations in preserving identity and editability. To address these obstacles,
we present PhotoVerse, an innovative methodology that incorporates a
dual-branch conditioning mechanism in both text and image domains, providing
effective control over the image generation process. Furthermore, we introduce
facial identity loss as a novel component to enhance the preservation of
identity during training. Remarkably, our proposed PhotoVerse eliminates the
need for test time tuning and relies solely on a single facial photo of the
target identity, significantly reducing the resource cost associated with image
generation. After a single training phase, our approach enables generating
high-quality images within only a few seconds. Moreover, our method can produce
diverse images that encompass various scenes and styles. The extensive
evaluation demonstrates the superior performance of our approach, which
achieves the dual objectives of preserving identity and facilitating
editability. Project page: https://photoverse2d.github.io/ |
This paper presents PhotoVerse, a novel personalized text-to-image generation method that uses dual-branch conditioning (text and image) and a facial identity loss to preserve identity while enabling image editing. |
Existing personalized text-to-image generation methods suffer from long tuning times, large storage needs, and limitations in preserving identity and editability. PhotoVerse addresses these challenges. |
PhotoVerse uses dual-branch conditioning to project a reference image into a pseudo-word and image feature. These are injected into a fine-tuned text-to-image diffusion model (Stable Diffusion) alongside a facial identity loss during training. |
PhotoVerse generates personalized images in seconds using a single reference image, eliminating test-time tuning.
It surpasses state-of-the-art methods in preserving identity attributes (facial features, expressions, hair) while enabling stylization and scene generation.
Ablation studies highlight the importance of the dual-branch conditioning, facial identity loss, and regularization for high-quality personalized image generation. |
The model's performance across different ethnicities may be influenced by biases in the pre-trained model.
Future work could explore further improvements in pose control and incorporating additional control mechanisms. |
text-to-image generation, personalization, diffusion models, identity preservation, image editing |
2309.05569
Report |
ITI-GEN: Inclusive Text-to-Image Generation |
Cheng Zhang, Xuanbai Chen, Siqi Chai, Chen Henry Wu, Dmitry Lagun, Thabo Beeler, Fernando De la Torre |
Text-to-image generative models often reflect the biases of the training
data, leading to unequal representations of underrepresented groups. This study
investigates inclusive text-to-image generative models that generate images
based on human-written prompts and ensure the resulting images are uniformly
distributed across attributes of interest. Unfortunately, directly expressing
the desired attributes in the prompt often leads to sub-optimal results due to
linguistic ambiguity or model misrepresentation. Hence, this paper proposes a
drastically different approach that adheres to the maxim that "a picture is
worth a thousand words". We show that, for some attributes, images can
represent concepts more expressively than text. For instance, categories of
skin tones are typically hard to specify by text but can be easily represented
by example images. Building upon these insights, we propose a novel approach,
ITI-GEN, that leverages readily available reference images for Inclusive
Text-to-Image GENeration. The key idea is learning a set of prompt embeddings
to generate images that can effectively represent all desired attribute
categories. More importantly, ITI-GEN requires no model fine-tuning, making it
computationally efficient to augment existing text-to-image models. Extensive
experiments demonstrate that ITI-GEN largely improves over state-of-the-art
models to generate inclusive images from a prompt. Project page:
https://czhang0528.github.io/iti-gen. |
This paper introduces \ourmethodbold, a novel framework that leverages reference images to learn inclusive prompts, thus improving text-to-image generation diversity across various attributes (e.g., skin tone, age) without retraining the generative model. |
Existing text-to-image models often inherit biases from training data, resulting in an under-representation of certain demographics and attributes. Current bias mitigation techniques in text-to-image generation are limited by linguistic ambiguity and computational complexity. |
\ourmethodbold uses a pre-trained CLIP model and a reference image set for each attribute. It learns inclusive token embeddings by aligning the direction of image and prompt embeddings, and employs a semantic consistency loss to preserve language semantics. By sampling from a diverse set of learned prompts, \ourmethodbold generates images with a balanced representation of attributes. |
\ourmethodbold effectively balances single binary attributes, achieving near-perfect performance on most of the 40 attributes from CelebA.
It generalizes well to multiple attributes, generating diverse images with various category combinations by aggregating inclusive tokens.
The approach demonstrates strong performance in handling multi-category attributes like age and skin tone, even when using synthetic images. |
Limitations: \ourmethodbold may not be optimal for subtle facial attributes or highly entangled attribute combinations.
Future Work: Explore lifelong learning capabilities for adding new attributes and investigate the application to other attribute types like 3D geometry. |
inclusive text-to-image generation, bias mitigation, prompt learning, clip, diversity |
2309.05448
Report |
Panoptic Vision-Language Feature Fields |
Haoran Chen, Kenneth Blomqvist, Francesco Milano, Roland Siegwart |
Recently, methods have been proposed for 3D open-vocabulary semantic
segmentation. Such methods are able to segment scenes into arbitrary classes
based on text descriptions provided during runtime. In this paper, we propose
to the best of our knowledge the first algorithm for open-vocabulary panoptic
segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature
Fields (PVLFF), learns a semantic feature field of the scene by distilling
vision-language features from a pretrained 2D model, and jointly fits an
instance feature field through contrastive learning using 2D instance segments
on input frames. Despite not being trained on the target classes, our method
achieves panoptic segmentation performance similar to the state-of-the-art
closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and
additionally outperforms current 3D open-vocabulary systems in terms of
semantic segmentation. We ablate the components of our method to demonstrate
the effectiveness of our model architecture. Our code will be available at
https://github.com/ethz-asl/pvlff. |
PVLFF, the first open-vocabulary 3D panoptic segmentation system, reconstructs scenes implicitly and enables panoptic segmentation under open-vocabulary prompts. |
Existing 3D panoptic segmentation methods are limited to closed-set predictions. PVLFF bridges this gap, allowing for flexible semantic queries and instance segmentation. |
PVLFF learns a semantic feature field by distilling vision-language embeddings and an instance feature field via contrastive learning on 2D instance proposals, all within a neural radiance field framework. |
PVLFF achieves comparable panoptic segmentation performance to state-of-the-art closed-set methods on HyperSim, ScanNet, and Replica datasets.
PVLFF outperforms zero-shot methods in both 2D and 3D semantic segmentation on ScanNet.
The learned instance features exhibit a hierarchical structure, enabling instance segmentation at different scales. |
The instance segmentation performance is limited by the quality and granularity of the pre-computed object-agnostic 2D instance proposals.
The semantic segmentation relies on a vision-language model with a closed vocabulary, limiting its performance on unseen categories. |
3d panoptic segmentation, open-vocabulary, neural radiance fields, contrastive learning, vision-language |
2309.05418
Report |
FlowIBR: Leveraging Pre-Training for Efficient Neural Image-Based Rendering of Dynamic Scenes |
Marcel Büsching, Josef Bengtson, David Nilsson, Mårten Björkman |
We introduce FlowIBR, a novel approach for efficient monocular novel view
synthesis of dynamic scenes. Existing techniques already show impressive
rendering quality but tend to focus on optimization within a single scene
without leveraging prior knowledge, resulting in long optimization times per
scene. FlowIBR circumvents this limitation by integrating a neural image-based
rendering method, pre-trained on a large corpus of widely available static
scenes, with a per-scene optimized scene flow field. Utilizing this flow field,
we bend the camera rays to counteract the scene dynamics, thereby presenting
the dynamic scene as if it were static to the rendering network. The proposed
method reduces per-scene optimization time by an order of magnitude, achieving
comparable rendering quality to existing methods -- all on a single
consumer-grade GPU. |
FlowIBR, a novel view synthesis method for dynamic scenes, reduces training time by combining a pre-trained generalizable neural image-based rendering method with a per-scene optimized scene flow field. |
Existing methods for dynamic novel view synthesis have long training times and struggle with fast-changing scenes due to relying solely on per-scene optimization without leveraging prior knowledge. |
The method uses a pre-trained Generalizable NeRF Transformer (GNT) for static scenes and learns a per-scene scene flow field. This field bends camera rays to compensate for scene dynamics, allowing the static GNT to render dynamic scenes. |
Reduces per-scene optimization time to 1.5 hours, an order of magnitude faster than previous methods.
Achieves comparable rendering quality to state-of-the-art methods on the Nvidia Dynamic Scenes Dataset.
Enables training on a single consumer-grade GPU due to a dynamics-focused optimization process. |
Moderate rendering speed due to the general-purpose rendering backbone and multi-view projection.
Long sequences can be challenging for the single scene flow network to capture. |
novel view synthesis, dynamic scenes, scene flow, neural rendering, image-based rendering |
2309.05375
Report |
Toward a Deeper Understanding: RetNet Viewed through Convolution |
Chenghao Li, Chaoning Zhang |
The success of Vision Transformer (ViT) has been widely reported on a wide
range of image recognition tasks. ViT can learn global dependencies superior to
CNN, yet CNN's inherent locality can substitute for expensive training
resources. Recently, the outstanding performance of RetNet in the field of
language modeling has garnered attention, surpassing that of the Transformer
with explicit local modeling, shifting researchers' focus towards Transformers
in the CV field. This paper investigates the effectiveness of RetNet from a CNN
perspective and presents a variant of RetNet tailored to the visual domain.
Similar to RetNet we improves ViT's local modeling by applying a weight mask on
the original self-attention matrix. A straightforward way to locally adapt the
self-attention matrix can be realized by an element-wise learnable weight mask
(ELM), for which our preliminary results show promising results. However, the
element-wise simple learnable weight mask not only induces a non-trivial
additional parameter overhead but also increases the optimization complexity.
To this end, this work proposes a novel Gaussian mixture mask (GMM) in which
one mask only has two learnable parameters and it can be conveniently used in
any ViT variants whose attention mechanism allows the use of masks.
Experimental results on multiple small datasets demonstrate that the
effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost
zero additional parameter or computation cost). Our code can be publicly
available at https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention. |
This paper proposes a Gaussian Mixture Mask (GMM) for Vision Transformers (ViTs) to enhance their local modeling capabilities, especially on small datasets, with negligible parameter and computation overhead. |
Transformers often struggle to match the performance of CNNs on small datasets due to their lack of inherent local inductive bias. While techniques like pre-training and explicit local modeling exist, they come with limitations. This work aims to address this by introducing a lightweight, effective, and plug-and-play module for ViTs. |
The paper first introduces an Element-wise Learnable Mask (ELM) added to the attention scores, revealing two key characteristics: locality (preference for nearby patches) and extroversion (reduced self-attention). Building upon these findings, they propose GMM, which uses a mixture of Gaussian functions to generate the attention mask dynamically. This approach significantly reduces the number of learnable parameters compared to ELM while achieving superior performance. |
GMM-ViT consistently outperforms standard ViT and even achieves comparable performance to Swin Transformer on small datasets with minimal additional parameters.
GMM effectively improves the performance of deep ViTs, mitigating the accuracy drop observed with increasing depth.
Visualization of attention maps shows that GMM-ViT exhibits stronger expressive power than both standard ViT and ELM-ViT. |
The paper mainly focuses on small-scale datasets and further investigation is needed to evaluate the effectiveness of GMM on large-scale datasets.
The impact of GMM on different ViT variants is explored, but a more comprehensive analysis on a wider range of architectures is desirable. |
vision transformer, local modeling, gaussian mixture mask, attention mechanism, small datasets |
2309.05251
Report |
Multi3DRefer: Grounding Text Description to Multiple 3D Objects |
Yiming Zhang, ZeMing Gong, Angel X. Chang |
We introduce the task of localizing a flexible number of objects in
real-world 3D scenes using natural language descriptions. Existing 3D visual
grounding tasks focus on localizing a unique object given a text description.
However, such a strict setting is unnatural as localizing potentially multiple
objects is a common need in real-world scenarios and robotic tasks (e.g.,
visual navigation and object rearrangement). To address this setting we propose
Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains
61926 descriptions of 11609 objects, where zero, single or multiple target
objects are referenced by each description. We also introduce a new evaluation
metric and benchmark methods from prior work to enable further investigation of
multi-modal 3D scene understanding. Furthermore, we develop a better baseline
leveraging 2D features from CLIP by rendering object proposals online with
contrastive learning, which outperforms the state of the art on the ScanRefer
benchmark. |
Introduces the task of localizing multiple 3D objects in real-world scenes from natural language descriptions, addressing the limitation of previous work that assumes a single target object. |
Enables more realistic and flexible 3D visual grounding, crucial for robotics, AR/VR, and other applications requiring interaction with 3D environments. |
Creates the Multi3DRefer dataset, augmenting ScanRefer with descriptions referencing zero, single, or multiple objects using ChatGPT and manual verification. Proposes M3DRef-CLIP, a CLIP-based approach with online rendering and contrastive learning, and benchmarks it against existing methods. |
M3DRef-CLIP outperforms state-of-the-art methods on ScanRefer and achieves competitive results on Nr3D.
Training on Multi3DRefer improves performance on ScanRefer, demonstrating the dataset's value.
Analysis shows that CLIP features, contrastive learning, and the Hungarian matching strategy improve performance on Multi3DRefer. |
Current design relies heavily on 3D object detector features, potentially limiting the understanding of global context.
Exploring positional encoding to improve spatial reasoning is an area for future work. |
3d visual grounding, multi-object localization, vision-language models, clip, 3d scene understanding |
2309.04917
Report |
Editing 3D Scenes via Text Prompts without Retraining |
Shuangkang Fang, Yufeng Wang, Yi Yang, Yi-Hsuan Tsai, Wenrui Ding, Shuchang Zhou, Ming-Hsuan Yang |
Numerous diffusion models have recently been applied to image synthesis and
editing. However, editing 3D scenes is still in its early stages. It poses
various challenges, such as the requirement to design specific methods for
different editing types, retraining new models for various 3D scenes, and the
absence of convenient human interaction during editing. To tackle these issues,
we introduce a text-driven editing method, termed DN2N, which allows for the
direct acquisition of a NeRF model with universal editing capabilities,
eliminating the requirement for retraining. Our method employs off-the-shelf
text-based editing models of 2D images to modify the 3D scene images, followed
by a filtering process to discard poorly edited images that disrupt 3D
consistency. We then consider the remaining inconsistency as a problem of
removing noise perturbation, which can be solved by generating training data
with similar perturbation characteristics for training. We further propose
cross-view regularization terms to help the generalized NeRF model mitigate
these perturbations. Our text-driven method allows users to edit a 3D scene
with their desired description, which is more friendly, intuitive, and
practical than prior works. Empirical results show that our method achieves
multiple editing types, including but not limited to appearance editing,
weather transition, material changing, and style transfer. Most importantly,
our method generalizes well with editing abilities shared among a set of model
parameters without requiring a customized editing model for some specific
scenes, thus inferring novel views with editing effects directly from user
input. The project website is available at https://sk-fun.fun/DN2N |
Proposes DN2N, a text-driven 3D scene editing framework with generalization capability, eliminating the need to retrain models for different scenes or editing types. |
Existing 3D scene editing methods often lack user-friendliness, require retraining for different scenes or editing types, and have limited modification capabilities. |
Leverages off-the-shelf 2D image editing models for initial 3D editing, filters poorly edited images, and trains a generalizable NeRF model to remove inconsistencies, treating them as perturbations. |
Achieves diverse editing types, including appearance editing, weather transitions, object changing, and style transfer.
Demonstrates superior performance in preserving image content, aligning with text descriptions, and maintaining 3D consistency compared to existing methods.
Significantly reduces editing time and storage consumption by eliminating retraining for new scenes or editing types. |
The quality of 3D editing is constrained by the capabilities of the underlying 2D editing model.
Quantitative evaluation of editing results remains a challenge, relying on subjective evaluations like user studies. |
3d scene editing, text-driven editing, neural radiance fields, generalizable model, content filter |
2309.04907
Report |
Effective Real Image Editing with Accelerated Iterative Diffusion Inversion |
Zhihong Pan, Riccardo Gherardi, Xiufeng Xie, Stephen Huang |
Despite all recent progress, it is still challenging to edit and manipulate
natural images with modern generative models. When using Generative Adversarial
Network (GAN), one major hurdle is in the inversion process mapping a real
image to its corresponding noise vector in the latent space, since its
necessary to be able to reconstruct an image to edit its contents. Likewise for
Denoising Diffusion Implicit Models (DDIM), the linearization assumption in
each inversion step makes the whole deterministic inversion process unreliable.
Existing approaches that have tackled the problem of inversion stability often
incur in significant trade-offs in computational efficiency. In this work we
propose an Accelerated Iterative Diffusion Inversion method, dubbed AIDI, that
significantly improves reconstruction accuracy with minimal additional overhead
in space and time complexity. By using a novel blended guidance technique, we
show that effective results can be obtained on a large range of image editing
tasks without large classifier-free guidance in inversion. Furthermore, when
compared with other diffusion inversion based works, our proposed process is
shown to be more robust for fast image editing in the 10 and 20 diffusion
steps' regimes. |
Presents AIDI, an accelerated iterative diffusion inversion method for enhanced real image editing with text-to-image diffusion models. |
Addresses the challenge of unreliable inversion in diffusion models, which limits their effectiveness for real image editing. |
Proposes AIDI, employing fixed-point iteration and acceleration techniques for improved inversion stability. Introduces blended guidance to apply different guidance scales for editing and inversion, enhancing editing control. |
AIDI significantly improves reconstruction accuracy compared to baseline methods, achieving near-exact inversion without classifier-free guidance.
Enables effective image editing with as few as 10 diffusion steps, outperforming competing approaches in terms of editing quality and perceptual similarity.
Proposed stochastic editing recovers from failure cases of deterministic editing, increasing editing flexibility. |
Detailed control of the editable area using coarse cross-attention maps requires further investigation.
Improving inversion stability for large guidance scales remains an area for future research. |
diffusion models, image editing, diffusion inversion, text-to-image synthesis, generative models |
2309.04887
Report |
SortedAP: Rethinking evaluation metrics for instance segmentation |
Long Chen, Yuli Wu, Johannes Stegmaier, Dorit Merhof |
Designing metrics for evaluating instance segmentation revolves around
comprehensively considering object detection and segmentation accuracy.
However, other important properties, such as sensitivity, continuity, and
equality, are overlooked in the current study. In this paper, we reveal that
most existing metrics have a limited resolution of segmentation quality. They
are only conditionally sensitive to the change of masks or false predictions.
For certain metrics, the score can change drastically in a narrow range which
could provide a misleading indication of the quality gap between results.
Therefore, we propose a new metric called sortedAP, which strictly decreases
with both object- and pixel-level imperfections and has an uninterrupted
penalization scale over the entire domain. We provide the evaluation toolkit
and experiment code at https://www.github.com/looooongChen/sortedAP. |
The paper proposes a new evaluation metric for instance segmentation called sorted Average Precision (sortedAP) that addresses limitations in existing metrics. |
Current instance segmentation metrics often lack sensitivity to small changes, exhibit abrupt score changes, or treat objects unequally based on size. This can lead to misleading evaluations. |
The authors analyze existing metrics like mAP and PQ, highlighting their deficiencies. They then introduce sortedAP, which utilizes a "Unique Matching" method based on the Hungarian algorithm to maximize IoU matching and calculate AP over the entire IoU range. |
sortedAP demonstrates smooth and continuous score degradation with gradually introduced errors in simulated experiments, unlike metrics like mAP or PQ.
The Unique Matching method allows for the use of IoU thresholds below 0.5 and handles object overlap effectively.
sortedAP provides a more sensitive and accurate reflection of segmentation quality compared to existing metrics. |
The paper primarily focuses on clustered instances, and further evaluation with different instance types is suggested.
Future work could explore incorporating sortedAP into existing deep learning frameworks for model training and optimization. |
instance segmentation, evaluation metric, sortedap, unique matching, iou |
2309.04820
Report |
ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting |
Michael A. Hobley, Victor A. Prisacariu |
Class-agnostic counting methods enumerate objects of an arbitrary class,
providing tremendous utility in many fields. Prior works have limited
usefulness as they require either a set of examples of the type to be counted
or that the image contains only a single type of object. A significant factor
in these shortcomings is the lack of a dataset to properly address counting in
settings with more than one kind of object present. To address these issues, we
propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A
Blind Counter (ABC123), a method that can count multiple types of objects
simultaneously without using examples of type during training or inference.
ABC123 introduces a new paradigm where instead of requiring exemplars to guide
the enumeration, examples are found after the counting stage to help a user
understand the generated outputs. We show that ABC123 outperforms contemporary
methods on MCAC without the requirement of human in-the-loop annotations. We
also show that this performance transfers to FSC-147, the standard
class-agnostic counting dataset. |
This paper introduces MCAC, the first multi-class class-agnostic counting dataset, and ABC123, an exemplar-free multi-class class-agnostic counting method that outperforms existing methods in multi-class settings. |
Existing class-agnostic counting methods rely on exemplars or single-class images, limiting their real-world applicability. This work addresses the need for accurate counting in multi-class scenarios. |
ABC123 leverages a vision transformer backbone and multiple upsampling heads to regress density maps for potential object classes. It then uses a matching stage to align predictions with ground truth labels during training and employs an example discovery stage to provide interpretable visualizations. |
ABC123 significantly outperforms exemplar-based methods on MCAC, demonstrating its effectiveness in multi-class counting.
The method generalizes well to FSC-147, a photographic counting dataset, highlighting its ability to learn from synthetic data.
ABC123 often identifies 'valid-but-unknown' counts, revealing its potential for novel class discovery. |
The example discovery stage relies on a pre-trained segmentation method, which might limit its accuracy.
Quantitative evaluation on FSC-147 is hindered by discrepancies in class definitions between MCAC and FSC, highlighting the need for further research on aligning synthetic and real-world data. |
class-agnostic counting, multi-class counting, exemplar-free counting, synthetic dataset, object counting |
2309.04581
Report |
Dynamic Mesh-Aware Radiance Fields |
Yi-Ling Qiao, Alexander Gao, Yiran Xu, Yue Feng, Jia-Bin Huang, Ming C. Lin |
Embedding polygonal mesh assets within photorealistic Neural Radience Fields
(NeRF) volumes, such that they can be rendered and their dynamics simulated in
a physically consistent manner with the NeRF, is under-explored from the system
perspective of integrating NeRF into the traditional graphics pipeline. This
paper designs a two-way coupling between mesh and NeRF during rendering and
simulation. We first review the light transport equations for both mesh and
NeRF, then distill them into an efficient algorithm for updating radiance and
throughput along a cast ray with an arbitrary number of bounces. To resolve the
discrepancy between the linear color space that the path tracer assumes and the
sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range
(HDR) images. We also present a strategy to estimate light sources and cast
shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric
formulation can be efficiently integrated with a high-performance physics
simulator that supports cloth, rigid and soft bodies. The full rendering and
simulation system can be run on a GPU at interactive rates. We show that a
hybrid system approach outperforms alternatives in visual realism for mesh
insertion, because it allows realistic light transport from volumetric NeRF
media onto surfaces, which affects the appearance of reflective/refractive
surfaces and illumination of diffuse surfaces informed by the dynamic scene. |
This paper presents a hybrid graphics pipeline that integrates neural radiance fields (NeRF) and polygonal meshes for photorealistic rendering and physically-based simulation. |
NeRF excels at capturing photorealistic appearances, while meshes are better suited for simulation and traditional graphics pipelines. Integrating both allows leveraging their respective strengths. |
The method unifies light transport equations for NeRF and surface rendering, enabling seamless switching between ray marching and path tracing. It utilizes HDR NeRF for accurate lighting and estimates light sources for shadow casting. |
Hybrid rendering produces more realistic results than separate rendering or mesh extraction from NeRF.
HDR NeRF provides more accurate lighting compared to standard NeRF, especially for indirect illumination.
The system achieves interactive frame rates on a laptop GPU for real-time applications. |
Current implementation lacks shadow casting and illumination on NeRF points.
Support for advanced rendering features like environment maps and UV textures is limited. |
neural radiance fields, nerf, hybrid rendering, physics simulation, hdr |
2309.04561
Report |
Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding |
Ozan Unal, Christos Sakaridis, Suman Saha, Fisher Yu, Luc Van Gool |
3D visual grounding is the task of localizing the object in a 3D scene which
is referred by a description in natural language. With a wide range of
applications ranging from autonomous indoor robotics to AR/VR, the task has
recently risen in popularity. A common formulation to tackle 3D visual
grounding is grounding-by-detection, where localization is done via bounding
boxes. However, for real-life applications that require physical interactions,
a bounding box insufficiently describes the geometry of an object. We therefore
tackle the problem of dense 3D visual grounding, i.e. referral-based 3D
instance segmentation. We propose a dense 3D grounding network ConcreteNet,
featuring three novel stand-alone modules which aim to improve grounding
performance for challenging repetitive instances, i.e. instances with
distractors of the same semantic class. First, we introduce a bottom-up
attentive fusion module that aims to disambiguate inter-instance relational
cues, next we construct a contrastive training scheme to induce separation in
the latent space, and finally we resolve view-dependent utterances via a
learned global camera token. ConcreteNet ranks 1st on the challenging ScanRefer
online benchmark by a considerable +9.43% accuracy at 50% IoU and has won the
ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge. |
This paper presents ConcreteNet, a novel network for dense 3D visual grounding, which localizes objects in a 3D scene based on natural language descriptions and provides detailed 3D instance masks instead of just bounding boxes. |
Dense 3D visual grounding is crucial for real-world applications like robotics, AR/VR, where detailed object geometry is needed for interactions beyond simple detection. |
ConcreteNet uses a grounding-by-selection approach: a 3D instance segmentation backbone generates candidates, followed by a verbo-visual fusion module that selects the target object based on the language input. Three novel modules are introduced: (1) Bottom-up Attentive Fusion (BAF) for disambiguating object relations using local attention, (2) Contrastive learning for better separation of instance embeddings, and (3) a Global Camera Token (GCT) to handle view-dependent descriptions. |
ConcreteNet outperforms state-of-the-art methods on the ScanRefer benchmark, achieving a significant +9.43% accuracy improvement at 50% IoU with test-time augmentation.
The ablation study demonstrates that each proposed module (BAF, contrastive learning, GCT) contributes to the improved performance, especially for challenging repetitive instances.
The paper provides evidence that using 3D instance segmentation for grounding yields more robust localization and tighter predictions compared to 3D object detection. |
The paper acknowledges the challenge of determining camera positions from unlabeled datasets for learning GCT in a fully unsupervised manner.
Future work could explore extending ConcreteNet to handle more complex language and interactions in 3D scenes. |
visual grounding, vision-language fusion, 3d vision, contrastive learning, instance segmentation |
2309.04430
Report |
Create Your World: Lifelong Text-to-Image Diffusion |
Gan Sun, Wenqi Liang, Jiahua Dong, Jun Li, Zhengming Ding, Yang Cong |
Text-to-image generative models can produce diverse high-quality images of
concepts with a text prompt, which have demonstrated excellent ability in image
generation, image translation, etc. We in this work study the problem of
synthesizing instantiations of a use's own concepts in a never-ending manner,
i.e., create your world, where the new concepts from user are quickly learned
with a few examples. To achieve this goal, we propose a Lifelong text-to-image
Diffusion Model (L2DM), which intends to overcome knowledge "catastrophic
forgetting" for the past encountered concepts, and semantic "catastrophic
neglecting" for one or more concepts in the text prompt. In respect of
knowledge "catastrophic forgetting", our L2DM framework devises a task-aware
memory enhancement module and a elastic-concept distillation module, which
could respectively safeguard the knowledge of both prior concepts and each past
personalized concept. When generating images with a user text prompt, the
solution to semantic "catastrophic neglecting" is that a concept attention
artist module can alleviate the semantic neglecting from concept aspect, and an
orthogonal attention module can reduce the semantic binding from attribute
aspect. To the end, our model can generate more faithful image across a range
of continual text prompts in terms of both qualitative and quantitative
metrics, when comparing with the related state-of-the-art models. The code will
be released at https://wenqiliang.github.io/. |
This paper proposes L$^2$DM, a lifelong text-to-image diffusion model that continually incorporates user-specific concepts while retaining prior knowledge. |
Existing text-to-image models struggle to efficiently learn new concepts without forgetting previously learned ones, limiting their ability to personalize to individual users' needs. |
L$^2$DM addresses catastrophic forgetting with a task-aware memory enhancement (TAME) module for prior knowledge and an elastic concept distillation (ECD) module for personalized knowledge. It tackles catastrophic neglecting during multi-concept generation using a concept attention artist (CAA) module for concept-neglecting and an orthogonal attention artist (OAA) module for attribute-neglecting. |
L$^2$DM outperforms state-of-the-art methods in lifelong single- and multi-concept generation, achieving higher text and image alignment.
It demonstrates superior anti-forgetting ability, evidenced by lower task forgetting rates.
The model is computationally efficient, requiring lower network parameters compared to other methods. |
L$^2$DM still faces challenges in generating complex compositions, similar to existing diffusion models.
Catastrophic neglecting remains an issue when composing four or more concepts. |
lifelong machine learning, stable diffusion, image generation, continual learning, text-to-image synthesis |
2309.04410
Report |
DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields |
Junzhe Zhang, Yushi Lan, Shuai Yang, Fangzhou Hong, Quan Wang, Chai Kiat Yeo, Ziwei Liu, Chen Change Loy |
In this paper, we address the challenging problem of 3D toonification, which
involves transferring the style of an artistic domain onto a target 3D face
with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN
on the artistic domain can produce reasonable performance, this strategy has
limitations in the 3D domain. In particular, fine-tuning can deteriorate the
original GAN latent space, which affects subsequent semantic editing, and
requires independent optimization and storage for each new style, limiting
flexibility and efficient deployment. To overcome these challenges, we propose
DeformToon3D, an effective toonification framework tailored for hierarchical 3D
GAN. Our approach decomposes 3D toonification into subproblems of geometry and
texture stylization to better preserve the original latent space. Specifically,
we devise a novel StyleField that predicts conditional 3D deformation to align
a real-space NeRF to the style space for geometry stylization. Thanks to the
StyleField formulation, which already handles geometry stylization well,
texture stylization can be achieved conveniently via adaptive style mixing that
injects information of the artistic domain into the decoder of the pre-trained
3D GAN. Due to the unique design, our method enables flexible style degree
control and shape-texture-specific style swap. Furthermore, we achieve
efficient training without any real-world 2D-3D training pairs but proxy
samples synthesized from off-the-shelf 2D toonification models. |
This paper proposes DeformToon3D, a novel 3D toonification framework that decomposes geometry and texture stylization, preserving the pre-trained GAN latent space for compatibility with existing editing and animation tools. |
3D toonification is crucial for applications like avatar creation, but existing methods relying on fine-tuning pre-trained GANs suffer from limitations in preserving the original GAN latent space, efficient style control, and storage efficiency. |
DeformToon3D introduces a novel StyleField module to deform the real-space NeRF to the style space for geometry stylization. It further utilizes adaptive style mixing to inject artistic domain information into the decoder for texture stylization. The method is trained on synthetic paired data, eliminating the need for real-world 2D-3D pairs. |
DeformToon3D achieves high-quality geometry and texture toonification over diverse styles, outperforming baselines in identity preservation and FID.
The method retains the original GAN latent space, enabling compatibility with inversion, editing, and animation techniques designed for the pre-trained GAN.
DeformToon3D significantly reduces storage costs by up to 98.5% compared to fine-tuning-based methods, making it suitable for deployment on resource-constrained devices. |
The performance of DeformToon3D relies on the quality of paired training data. Styles with limited information cues might lead to noticeable artifacts.
Future work could explore introducing re-lighting during training, incorporating vision-language models for flexible style control, and integrating 3D animation pipelines. |
toonification, 3d gan, style transfer, nerf, deformation field |
2309.04399
Report |
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask |
Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou, Jiashi Feng |
Recent advancements in diffusion models have showcased their impressive
capacity to generate visually striking images. Nevertheless, ensuring a close
match between the generated image and the given prompt remains a persistent
challenge. In this work, we identify that a crucial factor leading to the
text-image mismatch issue is the inadequate cross-modality relation learning
between the prompt and the output image. To better align the prompt and image
content, we advance the cross-attention with an adaptive mask, which is
conditioned on the attention maps and the prompt embeddings, to dynamically
adjust the contribution of each text token to the image features. This
mechanism explicitly diminishes the ambiguity in semantic information embedding
from the text encoder, leading to a boost of text-to-image consistency in the
synthesized images. Our method, termed MaskDiffusion, is training-free and
hot-pluggable for popular pre-trained diffusion models. When applied to the
latent diffusion models, our MaskDiffusion can significantly improve the
text-to-image consistency with negligible computation overhead compared to the
original diffusion models. |
This paper introduces a training-free method to enhance text-to-image consistency in diffusion models by adaptively masking cross-attention maps based on prompt embeddings. |
Existing text-to-image diffusion models often struggle to generate images that strictly adhere to the input text prompts, particularly when dealing with long or complex descriptions. |
The method identifies inadequate cross-modality relation learning as a root cause for inconsistency. It then proposes a conditional mask generation algorithm which analyzes the cross-attention maps and modifies them to ensure that the relevant objects and attributes are appropriately represented in the generated image. |
The method significantly improves text-to-image consistency without requiring any additional training data or significantly increasing computational overhead.
Evaluation using CLIP score and a user study demonstrated superior performance over other state-of-the-art techniques.
Ablation studies highlighted the importance of momentum-based attention map updating and selecting the appropriate feature resolution for mask application. |
The method's reliance on the CLIP text encoder can be limiting, as CLIP may not always accurately interpret complex sentences.
Future work aims to address the ambiguity arising from CLIP's semantic understanding and further enhance the method's capability to handle intricate prompts. |
diffusion models, text-to-image synthesis, cross-attention, semantic consistency, training-free methods |
2309.04354
Report |
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts |
Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du |
Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due
to their ability to decouple model size from inference efficiency by only
activating a small subset of the model parameters for any given input token. As
such, sparse MoEs have enabled unprecedented scalability, resulting in
tremendous successes across domains such as natural language processing and
computer vision. In this work, we instead explore the use of sparse MoEs to
scale-down Vision Transformers (ViTs) to make them more attractive for
resource-constrained vision applications. To this end, we propose a simplified
and mobile-friendly MoE design where entire images rather than individual
patches are routed to the experts. We also propose a stable MoE training
procedure that uses super-class information to guide the router. We empirically
show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off
between performance and efficiency than the corresponding dense ViTs. For
example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense
counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only
54M FLOPs inference cost, our MoE achieves an improvement of 4.66%. |
This paper introduces Mobile Vision MoEs (V-MoEs), a novel approach utilizing sparse Mixture-of-Experts (MoEs) to enhance the efficiency of Vision Transformers (ViTs) for resource-limited vision applications. |
This work addresses the limitations of traditional dense ViT models in resource-constrained environments by leveraging sparse MoEs to decouple model size from inference cost, thereby enhancing their suitability for mobile-friendly vision tasks. |
The paper proposes a simplified MoE design with per-image routing and a robust training strategy employing semantic super-class guidance for expert specialization. Experiments are conducted on ImageNet-1k to evaluate the performance and efficiency trade-offs. |
Mobile V-MoEs consistently outperform their dense ViT counterparts across various model sizes, demonstrating superior accuracy vs. FLOPs trade-off.
Optimal performance is achieved with 10 experts and 2 MoE layers, indicating a balance between model capacity and routing efficiency.
Per-image routing with semantic super-class guidance proves to be an effective strategy, outperforming end-to-end learned routing and random super-class baselines. |
Future work includes applying the MoE design to more mobile-friendly architectures like MobileNets and exploring its effectiveness in other vision tasks.
Further investigation into on-device latency measurements for a comprehensive efficiency analysis. |
vision transformers, mixture-of-experts, mobile vision, resource-constrained devices, image classification |
2309.04109
Report |
From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models |
Changming Xiao, Qi Yang, Feng Zhou, Changshui Zhang |
Diffusion models have revolted the field of text-to-image generation
recently. The unique way of fusing text and image information contributes to
their remarkable capability of generating highly text-related images. From
another perspective, these generative models imply clues about the precise
correlation between words and pixels. In this work, a simple but effective
method is proposed to utilize the attention mechanism in the denoising network
of text-to-image diffusion models. Without re-training nor inference-time
optimization, the semantic grounding of phrases can be attained directly. We
evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under
weakly-supervised semantic segmentation setting and our method achieves
superior performance to prior methods. In addition, the acquired word-pixel
correlation is found to be generalizable for the learned text embedding of
customized generation methods, requiring only a few modifications. To validate
our discovery, we introduce a new practical task called "personalized referring
image segmentation" with a new dataset. Experiments in various situations
demonstrate the advantages of our method compared to strong baselines on this
task. In summary, our work reveals a novel way to extract the rich multi-modal
knowledge hidden in diffusion models for segmentation. |
This paper proposes a novel method for open-vocabulary segmentation that utilizes the attention mechanism in off-the-shelf text-to-image diffusion models without retraining. |
Existing open-vocabulary segmentation methods rely on discriminative models, which may not have a thorough understanding of images. This work explores the potential of generative diffusion models for segmentation, which are believed to have a better grasp of scene-level structure. |
The method leverages the cross-attention and self-attention mechanisms in diffusion models. It treats self-attention as the affinity matrix of different image patches and propagates cross-attention scores accordingly to capture both unary and pairwise potentials for localization. |
The method achieves state-of-the-art performance on weakly-supervised semantic segmentation benchmarks like PASCAL VOC 2012 and MS COCO 2014.
A new benchmark for personalized referring image segmentation, Mug19, is introduced to evaluate the model's ability to locate user-specific items.
The proposed method outperforms strong baselines on Mug19, demonstrating its superior multi-modal comprehension ability. |
The method exhibits limitations when dealing with semantically similar objects (Cohyponym Entanglement).
The model's capability to handle affordance-related text queries is limited. |
open-vocabulary segmentation, diffusion models, attention mechanism, personalized referring image segmentation, multi-modal learning |
2309.03904
Report |
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis |
Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen |
Due to the difficulty in scaling up, generative adversarial networks (GANs)
seem to be falling from grace on the task of text-conditioned image synthesis.
Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a
valid solution to training large-scale models with limited computational
resources. Inspired by such a philosophy, we present Aurora, a GAN-based
text-to-image generator that employs a collection of experts to learn feature
processing, together with a sparse router to help select the most suitable
expert for each feature point. To faithfully decode the sampling stochasticity
and the text condition to the final synthesis, our router adaptively makes its
decision by taking into account the text-integrated global latent code. At
64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves
6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate
the community for further development. |
Presents Aurora, a GAN-based text-to-image generator employing sparse mixture-of-experts (MoE) to enhance model capacity for efficient text-conditioned image synthesis. |
Addresses the limitations of GANs in scaling up for text-to-image generation, a task where diffusion models have become dominant. It offers a fast inference alternative to iterative diffusion models. |
Employs a sparse MoE approach with a router considering both input features and text-integrated latent code for expert selection. It uses progressive training with a reference FID indicator for stable and efficient learning. |
Achieves 6.2 zero-shot FID on MS COCO at 64x64 resolution.
Exhibits smooth semantic transitions during text prompt interpolation.
Reveals unexpected behavior in latent space interpolation, challenging the common belief of semantic continuity in GAN latent spaces. |
Latent space interpolation results suggest a need to better disentangle text condition and sampling stochasticity effects.
Current model trained at 64x64 resolution, future work to focus on directly generating higher-resolution images. |
generative adversarial networks, text-to-image synthesis, sparse mixture-of-experts, progressive training, latent space interpolation |
2309.03903
Report |
Tracking Anything with Decoupled Video Segmentation |
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee |
Training data for video segmentation are expensive to annotate. This impedes
extensions of end-to-end algorithms to new video segmentation tasks, especially
in large-vocabulary settings. To 'track anything' without training on video
data for every individual task, we develop a decoupled video segmentation
approach (DEVA), composed of task-specific image-level segmentation and
class/task-agnostic bi-directional temporal propagation. Due to this design, we
only need an image-level model for the target task (which is cheaper to train)
and a universal temporal propagation model which is trained once and
generalizes across tasks. To effectively combine these two modules, we use
bi-directional propagation for (semi-)online fusion of segmentation hypotheses
from different frames to generate a coherent segmentation. We show that this
decoupled formulation compares favorably to end-to-end approaches in several
data-scarce tasks including large-vocabulary video panoptic segmentation,
open-world video segmentation, referring video segmentation, and unsupervised
video object segmentation. Code is available at:
https://hkchengrex.github.io/Tracking-Anything-with-DEVA |
This paper proposes DEVA, a decoupled video segmentation approach that leverages task-specific image-level segmentation and class-agnostic bi-directional temporal propagation to 'track anything' in videos. |
Training data for video segmentation is expensive, hindering the extension of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. DEVA addresses this by leveraging external, task-agnostic data, enabling better generalization to tasks with limited annotations. |
DEVA decouples video segmentation into two modules: a task-specific image segmentation model and a universal, class-agnostic temporal propagation model. It employs bi-directional propagation, including in-clip consensus and merging of image and propagated segmentations, to ensure temporal consistency and incorporate new objects. |
DEVA outperforms state-of-the-art end-to-end methods on large-scale video panoptic segmentation (VIPSeg) and open-world video segmentation (BURST).
It also achieves competitive results on referring video segmentation (Ref-DAVIS, Ref-YouTubeVOS) and unsupervised video object segmentation (DAVIS) without end-to-end training.
The approach shows significant improvements when target domain training data is scarce, particularly for rare object categories. |
The temporal propagation model relies on the image segmentation model to detect new objects, leading to potential delays in detection.
End-to-end approaches might still be preferable when sufficient training data is available, mainly in smaller vocabulary settings. |
video segmentation, temporal propagation, open-world learning, large-vocabulary segmentation, tracking-by-detection |
2309.03897
Report |
ProPainter: Improving Propagation and Transformer for Video Inpainting |
Shangchen Zhou, Chongyi Li, Kelvin C. K. Chan, Chen Change Loy |
Flow-based propagation and spatiotemporal Transformer are two mainstream
mechanisms in video inpainting (VI). Despite the effectiveness of these
components, they still suffer from some limitations that affect their
performance. Previous propagation-based approaches are performed separately
either in the image or feature domain. Global image propagation isolated from
learning may cause spatial misalignment due to inaccurate optical flow.
Moreover, memory or computational constraints limit the temporal range of
feature propagation and video Transformer, preventing exploration of
correspondence information from distant frames. To address these issues, we
propose an improved framework, called ProPainter, which involves enhanced
ProPagation and an efficient Transformer. Specifically, we introduce
dual-domain propagation that combines the advantages of image and feature
warping, exploiting global correspondences reliably. We also propose a
mask-guided sparse video Transformer, which achieves high efficiency by
discarding unnecessary and redundant tokens. With these components, ProPainter
outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining
appealing efficiency. |
Proposes ProPainter, a novel video inpainting framework, featuring enhanced dual-domain propagation and a highly efficient mask-guided sparse video Transformer. |
Addresses limitations of existing flow-based propagation and spatiotemporal Transformer methods in video inpainting, aiming for higher quality and efficiency. |
Combines image and feature warping for reliable global propagation, employs a recurrent network for fast flow completion, and introduces a sparse Transformer that discards redundant tokens for efficiency. |
Achieves superior performance with a large margin of 1.46 dB in PSNR compared to state-of-the-art methods.
Demonstrates significant efficiency gains, reducing memory consumption and running time.
Shows particular effectiveness on datasets with dynamic scenes and larger motions. |
Performance improvement is less pronounced on datasets with predominantly static scenes.
Further exploration of sparse attention mechanisms for higher resolutions and longer video sequences is promising for future work. |
video inpainting, dual-domain propagation, sparse video transformer, flow completion, deep learning |
2309.03895
Report |
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks |
Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo |
We present InstructDiffusion, a unifying and generic framework for aligning
computer vision tasks with human instructions. Unlike existing approaches that
integrate prior knowledge and pre-define the output space (e.g., categories and
coordinates) for each vision task, we cast diverse vision tasks into a
human-intuitive image-manipulating process whose output space is a flexible and
interactive pixel space. Concretely, the model is built upon the diffusion
process and is trained to predict pixels according to user instructions, such
as encircling the man's left shoulder in red or applying a blue mask to the
left car. InstructDiffusion could handle a variety of vision tasks, including
understanding tasks (such as segmentation and keypoint detection) and
generative tasks (such as editing and enhancement). It even exhibits the
ability to handle unseen tasks and outperforms prior methods on novel datasets.
This represents a significant step towards a generalist modeling interface for
vision tasks, advancing artificial general intelligence in the field of
computer vision. |
Introduces InstructDiffusion, a generalist modeling interface for vision tasks, unifying them as image generation through human-intuitive instructions. |
Addresses the challenge of unifying diverse vision tasks with different output formats, methodologies, and continuous input/output spaces. |
Leverages DDPM to handle various vision tasks as instructional image editing, trained on a dataset covering keypoint detection, segmentation, image enhancement, and editing. |
Achieves good performance in individual vision tasks, outperforming other generalist models in keypoint detection and referring segmentation.
Demonstrates enhanced generalization ability through joint training of multiple tasks.
Exhibits AGI capabilities by handling unseen tasks like image detection and classification, performing well on novel datasets. |
Limited by the VAE model's information loss, impacting performance in tasks like image enhancement.
Future work includes exploring better unified representations and incorporating self-supervised/unsupervised learning for improved generalization. |
generalist model, vision tasks, instruction following, image generation, diffusion models |
2309.03893
Report |
DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection |
Manlin Zhang, Jie Wu, Yuxi Ren, Ming Li, Jie Qin, Xuefeng Xiao, Wei Liu, Rui Wang, Min Zheng, Andy J. Ma |
Data is the cornerstone of deep learning. This paper reveals that the
recently developed Diffusion Model is a scalable data engine for object
detection. Existing methods for scaling up detection-oriented data often
require manual collection or generative models to obtain target images,
followed by data augmentation and labeling to produce training pairs, which are
costly, complex, or lacking diversity. To address these issues, we
presentDiffusionEngine (DE), a data scaling-up engine that provides
high-quality detection-oriented training pairs in a single stage. DE consists
of a pre-trained diffusion model and an effective Detection-Adapter,
contributing to generating scalable, diverse and generalizable detection data
in a plug-and-play manner. Detection-Adapter is learned to align the implicit
semantic and location knowledge in off-the-shelf diffusion models with
detection-aware signals to make better bounding-box predictions. Additionally,
we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing
detection benchmarks for facilitating follow-up research. Extensive experiments
demonstrate that data scaling-up via DE can achieve significant improvements in
diverse scenarios, such as various detection algorithms, self-supervised
pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised
learning. For example, when using DE with a DINO-based adapter to scale up
data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart. |
This paper introduces DiffusionEngine (DE), a one-stage data engine for object detection that leverages pre-trained diffusion models to generate high-quality, scalable, diverse, and generalizable detection training data. |
Large-scale, high-quality training data is crucial for object detection, but traditional data collection and existing augmentation methods are costly, complex, or lack diversity. DE addresses these limitations by efficiently generating diverse and scalable training pairs. |
DE consists of a frozen pre-trained diffusion model and a trainable Detection-Adapter. The adapter learns to align the implicit semantic and location knowledge within diffusion models with explicit detection signals. During training, DE simulates the last diffusion step on real images to learn from existing detection datasets. During inference, DE generates new images from text prompts and uses the adapter to directly predict bounding boxes. |
DE consistently improves the performance of various object detection algorithms, backbones, and pre-training strategies (including self-supervised) on COCO.
DE outperforms state-of-the-art data scaling techniques like Copy-Paste and DALL-E for Detection on VOC and exhibits greater data scalability.
DE effectively generalizes to out-of-domain scenarios, demonstrating significant improvements in cross-domain object detection (VOC to Clipart) and semi-supervised learning. |
Future work could explore creating an all-in-one model for various detection tasks using task-specific adapters.
Integrating ChatGPT for prompt generation and leveraging RLHF for improved alignment and quality of detection pairs are promising directions. |
object detection, data augmentation, diffusion models, data scaling, synthetic data |
2309.03809
Report |
SimNP: Learning Self-Similarity Priors Between Neural Points |
Christopher Wewer, Eddy Ilg, Bernt Schiele, Jan Eric Lenssen |
Existing neural field representations for 3D object reconstruction either (1)
utilize object-level representations, but suffer from low-quality details due
to conditioning on a global latent code, or (2) are able to perfectly
reconstruct the observations, but fail to utilize object-level prior knowledge
to infer unobserved regions. We present SimNP, a method to learn category-level
self-similarities, which combines the advantages of both worlds by connecting
neural point radiance fields with a category-level self-similarity
representation. Our contribution is two-fold. (1) We design the first neural
point representation on a category level by utilizing the concept of coherent
point clouds. The resulting neural point radiance fields store a high level of
detail for locally supported object regions. (2) We learn how information is
shared between neural points in an unconstrained and unsupervised fashion,
which allows to derive unobserved regions of an object during the
reconstruction process from given observations. We show that SimNP is able to
outperform previous methods in reconstructing symmetric unseen object regions,
surpassing methods that build upon category-level or pixel-aligned radiance
fields, while providing semantic correspondences between instances |
SimNP, a novel neural point radiance field that learns category-level self-similarities for 3D object reconstruction. |
Existing methods either lack detail by relying on global representations or fail to generalize to unseen regions due to overfitting observations. SimNP addresses this by combining local detail with a learned prior of object self-similarities. |
SimNP connects a coherent neural point radiance field to learned embeddings via bipartite attention, encoding self-similarity. During training, this attention learns to connect similar points, allowing information transfer during inference when one side is unseen. |
Outperforms state-of-the-art in reconstructing unseen symmetric object parts from single and two views.
Learns and leverages object symmetries for improved reconstruction.
Provides a disentangled representation space enabling meaningful interpolation. |
Assumes a canonical space with ground-truth point clouds during training, limiting applicability to in-the-wild data.
Point cloud prediction, while effective, could be further improved. |
neural radiance fields, 3d reconstruction, self-similarity, neural points, single-view reconstruction |
2309.03729
Report |
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption |
Teng Hu, Jiangning Zhang, Liang Liu, Ran Yi, Siqi Kou, Haokun Zhu, Xu Chen, Yabiao Wang, Chengjie Wang, Lizhuang Ma |
Training a generative model with limited number of samples is a challenging
task. Current methods primarily rely on few-shot model adaption to train the
network. However, in scenarios where data is extremely limited (less than 10),
the generative network tends to overfit and suffers from content degradation.
To address these problems, we propose a novel phasic content fusing few-shot
diffusion model with directional distribution consistency loss, which targets
different learning objectives at distinct training stages of the diffusion
model. Specifically, we design a phasic training strategy with phasic content
fusion to help our model learn content and style information when t is large,
and learn local details of target domain when t is small, leading to an
improvement in the capture of content, style and local details. Furthermore, we
introduce a novel directional distribution consistency loss that ensures the
consistency between the generated and source distributions more efficiently and
stably than the prior methods, preventing our model from overfitting. Finally,
we propose a cross-domain structure guidance strategy that enhances structure
consistency during domain adaptation. Theoretical analysis, qualitative and
quantitative experiments demonstrate the superiority of our approach in
few-shot generative model adaption tasks compared to state-of-the-art methods.
The source code is available at:
https://github.com/sjtuplayer/few-shot-diffusion. |
The paper proposes a novel few-shot diffusion model incorporating a phasic content fusing module and a directional distribution consistency loss. |
Training generative models on limited data often results in overfitting and content degradation. Existing few-shot methods suffer from these issues, especially with extremely limited data. |
The method introduces a phasic training strategy with content fusion to enhance content and style capture at different denoising stages. It also proposes a directional distribution consistency loss to ensure consistent structure and prevent distribution rotation during training. Lastly, a cross-domain structure guidance strategy is used to improve structure preservation during inference. |
The proposed model outperforms state-of-the-art few-shot generative models in content preservation and domain adaptation.
The directional distribution consistency loss effectively maintains the structure of the generated distribution and avoids rotation during training.
The iterative cross-domain structure guidance strategy enhances structure consistency in domain translation. |
The model requires careful tuning of hyperparameters, including the phasic factor and style enhancement factor.
The study primarily focuses on image generation, and future work could explore its application to other data modalities. |
few-shot learning, diffusion models, generative models, domain adaptation, image generation |
2309.03599
Report |
Chasing Consistency in Text-to-3D Generation from a Single Image |
Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang |
Text-to-3D generation from a single-view image is a popular but challenging
task in 3D vision. Although numerous methods have been proposed, existing works
still suffer from the inconsistency issues, including 1) semantic
inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency,
resulting in distorted, overfitted, and over-saturated generations. In light of
the above issues, we present Consist3D, a three-stage framework Chasing for
semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a
single image, in which the first two stages aim to learn parameterized
consistency tokens, and the last stage is for optimization. Specifically, the
semantic encoding stage learns a token independent of views and estimations,
promoting semantic consistency and robustness. Meanwhile, the geometric
encoding stage learns another token with comprehensive geometry and
reconstruction constraints under novel-view estimations, reducing overfitting
and encouraging geometric consistency. Finally, the optimization stage benefits
from the semantic and geometric tokens, allowing a low classifier-free guidance
scale and therefore preventing oversaturation. Experimental results demonstrate
that Consist3D produces more consistent, faithful, and photo-realistic 3D
assets compared to previous state-of-the-art methods. Furthermore, Consist3D
also allows background and object editing through text prompts. |
Presents Consist3D, a novel three-stage framework for consistent text-to-3D generation from a single image, addressing semantic, geometric, and saturation inconsistencies. |
Existing text-to-3D methods struggle with inconsistencies, resulting in distorted, overfitted, and oversaturated 3D generations. |
Consist3D uses a semantic encoding stage to learn a view-independent token, a geometric encoding stage to learn a token with geometric and reconstruction constraints, and a low-scale score distillation sampling stage for optimization. |
Generates more consistent and photo-realistic 3D assets compared to previous methods.
Enables background and object editing through text prompts.
Achieves high-quality 3D generation with lower classifier-free guidance scales, resulting in more natural saturation. |
Struggles when point cloud estimation is inaccurate.
Complex background prompts may result in low-detail background generation. |
text-to-3d generation, 3d vision, score distillation sampling, consistency, single image 3d reconstruction |
2309.03550
Report |
Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model |
Sungwon Hwang, Junha Hyung, Jaegul Choo |
Recent advances in diffusion models such as ControlNet have enabled
geometrically controllable, high-fidelity text-to-image generation. However,
none of them addresses the question of adding such controllability to
text-to-3D generation. In response, we propose Text2Control3D, a controllable
text-to-3D avatar generation method whose facial expression is controllable
given a monocular video casually captured with hand-held camera. Our main
strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF)
optimized with a set of controlled viewpoint-aware images that we generate from
ControlNet, whose condition input is the depth map extracted from the input
video. When generating the viewpoint-aware images, we utilize cross-reference
attention to inject well-controlled, referential facial expression and
appearance via cross attention. We also conduct low-pass filtering of Gaussian
latent of the diffusion model in order to ameliorate the viewpoint-agnostic
texture problem we observed from our empirical analysis, where the
viewpoint-aware images contain identical textures on identical pixel positions
that are incomprehensible in 3D. Finally, to train NeRF with the images that
are viewpoint-aware yet are not strictly consistent in geometry, our approach
considers per-image geometric variation as a view of deformation from a shared
3D canonical space. Consequently, we construct the 3D avatar in a canonical
space of deformable NeRF by learning a set of per-image deformation via
deformation field table. We demonstrate the empirical results and discuss the
effectiveness of our method. |
Text2Control3D, the first controllable text-to-3D avatar generation method that leverages a monocular video for controlling facial expressions and shapes. |
No existing work addresses adding geometric controllability to text-to-3D generation, despite its importance for creating controllable and expressive avatars. |
The method uses a depth-conditional ControlNet to generate viewpoint-aware images with controlled expressions. It introduces cross-reference attention for consistent appearance and expression across viewpoints and employs low-pass filtering of the Gaussian latent to address texture-sticking issues. Finally, it reconstructs the 3D avatar in a deformable NeRF canonical space to handle geometric inconsistencies. |
Generates high-fidelity 3D avatars that reflect text descriptions and source video expressions.
Outperforms baselines like DreamFusion and Instruct-NeRF2NeRF in user studies and quantitative metrics.
Demonstrates the effectiveness of cross-reference attention and low-pass filtering in improving controllability and visual quality. |
Controllability is limited by the capabilities of the key-point conditional ControlNet, particularly for less common expressions.
Future work can explore improving ControlNet's controllability and expanding the method to handle a wider range of expressions and geometric controls. |
text-to-3d generation, controllable avatar generation, neural radiance fields, diffusion models, controlnet |
2309.03549
Report |
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation |
Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, Hang Xu |
Inspired by the remarkable success of Latent Diffusion Models (LDMs) for
image synthesis, we study LDM for text-to-video generation, which is a
formidable challenge due to the computational and memory constraints during
both model training and inference. A single LDM is usually only capable of
generating a very limited number of video frames. Some existing works focus on
separate prediction models for generating more video frames, which suffer from
additional training cost and frame-level jittering, however. In this paper, we
propose a framework called "Reuse and Diffuse" dubbed $\textit{VidRD}$ to
produce more frames following the frames already generated by an LDM.
Conditioned on an initial video clip with a small number of frames, additional
frames are iteratively generated by reusing the original latent features and
following the previous diffusion process. Besides, for the autoencoder used for
translation between pixel space and latent space, we inject temporal layers
into its decoder and fine-tune these layers for higher temporal consistency. We
also propose a set of strategies for composing video-text data that involve
diverse content from multiple existing datasets including video datasets for
action recognition and image-text datasets. Extensive experiments show that our
method achieves good results in both quantitative and qualitative evaluations.
Our project page is available
$\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$. |
This paper introduces VidRD, a text-to-video generation framework that uses Latent Diffusion Models (LDMs) to iteratively generate smooth and coherent videos from text prompts. |
Current text-to-video generation methods struggle to produce long, high-quality videos with consistent content. VidRD addresses this by enabling the generation of longer, smoother videos from text. |
VidRD leverages a pre-trained image LDM and adapts it for video by incorporating temporal layers. It uses a 'reuse and diffuse' strategy to generate videos clip-by-clip, reusing latent features and imitating the diffusion process from previous clips. It also employs novel techniques like Frame-level Noise Reversion (FNR), Past-dependent Noise Sampling (PNS), and Denoising with Staged Guidance (DSG) to ensure temporal consistency. |
VidRD achieves state-of-the-art results on the UCF-101 benchmark for text-to-video generation, outperforming existing methods in terms of Frechet Video Distance (FVD) and Inception Score (IS).
The paper demonstrates the effectiveness of using pseudo-videos created from image-text datasets to enhance temporal consistency in generated videos.
Ablation studies validate the importance of the proposed FNR, PNS, and DSG techniques for generating smooth and coherent videos. |
The paper acknowledges that existing metrics for evaluating video generation models may not fully capture perceptual quality and can be inconsistent with human perception.
Future work could explore techniques to further improve the diversity of generated video content and address potential issues like content cycling. |
text-to-video generation, latent diffusion models, video synthesis, temporal consistency, iterative generation |
2309.03350
Report |
Relay Diffusion: Unifying diffusion process across resolutions for image synthesis |
Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang |
Diffusion models achieved great success in image synthesis, but still face
challenges in high-resolution generation. Through the lens of discrete cosine
transformation, we find the main reason is that \emph{the same noise level on a
higher resolution results in a higher Signal-to-Noise Ratio in the frequency
domain}. In this work, we present Relay Diffusion Model (RDM), which transfers
a low-resolution image or noise into an equivalent high-resolution one for
diffusion model via blurring diffusion and block noise. Therefore, the
diffusion process can continue seamlessly in any new resolution or model
without restarting from pure noise or low-resolution conditioning. RDM achieves
state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256,
surpassing previous works such as ADM, LDM and DiT by a large margin. All the
codes and checkpoints are open-sourced at
\url{https://github.com/THUDM/RelayDiffusion}. |
This paper introduces Relay Diffusion Model (RDM), a novel cascaded diffusion framework that enhances high-resolution image generation by transferring low-resolution images or noise into equivalent high-resolution representations. |
High-resolution image generation with diffusion models is challenging due to limitations in training efficiency and noise schedule design. Existing methods, like cascaded models, while effective, still suffer from drawbacks like time-consuming training and distribution mismatch issues. |
RDM leverages block noise and patch-level blurring diffusion to connect different stages of image generation. It starts diffusion from the previous stage's output instead of pure noise, mitigating the need for low-resolution conditioning and reducing training steps. |
RDM achieves state-of-the-art FID on CelebA-HQ 256x256, outperforming existing methods like StyleSwin with significantly fewer training iterations.
On ImageNet 256x256, RDM achieves state-of-the-art sFID and competitive FID results compared to advanced techniques like MDT-XL/2, even with less training data.
Ablation studies confirm the efficacy of block noise, stochastic sampling, and reduced sampling steps in enhancing RDM's performance. |
While RDM shows promising results, exploring better noise schedules tailored to model size and data distribution remains a future direction.
Investigating the impact of longer training and more granular classifier-free guidance strategies on RDM's FID scores, particularly for ImageNet, is another area for improvement. |
diffusion models, image generation, high-resolution, cascaded diffusion, block noise |
2309.03185
Report |
Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields |
Lily Goli, Cody Reading, Silvia Sellán, Alec Jacobson, Andrea Tagliasacchi |
Neural Radiance Fields (NeRFs) have shown promise in applications like view
synthesis and depth estimation, but learning from multiview images faces
inherent uncertainties. Current methods to quantify them are either heuristic
or computationally demanding. We introduce BayesRays, a post-hoc framework to
evaluate uncertainty in any pre-trained NeRF without modifying the training
process. Our method establishes a volumetric uncertainty field using spatial
perturbations and a Bayesian Laplace approximation. We derive our algorithm
statistically and show its superior performance in key metrics and
applications. Additional results available at: https://bayesrays.github.io. |
Introduces BayesRays, a post-hoc algorithm to estimate the spatial uncertainty of any pre-trained NeRF without modifying the training process. |
Quantifying uncertainty in NeRF is crucial for tasks like outlier detection and next-best-view planning, especially in critical applications like autonomous driving. |
Simulates spatially parametrized perturbations of the radiance field and uses a Bayesian Laplace approximation to produce a volumetric uncertainty field. |
Calculated uncertainties are statistically meaningful and outperform previous works on key metrics like correlation to reconstructed depth error.
Provides a framework for applications like removing 'floater' artifacts from NeRF, matching or improving the state-of-the-art.
Uncertainty field can be rendered as an additional color channel, enabling interactive artifact removal by thresholding. |
Discretization of the deformation field using a uniform grid can lead to high memory cost in regions of little geometric interest. Future work may explore more complex data structures.
Only quantifies epistemic uncertainty and does not capture aleatoric uncertainty caused by noise or inconsistencies between views. Combining with existing frameworks for aleatoric quantification is a potential future direction. |
neural radiance fields, uncertainty quantification, laplace approximation, artifact removal, depth estimation |
2309.03179
Report |
SLiMe: Segment Like Me |
Aliasghar Khani, Saeid Asgari Taghanaki, Aditya Sanghi, Ali Mahdavi Amiri, Ghassan Hamarneh |
Significant strides have been made using large vision-language models, like
Stable Diffusion (SD), for a variety of downstream tasks, including image
editing, image correspondence, and 3D shape generation. Inspired by these
advancements, we explore leveraging these extensive vision-language models for
segmenting images at any desired granularity using as few as one annotated
sample by proposing SLiMe. SLiMe frames this problem as an optimization task.
Specifically, given a single training image and its segmentation mask, we first
extract attention maps, including our novel "weighted accumulated
self-attention map" from the SD prior. Then, using the extracted attention
maps, the text embeddings of Stable Diffusion are optimized such that, each of
them, learn about a single segmented region from the training image. These
learned embeddings then highlight the segmented region in the attention maps,
which in turn can then be used to derive the segmentation map. This enables
SLiMe to segment any real-world image during inference with the granularity of
the segmented region in the training image, using just one example. Moreover,
leveraging additional training data when available, i.e. few-shot, improves the
performance of SLiMe. We carried out a knowledge-rich set of experiments
examining various design factors and showed that SLiMe outperforms other
existing one-shot and few-shot segmentation methods. |
This paper proposes SLiMe, a one-shot image segmentation method that leverages the semantic knowledge of Stable Diffusion (SD) to segment objects and parts at user-defined granularity levels. |
Current image segmentation methods often require extensive annotated data or training class-specific generative models. SLiMe addresses this by utilizing a single annotated image and the pre-trained SD model to perform accurate segmentation across various object categories and granularity levels. |
SLiMe frames the segmentation problem as a one-shot optimization task. It first extracts cross-attention and a novel weighted accumulated self-attention (WAS) map from SD. Then, it fine-tunes SD's text embeddings to highlight segmented regions within these attention maps, guided by a single reference image and its segmentation mask. During inference, the optimized text embeddings are used to segment unseen images, preserving the granularity of the reference segmentation. |
SLiMe outperforms existing one- and few-shot segmentation methods, including ReGAN and SegDDPM, on PASCAL-Part and CelebAMask-HQ datasets.
The method demonstrates strong generalization capabilities, segmenting unseen object categories and handling occlusions effectively.
Ablation studies confirm the importance of each component in SLiMe, including the WAS-attention map, loss functions, and parameter choices. |
SLiMe may struggle with segmenting tiny objects due to the lower resolution of the extracted attention maps.
Future work includes addressing this limitation and extending SLiMe's applicability to 3D and video segmentation. |
image segmentation, one-shot learning, stable diffusion, attention mechanisms, few-shot learning |
2309.03160
Report |
ResFields: Residual Neural Fields for Spatiotemporal Signals |
Marko Mihajlovic, Sergey Prokudin, Marc Pollefeys, Siyu Tang |
Neural fields, a category of neural networks trained to represent
high-frequency signals, have gained significant attention in recent years due
to their impressive performance in modeling complex 3D data, such as signed
distance (SDFs) or radiance fields (NeRFs), via a single multi-layer perceptron
(MLP). However, despite the power and simplicity of representing signals with
an MLP, these methods still face challenges when modeling large and complex
temporal signals due to the limited capacity of MLPs. In this paper, we propose
an effective approach to address this limitation by incorporating temporal
residual layers into neural fields, dubbed ResFields. It is a novel class of
networks specifically designed to effectively represent complex temporal
signals. We conduct a comprehensive analysis of the properties of ResFields and
propose a matrix factorization technique to reduce the number of trainable
parameters and enhance generalization capabilities. Importantly, our
formulation seamlessly integrates with existing MLP-based neural fields and
consistently improves results across various challenging tasks: 2D video
approximation, dynamic shape modeling via temporal SDFs, and dynamic NeRF
reconstruction. Lastly, we demonstrate the practical utility of ResFields by
showcasing its effectiveness in capturing dynamic 3D scenes from sparse RGBD
cameras of a lightweight capture system. |
Presents ResFields, a new method for increasing the capacity of neural fields when modeling complex temporal signals without increasing the underlying MLP size, thus maintaining efficient training and inference |
Addresses the limitations of neural fields in representing long, complex temporal signals due to the limited capacity of MLPs, which can hinder applications in computer graphics, vision, and robotics. |
Introduces temporal residual layers that add time-dependent residuals to the MLP weights. These residuals are factorized using a low-rank representation to reduce parameters and enhance generalization. |
ResFields consistently improve performance across various tasks: 2D video approximation, temporal 3D shape modeling, dynamic radiance field reconstruction, and scene flow learning.
Smaller MLPs with ResFields often outperform larger MLPs without them, leading to faster training and lower memory requirements.
A low-rank factorization of residual weights improves generalization compared to alternatives like no factorization or hypernetworks. |
ResFields are less beneficial for ill-posed monocular reconstruction tasks where constraints, rather than capacity, are the bottleneck.
Modeling very long or evolving signals may require chunking the sequence due to limitations in the shared weight matrix capacity. |
neural fields, temporal signals, residual networks, low-rank factorization, dynamic scene reconstruction |
2309.03110
Report |
Do We Still Need Non-Maximum Suppression? Accurate Confidence Estimates and Implicit Duplication Modeling with IoU-Aware Calibration |
Johannes Gilg, Torben Teepe, Fabian Herzog, Philipp Wolters, Gerhard Rigoll |
Object detectors are at the heart of many semi- and fully autonomous decision
systems and are poised to become even more indispensable. They are, however,
still lacking in accessibility and can sometimes produce unreliable
predictions. Especially concerning in this regard are the -- essentially
hand-crafted -- non-maximum suppression algorithms that lead to an obfuscated
prediction process and biased confidence estimates. We show that we can
eliminate classic NMS-style post-processing by using IoU-aware calibration.
IoU-aware calibration is a conditional Beta calibration; this makes it
parallelizable with no hyper-parameters. Instead of arbitrary cutoffs or
discounts, it implicitly accounts for the likelihood of each detection being a
duplicate and adjusts the confidence score accordingly, resulting in
empirically based precision estimates for each detection. Our extensive
experiments on diverse detection architectures show that the proposed IoU-aware
calibration can successfully model duplicate detections and improve
calibration. Compared to the standard sequential NMS and calibration approach,
our joint modeling can deliver performance gains over the best NMS-based
alternative while producing consistently better-calibrated confidence
predictions with less complexity. The
\hyperlink{https://github.com/Blueblue4/IoU-AwareCalibration}{code} for all our
experiments is publicly available. |
This paper proposes IoU-aware calibration, a method to replace Non-Maximum Suppression (NMS) in object detection by directly modeling the probability of duplicate detections within the confidence score. |
NMS, a crucial step in object detection pipelines, relies on hand-crafted algorithms and often leads to biased confidence estimates and obfuscated prediction processes. IoU-aware calibration aims to address these issues by providing a data-driven approach for accurate and reliable confidence estimates. |
The proposed method utilizes a conditional Beta calibration, conditioned on the minimum Jaccard distance to other detections, to implicitly model the likelihood of a detection being a duplicate. This approach eliminates the need for iterative NMS calculations and enables parallelized computation. |
IoU-aware calibration successfully models duplicate detections and improves calibration across diverse detection architectures.
It achieves performance gains comparable to or exceeding those obtained by fine-tuned NMS, demonstrating its ability to implicitly capture duplicate likelihood.
The method leads to significantly better-calibrated confidence predictions than NMS-based approaches, enabling more reliable probability estimates for object detection. |
Performance may degrade under significant distribution shifts between calibration and deployment data.
Highly crowded scenes, unseen during calibration, may lead to under-confident predictions. |
object detection, non-maximum suppression, confidence calibration, deep learning, computer vision |
2309.02999
Report |
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning |
Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen |
3D dense captioning requires a model to translate its understanding of an
input 3D scene into several captions associated with different object regions.
Existing methods adopt a sophisticated "detect-then-describe" pipeline, which
builds explicit relation modules upon a 3D detector with numerous hand-crafted
components. While these methods have achieved initial success, the cascade
pipeline tends to accumulate errors because of duplicated and inaccurate box
estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR,
a simple-yet-effective transformer framework that decouples the decoding
process of caption generation and object localization through parallel
decoding. Moreover, we argue that object localization and description
generation require different levels of scene understanding, which could be
challenging for a shared set of queries to capture. To this end, we propose an
advanced version, Vote2Cap-DETR++, which decouples the queries into
localization and caption queries to capture task-specific features.
Additionally, we introduce the iterative spatial refinement strategy to vote
queries for faster convergence and better localization performance. We also
insert additional spatial information to the caption head for more accurate
descriptions. Without bells and whistles, extensive experiments on two commonly
used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and
Vote2Cap-DETR++ surpass conventional "detect-then-describe" methods by a large
margin. Codes will be made available at
https://github.com/ch3cook-fdu/Vote2Cap-DETR. |
This paper proposes Vote2Cap-DETR and Vote2Cap-DETR++, two novel transformer-based frameworks for 3D dense captioning that decouple caption generation and object localization, unlike conventional "detect-then-describe" pipelines. |
Existing "detect-then-describe" methods suffer from error accumulation due to serial processing and rely heavily on hand-crafted components, limiting their performance in complex 3D scenes. |
The models utilize a transformer encoder-decoder architecture with vote queries for object localization and a dual-clued captioner for description generation. Vote2Cap-DETR++ further decouples queries for task-specific feature extraction, introduces iterative spatial refinement for queries, and injects 3D spatial information into the caption head. |
Vote2Cap-DETR and Vote2Cap-DETR++ significantly outperform previous state-of-the-art methods on ScanRefer and Nr3D datasets.
Vote queries with iterative spatial refinement improve object localization accuracy and convergence speed.
Injecting 3D spatial information into the caption head enhances the quality and informativeness of generated descriptions. |
Limited caption diversity due to small text annotations and beam search.
Future work includes exploring multimodal pre-training and leveraging LLMs for improved caption diversity. |
3d dense captioning, vision-language, transformers, vote queries, spatial refinement |
2309.02773
Report |
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter |
Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, Dong Xu |
The pre-trained text-image discriminative models, such as CLIP, has been
explored for open-vocabulary semantic segmentation with unsatisfactory results
due to the loss of crucial localization information and awareness of object
shapes. Recently, there has been a growing interest in expanding the
application of generative models from generation tasks to semantic
segmentation. These approaches utilize generative models either for generating
annotated data or extracting features to facilitate semantic segmentation. This
typically involves generating a considerable amount of synthetic data or
requiring additional mask annotations. To this end, we uncover the potential of
generative text-to-image diffusion models (e.g., Stable Diffusion) as highly
efficient open-vocabulary semantic segmenters, and introduce a novel
training-free approach named DiffSegmenter. The insight is that to generate
realistic objects that are semantically faithful to the input text, both the
complete object shapes and the corresponding semantics are implicitly learned
by diffusion models. We discover that the object shapes are characterized by
the self-attention maps while the semantics are indicated through the
cross-attention maps produced by the denoising U-Net, forming the basis of our
segmentation results.Additionally, we carefully design effective textual
prompts and a category filtering mechanism to further enhance the segmentation
results. Extensive experiments on three benchmark datasets show that the
proposed DiffSegmenter achieves impressive results for open-vocabulary semantic
segmentation. |
This paper introduces DiffSegmenter, a novel training-free method for open-vocabulary semantic segmentation leveraging off-the-shelf text-to-image diffusion models. |
Existing open-vocabulary segmentation methods relying on discriminative models often lose crucial localization information. This work explores the potential of generative diffusion models for this task. |
DiffSegmenter leverages the cross-attention maps from the denoising U-Net of a diffusion model as initial segmentation scores, further refined by the self-attention maps. Textual prompts are designed to enhance semantic understanding. |
DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation in both zero-shot and weakly-supervised settings.
It outperforms most existing methods on PASCAL VOC 2012, Pascal Context, and COCO-Object datasets.
The method shows potential for downstream tasks like controllable image editing. |
The use of latent features in Stable Diffusion might lead to the disappearance of small objects.
Future work can explore other text-to-image diffusion models or larger latent feature map sizes to address this. |
semantic segmentation, open-vocabulary, diffusion models, generative models, attention mechanisms |
2309.02401
Report |
Prototype-based Dataset Comparison |
Nanne van Noord |
Dataset summarisation is a fruitful approach to dataset inspection. However,
when applied to a single dataset the discovery of visual concepts is restricted
to those most prominent. We argue that a comparative approach can expand upon
this paradigm to enable richer forms of dataset inspection that go beyond the
most prominent concepts. To enable dataset comparison we present a module that
learns concept-level prototypes across datasets. We leverage self-supervised
learning to discover these prototypes without supervision, and we demonstrate
the benefits of our approach in two case-studies. Our findings show that
dataset comparison extends dataset inspection and we hope to encourage more
works in this direction. Code and usage instructions available at
https://github.com/Nanne/ProtoSim |
This paper introduces "dataset comparison" as a novel approach for inspecting datasets and proposes a method for learning concept-level prototypes across datasets called "ProtoSim." |
Dataset inspection is crucial for understanding dataset content, identifying potential biases, and ensuring alignment with usage goals, especially with the increasing size of image datasets used in computer vision. |
ProtoSim, a module integrated into a Vision Transformer (ViT), leverages self-supervised learning, specifically the DINO loss, to discover visual concepts in datasets without relying on class labels. |
ProtoSim successfully identifies both dataset-specific and shared prototypes, effectively distinguishing unique and common visual concepts across datasets.
The comparative approach allows for a richer understanding of dataset content, revealing subtle differences and nuances not apparent from single-dataset analysis.
Case studies on ImageNet/PASS and three artwork datasets demonstrate the efficacy of dataset comparison in revealing dataset biases and highlighting unique characteristics. |
Interpreting the meaning of learned prototypes requires manual inspection, although visualization of attention maps can aid in this process.
The choice of a pre-trained backbone, particularly one trained on a dataset under comparison like ImageNet, might influence the learned prototypes and warrants further investigation. |
dataset comparison, prototype learning, self-supervised learning, vision transformer, dataset inspection |
2309.02270
Report |
SAM-Deblur: Let Segment Anything Boost Image Deblurring |
Siwei Li, Mingxuan Liu, Yating Zhang, Shu Chen, Haoxiang Li, Zifei Dou, Hong Chen |
Image deblurring is a critical task in the field of image restoration, aiming
to eliminate blurring artifacts. However, the challenge of addressing
non-uniform blurring leads to an ill-posed problem, which limits the
generalization performance of existing deblurring models. To solve the problem,
we propose a framework SAM-Deblur, integrating prior knowledge from the Segment
Anything Model (SAM) into the deblurring task for the first time. In
particular, SAM-Deblur is divided into three stages. First, we preprocess the
blurred images, obtain segment masks via SAM, and propose a mask dropout method
for training to enhance model robustness. Then, to fully leverage the
structural priors generated by SAM, we propose a Mask Average Pooling (MAP)
unit specifically designed to average SAM-generated segmented areas, serving as
a plug-and-play component which can be seamlessly integrated into existing
deblurring networks. Finally, we feed the fused features generated by the MAP
Unit into the deblurring model to obtain a sharp image. Experimental results on
the RealBlurJ, ReloBlur, and REDS datasets reveal that incorporating our
methods improves GoPro-trained NAFNet's PSNR by 0.05, 0.96, and 7.03,
respectively. Project page is available at GitHub
\href{https://hplqaq.github.io/projects/sam-deblur}{HPLQAQ/SAM-Deblur}. |
This paper introduces SAM-Deblur, a novel framework that integrates semantic priors from the Segment Anything Model (SAM) to enhance the performance of image deblurring, particularly in addressing the challenge of non-uniform blurring. |
Image deblurring models often struggle with generalizability, especially when dealing with non-uniform blurring in real-world scenarios. SAM-Deblur aims to improve the generalization performance of these models by incorporating semantic information. |
The SAM-Deblur framework consists of three primary stages: preprocessing of blurred images and mask generation using SAM, a novel Mask Average Pooling (MAP) unit to integrate SAM priors, and finally, feeding the fused features into a deblurring model (NAFNet). A mask dropout method is also used during training to enhance model robustness. |
SAM-Deblur significantly improves the deblurring performance on out-of-distribution datasets like RealBlurJ, REDS, and ReloBlur.
The proposed method enhances PSNR on the tested datasets and reduces the Mode Collapse Rate (MCR), indicating better generalization ability.
The introduced MAP unit proves to be more effective in leveraging SAM priors compared to previously used methods, leading to superior results. |
The reliance on a pre-trained SAM model introduces additional computational overhead.
The performance of SAM-Deblur is contingent on the quality of masks generated by SAM, which can be influenced by factors like image quality and the scale of the SAM model.
Further exploration of alternative architectural designs for the MAP unit and exploring the effectiveness of SAM-Deblur in conjunction with other state-of-the-art deblurring models. |
image deblurring, segment anything model, out-of-distribution generalization, mask average pooling, semantic priors |
2309.02224
Report |
Dense Object Grounding in 3D Scenes |
Wencan Huang, Daizong Liu, Wei Hu |
Localizing objects in 3D scenes according to the semantics of a given natural
language is a fundamental yet important task in the field of multimedia
understanding, which benefits various real-world applications such as robotics
and autonomous driving. However, the majority of existing 3D object grounding
methods are restricted to a single-sentence input describing an individual
object, which cannot comprehend and reason more contextualized descriptions of
multiple objects in more practical 3D cases. To this end, we introduce a new
challenging task, called 3D Dense Object Grounding (3D DOG), to jointly
localize multiple objects described in a more complicated paragraph rather than
a single sentence. Instead of naively localizing each sentence-guided object
independently, we found that dense objects described in the same paragraph are
often semantically related and spatially located in a focused region of the 3D
scene. To explore such semantic and spatial relationships of densely referred
objects for more accurate localization, we propose a novel Stacked Transformer
based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a
contextual query-driven local transformer decoder to generate initial grounding
proposals for each target object. Then, we employ a proposal-guided global
transformer decoder that exploits the local object features to learn their
correlation for further refining initial grounding proposals. Extensive
experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show
that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object
grounding methods and their dense-object variants by significant margins. |
This paper introduces 3D Dense Object Grounding (3D DOG), a new task aiming to localize multiple objects described in a paragraph within a 3D scene, and proposes 3DOGSFormer, a novel Stacked Transformer-based framework to address this task. |
Existing 3D object grounding methods are limited to single-sentence inputs, failing to capture the contextual semantic and spatial relationships crucial for understanding and localizing multiple objects described in a paragraph. |
3DOGSFormer employs a two-phase grounding pipeline. First, a contextual query-driven local transformer decoder generates initial grounding proposals for each sentence, leveraging semantic relations within the paragraph. Second, a proposal-guided global transformer decoder refines these proposals by capturing 3D spatial relations among objects. |
3D DOG methods, including the proposed 3DOGSFormer, significantly outperform 3D single-object grounding methods adapted to the dense grounding setting.
Jointly modeling multiple target objects within a paragraph through contextual query generation and global reasoning leads to substantial performance improvements in 3D DOG.
3DOGSFormer's proposal-guided global transformer decoder effectively captures 3D spatial relations among objects, further enhancing grounding accuracy. |
The performance of 3DOGSFormer is sensitive to the number of sentences in the input paragraph, with longer descriptions generally leading to better results.
Further research is needed to explore more sophisticated methods for handling complex linguistic structures and long-range dependencies in paragraph descriptions. |
3d dense object grounding, query-based proposal generation, global transformer, 3d vision and language, spatial relation understanding |
2309.02186
Report |
AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections |
Yue Wu, Sicheng Xu, Jianfeng Xiang, Fangyun Wei, Qifeng Chen, Jiaolong Yang, Xin Tong |
Previous animatable 3D-aware GANs for human generation have primarily focused
on either the human head or full body. However, head-only videos are relatively
uncommon in real life, and full body generation typically does not deal with
facial expression control and still has challenges in generating high-quality
results. Towards applicable video avatars, we present an animatable 3D-aware
GAN that generates portrait images with controllable facial expression, head
pose, and shoulder movements. It is a generative model trained on unstructured
2D image collections without using 3D or video data. For the new task, we base
our method on the generative radiance manifold representation and equip it with
learnable facial and head-shoulder deformations. A dual-camera rendering and
adversarial learning scheme is proposed to improve the quality of the generated
faces, which is critical for portrait images. A pose deformation processing
network is developed to generate plausible deformations for challenging regions
such as long hair. Experiments show that our method, trained on unstructured 2D
images, can generate diverse and high-quality 3D portraits with desired control
over different properties. |
This paper introduces AniPortraitGAN, the first animatable 3D-aware GAN for generating portrait images with controllable facial expressions, head poses, and shoulder movements from 2D image collections, without relying on 3D or video data. |
This method addresses limitations of previous 3D-aware GANs that focused solely on either heads or full bodies, aiming to create realistic and controllable virtual human portraits for applications like video conferencing. |
The method utilizes a generative radiance manifold representation with learnable facial and head-shoulder deformations guided by 3DMM and SMPL models. A dual-camera rendering scheme with multiple discriminators enhances face generation quality. A pose deformation processing network ensures plausible deformations, especially for challenging areas like hair. |
AniPortraitGAN generates high-quality, diverse 3D portraits with control over facial expressions, head poses, and shoulder movements.
The dual-camera rendering significantly improves face generation quality compared to single-camera approaches.
The pose deformation processing module ensures plausible and smooth deformations, particularly for hair, addressing limitations of standard skinning weight assignment methods. |
The model exhibits limitations in generating poses and expressions outside the training data distribution.
Control over attributes like eye gaze and environment lighting is absent in the current implementation. |
generative adversarial networks, 3d-aware image generation, animatable avatars, portrait generation, deep learning |
2309.02119
Report |
Hierarchical Masked 3D Diffusion Model for Video Outpainting |
Fanda Fan, Chaoxu Guo, Litong Gong, Biao Wang, Tiezheng Ge, Yuning Jiang, Chunjie Luo, Jianfeng Zhan |
Video outpainting aims to adequately complete missing areas at the edges of
video frames. Compared to image outpainting, it presents an additional
challenge as the model should maintain the temporal consistency of the filled
area. In this paper, we introduce a masked 3D diffusion model for video
outpainting. We use the technique of mask modeling to train the 3D diffusion
model. This allows us to use multiple guide frames to connect the results of
multiple video clip inferences, thus ensuring temporal consistency and reducing
jitter between adjacent frames. Meanwhile, we extract the global frames of the
video as prompts and guide the model to obtain information other than the
current video clip using cross-attention. We also introduce a hybrid
coarse-to-fine inference pipeline to alleviate the artifact accumulation
problem. The existing coarse-to-fine pipeline only uses the infilling strategy,
which brings degradation because the time interval of the sparse frames is too
large. Our pipeline benefits from bidirectional learning of the mask modeling
and thus can employ a hybrid strategy of infilling and interpolation when
generating sparse frames. Experiments show that our method achieves
state-of-the-art results in video outpainting tasks. More results and codes are
provided at our https://fanfanda.github.io/M3DDM/. |
This paper introduces a novel Masked 3D Diffusion Model (M3DDM) and a hybrid coarse-to-fine inference pipeline specifically designed for video outpainting. |
Video outpainting requires maintaining temporal consistency across frames, a challenge unmet by existing image outpainting techniques. This work addresses the limitations of previous video outpainting methods in handling long videos and complex motions. |
The M3DDM is trained with a mask modeling technique that uses guide frames to enhance temporal consistency and reduce jitter. Global video clips are integrated as prompts to provide global context. The hybrid coarse-to-fine pipeline leverages infilling and interpolation for long videos to minimize artifact accumulation. |
The M3DDM outperforms previous methods in generating high temporal consistency and visually plausible outpainting results on DAVIS, YouTube-VOS, and a 5M E-commerce dataset.
The hybrid coarse-to-fine pipeline effectively mitigates artifact accumulation in long video outpainting.
The use of guide frames and global video prompts significantly improves temporal consistency and content realism. |
The method's reliance on a fixed VAE encoder can lead to limitations in depicting fine structures like human faces.
The model's sensitivity to initial Gaussian noise during sampling may cause edge blurring in some cases.
Future work can focus on improving the robustness of text generation within videos and addressing the limitations of the VAE encoder. |
video outpainting, diffusion model, mask modeling, coarse-to-fine, temporal consistency |
2309.02049
Report |
Diffusion-based 3D Object Detection with Random Boxes |
Xin Zhou, Jinghua Hou, Tingting Yao, Dingkang Liang, Zhe Liu, Zhikang Zou, Xiaoqing Ye, Jianwei Cheng, Xiang Bai |
3D object detection is an essential task for achieving autonomous driving.
Existing anchor-based detection methods rely on empirical heuristics setting of
anchors, which makes the algorithms lack elegance. In recent years, we have
witnessed the rise of several generative models, among which diffusion models
show great potential for learning the transformation of two distributions. Our
proposed Diff3Det migrates the diffusion model to proposal generation for 3D
object detection by considering the detection boxes as generative targets.
During training, the object boxes diffuse from the ground truth boxes to the
Gaussian distribution, and the decoder learns to reverse this noise process. In
the inference stage, the model progressively refines a set of random boxes to
the prediction results. We provide detailed experiments on the KITTI benchmark
and achieve promising performance compared to classical anchor-based 3D
detection methods. |
This paper presents Diff3Det, a novel 3D object detection framework that leverages diffusion models for proposal generation, eliminating the need for pre-defined anchor boxes. |
Existing anchor-based 3D object detection methods rely on manually set anchors, lacking elegance and potentially hindering performance. This work explores the potential of diffusion models in 3D vision for more flexible and effective proposal generation. |
Diff3Det employs a diffusion-guided proposal generator that corrupts ground truth boxes with Gaussian noise during training. A 3D encoder extracts point cloud features, while a decoder learns to recover the original boxes from noisy ones. The inference involves progressively refining randomly generated boxes to predictions through a reverse diffusion process. |
Diff3Det achieves competitive performance compared to state-of-the-art anchor-based methods on the KITTI benchmark.
The proposed size correlation and dynamic time step strategies for proposal refinement demonstrate significant performance improvement.
Increasing sampling steps during inference further boosts performance, particularly for hard examples, highlighting the benefit of the iterative denoising process. |
The method suffers from slow convergence due to the difficulty of regressing from random boxes.
Future work will focus on exploring fast-converging diffusion-based 3D object detection. |
3d object detection, diffusion models, proposal generation, autonomous driving, point cloud processing |
2309.01958
Report |
Empowering Low-Light Image Enhancer through Customized Learnable Priors |
Naishan Zheng, Man Zhou, Yanmeng Dong, Xiangyu Rui, Jie Huang, Chongyi Li, Feng Zhao |
Deep neural networks have achieved remarkable progress in enhancing low-light
images by improving their brightness and eliminating noise. However, most
existing methods construct end-to-end mapping networks heuristically,
neglecting the intrinsic prior of image enhancement task and lacking
transparency and interpretability. Although some unfolding solutions have been
proposed to relieve these issues, they rely on proximal operator networks that
deliver ambiguous and implicit priors. In this work, we propose a paradigm for
low-light image enhancement that explores the potential of customized learnable
priors to improve the transparency of the deep unfolding paradigm. Motivated by
the powerful feature representation capability of Masked Autoencoder (MAE), we
customize MAE-based illumination and noise priors and redevelop them from two
perspectives: 1) \textbf{structure flow}: we train the MAE from a normal-light
image to its illumination properties and then embed it into the proximal
operator design of the unfolding architecture; and m2) \textbf{optimization
flow}: we train MAE from a normal-light image to its gradient representation
and then employ it as a regularization term to constrain noise in the model
output. These designs improve the interpretability and representation
capability of the model.Extensive experiments on multiple low-light image
enhancement datasets demonstrate the superiority of our proposed paradigm over
state-of-the-art methods. Code is available at
https://github.com/zheng980629/CUE. |
This paper proposes a new deep unfolding paradigm, Customized Unfolding Enhancer (CUE), for low-light image enhancement, which leverages customized learnable priors for illumination and noise. |
Most existing deep learning methods for low-light image enhancement lack transparency and interpretability due to heuristically constructed networks. This paper addresses this by integrating learnable priors based on intrinsic image properties. |
The authors use a Masked Autoencoder (MAE) to learn illumination and noise priors. The illumination prior is trained to predict illumination maps filtered by a bilateral filter, and is embedded into the unfolding architecture. The noise prior learns gradient representations and is used as a regularization term to reduce noise. |
CUE outperforms state-of-the-art methods on LOL and Huawei datasets in terms of PSNR, SSIM, and NIQE.
The learned illumination prior enhances the transparency of the unfolding architecture and improves visual quality.
The noise prior effectively reduces noise in enhanced images and also demonstrates promising results for image denoising. |
The performance of CUE may be further improved by exploring more sophisticated prior designs.
The computational cost of CUE is relatively high compared to some lightweight methods. |
low-light image enhancement, deep unfolding, learnable priors, masked autoencoder, image denoising |
2309.01858
Report |
Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations |
Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Bingyi Cao, Mário Lipovský, Pelin Dogan-Schönberger, Grzegorz Makosa, Boris Bluntschli, Mojtaba Seyedhosseini, Ondřej Chum, André Araujo |
Fine-grained and instance-level recognition methods are commonly trained and
evaluated on specific domains, in a model per domain scenario. Such an
approach, however, is impractical in real large-scale applications. In this
work, we address the problem of universal image embedding, where a single
universal model is trained and used in multiple domains. First, we leverage
existing domain-specific datasets to carefully construct a new large-scale
public benchmark for the evaluation of universal image embeddings, with 241k
query images, 1.4M index images and 2.8M training images across 8 different
domains and 349k classes. We define suitable metrics, training and evaluation
protocols to foster future research in this area. Second, we provide a
comprehensive experimental evaluation on the new dataset, demonstrating that
existing approaches and simplistic extensions lead to worse performance than an
assembly of models trained for each domain separately. Finally, we conducted a
public research competition on this topic, leveraging industrial datasets,
which attracted the participation of more than 1k teams worldwide. This
exercise generated many interesting research ideas and findings which we
present in detail. Project webpage: https://cmp.felk.cvut.cz/univ_emb/ |
This paper introduces the Universal Embedding Dataset (UnED) for training and evaluating universal image embeddings, which aim to discriminate fine-grained objects across multiple domains. |
Universal image embeddings are crucial for general-purpose visual search systems, as using domain-specific models is impractical. Previous research lacked a standard large-scale dataset for this purpose. |
The authors construct UnED from existing public datasets, encompassing 4.1M images, 349k classes, and 8 domains. They benchmark various pre-trained models and propose universal embedding training methods based on joint and separate classifiers with different sampling strategies. |
DINOv2 pretraining yields the best off-the-shelf performance for universal embedding.
Direct extensions of specialist training to universal embedding show promising results, approaching specialist performance on some domains.
The Google Universal Image Embedding Challenge revealed the effectiveness of image-text foundation models (like CLIP) for pre-training, multi-stage finetuning, and careful training data selection. |
The baseline universal embedding training methods are simplistic and don't exploit domain-specific knowledge.
Future work can explore more advanced training techniques and architectures specifically designed for universal embedding. |
image embedding, universal representation learning, fine-grained recognition, image retrieval, benchmark dataset |
2309.01770
Report |
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation |
Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo |
This paper presents a LoRA-free method for stylized image generation that
takes a text prompt and style reference images as inputs and produces an output
image in a single pass. Unlike existing methods that rely on training a
separate LoRA for each style, our method can adapt to various styles with a
unified model. However, this poses two challenges: 1) the prompt loses
controllability over the generated content, and 2) the output image inherits
both the semantic and style features of the style reference image, compromising
its content fidelity. To address these challenges, we introduce StyleAdapter, a
model that comprises two components: a two-path cross-attention module (TPCA)
and three decoupling strategies. These components enable our model to process
the prompt and style reference features separately and reduce the strong
coupling between the semantic and style information in the style references.
StyleAdapter can generate high-quality images that match the content of the
prompts and adopt the style of the references (even for unseen styles) in a
single pass, which is more flexible and efficient than previous methods.
Experiments have been conducted to demonstrate the superiority of our method
over previous works. |
This paper presents StyleAdapter, a LoRA-free method for generating stylized images in a single pass using text prompts and style reference images. |
Existing methods for stylized image generation either struggle to capture detailed style from text descriptions or require computationally expensive fine-tuning for each new style. |
StyleAdapter leverages a two-path cross-attention module (TPCA) to process prompt and style features independently and employs three decoupling strategies to separate semantic and style information in reference images. |
StyleAdapter generates high-quality images consistent with prompts and style references, even for unseen styles.
It outperforms existing methods in balancing content fidelity and stylization, as demonstrated by qualitative and quantitative comparisons.
StyleAdapter can be integrated with controllable synthesis methods, such as T2I-adapter, for enhanced control over image generation. |
StyleAdapter's stylization performance may not always match LoRA, which is specifically trained for each style.
The reliance on pre-trained stable diffusion might lead to the generation of unethical content. |
stylized image generation, lora-free, text-to-image synthesis, two-path cross-attention, semantic and style decoupling |
2309.01694
Report |
No Data Augmentation? Alternative Regularizations for Effective Training on Small Datasets |
Lorenzo Brigato, Stavroula Mougiakakou |
Solving image classification tasks given small training datasets remains an
open challenge for modern computer vision. Aggressive data augmentation and
generative models are among the most straightforward approaches to overcoming
the lack of data. However, the first fails to be agnostic to varying image
domains, while the latter requires additional compute and careful design. In
this work, we study alternative regularization strategies to push the limits of
supervised learning on small image classification datasets. In particular,
along with the model size and training schedule scaling, we employ a heuristic
to select (semi) optimal learning rate and weight decay couples via the norm of
model parameters. By training on only 1% of the original CIFAR-10 training set
(i.e., 50 images per class) and testing on ciFAIR-10, a variant of the original
CIFAR without duplicated images, we reach a test accuracy of 66.5%, on par with
the best state-of-the-art methods. |
This paper introduces a simple yet effective training methodology to enhance image classification accuracy when dealing with small training datasets, achieving results comparable to state-of-the-art methods that rely on extensive data augmentation or generative models. |
Improving data efficiency in deep learning is crucial, particularly in domains where data is scarce or expensive to collect, such as medicine. This method offers a practical and transferable alternative to complex data augmentation techniques. |
The methodology involves a combination of:
- A heuristic for selecting optimal learning rate and weight decay pairs based on the norm of model parameters.
- Removal of momentum from the optimizer.
- Scaling of model size (specifically width).
- Increasing training schedule length. |
A simple Wide ResNet-16-1 trained with this method achieves 66.5% accuracy on ciFAIR-10, matching the performance of state-of-the-art methods that use complex augmentation strategies.
The proposed hyperparameter selection method based on parameter norm proves to be a reliable predictor of generalization performance.
Increasing model size and training length further boosts performance, highlighting the importance of these factors in small data regimes. |
The experiments were conducted on a single dataset (ciFAIR-10), so further validation on diverse datasets is necessary.
While the optimal hyperparameter combination remained consistent across different model sizes, further investigation is needed to understand this observation fully. |
image classification, small data, data efficiency, regularization, hyperparameter optimization |
2309.01430
Report |
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention |
Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang |
Transformers have shown superior performance on various vision tasks. Their
large receptive field endows Transformer models with higher representation
power than their CNN counterparts. Nevertheless, simply enlarging the receptive
field also raises several concerns. On the one hand, using dense attention in
ViT leads to excessive memory and computational cost, and features can be
influenced by irrelevant parts that are beyond the region of interests. On the
other hand, the handcrafted attention adopted in PVT or Swin Transformer is
data agnostic and may limit the ability to model long-range relations. To solve
this dilemma, we propose a novel deformable multi-head attention module, where
the positions of key and value pairs in self-attention are adaptively allocated
in a data-dependent way. This flexible scheme enables the proposed deformable
attention to dynamically focus on relevant regions while maintains the
representation power of global attention. On this basis, we present Deformable
Attention Transformer (DAT), a general vision backbone efficient and effective
for visual recognition. We further build an enhanced version DAT++. Extensive
experiments show that our DAT++ achieves state-of-the-art results on various
visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0
MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU. |
This paper proposes DAT++, a hierarchical Vision Transformer utilizing a novel deformable multi-head attention module (DMHA) for dynamic and data-dependent allocation of key and value pairs in self-attention. |
DAT++ addresses limitations of traditional Vision Transformers, such as dense attention leading to high computational cost and handcrafted sparse attention being data-agnostic, by enabling flexible and adaptive attention to relevant image regions. |
DMHA employs an offset generation network to predict offsets for reference points, guiding bilinear sampling of features from important image regions to form deformed keys and values for attention computation. This allows for data-dependent sparse attention with linear space complexity. |
DAT++ achieves 85.9% Top-1 accuracy on ImageNet image classification.
It achieves 54.5 bbox mAP and 47.0 mask mAP on MS-COCO instance segmentation.
It achieves 51.5 mIoU on ADE20K semantic segmentation, demonstrating state-of-the-art performance across diverse visual recognition tasks. |
The performance gain of incorporating DMHA in earlier stages of the model tends to saturate.
The current implementation of DMHA does not leverage techniques like EMA, LayerScale, or layer-wise learning rate decay, which could potentially lead to further improvements. |
vision transformer, deformable attention, dynamic neural networks, image classification, object detection |
2309.01409
Report |
Implicit Neural Image Stitching |
Minsu Kim, Jaewon Lee, Byeonghun Lee, Sunghoon Im, Kyong Hwan Jin |
Existing frameworks for image stitching often provide visually reasonable
stitchings. However, they suffer from blurry artifacts and disparities in
illumination, depth level, etc. Although the recent learning-based stitchings
relax such disparities, the required methods impose sacrifice of image
qualities failing to capture high-frequency details for stitched images. To
address the problem, we propose a novel approach, implicit Neural Image
Stitching (NIS) that extends arbitrary-scale super-resolution. Our method
estimates Fourier coefficients of images for quality-enhancing warps. Then, the
suggested model blends color mismatches and misalignment in the latent space
and decodes the features into RGB values of stitched images. Our experiments
show that our approach achieves improvement in resolving the low-definition
imaging of the previous deep image stitching with favorable accelerated
image-enhancing methods. Our source code is available at
https://github.com/minshu-kim/NIS. |
This paper proposes Neural Image Stitching (NIS), an implicit neural representation method for image stitching that enhances the resolution and quality of stitched images. |
Existing image stitching methods often result in blurry artifacts and struggle to handle disparities in illumination and parallax errors. This work aims to improve the quality of stitched images by leveraging implicit neural representations. |
NIS uses a two-stage training strategy: 1) learns high-frequency details through supervised learning on synthetic data, 2) learns to blend images and reduce artifacts by minimizing a photometric seam loss on real images. The model utilizes a neural warping module to extract detail-aware features, Fourier coefficients to represent high-frequency details, and a blender to merge features from multiple images. |
NIS outperforms traditional stitching methods (bilinear, bicubic) and a recent deep stitching method (UDIS) in terms of PSNR and SSIM on synthetic images.
On real images, NIS with fine-tuning achieves better NIQE, PIQE, and BRISQUE scores compared to other methods, including feature-based and learning-based stitching.
Ablation studies show the importance of Fourier features and the effectiveness of the two-stage training strategy. |
NIS currently exhibits higher computational cost for very high-resolution images compared to UDIS.
Future work will focus on developing a fully end-to-end deep image stitching pipeline that integrates alignment and reconstruction. |
image stitching, implicit neural representation, super-resolution, image blending, fourier features |
2309.01369
Report |
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation |
Ryota Yoshihashi, Yuya Otsuka, Kenji Doi, Tomohiro Tanaka, Hirokatsu Kataoka |
The advance of generative models for images has inspired various training
techniques for image recognition utilizing synthetic images. In semantic
segmentation, one promising approach is extracting pseudo-masks from attention
maps in text-to-image diffusion models, which enables
real-image-and-annotation-free training. However, the pioneering training
method using the diffusion-synthetic images and pseudo-masks, i.e., DiffuMask
has limitations in terms of mask quality, scalability, and ranges of applicable
domains. To overcome these limitations, this work introduces three techniques
for diffusion-synthetic semantic segmentation training. First,
reliability-aware robust training, originally used in weakly supervised
learning, helps segmentation with insufficient synthetic mask quality. %Second,
large-scale pretraining of whole segmentation models, not only backbones, on
synthetic ImageNet-1k-class images with pixel-labels benefits downstream
segmentation tasks. Second, we introduce prompt augmentation, data augmentation
to the prompt text set to scale up and diversify training images with a limited
text resources. Finally, LoRA-based adaptation of Stable Diffusion enables the
transfer to a distant domain, e.g., auto-driving images. Experiments in PASCAL
VOC, ImageNet-S, and Cityscapes show that our method effectively closes gap
between real and synthetic training in semantic segmentation. |
This paper proposes Attn2mask, a real-image-and-annotation-free semantic segmentation method that leverages diffusion models for synthetic training data generation. |
Attn2mask addresses limitations in previous diffusion-synthetic training methods by framing the problem as weakly supervised learning, enabling accurate segmentation from potentially inaccurate generated labels. |
The method generates training images and pseudo-masks using Stable Diffusion and its cross-attention maps. It then employs reliability-aware robust co-training to handle inaccuracies in the pseudo-masks. Additional techniques include prompt augmentation for data diversity and LoRA-based adaptation for domain transfer. |
Attn2mask achieves 62.2 mIoU on PASCAL VOC without using real images or annotations, outperforming prior diffusion-synthetic methods.
It demonstrates competitive performance on ImageNet-S, showcasing scalability to larger datasets and more classes.
LoRA-based adaptation significantly improves performance on Cityscapes, highlighting its effectiveness for domain transfer. |
The performance of Attn2mask, while impressive for real-image-free training, is still lower than fully supervised or real-image-based weakly supervised methods.
The method relies on the quality and biases present in the Stable Diffusion model and its training data. |
diffusion model, semantic segmentation, weakly supervised learning, diffusion-synthetic training, domain adaptation |
2309.01141
Report |
VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders |
Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang |
Large-scale text-to-image diffusion models have shown impressive capabilities
for generative tasks by leveraging strong vision-language alignment from
pre-training. However, most vision-language discriminative tasks require
extensive fine-tuning on carefully-labeled datasets to acquire such alignment,
with great cost in time and computing resources. In this work, we explore
directly applying a pre-trained generative diffusion model to the challenging
discriminative task of visual grounding without any fine-tuning and additional
training dataset. Specifically, we propose VGDiffZero, a simple yet effective
zero-shot visual grounding framework based on text-to-image diffusion models.
We also design a comprehensive region-scoring method considering both global
and local contexts of each isolated proposal. Extensive experiments on RefCOCO,
RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on
zero-shot visual grounding. Our code is available at
https://github.com/xuyang-liu16/VGDiffZero. |
This paper presents VGDiffZero, a zero-shot visual grounding framework that leverages the pre-trained text-to-image diffusion models without fine-tuning. |
Fine-tuning vision-language models for discriminative tasks like visual grounding is expensive. This paper explores using pre-trained generative diffusion models for this task in a zero-shot setting. |
VGDiffZero uses a pre-trained diffusion model (Stable Diffusion) and proposes a region-scoring method. It injects noise into latent representations of object proposals and uses the denoising process to assess the alignment between the proposal and text query. |
VGDiffZero outperforms other zero-shot visual grounding baselines on RefCOCO, RefCOCO+, and RefCOCOg datasets.
Considering both global and local contexts of object proposals improves performance.
Larger-scale pre-trained diffusion models lead to better visual grounding accuracy. |
The performance improvement from using core expressions instead of full expressions is inconsistent across datasets.
Future work can explore different proposal generation methods or more efficient diffusion model architectures. |
visual grounding, diffusion models, zero-shot learning, vision-language models, stable diffusion |
2309.00908
Report |
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation |
Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, Jiashi Feng |
This paper addresses the issue of modifying the visual appearance of videos
while preserving their motion. A novel framework, named MagicProp, is proposed,
which disentangles the video editing process into two stages: appearance
editing and motion-aware appearance propagation. In the first stage, MagicProp
selects a single frame from the input video and applies image-editing
techniques to modify the content and/or style of the frame. The flexibility of
these techniques enables the editing of arbitrary regions within the frame. In
the second stage, MagicProp employs the edited frame as an appearance reference
and generates the remaining frames using an autoregressive rendering approach.
To achieve this, a diffusion-based conditional generation model, called
PropDPM, is developed, which synthesizes the target frame by conditioning on
the reference appearance, the target motion, and its previous appearance. The
autoregressive editing approach ensures temporal consistency in the resulting
videos. Overall, MagicProp combines the flexibility of image-editing techniques
with the superior temporal consistency of autoregressive modeling, enabling
flexible editing of object types and aesthetic styles in arbitrary regions of
input videos while maintaining good temporal consistency across frames.
Extensive experiments in various video editing scenarios demonstrate the
effectiveness of MagicProp. |
MagicProp, a novel two-stage framework, edits video appearances while preserving motion by first editing a reference frame and then propagating the appearance to other frames based on the original motion. |
Existing methods struggle to balance temporal consistency and editing flexibility. MagicProp addresses this by leveraging powerful image editing techniques and autoregressive modeling for flexible editing with high temporal consistency. |
MagicProp uses an image diffusion model (e.g., ControlNet) for appearance editing in the first stage. Then, a novel diffusion-based conditional generation model, PropDPM, synthesizes the target video frame-by-frame, conditioned on the reference appearance, target motion (depth map), and previous frame. |
MagicProp enables flexible editing of object types and aesthetic styles in arbitrary regions of input videos.
The method maintains good temporal consistency across frames thanks to its autoregressive approach.
Zero-Terminal-SNR noise schedule and a novel appearance adaptor help to alleviate error accumulation and color shifting. |
The current implementation may exhibit degradation in quality for videos longer than 30 frames due to error accumulation.
Future work aims to improve MagicProp's capability to handle longer videos. |
video editing, appearance editing, motion preservation, diffusion models, temporal consistency |
2309.00828
Report |
When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision |
Qingtao Yu, Heming Du, Chen Liu, Xin Yu |
Learning from bounding-boxes annotations has shown great potential in
weakly-supervised 3D point cloud instance segmentation. However, we observed
that existing methods would suffer severe performance degradation with
perturbed bounding box annotations. To tackle this issue, we propose a
complementary image prompt-induced weakly-supervised point cloud instance
segmentation (CIP-WPIS) method. CIP-WPIS leverages pretrained knowledge
embedded in the 2D foundation model SAM and 3D geometric prior to achieve
accurate point-wise instance labels from the bounding box annotations.
Specifically, CP-WPIS first selects image views in which 3D candidate points of
an instance are fully visible. Then, we generate complementary background and
foreground prompts from projections to obtain SAM 2D instance mask predictions.
According to these, we assign the confidence values to points indicating the
likelihood of points belonging to the instance. Furthermore, we utilize 3D
geometric homogeneity provided by superpoints to decide the final instance
label assignments. In this fashion, we achieve high-quality 3D point-wise
instance labels. Extensive experiments on both Scannet-v2 and S3DIS benchmarks
demonstrate that our method is robust against noisy 3D bounding-box annotations
and achieves state-of-the-art performance. |
This paper presents CIP-WPIS, a method for weakly-supervised 3D point cloud instance segmentation that is robust to noisy bounding box annotations. |
Existing methods struggle with performance degradation when bounding box annotations are inaccurate, which is common in real-world scenarios. This method leverages readily available noisy annotations to achieve accurate instance segmentation. |
CIP-WPIS utilizes a greedy algorithm to select image views where instance points are visible and generates complementary image prompts for the Segment Anything Model (SAM). It then uses SAM predictions and 3D geometric constraints from superpoints to refine point-wise instance labels. |
CIP-WPIS achieves state-of-the-art performance on ScanNet-v2 and S3DIS datasets, even with noisy bounding boxes.
The method demonstrates strong robustness to increasing noise levels in bounding box annotations.
Using complementary prompts and 3D geometric consistency significantly improves labeling accuracy compared to using noisy boxes directly. |
Labeling accuracy, while improved, still doesn't reach the level of human annotation.
The greedy view selection, while balancing performance and computation, could be further optimized. |
3d point cloud, instance segmentation, weakly supervised learning, noisy annotations, segment anything model (sam) |
2309.00775
Report |
Contrastive Feature Masking Open-Vocabulary Vision Transformer |
Dahun Kim, Anelia Angelova, Weicheng Kuo |
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an
image-text pretraining methodology that achieves simultaneous learning of
image- and region-level representation for open-vocabulary object detection
(OVD). Our approach combines the masked autoencoder (MAE) objective into the
contrastive learning objective to improve the representation for localization
tasks. Unlike standard MAE, we perform reconstruction in the joint image-text
embedding space, rather than the pixel space as is customary with the classical
MAE method, which causes the model to better learn region-level semantics.
Moreover, we introduce Positional Embedding Dropout (PED) to address scale
variation between image-text pretraining and detection finetuning by randomly
dropping out the positional embeddings during pretraining. PED improves
detection performance and enables the use of a frozen ViT backbone as a region
classifier, preventing the forgetting of open-vocabulary knowledge during
detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT
achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6
points and achieves better zero-shot detection transfer. Finally, CFM-ViT
acquires strong image-level representation, outperforming the state of the art
on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks. |
Proposes Contrastive Feature Masking Vision Transformer (CFMT), an image-text pretraining methodology for open-vocabulary object detection (OVD) by combining masked autoencoder objectives with contrastive learning to enhance object localization. |
Addresses the limitations of existing VLMs, which are primarily optimized for image-level tasks and lack adequate utilization of pixel- and region-level information crucial for OVD. |
Introduces Contrastive Feature Masking to predict masked image regions in the joint image-text embedding space and proposes Positional Embedding Dropout (PED) to enhance region-level representation learning and enable frozen ViT encoder usage during detection. |
Achieves state-of-the-art 33.9 APr on LVIS OVD benchmark, surpassing previous best by 7.6 points.
Demonstrates competitive novel AP on COCO without pseudo labels or weak supervision, representing the first ViT-based approach on this benchmark.
Exhibits strong zero-shot transfer capabilities, outperforming previous methods on Objects365, and surpassing state-of-the-art on 8 out of 12 image-text retrieval benchmarks. |
Potential overfitting on benchmarks with fewer training categories when using only vanilla detection losses.
Future exploration of alternative techniques to mitigate overfitting on specific benchmarks. |
open-vocabulary object detection, vision transformer, contrastive learning, masked image modeling, positional embedding dropout |
2309.00616
Report |
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation |
Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby |
Current 3D open-vocabulary scene understanding methods mostly utilize
well-aligned 2D images as the bridge to learn 3D features with language.
However, applying these approaches becomes challenging in scenarios where 2D
images are absent. In this work, we introduce a new pipeline, namely,
OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene
understanding at the instance level. The OpenIns3D framework employs a
"Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask
proposals in 3D point clouds. The "Snap" module generates synthetic scene-level
images at multiple scales and leverages 2D vision language models to extract
interesting objects. The "Lookup" module searches through the outcomes of
"Snap" with the help of Mask2Pixel maps, which contain the precise
correspondence between 3D masks and synthetic images, to assign category names
to the proposed masks. This 2D input-free and flexible approach achieves
state-of-the-art results on a wide range of indoor and outdoor datasets by a
large margin. Moreover, OpenIns3D allows for effortless switching of 2D
detectors without re-training. When integrated with powerful 2D open-world
models such as ODISE and GroundingDINO, excellent results were observed on
open-vocabulary instance segmentation. When integrated with LLM-powered 2D
models like LISA, it demonstrates a remarkable capacity to process highly
complex text queries which require intricate reasoning and world knowledge.
Project page: https://zheninghuang.github.io/OpenIns3D/ |
This paper presents OpenIns3D, a novel framework for 3D open-vocabulary instance understanding that operates solely on 3D point clouds, eliminating the need for aligned 2D images. |
Current 3D open-vocabulary scene understanding methods heavily rely on well-aligned 2D images, limiting their applicability in real-world scenarios where such images are often unavailable. |
OpenIns3D employs a "Mask-Snap-Lookup" scheme: 1) **Mask:** Learns class-agnostic mask proposals from point clouds. 2) **Snap:** Generates synthetic scene-level images and leverages 2D vision-language models to detect objects. 3) **Lookup:** Assigns category names to 3D masks by searching object detections in synthetic images using Mask2Pixel maps. |
Achieves state-of-the-art results on indoor (S3DIS, ScanNetv2) and outdoor (STPLS3D) datasets for open-vocabulary instance segmentation and object detection.
Demonstrates robustness by not requiring retraining when switching between different 2D detectors.
Exhibits strong capability to comprehend complex language queries, including those requiring reasoning and world knowledge, when integrated with LLM-powered 2D models like LISA. |
Reliance on ground truth instance masks for training the Mask Proposal Module.
Limited performance in semantic segmentation due to prioritization of mask quality over completeness. |
3d open-vocabulary learning, instance segmentation, object detection, point cloud understanding, vision-language models |
2309.00615
Report |
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following |
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng |
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with
2D image, language, audio, and video. Guided by ImageBind, we construct a joint
embedding space between 3D and multi-modalities, enabling many promising
applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D
open-world understanding. On top of this, we further present Point-LLM, the
first 3D large language model (LLM) following 3D multi-modal instructions. By
parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of
Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction
data, but exhibits superior 3D and multi-modal question-answering capacity. We
hope our work may cast a light on the community for extending 3D point clouds
to multi-modality applications. Code is available at
https://github.com/ZiyuGuo99/Point-Bind_Point-LLM. |
This paper introduces Point-Bind, a 3D multi-modality model aligning point clouds with 2D images, language, audio, and video, and Point-LLM, the first 3D large language model that understands and responds to 3D and multi-modal instructions. |
Extending 3D point clouds to multi-modal applications is crucial for expanding the applications of 3D vision, enabling more robust and diverse 3D understanding, generation, and interaction. |
Point-Bind leverages ImageBind's joint embedding space and contrastive learning to align 3D point clouds with other modalities. Point-LLM is built upon Point-Bind and LLaMA, fine-tuned with vision-language data and parameter-efficient techniques. |
Point-Bind enables several 3D multi-modal applications like any-to-3D generation, 3D embedding arithmetic, and achieves state-of-the-art performance in 3D zero-shot classification and cross-modal retrieval.
Point-LLM successfully demonstrates the ability to understand 3D point clouds and respond to 3D-related instructions in both English and Chinese.
Both Point-Bind and Point-LLM show strong data efficiency, requiring no 3D instruction data for training. |
Future work includes aligning multi-modality with more diverse 3D data like indoor and outdoor scenes.
Exploring more complex 3D instruction following tasks is another potential direction. |
3d vision, multi-modality learning, point cloud, large language model, instruction following |
2309.00613
Report |
Iterative Multi-granular Image Editing using Diffusion Models |
K J Joseph, Prateksha Udhayanan, Tripti Shukla, Aishwarya Agarwal, Srikrishna Karanam, Koustava Goswami, Balaji Vasan Srinivasan |
Recent advances in text-guided image synthesis has dramatically changed how
creative professionals generate artistic and aesthetically pleasing visual
assets. To fully support such creative endeavors, the process should possess
the ability to: 1) iteratively edit the generations and 2) control the spatial
reach of desired changes (global, local or anything in between). We formalize
this pragmatic problem setting as Iterative Multi-granular Editing. While there
has been substantial progress with diffusion-based models for image synthesis
and editing, they are all one shot (i.e., no iterative editing capabilities)
and do not naturally yield multi-granular control (i.e., covering the full
spectrum of local-to-global edits). To overcome these drawbacks, we propose
EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent
iteration strategy, which re-purposes a pre-trained diffusion model to
facilitate iterative editing. This is complemented by a gradient control
operation for multi-granular control. We introduce a new benchmark dataset to
evaluate our newly proposed setting. We conduct exhaustive quantitatively and
qualitatively evaluation against recent state-of-the-art approaches adapted to
our task, to being out the mettle of EMILIE. We hope our work would attract
attention to this newly identified, pragmatic problem setting. |
This paper introduces EMILIE, a novel diffusion model based framework for iterative and multi-granular image editing. |
Existing image editing with diffusion models are one-shot and do not allow user control over the spatial extent of the edits, while creators often require iterative editing and multi-granular control (local to global edits). |
EMILIE leverages a novel latent iteration strategy that re-purposes a pre-trained diffusion model to facilitate iterative editing, and employs gradient control to enable multi-granular control. |
EMILIE effectively reduces artifact accumulation during iterative edits by operating in the latent space.
Gradient control through masking allows for precise localization of edits.
Quantitative and qualitative evaluation on the proposed IMIEBench and EditBench datasets demonstrate EMILIE's superiority over existing methods. |
EMILIE struggles with negative edit instructions (e.g., undoing previous edits).
Future work includes exploring disentanglement of feature representations to address limitations with negative edits and improve consistency. |
image editing, diffusion models, iterative editing, multi-granular control, latent space |
2309.00610
Report |
CityDreamer: Compositional Generative Model of Unbounded 3D Cities |
Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu |
3D city generation is a desirable yet challenging task, since humans are more
sensitive to structural distortions in urban environments. Additionally,
generating 3D cities is more complex than 3D natural scenes since buildings, as
objects of the same class, exhibit a wider range of appearances compared to the
relatively consistent appearance of objects like trees in natural scenes. To
address these challenges, we propose \textbf{CityDreamer}, a compositional
generative model designed specifically for unbounded 3D cities. Our key insight
is that 3D city generation should be a composition of different types of neural
fields: 1) various building instances, and 2) background stuff, such as roads
and green lands. Specifically, we adopt the bird's eye view scene
representation and employ a volumetric render for both instance-oriented and
stuff-oriented neural fields. The generative hash grid and periodic positional
embedding are tailored as scene parameterization to suit the distinct
characteristics of building instances and background stuff. Furthermore, we
contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which
comprises a vast amount of real-world city imagery to enhance the realism of
the generated 3D cities both in their layouts and appearances. CityDreamer
achieves state-of-the-art performance not only in generating realistic 3D
cities but also in localized editing within the generated cities. |
Proposes CityDreamer, a compositional generative model for creating unbounded 3D cities, which separates the generation of building instances and background stuff (roads, green lands, etc.) to handle the diversity of building appearances. |
Addresses the challenge of 3D city generation, which is more complex than generating natural scenes due to the wide range of appearances exhibited by buildings. This has applications in urban planning, environmental simulations, and game development. |
Employs a bird's eye view (BEV) scene representation and volumetric rendering for both building instances and background stuff. Utilizes a generative hash grid for background and periodic positional encoding for buildings. Leverages CityGen Datasets, including OSM and GoogleEarth, for realistic city layouts and appearances. |
Achieves state-of-the-art performance in generating realistic 3D cities, as evidenced by FID, KID, depth error, and camera error metrics.
Outperforms baselines in user studies assessing perceptual quality, 3D realism, and view consistency.
Enables localized editing of building instances within the generated cities, including style and height modifications. |
Limited to modeling convex geometries due to the voxel-based representation.
Individual building generation during inference leads to higher computational cost. |
3d city generation, generative adversarial networks, neural radiance fields, unbounded scene generation, compositional modeling |
2309.00398
Report |
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation |
Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang |
In this paper, we present VideoGen, a text-to-video generation approach,
which can generate a high-definition video with high frame fidelity and strong
temporal consistency using reference-guided latent diffusion. We leverage an
off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to
generate an image with high content quality from the text prompt, as a
reference image to guide video generation. Then, we introduce an efficient
cascaded latent diffusion module conditioned on both the reference image and
the text prompt, for generating latent video representations, followed by a
flow-based temporal upsampling step to improve the temporal resolution.
Finally, we map latent video representations into a high-definition video
through an enhanced video decoder. During training, we use the first frame of a
ground-truth video as the reference image for training the cascaded latent
diffusion module. The main characterises of our approach include: the reference
image generated by the text-to-image model improves the visual fidelity; using
it as the condition makes the diffusion model focus more on learning the video
dynamics; and the video decoder is trained over unlabeled video data, thus
benefiting from high-quality easily-available videos. VideoGen sets a new
state-of-the-art in text-to-video generation in terms of both qualitative and
quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for
more samples. |
VideoGen, a novel text-to-video generation approach that leverages a reference-guided latent diffusion model to produce high-definition videos with strong temporal consistency and fidelity. |
Text-to-video generation is a challenging task, requiring high visual quality, temporally consistent motion, and handling limited video-text pair datasets. Existing methods often struggle to balance these aspects. |
VideoGen utilizes a pre-trained text-to-image model to generate a reference image from the text prompt. This image guides a cascaded latent video diffusion model, conditioned on the text and reference, to generate latent video representations. Flow-based temporal upsampling enhances resolution, and a video decoder trained on unlabeled video data maps latent representations to the final video. |
Achieves state-of-the-art results on UCF-101 and MSR-VTT benchmarks, demonstrating superior video quality and text-video alignment.
Significantly improves Inception Score (IS) compared to previous methods, indicating high quality and diversity in generated videos.
User studies confirm that VideoGen generates videos with better visual quality and text alignment compared to Make-A-Video and Imagen Video. |
While achieving competitive results, further exploration is needed to improve Frechet Video Distance (FVD) for even better distribution alignment with real videos.
Fine-tuning the text-to-image model specifically for video generation could further enhance content fidelity to the target domain. |
text-to-video generation, latent diffusion model, reference-guided synthesis, temporal consistency, high-definition video |
2309.00339
Report |
Robust Point Cloud Processing through Positional Embedding |
Jianqiao Zheng, Xueqian Li, Sameera Ramasinghe, Simon Lucey |
End-to-end trained per-point embeddings are an essential ingredient of any
state-of-the-art 3D point cloud processing such as detection or alignment.
Methods like PointNet, or the more recent point cloud transformer -- and its
variants -- all employ learned per-point embeddings. Despite impressive
performance, such approaches are sensitive to out-of-distribution (OOD) noise
and outliers. In this paper, we explore the role of an analytical per-point
embedding based on the criterion of bandwidth. The concept of bandwidth enables
us to draw connections with an alternate per-point embedding -- positional
embedding, particularly random Fourier features. We present compelling robust
results across downstream tasks such as point cloud classification and
registration with several categories of OOD noise. |
This paper investigates the use of untrained, analytical positional embeddings (PE) as a more robust alternative to learned per-point embeddings (PPE) in 3D point cloud processing tasks. |
Learned PPEs in popular architectures like PointNet and PCT are sensitive to out-of-distribution (OOD) noise and outliers, leading to significant performance degradation in practical applications. |
The authors theoretically connect the concept of bandwidth and spatial locality to the variance of weights in PE, specifically random Fourier features (RFF). They empirically evaluate the robustness of PE-based PointNet and PCT variants on classification and registration tasks using ModelNet40 and ModelNet40-C datasets with various OOD corruptions. |
PE-based embeddings show comparable performance to learned PPEs on clean data.
PE-based methods significantly outperform learned PPEs on various OOD corruptions, including noise and outliers.
The bandwidth of PE can be easily tuned to control its robustness to different levels of noise. |
PE-based methods do not show clear advantages over learned PPEs for corruptions like density changes and transformations.
Future work includes investigating more specialized PE functions and data normalization techniques to address these limitations. |
point cloud processing, positional embedding, out-of-distribution robustness, pointnet, point cloud transformer |
2309.00107
Report |
Unsupervised evaluation of GAN sample quality: Introducing the TTJac Score |
Egor Sevriugov, Ivan Oseledets |
Evaluation metrics are essential for assessing the performance of generative
models in image synthesis. However, existing metrics often involve high memory
and time consumption as they compute the distance between generated samples and
real data points. In our study, the new evaluation metric called the "TTJac
score" is proposed to measure the fidelity of individual synthesized images in
a data-free manner. The study first establishes a theoretical approach to
directly evaluate the generated sample density. Then, a method incorporating
feature extractors and discrete function approximation through tensor train is
introduced to effectively assess the quality of generated samples. Furthermore,
the study demonstrates that this new metric can be used to improve the
fidelity-variability trade-off when applying the truncation trick. The
experimental results of applying the proposed metric to StyleGAN 2 and StyleGAN
2 ADA models on FFHQ, AFHQ-Wild, LSUN-Cars, and LSUN-Horse datasets are
presented. The code used in this research will be made publicly available
online for the research community to access and utilize. |
This paper introduces 'TTJac score', a novel data-free metric for evaluating the fidelity of individual synthesized images generated by GANs. |
Existing image synthesis evaluation metrics suffer from high memory and time consumption due to their reliance on comparing generated samples with real data points. |
The TTJac score leverages the generator's Jacobian to directly calculate sample density. It uses feature extractors (VGG19) to reduce Jacobian size and tensor train decomposition for efficient approximation and inference time reduction. |
TTJac score effectively identifies low-quality images with artifacts comparable to data-dependent metrics like Realism score.
It enables a better fidelity-variability trade-off when used for truncation trick, particularly for LSUN-Car dataset.
Qualitative evaluation shows TTJac score's ability to detect visual artifacts and unrealistic elements in various domains (FFHQ, AFHQ-Wild, LSUN-Cars, LSUN-Horse). |
Error in metric approximation limits achieving maximum precision with minimal recall.
Challenges arise in domains like AFHQ-Wild where GAN models already exhibit high precision. |
generative adversarial networks, image synthesis, evaluation metrics, tensor train decomposition, feature density |
2309.00096
Report |
AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation |
Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Ya Zhang, Yanfeng Wang |
Open-vocabulary semantic segmentation is a challenging task that requires
segmenting novel object categories at inference time. Recent studies have
explored vision-language pre-training to handle this task, but suffer from
unrealistic assumptions in practical scenarios, i.e., low-quality textual
category names. For example, this paradigm assumes that new textual categories
will be accurately and completely provided, and exist in lexicons during
pre-training. However, exceptions often happen when encountering ambiguity for
brief or incomplete names, new words that are not present in the pre-trained
lexicons, and difficult-to-describe categories for users. To address these
issues, this work proposes a novel attribute decomposition-aggregation
framework, AttrSeg, inspired by human cognition in understanding new concepts.
Specifically, in the decomposition stage, we decouple class names into diverse
attribute descriptions to complement semantic contexts from multiple
perspectives. Two attribute construction strategies are designed: using large
language models for common categories, and involving manually labeling for
human-invented categories. In the aggregation stage, we group diverse
attributes into an integrated global description, to form a discriminative
classifier that distinguishes the target object from others. One hierarchical
aggregation architecture is further proposed to achieve multi-level
aggregations, leveraging the meticulously designed clustering module. The final
results are obtained by computing the similarity between aggregated attributes
and images embeddings. To evaluate the effectiveness, we annotate three types
of datasets with attribute descriptions, and conduct extensive experiments and
ablation studies. The results show the superior performance of attribute
decomposition-aggregation. |
This paper proposes AttrSeg, an attribute decomposition-aggregation framework for open-vocabulary semantic segmentation to address the limitations of relying solely on potentially ambiguous or unfamiliar category names. |
Existing open-vocabulary segmentation methods struggle with ambiguous category names, new words (neologisms), and difficult-to-describe categories, limiting their real-world practicality. |
The framework decomposes class names into detailed attribute descriptions, generated by LLMs or manual annotation. These attributes are then hierarchically aggregated into a global representation, enabling segmentation based on similarity with image embeddings. |
AttrSeg outperforms state-of-the-art methods on PASCAL-5$^i$, COCO-20$^i$, PASCAL VOC, and PASCAL Context, even when using only attribute descriptions.
The method demonstrates strong performance on the newly introduced 'Fantastic Beasts' dataset, specifically designed to test neologisms and unnameable categories.
Ablation studies validate the effectiveness of the hierarchical aggregation strategy, the importance of each component, and the model's robustness to noisy attribute inputs. |
The reliance on CLIP's training data may introduce biases into the model's predictions.
Attribute decomposition using LLMs can also potentially introduce biases if not carefully controlled. |
open-vocabulary semantic segmentation, attribute decomposition-aggregation, vision-language pre-training, hierarchical aggregation, fantastic beasts dataset |
2309.00035
Report |
FACET: Fairness in Computer Vision Evaluation Benchmark |
Laura Gustafson, Chloe Rolland, Nikhila Ravi, Quentin Duval, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, Candace Ross |
Computer vision models have known performance disparities across attributes
such as gender and skin tone. This means during tasks such as classification
and detection, model performance differs for certain classes based on the
demographics of the people in the image. These disparities have been shown to
exist, but until now there has not been a unified approach to measure these
differences for common use-cases of computer vision models. We present a new
benchmark named FACET (FAirness in Computer Vision EvaluaTion), a large,
publicly available evaluation set of 32k images for some of the most common
vision tasks - image classification, object detection and segmentation. For
every image in FACET, we hired expert reviewers to manually annotate
person-related attributes such as perceived skin tone and hair type, manually
draw bounding boxes and label fine-grained person-related classes such as disk
jockey or guitarist. In addition, we use FACET to benchmark state-of-the-art
vision models and present a deeper understanding of potential performance
disparities and challenges across sensitive demographic attributes. With the
exhaustive annotations collected, we probe models using single demographics
attributes as well as multiple attributes using an intersectional approach
(e.g. hair color and perceived skin tone). Our results show that
classification, detection, segmentation, and visual grounding models exhibit
performance disparities across demographic attributes and intersections of
attributes. These harms suggest that not all people represented in datasets
receive fair and equitable treatment in these vision tasks. We hope current and
future results using our benchmark will contribute to fairer, more robust
vision models. FACET is available publicly at https://facet.metademolab.com/ |
FACET is a large-scale, publicly available fairness benchmark for evaluating bias in computer vision models across a variety of tasks. |
Existing fairness datasets lack exhaustive demographic annotations and often only support a limited number of vision tasks, making it difficult to thoroughly analyze model fairness. |
FACET comprises 30,000 images with annotations for 52 person-related classes and 17 attributes, including demographic attributes (gender, skin tone, age), physical presentation (hair type, accessories), and robustness factors (lighting, occlusion). Expert annotators from diverse geographical regions manually labeled the dataset. |
CLIP image classification model shows disparities in performance based on perceived gender, reflecting societal biases.
Faster R-CNN object detection model demonstrates lower accuracy in detecting people with darker skin tones, especially in precise localization, and this issue is exacerbated for certain hair types.
Mask R-CNN exhibits similar performance disparities across gender for both person detection and segmentation tasks, though disparities are slightly larger for detection. |
The use of perceived attributes instead of self-identified ones introduces potential for annotator bias.
The discrete nature of labels for gender and age might lead to the erasure of certain identities. |
fairness, computer vision, benchmark, bias, demographic attributes |
2308.16911
Report |
PointLLM: Empowering Large Language Models to Understand Point Clouds |
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin |
The unprecedented advancements in Large Language Models (LLMs) have shown a
profound impact on natural language processing but are yet to fully embrace the
realm of 3D understanding. This paper introduces PointLLM, a preliminary effort
to fill this gap, enabling LLMs to understand point clouds and offering a new
avenue beyond 2D visual data. PointLLM understands colored object point clouds
with human instructions and generates contextually appropriate responses,
illustrating its grasp of point clouds and common sense. Specifically, it
leverages a point cloud encoder with a powerful LLM to effectively fuse
geometric, appearance, and linguistic information. We collect a novel dataset
comprising 660K simple and 70K complex point-text instruction pairs to enable a
two-stage training strategy: aligning latent spaces and subsequently
instruction-tuning the unified model. To rigorously evaluate the perceptual and
generalization capabilities of PointLLM, we establish two benchmarks:
Generative 3D Object Classification and 3D Object Captioning, assessed through
three different methods, including human evaluation, GPT-4/ChatGPT evaluation,
and traditional metrics. Experimental results reveal PointLLM's superior
performance over existing 2D and 3D baselines, with a notable achievement in
human-evaluated object captioning tasks where it surpasses human annotators in
over 50% of the samples. Codes, datasets, and benchmarks are available at
https://github.com/OpenRobotLab/PointLLM . |
Introduces PointLLM, a multi-modal large language model capable of understanding colored point clouds of objects, addressing the lack of LLM integration with 3D data. |
Enables LLMs to move beyond 2D visual data and understand 3D structures, paving the way for applications like interactive 3D content creation and robot manipulation through natural language. |
Leverages a point cloud encoder and a pre-trained LLM, trained in two stages: aligning latent spaces using a novel point-text instruction dataset, and instruction-tuning the unified model for complex instruction understanding. |
Outperforms 2D and 3D baselines in generative 3D object classification on ModelNet40 and Objaverse datasets.
Achieves superior performance in 3D object captioning, surpassing human annotators in over 50% of cases in human evaluation.
Demonstrates accurate understanding of object details, including those often obscured by occlusion in 2D images. |
Further improvement in reducing hallucination rates to match human-level precision.
Exploration of more efficient, point cloud-specific fusion mechanisms for MLLMs. |
large language models, point cloud understanding, 3d object recognition, multi-modal learning, generative ai |
2308.16909
Report |
StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation |
Yuhan Wang, Liming Jiang, Chen Change Loy |
Unconditional video generation is a challenging task that involves
synthesizing high-quality videos that are both coherent and of extended
duration. To address this challenge, researchers have used pretrained StyleGAN
image generators for high-quality frame synthesis and focused on motion
generator design. The motion generator is trained in an autoregressive manner
using heavy 3D convolutional discriminators to ensure motion coherence during
video generation. In this paper, we introduce a novel motion generator design
that uses a learning-based inversion network for GAN. The encoder in our method
captures rich and smooth priors from encoding images to latents, and given the
latent of an initially generated frame as guidance, our method can generate
smooth future latent by modulating the inversion encoder temporally. Our method
enjoys the advantage of sparse training and naturally constrains the generation
space of our motion generator with the inversion network guided by the initial
frame, eliminating the need for heavy discriminators. Moreover, our method
supports style transfer with simple fine-tuning when the encoder is paired with
a pretrained StyleGAN generator. Extensive experiments conducted on various
benchmarks demonstrate the superiority of our method in generating long and
high-resolution videos with decent single-frame quality and temporal
consistency. |
This paper introduces StyleInV, a novel motion generator for unconditional video generation that leverages a learning-based inversion network for GANs to generate temporally coherent motion latent codes. |
Unconditional video generation is challenging, especially in synthesizing high-resolution videos with long durations and coherent motion. Existing methods often struggle with motion collapse or require computationally heavy discriminators. |
StyleInV utilizes a GAN inversion network modulated by temporal style codes. It employs a first-frame-aware acyclic positional encoding (FFA-APE) and first-frame-aware sparse training (FFA-ST) to ensure smooth motion and accurate initial frame reconstruction. |
StyleInV generates high-quality, long-duration videos with superior identity preservation on human-face datasets compared to baselines like MoCoGAN-HD, DIGAN, StyleGAN-V, and Long-Video-GAN.
The method supports fine-tuning-based style transfer by leveraging the pretrained StyleGAN generator, allowing for easy adaptation to different artistic styles.
StyleInV enables initial-frame conditioned generation, allowing users to generate videos with specific starting content. |
The model's performance on datasets with global motions (e.g., SkyTimelapse) needs improvement.
The two-stage training process, although efficient for hyperparameter tuning, is more computationally expensive than single-stage methods like StyleGAN-V. |
video generation, gan inversion, motion generation, style transfer, temporal consistency |
2308.16880
Report |
Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details |
Inwoo Hwang, Hyeonwoo Kim, Young Min Kim |
We propose Text2Scene, a method to automatically create realistic textures
for virtual scenes composed of multiple objects. Guided by a reference image
and text descriptions, our pipeline adds detailed texture on labeled 3D
geometries in the room such that the generated colors respect the hierarchical
structure or semantic parts that are often composed of similar materials.
Instead of applying flat stylization on the entire scene at a single step, we
obtain weak semantic cues from geometric segmentation, which are further
clarified by assigning initial colors to segmented parts. Then we add texture
details for individual objects such that their projections on image space
exhibit feature embedding aligned with the embedding of the input. The
decomposition makes the entire pipeline tractable to a moderate amount of
computation resources and memory. As our framework utilizes the existing
resources of image and text embedding, it does not require dedicated datasets
with high-quality textures designed by skillful artists. To the best of our
knowledge, it is the first practical and scalable approach that can create
detailed and realistic textures of the desired style that maintain structural
context for scenes with multiple objects. |
Text2Scene: a novel method for automatically generating realistic textures for virtual 3D scenes composed of multiple objects, guided by a reference image and text descriptions. |
Creating realistic virtual scenes is crucial for various applications but current methods are either manual and not scalable or lack detail and realism. Text2Scene addresses this by automating texture creation while respecting object part boundaries and style consistency. |
Text2Scene uses a coarse-to-fine strategy: 1) Retrieves texture for walls, ceilings, and floors from a material library. 2) Decomposes objects into parts based on geometric features and texture similarity. 3) Assigns base colors to parts, optimizing for global scene harmony. 4) Adds detailed texture to individual objects using local neural style fields, respecting part boundaries and guided by text descriptions. |
Generates realistic textures with clear part boundaries, outperforming baselines in user studies.
Successfully discovers part segments for diverse objects without explicit part labels.
Enables efficient scene stylization, allowing object manipulation and diverse outputs from various text prompts and target images. |
Current pipeline handles objects separately after base color assignment, potentially limiting holistic scene understanding.
Requires class labels or text descriptions per object, which could be further automated. |
texture synthesis, 3d scene stylization, text-to-3d, part discovery, neural style fields |
2308.16825
Report |
Coarse-to-Fine Amodal Segmentation with Shape Prior |
Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, Yanwei Fu |
Amodal object segmentation is a challenging task that involves segmenting
both visible and occluded parts of an object. In this paper, we propose a novel
approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this
problem by progressively modeling the amodal segmentation. C2F-Seg initially
reduces the learning space from the pixel-level image space to the
vector-quantized latent space. This enables us to better handle long-range
dependencies and learn a coarse-grained amodal segment from visual features and
visible segments. However, this latent space lacks detailed information about
the object, which makes it difficult to provide a precise segmentation
directly. To address this issue, we propose a convolution refine module to
inject fine-grained information and provide a more precise amodal object
segmentation based on visual features and coarse-predicted segmentation. To
help the studies of amodal object segmentation, we create a synthetic amodal
dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and
video amodal object segmentation. We extensively evaluate our model on two
benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the
superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for
video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A.
Project page at: http://jianxgao.github.io/C2F-Seg. |
Proposes C2F-Seg, a coarse-to-fine framework for amodal segmentation that leverages shape priors learned in a latent space via transformers and refines them with a convolutional module. |
Amodal segmentation is challenging due to the ill-posed nature of predicting occluded regions. Shape priors can help but need to be refined with visual details. |
C2F-Seg first generates a coarse amodal mask from a latent representation of the visible mask and image features using a mask-and-predict transformer. This mask is then refined with a convolutional module guided by an attention mechanism derived from the coarse mask and image features. |
C2F-Seg achieves state-of-the-art performance on KINS and COCOA image amodal segmentation benchmarks.
It also excels in video amodal segmentation, surpassing baselines on FISHBOWL and the newly proposed MOViD-A dataset.
Ablation studies demonstrate the effectiveness of the convolutional refinement, attention mechanism, and iterative inference process. |
Reliance on pre-detected visible masks limits efficiency in multi-object scenes, aiming to explore single-point input or end-to-end detection integration in future work.
Handling heavily occluded objects remains challenging; future efforts will focus on leveraging spatio-temporal priors and enforcing inter-frame consistency. |
amodal segmentation, shape prior, transformers, convolutional refinement, video amodal segmentation |
2308.16758
Report |
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images |
Cuican Yu, Guansong Lu, Yihan Zeng, Jian Sun, Xiaodan Liang, Huibin Li, Zongben Xu, Songcen Xu, Wei Zhang, Hang Xu |
Generating 3D faces from textual descriptions has a multitude of
applications, such as gaming, movie, and robotics. Recent progresses have
demonstrated the success of unconditional 3D face generation and text-to-3D
shape generation. However, due to the limited text-3D face data pairs,
text-driven 3D face generation remains an open problem. In this paper, we
propose a text-guided 3D faces generation method, refer as TG-3DFace, for
generating realistic 3D faces using text guidance. Specifically, we adopt an
unconditional 3D face generation framework and equip it with text conditions,
which learns the text-guided 3D face generation with only text-2D face data. On
top of that, we propose two text-to-face cross-modal alignment techniques,
including the global contrastive learning and the fine-grained alignment
module, to facilitate high semantic consistency between generated 3D faces and
input texts. Besides, we present directional classifier guidance during the
inference process, which encourages creativity for out-of-domain generations.
Compared to the existing methods, TG-3DFace creates more realistic and
aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over
Latent3D. The rendered face images generated by TG-3DFace achieve higher FID
and CLIP score than text-to-2D face/image generation models, demonstrating our
superiority in generating realistic and semantic-consistent textures. |
This paper introduces TG-3DFace, a novel framework for generating high-fidelity 3D faces from textual descriptions using only text-2D face image pairs. |
Text-guided 3D face generation is highly sought after in fields like gaming and film, but is challenging due to limited text-3D face data and the need for semantic alignment between generated faces and text. |
TG-3DFace employs a text-conditional 3D GAN trained on text-2D face images and leverages two key techniques: global text-to-face contrastive learning for semantic consistency and fine-grained text-to-face alignment for detailed attribute control. Additionally, directional classifier guidance is used during inference to enable out-of-domain generation. |
TG-3DFace generates more realistic and aesthetically pleasing 3D faces, outperforming baseline Latent3D in multi-view consistency by 9%.
Rendered face images from TG-3DFace achieve higher FID and CLIP scores compared to existing text-to-2D face generation models, demonstrating superior realism and semantic consistency.
TG-3DFace has applications in single-view 3D face reconstruction and text-guided 3D face manipulation. |
Current limitations include the inability to infer identity from text, occasional asymmetry in generated faces, and limited racial diversity.
Future work aims to address these limitations, improve shape quality, and enhance racial representation in the generated faces. |
3d face generation, text-to-3d, text-guided synthesis, generative adversarial networks, cross-modal alignment |
2308.16689
Report |
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation |
Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun |
Vision-language pre-training (VLP) methods are blossoming recently, and its
crucial goal is to jointly learn visual and textual features via a
transformer-based architecture, demonstrating promising improvements on a
variety of vision-language tasks. Prior arts usually focus on how to align
visual and textual features, but strategies for improving the robustness of
model and speeding up model convergence are left insufficiently explored.
In this paper, we propose a novel method ViLTA, comprising of two components
to further facilitate the model to learn fine-grained representations among
image-text pairs. For Masked Language Modeling (MLM), we propose a
cross-distillation method to generate soft labels to enhance the robustness of
model, which alleviates the problem of treating synonyms of masked words as
negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage
the current language encoder to synthesize hard negatives based on the context
of language input, encouraging the model to learn high-quality representations
by increasing the difficulty of the ITM task. By leveraging the above
techniques, our ViLTA can achieve better performance on various vision-language
tasks. Extensive experiments on benchmark datasets demonstrate that the
effectiveness of ViLTA and its promising potential for vision-language
pre-training. |
This paper introduces ViLTA, a novel vision-language pre-training method that enhances model representation ability through textual augmentation. |
Existing VLP models suffer from limitations in MLM robustness due to one-hot labels and sub-optimal negative sample selection in ITM. |
ViLTA employs two key components: (1) cross-distillation for MLM using a frozen language encoder to generate soft labels and (2) synthetic hard negative generation for ITM based on the current language encoder. |
ViLTA achieves state-of-the-art performance on various vision-language tasks, including VQA, visual reasoning, and image captioning.
Cross-distillation in MLM improves model robustness and learning efficiency.
Synthetic hard negatives for ITM enhance model convergence and downstream performance. |
The study mainly focuses on MLM and ITM, potentially limiting performance gains in retrieval tasks compared to models incorporating ITC.
Future work can explore the impact of large-scale pre-training on ViLTA's performance and investigate advanced negative sampling techniques. |
vision-language pre-training, textual augmentation, knowledge distillation, hard negative mining, multimodal learning |
2308.16632
Report |
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation |
Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun |
In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts
a two-stage paradigm, extracting segmentation proposals and then matching them
with referring expressions. However, this conventional paradigm encounters
significant challenges, most notably in terms of the generation of lackluster
initial proposals and a pronounced deceleration in inference speed. Recognizing
these limitations, we introduce an innovative end-to-end Superpoint-Text
Matching Network (3D-STMN) that is enriched by dependency-driven insights. One
of the keystones of our model is the Superpoint-Text Matching (STM) mechanism.
Unlike traditional methods that navigate through instance proposals, STM
directly correlates linguistic indications with their respective superpoints,
clusters of semantically related points. This architectural decision empowers
our model to efficiently harness cross-modal semantic relationships, primarily
leveraging densely annotated superpoint-text pairs, as opposed to the more
sparse instance-text pairs. In pursuit of enhancing the role of text in guiding
the segmentation process, we further incorporate the Dependency-Driven
Interaction (DDI) module to deepen the network's semantic comprehension of
referring expressions. Using the dependency trees as a beacon, this module
discerns the intricate relationships between primary terms and their associated
descriptors in expressions, thereby elevating both the localization and
segmentation capacities of our model. Comprehensive experiments on the
ScanRefer benchmark reveal that our model not only set new performance
standards, registering an mIoU gain of 11.7 points but also achieve a
staggering enhancement in inference speed, surpassing traditional methods by
95.7 times. The code and models are available at
https://github.com/sosppxo/3D-STMN. |
This paper proposes 3D-STMN, an efficient end-to-end Superpoint-Text Matching Network for 3D Referring Expression Segmentation (3D-RES), addressing limitations of previous two-stage methods. |
3D-RES is crucial for applications like robotics and self-driving by enabling precise object identification and segmentation from language descriptions in 3D scenes. |
3D-STMN leverages superpoints and a novel Superpoint-Text Matching mechanism (STM) for aligning visual and language modalities. It incorporates a Dependency-Driven Interaction (DDI) module to exploit sentence structure for improved semantic comprehension. |
3D-STMN significantly outperforms prior art (TGNN) on ScanRefer benchmark, achieving 11.7 point improvement in mIoU.
It achieves a remarkable 95.7x speedup in inference time compared to TGNN, enabling near real-time performance.
Qualitative analysis demonstrates 3D-STMN's superior accuracy in localizing and segmenting target objects, even in complex scenes with similar objects. |
The model's performance depends on the effectiveness of individual components like BERT, Sparse 3D U-Net, and the dependency parser.
Future work can explore incorporating contextual information from the 3D scene to further enhance object disambiguation. |
3d referring expression segmentation, 3d visual grounding, superpoint, dependency parsing, multimodal learning |
2308.16582
Report |
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images |
Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu |
Stable diffusion, a generative model used in text-to-image synthesis,
frequently encounters resolution-induced composition problems when generating
images of varying sizes. This issue primarily stems from the model being
trained on pairs of single-scale images and their corresponding text
descriptions. Moreover, direct training on images of unlimited sizes is
unfeasible, as it would require an immense number of text-image pairs and
entail substantial computational expenses. To overcome these challenges, we
propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to
efficiently generate well-composed images of any size, while minimizing the
need for high-memory GPU resources. Specifically, the initial stage, dubbed Any
Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a
restricted range of ratios to optimize the text-conditional diffusion model,
thereby improving its ability to adjust composition to accommodate diverse
image sizes. To support the creation of images at any desired size, we further
introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the
subsequent stage. This method allows for the rapid enlargement of the ASD
output to any high-resolution size, avoiding seaming artifacts or memory
overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks
demonstrate that ASD can produce well-structured images of arbitrary sizes,
cutting down the inference time by 2x compared to the traditional tiled
algorithm. |
This paper introduces Any-Size-Diffusion (ASD), a two-stage pipeline for generating high-resolution images of any size from text prompts, addressing resolution-induced composition problems in existing text-to-image synthesis models. |
Existing models struggle with resolution changes, leading to poor composition in generated images of varying sizes. This issue arises from training on single-scale images and poses challenges in handling the vast range of possible image sizes. |
ASD employs a two-stage approach: 1) Any Ratio Adaptability Diffusion (ARAD) trains on multi-aspect ratio images to generate well-composed images adaptable to different sizes. 2) Fast Seamless Tiled Diffusion (FSTD) magnifies ARAD outputs to arbitrary sizes using an implicit overlap technique for efficiency and seam artifact avoidance. |
ASD outperforms baseline models in quantitative metrics (FID, IS, CLIP) and qualitative comparisons, demonstrating superior composition quality.
Multi-aspect ratio training in ARAD significantly improves composition consistency across different sizes compared to single-size trained models.
FSTD effectively mitigates seaming artifacts while achieving comparable inference time to non-overlapping tiled sampling. |
The current implementation relies on a pre-defined set of aspect ratios for training.
Future work will explore expanding the range of aspect ratios and optimizing the computational efficiency of FSTD for even faster inference. |
text-to-image synthesis, stable diffusion, multi-resolution generation, compositionality, tiled diffusion |
2308.16512
Report |
MVDream: Multi-view Diffusion for 3D Generation |
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang |
We introduce MVDream, a diffusion model that is able to generate consistent
multi-view images from a given text prompt. Learning from both 2D and 3D data,
a multi-view diffusion model can achieve the generalizability of 2D diffusion
models and the consistency of 3D renderings. We demonstrate that such a
multi-view diffusion model is implicitly a generalizable 3D prior agnostic to
3D representations. It can be applied to 3D generation via Score Distillation
Sampling, significantly enhancing the consistency and stability of existing
2D-lifting methods. It can also learn new concepts from a few 2D examples, akin
to DreamBooth, but for 3D generation. |
This paper presents MVDream, a novel multi-view diffusion model for consistent multi-view and 3D image generation from text prompts. |
Existing 2D-lifting methods for text-to-3D generation often produce inconsistent images across different viewpoints. MVDream addresses this by directly incorporating multi-view consistency into a diffusion-based framework. |
The authors achieve this by (1) adapting a pre-trained text-to-image diffusion model to generate multiple views using inflated 3D self-attention and camera embeddings, (2) training the model on a combination of 3D rendered data and a large-scale 2D image-text dataset, and (3) using the trained model as a prior for 3D generation via Score Distillation Sampling (SDS). |
MVDream effectively addresses the multi-view consistency issue common in 2D-lifting methods.
The method demonstrates strong generalizability, generating coherent multi-view images from unseen and potentially counterfactual prompts.
MVDream can be extended to a DreamBooth model for personalized 3D generation, outperforming previous state-of-the-art methods in terms of quality and detail. |
The current implementation is limited to a 256x256 resolution and its generalizability depends on the base model.
The generated image styles can be influenced by the rendered dataset, highlighting the need for larger and more diverse 3D datasets. |
multi-view diffusion model, text-to-3d generation, score distillation sampling, dreambooth, multi-view consistency |
2308.16510
Report |
Robust GAN inversion |
Egor Sevriugov, Ivan Oseledets |
Recent advancements in real image editing have been attributed to the
exploration of Generative Adversarial Networks (GANs) latent space. However,
the main challenge of this procedure is GAN inversion, which aims to map the
image to the latent space accurately. Existing methods that work on extended
latent space $W+$ are unable to achieve low distortion and high editability
simultaneously. To address this issue, we propose an approach which works in
native latent space $W$ and tunes the generator network to restore missing
image details. We introduce a novel regularization strategy with learnable
coefficients obtained by training randomized StyleGAN 2 model - WRanGAN. This
method outperforms traditional approaches in terms of reconstruction quality
and computational efficiency, achieving the lowest distortion with 4 times
fewer parameters. Furthermore, we observe a slight improvement in the quality
of constructing hyperplanes corresponding to binary image attributes. We
demonstrate the effectiveness of our approach on two complex datasets:
Flickr-Faces-HQ and LSUN Church. |
This paper introduces WRanGAN, a novel GAN inversion approach using adaptive regularization in the native latent space (W) to enhance image reconstruction quality while maintaining editability. |
Existing GAN inversion methods struggle to balance high-fidelity image reconstruction with preserving editability, limiting their use in real image editing applications. |
The method leverages a randomized StyleGAN 2 model (WRanGAN), where a subset of generator weights are randomized. This allows for learning regularization coefficients through adversarial training. During inversion, these coefficients are applied, guiding the optimization towards high-quality reconstruction without compromising model fidelity. |
WRanGAN achieves superior image reconstruction compared to baselines, evidenced by lower MSE and higher MS-SSIM values.
It outperforms the best baseline (PTI) in reconstruction quality while being significantly faster and requiring less memory.
The method retains good image editing capabilities, comparable to or slightly better than the original StyleGAN 2 model. |
The approach focuses on StyleGAN 2 architecture and its generalizability to other GAN architectures needs further investigation.
While the method maintains good editability, further exploration of techniques for even finer-grained control is possible. |
gan inversion, image editing, generative adversarial networks, stylegan, regularization |
2308.16481
Report |
Point-TTA: Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary Learning |
Ahmed Hatem, Yiming Qian, Yang Wang |
We present Point-TTA, a novel test-time adaptation framework for point cloud
registration (PCR) that improves the generalization and the performance of
registration models. While learning-based approaches have achieved impressive
progress, generalization to unknown testing environments remains a major
challenge due to the variations in 3D scans. Existing methods typically train a
generic model and the same trained model is applied on each instance during
testing. This could be sub-optimal since it is difficult for the same model to
handle all the variations during testing. In this paper, we propose a test-time
adaptation approach for PCR. Our model can adapt to unseen distributions at
test-time without requiring any prior knowledge of the test data. Concretely,
we design three self-supervised auxiliary tasks that are optimized jointly with
the primary PCR task. Given a test instance, we adapt our model using these
auxiliary tasks and the updated model is used to perform the inference. During
training, our model is trained using a meta-auxiliary learning approach, such
that the adapted model via auxiliary tasks improves the accuracy of the primary
task. Experimental results demonstrate the effectiveness of our approach in
improving generalization of point cloud registration and outperforming other
state-of-the-art approaches. |
This paper introduces Point-TTA, a novel test-time adaptation framework designed for point cloud registration (PCR) that enhances the generalization and performance of registration models. |
Generalization to unknown testing environments remains a challenge for learning-based PCR approaches due to variations in 3D scans, making a single set of model parameters sub-optimal. |
The method utilizes three self-supervised auxiliary tasks: point cloud reconstruction, feature learning (BYOL), and correspondence classification, to adapt the model to unseen distributions at test time. It employs a meta-auxiliary learning approach based on MAML to train the model, ensuring the adapted model improves the accuracy of the primary PCR task. |
Point-TTA significantly improves registration recall and reduces rotation and translation errors on the 3DMatch benchmark, outperforming state-of-the-art methods.
The method exhibits strong generalization capabilities, demonstrated by significant performance improvements in cross-dataset evaluations between 3DMatch and KITTI datasets, as well as robustness on the low-overlapping 3DLoMatch dataset.
Point-TTA, integrated into a multi-way registration pipeline, enhances the accuracy of 3D reconstruction scenes on the Augmented ICL-NUIM dataset, surpassing baseline methods. |
The paper acknowledges the potential limitation of slightly worse rotation error observed in some cases when using all three auxiliary tasks.
Future work could explore extending the approach to handle dynamic scenes and incorporating additional self-supervised auxiliary tasks. |
point cloud registration, test-time adaptation, meta-learning, self-supervised learning, 3d vision |
2308.16110
Report |
Improving Few-shot Image Generation by Structural Discrimination and Textural Modulation |
Mengping Yang, Zhe Wang, Wenyi Feng, Qian Zhang, Ting Xiao |
Few-shot image generation, which aims to produce plausible and diverse images
for one category given a few images from this category, has drawn extensive
attention. Existing approaches either globally interpolate different images or
fuse local representations with pre-defined coefficients. However, such an
intuitive combination of images/features only exploits the most relevant
information for generation, leading to poor diversity and coarse-grained
semantic fusion. To remedy this, this paper proposes a novel textural
modulation (TexMod) mechanism to inject external semantic signals into internal
local representations. Parameterized by the feedback from the discriminator,
our TexMod enables more fined-grained semantic injection while maintaining the
synthesis fidelity. Moreover, a global structural discriminator (StructD) is
developed to explicitly guide the model to generate images with reasonable
layout and outline. Furthermore, the frequency awareness of the model is
reinforced by encouraging the model to distinguish frequency signals. Together
with these techniques, we build a novel and effective model for few-shot image
generation. The effectiveness of our model is identified by extensive
experiments on three popular datasets and various settings. Besides achieving
state-of-the-art synthesis performance on these datasets, our proposed
techniques could be seamlessly integrated into existing models for a further
performance boost. |
This paper proposes SDTM-GAN, a few-shot image generation model, which improves global coherence and enables fine-grained semantic fusion through structural discrimination (StructD) and textural modulation (TexMod). |
Few-shot image generation models struggle to achieve desirable diversity and fidelity due to limitations in semantic fusion and lack of structural guidance. |
TexMod injects external semantic layouts into internal textural styles using a two-stage injection mechanism. StructD uses Laplacian representations to provide global structural guidelines. A frequency discriminator encourages high-frequency signal capture. |
SDTM-GAN significantly improves FID and LPIPS scores on Flowers, Animal Faces, and VGGFace datasets, achieving state-of-the-art performance.
The generated images show improved global coherence, fine-grained semantic details, and diversity.
The proposed techniques are complementary to existing models and improve downstream classification accuracy when used for data augmentation. |
The model's performance might degrade on datasets with large class variances or in cross-domain generation with substantial domain gaps.
Future work includes exploring data augmentation for one-shot generation, capturing more distributional information, and investigating diffusion models. |
few-shot image generation, textural modulation, structural discrimination, generative adversarial networks, semantic fusion |
2308.15854
Report |
Zero-shot Inversion Process for Image Attribute Editing with Diffusion Models |
Zhanbo Feng, Zenan Ling, Ci Gong, Feng Zhou, Jie Li, Robert C. Qiu |
Denoising diffusion models have shown outstanding performance in image
editing. Existing works tend to use either image-guided methods, which provide
a visual reference but lack control over semantic coherence, or text-guided
methods, which ensure faithfulness to text guidance but lack visual quality. To
address the problem, we propose the Zero-shot Inversion Process (ZIP), a
framework that injects a fusion of generated visual reference and text guidance
into the semantic latent space of a \textit{frozen} pre-trained diffusion
model. Only using a tiny neural network, the proposed ZIP produces diverse
content and attributes under the intuitive control of the text prompt.
Moreover, ZIP shows remarkable robustness for both in-domain and out-of-domain
attribute manipulation on real images. We perform detailed experiments on
various benchmark datasets. Compared to state-of-the-art methods, ZIP produces
images of equivalent quality while providing a realistic editing effect. |
This paper introduces Zero-shot Inversion Process (ZIP), a novel framework for realistic image attribute editing that injects a fusion of generated visual reference and text guidance into the semantic latent space of a *frozen* pre-trained diffusion model. |
Existing image editing methods using diffusion models often lack control over semantic coherence (image-guided) or suffer from low visual quality (text-guided). ZIP addresses this by combining text guidance for intuitive control and generated visual references for fine-grained visual patterns. |
ZIP leverages a pre-trained text-to-image diffusion model to generate a reference image from the target attribute. An attribute encoder (a small neural network) is trained to encode this reference image into features, which are then integrated into the latent space of a frozen pre-trained diffusion model (editing generator). The editing process is guided by a text prompt and optimized using a CLIP-based loss function. |
ZIP enables consistent and controllable editing by generating specific attributes aligned with the reference image, unlike text-guided methods like Null-text Inversion.
ZIP demonstrates superior performance in both in-domain and out-of-domain attribute editing compared to image-guided methods (ILVR) and text-guided methods (Asyrp), achieving higher CLIP scores while preserving visual quality.
ZIP exhibits versatility across diverse datasets (CelebA-HQ, LSUN-church, LSUN-bedroom), successfully synthesizing attributes and manipulating images based on textual semantics. |
ZIP currently lacks the capability to utilize a target mask for precise editing, which may lead to unintended modifications.
Future work will focus on improving the accuracy of attribute acquisition from reference images, particularly for attributes with similar visual features. |
image editing, diffusion models, text-guided synthesis, zero-shot learning, semantic manipulation |
2308.15692
Report |
Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models |
Takami Sato, Justin Yue, Nanze Chen, Ningfei Wang, Qi Alfred Chen |
Denoising probabilistic diffusion models have shown breakthrough performance
to generate more photo-realistic images or human-level illustrations than the
prior models such as GANs. This high image-generation capability has stimulated
the creation of many downstream applications in various areas. However, we find
that this technology is actually a double-edged sword: We identify a new type
of attack, called the Natural Denoising Diffusion (NDD) attack based on the
finding that state-of-the-art deep neural network (DNN) models still hold their
prediction even if we intentionally remove their robust features, which are
essential to the human visual system (HVS), through text prompts. The NDD
attack shows a significantly high capability to generate low-cost,
model-agnostic, and transferable adversarial attacks by exploiting the natural
attack capability in diffusion models. To systematically evaluate the risk of
the NDD attack, we perform a large-scale empirical study with our newly created
dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the
natural attack capability by answering 6 research questions. Through a user
study, we find that it can achieve an 88% detection rate while being stealthy
to 93% of human subjects; we also find that the non-robust features embedded by
diffusion models contribute to the natural attack capability. To confirm the
model-agnostic and transferable attack capability, we perform the NDD attack
against the Tesla Model 3 and find that 73% of the physically printed attacks
can be detected as stop signs. Our hope is that the study and dataset can help
our community be aware of the risks in diffusion models and facilitate further
research toward robust DNN models. |
This paper identifies a new security threat, the Natural Denoising Diffusion (NDD) attack, which exploits the natural attack capability of diffusion models to generate model-agnostic and transferable adversarial attacks. |
Diffusion models, while revolutionary for image generation, introduce new security risks by embedding imperceptible features that can fool DNN models. |
The authors construct the Natural Diffusion Denoising Attack (NDDA) dataset by generating images with and without robust features (shape, color, text, pattern) using various diffusion models. They evaluate the attack capability against object detectors and image classifiers and conduct a user study to assess the stealthiness of the attacks. |
Diffusion models exhibit a significantly higher natural attack capability compared to prior image generation models like GANs.
The NDD attack can achieve a high attack success rate (88% for stop signs) while remaining stealthy to human perception (93% for stop signs).
Non-robust features, imperceptible yet predictive to DNNs, play a significant role in the NDD attack's effectiveness. |
The study primarily focuses on three object classes, requiring further evaluation on a wider range of categories.
While the research provides empirical evidence, the root causes of the natural attack capability in diffusion models require further theoretical and large-scale empirical investigation. |
adversarial attacks, diffusion models, computer vision, deep learning security, ndd attack |
2308.15547
Report |
Efficient Ray Sampling for Radiance Fields Reconstruction |
Shilei Sun, Ming Liu, Zhongyi Fan, Yuxue Liu, Chengwei Lv, Liquan Dong, Lingqin Kong |
Accelerating neural radiance fields training is of substantial practical
value, as the ray sampling strategy profoundly impacts network convergence.
More efficient ray sampling can thus directly enhance existing NeRF models'
training efficiency. We therefore propose a novel ray sampling approach for
neural radiance fields that improves training efficiency while retaining
photorealistic rendering results. First, we analyze the relationship between
the pixel loss distribution of sampled rays and rendering quality. This reveals
redundancy in the original NeRF's uniform ray sampling. Guided by this finding,
we develop a sampling method leveraging pixel regions and depth boundaries. Our
main idea is to sample fewer rays in training views, yet with each ray more
informative for scene fitting. Sampling probability increases in pixel areas
exhibiting significant color and depth variation, greatly reducing wasteful
rays from other regions without sacrificing precision. Through this method, not
only can the convergence of the network be accelerated, but the spatial
geometry of a scene can also be perceived more accurately. Rendering outputs
are enhanced, especially for texture-complex regions. Experiments demonstrate
that our method significantly outperforms state-of-the-art techniques on public
benchmark datasets. |
This paper proposes a novel ray sampling method for neural radiance fields that improves training efficiency while maintaining high-quality rendering results. |
Accelerating neural radiance fields training is crucial, and efficient ray sampling is key to achieving faster convergence and better utilizing resources. |
The proposed method leverages pixel regions and depth boundaries to guide ray sampling. It increases sampling probability in areas with significant color and depth variations, reducing redundant rays in other regions. |
The method significantly accelerates convergence, achieving comparable results to traditional methods in much shorter times.
It improves rendering quality, particularly in regions with rich texture and complex details.
The method is easily integrated into existing NeRF frameworks, consistently demonstrating improvements in speed and rendering quality. |
The current implementation primarily focuses on static and dynamic scenes with simple backgrounds.
Future work will explore extending the method to handle complex backgrounds and further enhance its generalization capabilities. |
neural radiance fields, ray sampling, view synthesis, training acceleration, rendering quality |
2308.15472
Report |
Learning Modulated Transformation in GANs |
Ceyuan Yang, Qihang Zhang, Yinghao Xu, Jiapeng Zhu, Yujun Shen, Bo Dai |
The success of style-based generators largely benefits from style modulation,
which helps take care of the cross-instance variation within data. However, the
instance-wise stochasticity is typically introduced via regular convolution,
where kernels interact with features at some fixed locations, limiting its
capacity for modeling geometric variation. To alleviate this problem, we equip
the generator in generative adversarial networks (GANs) with a plug-and-play
module, termed as modulated transformation module (MTM). This module predicts
spatial offsets under the control of latent codes, based on which the
convolution operation can be applied at variable locations for different
instances, and hence offers the model an additional degree of freedom to handle
geometry deformation. Extensive experiments suggest that our approach can be
faithfully generalized to various generative tasks, including image generation,
3D-aware image synthesis, and video generation, and get compatible with
state-of-the-art frameworks without any hyper-parameter tuning. It is
noteworthy that, towards human generation on the challenging TaiChi dataset, we
improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of
learning modulated geometry transformation. |
This paper proposes a plug-and-play module for GAN generators called Modulated Transformation Module (MTM) to improve the handling of large geometric variations in generated content. |
Standard GAN generators, while effective for aligned datasets with limited geometric variance, struggle with datasets like ImageNet or videos with complex motions due to the limitations of regular convolutions in modeling diverse geometry. |
MTM predicts spatial offsets for each spatial location in the feature map conditioned on the latent code. These offsets allow convolution operations to be performed at variable locations, enabling the model to learn and represent diverse geometric transformations. |
MTM consistently improves both image and video generation quality across different datasets and baseline models, as evidenced by metrics like FID, CLIP-FD, sFID, and FVD.
Applying MTM to low-resolution layers of the generator offers the best performance-efficiency trade-off.
Disabling the learnable offsets in MTM after training results in a collapse of geometric variation, highlighting its role in enabling explicit deformation. |
The paper doesn't explore the effectiveness of MTM on other generative models like auto-regressive models or diffusion models.
The impact of MTM on large-scale generative tasks such as text-to-image generation remains unexplored. |
generative adversarial networks, geometric variation, image generation, video generation, spatial transformation |
2308.15070
Report |
DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior |
Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, Chao Dong |
We present DiffBIR, a general restoration pipeline that could handle
different blind image restoration tasks in a unified framework. DiffBIR
decouples blind image restoration problem into two stages: 1) degradation
removal: removing image-independent content; 2) information regeneration:
generating the lost image content. Each stage is developed independently but
they work seamlessly in a cascaded manner. In the first stage, we use
restoration modules to remove degradations and obtain high-fidelity restored
results. For the second stage, we propose IRControlNet that leverages the
generative ability of latent diffusion models to generate realistic details.
Specifically, IRControlNet is trained based on specially produced condition
images without distracting noisy content for stable generation performance.
Moreover, we design a region-adaptive restoration guidance that can modify the
denoising process during inference without model re-training, allowing users to
balance realness and fidelity through a tunable guidance scale. Extensive
experiments have demonstrated DiffBIR's superiority over state-of-the-art
approaches for blind image super-resolution, blind face restoration and blind
image denoising tasks on both synthetic and real-world datasets. The code is
available at https://github.com/XPixelGroup/DiffBIR. |
DiffBIR, a unified two-stage blind image restoration pipeline achieving state-of-the-art performance on BSR, BFR, and BID. |
Existing BIR methods struggle to generalize to real-world degradations or lack the ability to generate realistic details, limiting their practicality. |
DiffBIR decouples BIR into degradation removal (using task-specific modules) and information regeneration (using IRControlNet, a novel generation module leveraging latent diffusion models conditioned on restored images). It also introduces region-adaptive restoration guidance for fidelity-quality trade-off during inference. |
DiffBIR significantly outperforms state-of-the-art methods in BSR, BFR, and BID on both synthetic and real-world datasets, demonstrating its superior generalization ability.
IRControlNet, with its efficient condition encoding and feature modulation, proves crucial for high-quality image reconstruction in BIR.
The region-adaptive restoration guidance allows for flexible control over fidelity and quality based on user preferences. |
The current implementation of DiffBIR is computationally expensive, requiring 50 sampling steps per image.
The effectiveness of the two-stage pipeline on other BIR tasks requires further exploration. |
blind image restoration, blind super-resolution, blind face restoration, blind image denoising, latent diffusion models |
2308.15049
Report |
Pose-Free Neural Radiance Fields via Implicit Pose Regularization |
Jiahui Zhang, Fangneng Zhan, Yingchen Yu, Kunhao Liu, Rongliang Wu, Xiaoqin Zhang, Ling Shao, Shijian Lu |
Pose-free neural radiance fields (NeRF) aim to train NeRF with unposed
multi-view images and it has achieved very impressive success in recent years.
Most existing works share the pipeline of training a coarse pose estimator with
rendered images at first, followed by a joint optimization of estimated poses
and neural radiance field. However, as the pose estimator is trained with only
rendered images, the pose estimation is usually biased or inaccurate for real
images due to the domain gap between real images and rendered images, leading
to poor robustness for the pose estimation of real images and further local
minima in joint optimization. We design IR-NeRF, an innovative pose-free NeRF
that introduces implicit pose regularization to refine pose estimator with
unposed real images and improve the robustness of the pose estimation for real
images. With a collection of 2D images of a specific scene, IR-NeRF constructs
a scene codebook that stores scene features and captures the scene-specific
pose distribution implicitly as priors. Thus, the robustness of pose estimation
can be promoted with the scene priors according to the rationale that a 2D real
image can be well reconstructed from the scene codebook only when its estimated
pose lies within the pose distribution. Extensive experiments show that IR-NeRF
achieves superior novel view synthesis and outperforms the state-of-the-art
consistently across multiple synthetic and real datasets. |
This paper proposes IR-NeRF, a pose-free Neural Radiance Field (NeRF) that leverages implicit pose regularization to refine pose estimation with unposed real images and improve the robustness of pose estimation. |
Existing pose-free NeRF methods struggle with inaccurate pose estimation due to the domain gap between rendered and real images. This leads to poor robustness and local minima during optimization. IR-NeRF addresses this challenge by incorporating implicit pose regularization. |
IR-NeRF constructs a scene codebook that stores scene features and implicitly captures scene-specific pose distribution. A pose-guided view reconstruction scheme then refines the pose estimator using unposed real images and a view consistency loss. |
IR-NeRF achieves superior novel view synthesis compared to the state-of-the-art GNeRF, as demonstrated by higher PSNR, SSIM, and lower LPIPS scores across various synthetic and real datasets.
The proposed implicit pose regularization effectively improves the accuracy of camera pose estimation on real images.
Ablation studies confirm the effectiveness of each component in IR-NeRF, including implicit pose regularization, scene codebook construction, and view consistency loss. |
The training process of IR-NeRF is computationally expensive and time-consuming.
Future work can explore methods to improve training speed, potentially by using more efficient representations. |
neural radiance fields, novel view synthesis, pose estimation, implicit pose regularization, scene codebook |
2308.14761
Report |
Unified Concept Editing in Diffusion Models |
Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau |
Text-to-image models suffer from various safety issues that may limit their
suitability for deployment. Previous methods have separately addressed
individual issues of bias, copyright, and offensive content in text-to-image
models. However, in the real world, all of these issues appear simultaneously
in the same model. We present a method that tackles all issues with a single
approach. Our method, Unified Concept Editing (UCE), edits the model without
training using a closed-form solution, and scales seamlessly to concurrent
edits on text-conditional diffusion models. We demonstrate scalable
simultaneous debiasing, style erasure, and content moderation by editing
text-to-image projections, and we present extensive experiments demonstrating
improved efficacy and scalability over prior work. Our code is available at
https://unified.baulab.info |
This paper introduces Unified Concept Editing (UCE), a closed-form model editing method for addressing safety issues like bias, copyright, and offensive content in text-to-image diffusion models. |
Existing methods address these issues separately, while real-world models exhibit all problems concurrently. UCE offers a single, scalable solution for simultaneous editing, crucial for responsible AI deployment. |
UCE modifies cross-attention weights using a closed-form solution. It identifies concepts via text embeddings, then erases, debiases, or moderates them by steering outputs towards desired targets while preserving unrelated concepts. |
UCE effectively erases artistic styles with minimal interference on unrelated concepts, outperforming fine-tuning based methods.
It debiases gender and racial representations in professions, achieving distributions closer to desired ratios than previous methods.
UCE moderates sensitive content like nudity, effectively reducing its presence while better preserving image quality and text-image alignment compared to other techniques. |
Debiasing across multiple attributes reveals interdependencies and compounding biases, requiring joint consideration for mitigation.
Excessive erasure of artistic styles, even with preservation, degrades general image generation, indicating limits to removable content. |
diffusion models, model editing, debiasing, content moderation, copyright |
2308.14753
Report |
Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond |
Oren Barkan, Tal Reiss, Jonathan Weill, Ori Katz, Roy Hirsch, Itzik Malkiel, Noam Koenigstein |
Visual similarities discovery (VSD) is an important task with broad
e-commerce applications. Given an image of a certain object, the goal of VSD is
to retrieve images of different objects with high perceptual visual similarity.
Although being a highly addressed problem, the evaluation of proposed methods
for VSD is often based on a proxy of an identification-retrieval task,
evaluating the ability of a model to retrieve different images of the same
object. We posit that evaluating VSD methods based on identification tasks is
limited, and faithful evaluation must rely on expert annotations. In this
paper, we introduce the first large-scale fashion visual similarity benchmark
dataset, consisting of more than 110K expert-annotated image pairs. Besides
this major contribution, we share insight from the challenges we faced while
curating this dataset. Based on these insights, we propose a novel and
efficient labeling procedure that can be applied to any dataset. Our analysis
examines its limitations and inductive biases, and based on these findings, we
propose metrics to mitigate those limitations. Though our primary focus lies on
visual similarity, the methodologies we present have broader applications for
discovering and evaluating perceptual similarity across various domains. |
This paper introduces the first large-scale, expert-annotated benchmark dataset for fashion visual similarity discovery (VSD), addressing limitations of identification-based evaluations. |
Accurate VSD evaluation is crucial for e-commerce applications, but existing methods often rely on flawed identification-based proxies. This dataset enables more reliable assessment of VSD models. |
The authors develop the Efficient Discovery of Similarities (EDS) method to curate the dataset. EDS leverages multiple vision models to propose candidate similar pairs, which are then verified by human experts. |
Expert-annotated dataset with over 110K image pairs for closed-catalog and in-the-wild fashion VSD benchmarks.
Proposed ROC-AUC metric for VSD evaluation shows robustness to model bias inherent in the EDS method.
Supervised finetuning for identification does not necessarily improve VSD performance, highlighting the distinction between the two tasks. |
EDS method, while more efficient than brute force, may not uncover all positive pairs.
Future work could explore alternative training schemes for improving VSD performance beyond supervised finetuning. |
visual similarity, benchmark dataset, fashion, information retrieval, evaluation metrics |
2308.14749
Report |
MagicEdit: High-Fidelity and Temporally Coherent Video Editing |
Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, Jiashi Feng |
In this report, we present MagicEdit, a surprisingly simple yet effective
solution to the text-guided video editing task. We found that high-fidelity and
temporally coherent video-to-video translation can be achieved by explicitly
disentangling the learning of content, structure and motion signals during
training. This is in contradict to most existing methods which attempt to
jointly model both the appearance and temporal representation within a single
framework, which we argue, would lead to degradation in per-frame quality.
Despite its simplicity, we show that MagicEdit supports various downstream
video editing tasks, including video stylization, local editing, video-MagicMix
and video outpainting. |
\modelname is a surprisingly simple yet effective solution for text-guided video editing that achieves high-fidelity and temporally coherent results by explicitly disentangling the learning of content, structure, and motion signals. |
Existing video editing methods often struggle with maintaining high per-frame quality and temporal consistency. \modelname addresses these issues with a novel training approach. |
\modelname uses a three-stage training process: (1) Train a base text-to-image diffusion model. (2) Train a structure-conditioned module while freezing the pre-trained UNet. (3) Train a motion module to enforce cross-frame consistency, also while freezing the UNet. |
Successfully performs video stylization with different subjects, backgrounds, and styles.
Enables local editing of videos based on text prompts.
Demonstrates video outpainting capabilities with various ratios and content control. |
Relies on the quality of existing structure extraction methods (e.g., depth, pose).
Large outpainting ratios can sometimes lead to less coherent results. |
video editing, text-guided generation, diffusion models, temporal consistency, video outpainting |
2308.14748
Report |
MagicAvatar: Multimodal Avatar Generation and Animation |
Jianfeng Zhang, Hanshu Yan, Zhongcong Xu, Jiashi Feng, Jun Hao Liew |
This report presents MagicAvatar, a framework for multimodal video generation
and animation of human avatars. Unlike most existing methods that generate
avatar-centric videos directly from multimodal inputs (e.g., text prompts),
MagicAvatar explicitly disentangles avatar video generation into two stages:
(1) multimodal-to-motion and (2) motion-to-video generation. The first stage
translates the multimodal inputs into motion/ control signals (e.g., human
pose, depth, DensePose); while the second stage generates avatar-centric video
guided by these motion signals. Additionally, MagicAvatar supports avatar
animation by simply providing a few images of the target person. This
capability enables the animation of the provided human identity according to
the specific motion derived from the first stage. We demonstrate the
flexibility of MagicAvatar through various applications, including text-guided
and video-guided avatar generation, as well as multimodal avatar animation. |
MagicAvatar, a two-stage framework for multimodal avatar generation and animation, enabling the creation of avatar-centric videos from text, video, or audio inputs. |
Addresses the increasing demand for flexible and user-friendly avatar generation tools in virtual reality, gaming, and social media. |
Disentangles avatar video generation into multimodal-to-motion and motion-to-video stages; utilizes off-the-shelf models for multimodal-to-motion conversion and leverages MagicEdit for motion-to-video generation; enables identity personalization via DreamBooth for animating specific subjects. |
Generates realistic and temporally-coherent avatar videos from text prompts.
Creates avatar videos mimicking motions from source videos.
Allows animating specific subjects using various input modalities. |
Relies on the performance of off-the-shelf models for multimodal-to-motion generation.
Limited control over fine-grained details of the generated motion. |
avatar generation, avatar animation, multimodal learning, text-to-video, video-to-video |
2308.14740
Report |
Total Selfie: Generating Full-Body Selfies |
Bowei Chen, Brian Curless, Ira Kemelmacher-Shlizerman, Steven M. Seitz |
We present a method to generate full-body selfies from photographs originally
taken at arms length. Because self-captured photos are typically taken close
up, they have limited field of view and exaggerated perspective that distorts
facial shapes. We instead seek to generate the photo some one else would take
of you from a few feet away. Our approach takes as input four selfies of your
face and body, a background image, and generates a full-body selfie in a
desired target pose. We introduce a novel diffusion-based approach to combine
all of this information into high-quality, well-composed photos of you with the
desired pose and background. |
Introduces "total selfie", a new type of self-captured photo that captures the entire body in a scene, and proposes a diffusion-based framework to generate it from four selfies, a background image, and a target pose. |
Selfies have limited field of view, distorted perspectives, and pose compositional challenges. Total selfies aim to capture full-body images as if taken by someone else, addressing these limitations. |
Trains a selfie-conditioned inpainting model on a synthetic dataset of selfies and full-body images. At test time, performs face undistortion, automatically selects target pose from user's photo collection, and fine-tunes the model per capture for enhanced fidelity. |
Generates high-quality full-body selfies with accurate poses, expressions, and clothing, even with significant pose differences between input and target.
Outperforms adapted baseline methods, including Paint-By-Example, DisCo, LaDI-VTON, and DreamBooth, in both qualitative and quantitative comparisons.
Demonstrates the trade-offs of using different target pose options (no condition, OpenPose skeleton, Canny Edge) for controlling pose and body shape. |
Shading of the generated body may not perfectly match the real photo.
Struggles to accurately generate hard shadows under strong sunlight due to difficulty in inferring sun direction and scene geometry from the background image alone. |
total selfie, full-body selfie generation, diffusion models, image inpainting, pose control |
2308.14737
Report |
Flexible Techniques for Differentiable Rendering with 3D Gaussians |
Leonid Keselman, Martial Hebert |
Fast, reliable shape reconstruction is an essential ingredient in many
computer vision applications. Neural Radiance Fields demonstrated that
photorealistic novel view synthesis is within reach, but was gated by
performance requirements for fast reconstruction of real scenes and objects.
Several recent approaches have built on alternative shape representations, in
particular, 3D Gaussians. We develop extensions to these renderers, such as
integrating differentiable optical flow, exporting watertight meshes and
rendering per-ray normals. Additionally, we show how two of the recent methods
are interoperable with each other. These reconstructions are quick, robust, and
easily performed on GPU or CPU. For code and visual examples, see
https://leonidk.github.io/fmb-plus |
This table presents the rendering runtimes for forward passes using a proposed method on two different datasets: Ficus and CO3D Teddy. |
The comparison of CPU and GPU runtimes highlights the efficiency of the method, particularly on the GPU. |
The method is evaluated by measuring the time taken for blending and compositing operations on both CPU and GPU. |
The GPU significantly outperforms the CPU for both datasets.
The method demonstrates faster rendering times on the simpler CO3D Teddy dataset compared to the more complex Ficus dataset.
The small differences in GPU runtimes for different datasets suggest a potential memory bottleneck in the current implementation. |
The study is limited to two datasets.
The potential memory bottleneck requires further investigation and optimization. |
rendering, gpu, runtime, neural rendering, gaussian |
2308.14713
Report |
R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras |
Aron Schmied, Tobias Fischer, Martin Danelljan, Marc Pollefeys, Fisher Yu |
Dense 3D reconstruction and ego-motion estimation are key challenges in
autonomous driving and robotics. Compared to the complex, multi-modal systems
deployed today, multi-camera systems provide a simpler, low-cost alternative.
However, camera-based 3D reconstruction of complex dynamic scenes has proven
extremely difficult, as existing solutions often produce incomplete or
incoherent results. We propose R3D3, a multi-camera system for dense 3D
reconstruction and ego-motion estimation. Our approach iterates between
geometric estimation that exploits spatial-temporal information from multiple
cameras, and monocular depth refinement. We integrate multi-camera feature
correlation and dense bundle adjustment operators that yield robust geometric
depth and pose estimates. To improve reconstruction where geometric depth is
unreliable, e.g. for moving objects or low-textured regions, we introduce
learnable scene priors via a depth refinement network. We show that this design
enables a dense, consistent 3D reconstruction of challenging, dynamic outdoor
environments. Consequently, we achieve state-of-the-art dense depth prediction
on the DDAD and NuScenes benchmarks. |
\modelname is a novel multi-camera system for dense 3D reconstruction and ego-motion estimation that leverages both spatial and temporal information in dynamic outdoor environments. |
Camera-based 3D reconstruction is crucial for applications like autonomous driving and robotics, providing a simpler and low-cost alternative to multi-modal systems. However, existing solutions often fail to produce complete and consistent 3D reconstructions of complex dynamic scenes. |
The proposed system iterates between geometric depth estimation from multi-camera feature correspondences and monocular depth refinement. A multi-camera dense bundle adjustment operator and a multi-camera co-visibility graph are introduced to enable robust depth and pose estimation. A depth refinement network integrates monocular cues with prior geometric depth and uncertainty to improve reconstruction in challenging areas. |
\modelname achieves state-of-the-art performance on the DDAD and NuScenes multi-camera depth estimation benchmarks.
The multi-camera dense bundle adjustment operator significantly improves depth accuracy and pose estimation robustness compared to a naive implementation.
The proposed co-visibility graph construction reduces system runtime by nearly 10x while maintaining performance. |
The system relies on deep neural networks with downsampling operations, potentially causing loss of high-frequency details, leading to difficulties in reconstructing thin structures.
Further research is needed to explore the integration of additional sensor modalities like IMU to further improve pose estimation accuracy. |
3d reconstruction, ego-motion estimation, multi-camera systems, dynamic scenes, deep learning |
2308.14616
Report |
VoroMesh: Learning Watertight Surface Meshes with Voronoi Diagrams |
Nissim Maruani, Roman Klokov, Maks Ovsjanikov, Pierre Alliez, Mathieu Desbrun |
In stark contrast to the case of images, finding a concise, learnable
discrete representation of 3D surfaces remains a challenge. In particular,
while polygon meshes are arguably the most common surface representation used
in geometry processing, their irregular and combinatorial structure often make
them unsuitable for learning-based applications. In this work, we present
VoroMesh, a novel and differentiable Voronoi-based representation of watertight
3D shape surfaces. From a set of 3D points (called generators) and their
associated occupancy, we define our boundary representation through the Voronoi
diagram of the generators as the subset of Voronoi faces whose two associated
(equidistant) generators are of opposite occupancy: the resulting polygon mesh
forms a watertight approximation of the target shape's boundary. To learn the
position of the generators, we propose a novel loss function, dubbed VoroLoss,
that minimizes the distance from ground truth surface samples to the closest
faces of the Voronoi diagram which does not require an explicit construction of
the entire Voronoi diagram. A direct optimization of the Voroloss to obtain
generators on the Thingi32 dataset demonstrates the geometric efficiency of our
representation compared to axiomatic meshing algorithms and recent
learning-based mesh representations. We further use VoroMesh in a
learning-based mesh prediction task from input SDF grids on the ABC dataset,
and show comparable performance to state-of-the-art methods while guaranteeing
closed output surfaces free of self-intersections. |
VoroMesh, a novel differentiable Voronoi-based representation for watertight 3D surface meshes, along with a new loss function, VoroLoss, that minimizes the distance from surface samples to Voronoi facets without explicitly constructing the entire Voronoi diagram. |
Finding a concise and learnable representation for 3D surfaces suitable for learning-based applications is challenging, and VoroMesh addresses this by providing a differentiable and efficient way to represent watertight surfaces. |
VoroMesh optimizes the positions of 3D points called generators to fit a target surface. The surface is then extracted as a subset of the Voronoi diagram of these generators, determined by their assigned occupancies. The VoroLoss leverages geometric properties of Voronoi diagrams to efficiently optimize generator positions. |
VoroMesh outperforms Marching Cubes, Dual Contouring, and two recent learning-based methods in terms of geometric fidelity when fitting to ground truth surfaces.
VoroMesh is robust to noise, making it suitable for learning-based applications.
In a learning-based mesh prediction task from SDF grids, VoroMesh achieves comparable performance to state-of-the-art while guaranteeing closed, non-self-intersecting output meshes. |
Small Voronoi faces can create surface artifacts, requiring post-processing.
The initialization of generators could be improved for better efficiency. |
3d shape representation, surface reconstruction, voronoi diagram, differentiable geometry, deep learning |
2308.14267
Report |
Unleash Model Potential: Bootstrapped Meta Self-supervised Learning |
Jingyao Wang, Zeen Song, Wenwen Qiang, Changwen Zheng |
The long-term goal of machine learning is to learn general visual
representations from a small amount of data without supervision, mimicking
three advantages of human cognition: i) no need for labels, ii) robustness to
data scarcity, and iii) learning from experience. Self-supervised learning and
meta-learning are two promising techniques to achieve this goal, but they both
only partially capture the advantages and fail to address all the problems.
Self-supervised learning struggles to overcome the drawbacks of data scarcity,
while ignoring prior knowledge that can facilitate learning and generalization.
Meta-learning relies on supervised information and suffers from a bottleneck of
insufficient learning. To address these issues, we propose a novel Bootstrapped
Meta Self-Supervised Learning (BMSSL) framework that aims to simulate the human
learning process. We first analyze the close relationship between meta-learning
and self-supervised learning. Based on this insight, we reconstruct tasks to
leverage the strengths of both paradigms, achieving advantages i and ii.
Moreover, we employ a bi-level optimization framework that alternates between
solving specific tasks with a learned ability (first level) and improving this
ability (second level), attaining advantage iii. To fully harness its power, we
introduce a bootstrapped target based on meta-gradient to make the model its
own teacher. We validate the effectiveness of our approach with comprehensive
theoretical and empirical study. |
This paper proposes Bootstrapped Meta Self-Supervised Learning (BMSSL), a novel framework that combines self-supervised and meta-learning to learn general visual representations from limited data without supervision, mimicking human-like learning. |
The goal is to overcome limitations of existing self-supervised and meta-learning methods in addressing data scarcity and incorporating prior knowledge for efficient learning and generalization. |
BMSSL reconstructs self-supervised tasks into few-shot classification problems using data augmentation. It employs a bi-level optimization: inner loop for task-specific learning with contrastive loss and outer loop for meta-learning optimal initialization using a bootstrapped target based on meta-gradient. |
BMSSL achieves superior performance on standard and cross-domain few-shot classification benchmarks, outperforming previous unsupervised meta-learning methods.
It exhibits competitive generalization capability compared to supervised meta-learning and self-supervised baselines.
Theoretical analysis provides performance guarantees for BMSSL's task construction and bootstrapped meta-training. |
The evaluation primarily focuses on visual tasks, without exploring its effectiveness in other domains like reinforcement learning or language processing.
Future work includes investigating its applicability to a wider range of tasks beyond classification, such as regression and generation. |
self-supervised learning, meta-learning, few-shot learning, representation learning, computer vision |
2308.14244
Report |
HoloFusion: Towards Photo-realistic 3D Generative Modeling |
Animesh Karnewar, Niloy J. Mitra, Andrea Vedaldi, David Novotny |
Diffusion-based image generators can now produce high-quality and diverse
samples, but their success has yet to fully translate to 3D generation:
existing diffusion methods can either generate low-resolution but 3D consistent
outputs, or detailed 2D views of 3D objects but with potential structural
defects and lacking view consistency or realism. We present HoloFusion, a
method that combines the best of these approaches to produce high-fidelity,
plausible, and diverse 3D samples while learning from a collection of
multi-view 2D images only. The method first generates coarse 3D samples using a
variant of the recently proposed HoloDiffusion generator. Then, it
independently renders and upsamples a large number of views of the coarse 3D
model, super-resolves them to add detail, and distills those into a single,
high-fidelity implicit 3D representation, which also ensures view consistency
of the final renders. The super-resolution network is trained as an integral
part of HoloFusion, end-to-end, and the final distillation uses a new sampling
scheme to capture the space of super-resolved signals. We compare our method
against existing baselines, including DreamFusion, Get3D, EG3D, and
HoloDiffusion, and achieve, to the best of our knowledge, the most realistic
results on the challenging CO3Dv2 dataset. |
This paper proposes HoloFusion, a method that combines a 3D diffusion model (HoloDiffusion) with a jointly trained 2D super-resolution network to generate high-fidelity 3D radiance fields from multi-view 2D images. |
Current 3D generation methods struggle to achieve both high resolution and 3D consistency. HoloFusion addresses these limitations by leveraging the strengths of both 2D and 3D diffusion models. |
HoloFusion first generates coarse 3D models using a modified HoloDiffusion. Then, it renders multiple views, super-resolves them using a 2D diffusion model, and finally distills the super-resolved images into a single high-resolution 3D model using a novel patch-based optimization strategy. |
Achieves state-of-the-art results on the CO3Dv2 dataset, outperforming baselines like DreamFusion, EG3D, and Get3D in terms of realism and view consistency.
Demonstrates the effectiveness of integrating 2D super-resolution with 3D diffusion for high-quality 3D generation.
Proposes a novel patch-based distillation technique that improves the fusion of multiple super-resolved views into a coherent 3D model. |
The generation process is slow due to the distillation step.
The method doesn't explicitly generate a surface representation like a mesh. |
3d generation, diffusion models, super-resolution, neural radiance fields, view consistency |
2308.14078
Report |
Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views |
Zi-Xin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, Song-Hai Zhang |
Reconstructing 3D objects from extremely sparse views is a long-standing and
challenging problem. While recent techniques employ image diffusion models for
generating plausible images at novel viewpoints or for distilling pre-trained
diffusion priors into 3D representations using score distillation sampling
(SDS), these methods often struggle to simultaneously achieve high-quality,
consistent, and detailed results for both novel-view synthesis (NVS) and
geometry. In this work, we present Sparse3D, a novel 3D reconstruction method
tailored for sparse view inputs. Our approach distills robust priors from a
multiview-consistent diffusion model to refine a neural radiance field.
Specifically, we employ a controller that harnesses epipolar features from
input views, guiding a pre-trained diffusion model, such as Stable Diffusion,
to produce novel-view images that maintain 3D consistency with the input. By
tapping into 2D priors from powerful image diffusion models, our integrated
model consistently delivers high-quality results, even when faced with
open-world objects. To address the blurriness introduced by conventional SDS,
we introduce the category-score distillation sampling (C-SDS) to enhance
detail. We conduct experiments on CO3DV2 which is a multi-view dataset of
real-world objects. Both quantitative and qualitative evaluations demonstrate
that our approach outperforms previous state-of-the-art works on the metrics
regarding NVS and geometry reconstruction. |
This paper presents Sparse3D, a novel 3D reconstruction method that leverages multiview-consistent diffusion models to refine neural radiance fields for high-fidelity 3D object reconstruction from sparse views. |
Reconstructing 3D objects from sparse view images is crucial for applications like AR/VR, but existing methods struggle to generate consistent and detailed results, especially for novel view synthesis and geometry reconstruction. |
The method uses epipolar features from input views to guide a pre-trained diffusion model (Stable Diffusion) to generate consistent novel views. It introduces a category-score distillation sampling (C-SDS) strategy to enhance details in the reconstructed NeRF, addressing blurriness common in existing SDS methods. |
Sparse3D outperforms state-of-the-art methods in novel view synthesis quality and geometry reconstruction on the CO3DV2 dataset.
It exhibits superior generalization, producing high-quality results even for unseen object categories.
The proposed C-SDS strategy effectively enhances details in the reconstructed NeRF compared to traditional SDS methods. |
Limitations include challenges with extremely partial object observations and occasional occurrences of the Janus problem.
Future work may explore more efficient 3D representations or feed-forward model priors for improved computational efficiency. |
3d reconstruction, sparse view synthesis, neural radiance fields, diffusion models, score distillation sampling |
2308.13897
Report |
InsertNeRF: Instilling Generalizability into NeRF with HyperNet Modules |
Yanqi Bao, Tianyu Ding, Jing Huo, Wenbin Li, Yuxin Li, Yang Gao |
Generalizing Neural Radiance Fields (NeRF) to new scenes is a significant
challenge that existing approaches struggle to address without extensive
modifications to vanilla NeRF framework. We introduce InsertNeRF, a method for
INStilling gEneRalizabiliTy into NeRF. By utilizing multiple plug-and-play
HyperNet modules, InsertNeRF dynamically tailors NeRF's weights to specific
reference scenes, transforming multi-scale sampling-aware features into
scene-specific representations. This novel design allows for more accurate and
efficient representations of complex appearances and geometries. Experiments
show that this method not only achieves superior generalization performance but
also provides a flexible pathway for integration with other NeRF-like systems,
even in sparse input settings. Code will be available
https://github.com/bbbbby-99/InsertNeRF. |
This paper introduces InsertNeRF, a novel method that instills generalizability into Neural Radiance Fields (NeRF) by using plug-and-play HyperNet modules to dynamically adapt NeRF's weights to specific scenes based on reference images. |
Existing methods for generalizing NeRF to new scenes require significant modifications to the original framework or rely on computationally expensive structures like transformers. InsertNeRF offers a more efficient and flexible alternative by preserving the original NeRF architecture. |
InsertNeRF leverages multiple HyperNet modules within the NeRF framework. These modules generate scene-specific weights based on multi-scale features extracted from reference images, aggregated using a novel multi-layer dynamic-static strategy. This strategy effectively captures scene details and models occlusions for accurate view synthesis. |
InsertNeRF achieves state-of-the-art generalization performance on standard benchmarks (NeRF Synthetic, LLFF, DTU) outperforming existing methods in terms of PSNR, SSIM, and LPIPS.
The plug-and-play nature of HyperNet modules allows InsertNeRF to be easily integrated with other NeRF-like systems like mip-NeRF and NeRF++, demonstrating its versatility and effectiveness in various scenarios.
InsertNeRF shows promising results in view synthesis with sparse inputs, suggesting its potential for applications with limited training data. |
The current implementation of InsertNeRF requires consistent sampling point numbers during training and evaluation, potentially limiting its rendering performance compared to methods that use different settings.
Further exploration is needed to optimize InsertNeRF for sparse input scenarios, including developing fine-tuning strategies for improved results. |
neural radiance fields, generalizable nerf, hypernetworks, view synthesis, novel view synthesis |
2308.13812
Report |
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs |
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua |
Text-to-video (T2V) synthesis has gained increasing attention in the
community, in which the recently emerged diffusion models (DMs) have
promisingly shown stronger performance than the past approaches. While existing
state-of-the-art DMs are competent to achieve high-resolution video generation,
they may largely suffer from key limitations (e.g., action occurrence
disorders, crude video motions) with respect to the intricate temporal dynamics
modeling, one of the crux of video synthesis. In this work, we investigate
strengthening the awareness of video dynamics for DMs, for high-quality T2V
generation. Inspired by human intuition, we design an innovative dynamic scene
manager (dubbed as Dysen) module, which includes (step-1) extracting from input
text the key actions with proper time-order arrangement, (step-2) transforming
the action schedules into the dynamic scene graph (DSG) representations, and
(step-3) enriching the scenes in the DSG with sufficient and reasonable
details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via
in-context learning, Dysen realizes (nearly) human-level temporal dynamics
understanding. Finally, the resulting video DSG with rich action scene details
is encoded as fine-grained spatio-temporal features, integrated into the
backbone T2V DM for video generating. Experiments on popular T2V datasets
suggest that our Dysen-VDM consistently outperforms prior arts with significant
margins, especially in scenarios with complex actions. Codes at
https://haofei.vip/Dysen-VDM |
Presents Dysen-VDM, a dynamics-aware text-to-video diffusion model that leverages LLMs for improved temporal dynamics modeling, addressing issues like action disorders and crude motions in existing methods. |
Existing text-to-video synthesis methods, while achieving high resolution, often overlook the crucial aspect of intricate temporal dynamics modeling, leading to unrealistic and low-quality video generation. |
The Dysen module extracts key actions from text, converts them into Dynamic Scene Graphs (DSGs), and enriches these DSGs with details using ChatGPT (LLM) via in-context learning. A recurrent graph Transformer encodes the enriched DSGs into fine-grained features, integrated into a backbone video diffusion model for enhanced generation. |
Dysen-VDM significantly outperforms prior arts on UCF-101, MSR-VTT, and ActivityNet datasets, especially in action-complex scenarios.
Human evaluation confirms superior performance in action faithfulness, scene richness, and movement fluency.
Ablation studies validate the contributions of the Dysen module, scene imagination, and RGTrm. |
LLM hallucinations can occasionally lead to scene understanding errors, impacting video quality.
DSG-based scene representation may not be suitable for all video styles, such as abstract or cartoon-style content. |
text-to-video synthesis, diffusion models, dynamic scene graphs, large language models, temporal dynamics modeling |
2308.13680
Report |
ACC-UNet: A Completely Convolutional UNet model for the 2020s |
Nabil Ibtehaz, Daisuke Kihara |
This decade is marked by the introduction of Vision Transformer, a radical
paradigm shift in broad computer vision. A similar trend is followed in medical
imaging, UNet, one of the most influential architectures, has been redesigned
with transformers. Recently, the efficacy of convolutional models in vision is
being reinvestigated by seminal works such as ConvNext, which elevates a ResNet
to Swin Transformer level. Deriving inspiration from this, we aim to improve a
purely convolutional UNet model so that it can be on par with the
transformer-based models, e.g, Swin-Unet or UCTransNet. We examined several
advantages of the transformer-based UNet models, primarily long-range
dependencies and cross-level skip connections. We attempted to emulate them
through convolution operations and thus propose, ACC-UNet, a completely
convolutional UNet model that brings the best of both worlds, the inherent
inductive biases of convnets with the design decisions of transformers.
ACC-UNet was evaluated on 5 different medical image segmentation benchmarks and
consistently outperformed convnets, transformers, and their hybrids. Notably,
ACC-UNet outperforms state-of-the-art models Swin-Unet and UCTransNet by $2.64
\pm 2.54\%$ and $0.45 \pm 1.61\%$ in terms of dice score, respectively, while
using a fraction of their parameters ($59.26\%$ and $24.24\%$). Our codes are
available at https://github.com/kiharalab/ACC-UNet. |
ACC-UNet, a novel fully convolutional UNet model for medical image segmentation, incorporating design principles from transformers, namely long-range dependency through hierarchical neighborhood context aggregation (HANC) and multi-level feature combination via multi-level feature compilation (MLFC) in the skip connections. |
Existing UNet models either rely solely on convolutions or incorporate transformers, lacking a solution that effectively integrates the strengths of both. |
The authors designed HANC blocks with inverted bottlenecks and hierarchical neighborhood context aggregation to mimic the long-range dependency achieved by self-attention in transformers. Additionally, they introduced MLFC blocks in the skip connections to fuse feature maps from multiple encoder levels, inspired by transformer-based UNets. |
ACC-UNet consistently outperformed convolutional, transformer-based, and hybrid UNet models on five different medical image segmentation benchmarks.
It surpassed state-of-the-art models like Swin-Unet and UCTransNet in terms of dice score while utilizing significantly fewer parameters.
Qualitative results demonstrate ACC-UNet's ability to accurately segment regions of interest, effectively capturing boundaries and distinguishing between different tissues. |
The reliance on concatenation operations in ACC-UNet leads to computational slowdown, which the authors aim to address in future work through optimized implementations.
Further exploration of transformer-inspired innovations, such as layer normalization, GELU activation, and AdamW optimizer, is planned to further enhance the model's performance. |
unet, medical image segmentation, convolutional neural networks, transformers, deep learning |
2308.13404
Report |
Relighting Neural Radiance Fields with Shadow and Highlight Hints |
Chong Zeng, Guojun Chen, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong |
This paper presents a novel neural implicit radiance representation for free
viewpoint relighting from a small set of unstructured photographs of an object
lit by a moving point light source different from the view position. We express
the shape as a signed distance function modeled by a multi layer perceptron. In
contrast to prior relightable implicit neural representations, we do not
disentangle the different reflectance components, but model both the local and
global reflectance at each point by a second multi layer perceptron that, in
addition, to density features, the current position, the normal (from the
signed distace function), view direction, and light position, also takes shadow
and highlight hints to aid the network in modeling the corresponding high
frequency light transport effects. These hints are provided as a suggestion,
and we leave it up to the network to decide how to incorporate these in the
final relit result. We demonstrate and validate our neural implicit
representation on synthetic and real scenes exhibiting a wide variety of
shapes, material properties, and global illumination light transport. |
This paper introduces a novel neural implicit radiance representation for free viewpoint relighting of objects and scenes using a small set of unstructured photographs. |
Existing relighting methods for neural implicit representations often require a large number of images, rely on simplified lighting or BRDF models, and have difficulty handling complex light transport effects. |
The method uses two MLPs: one for modeling the SDF (as in NeuS) and another for modeling radiance. It incorporates shadow and highlight hints to guide the radiance MLP in capturing high-frequency light transport effects. The model is trained jointly using an image reconstruction loss and an SDF regularization loss. |
The method achieves high-quality relighting with a small number of input images (~500) compared to prior work.
Shadow and highlight hints are shown to be crucial for accurately reproducing these effects.
The method effectively handles complex shapes, materials, and global illumination effects. |
The method struggles with highly specular surfaces reflecting the scene due to limitations in surface normal accuracy.
Exploring alternative approaches to handle other high-frequency light transport effects beyond shadows and highlights is left for future work. |
relighting, free-viewpoint, neural implicit modeling, neural radiance fields, light transport hints |
2308.13266
Report |
Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation |
Yuanyou Xu, Zongxin Yang, Yi Yang |
Tracking any given object(s) spatially and temporally is a common purpose in
Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Joint
tracking and segmentation have been attempted in some studies but they often
lack full compatibility of both box and mask in initialization and prediction,
and mainly focus on single-object scenarios. To address these limitations, this
paper proposes a Multi-object Mask-box Integrated framework for unified
Tracking and Segmentation, dubbed MITS. Firstly, the unified identification
module is proposed to support both box and mask reference for initialization,
where detailed object information is inferred from boxes or directly retained
from masks. Additionally, a novel pinpoint box predictor is proposed for
accurate multi-object box prediction, facilitating target-oriented
representation learning. All target objects are processed simultaneously from
encoding to propagation and decoding, as a unified pipeline for VOT and VOS.
Experimental results show MITS achieves state-of-the-art performance on both
VOT and VOS benchmarks. Notably, MITS surpasses the best prior VOT competitor
by around 6% on the GOT-10k test set, and significantly improves the
performance of box initialization on VOS benchmarks. The code is available at
https://github.com/yoxu515/MITS. |
Presents MITS, a multi-object framework integrating boxes and masks for unified visual object tracking and segmentation. |
Prior works lack full compatibility of box and mask representations, and mainly focus on single-object scenarios. This work aims to unify visual object tracking and segmentation in a multi-object framework. |
MITS leverages a unified identification module for box/mask initialization and a pinpoint box predictor for accurate box prediction, achieving simultaneous multi-object processing in an encoding-propagation-decoding pipeline. |
Achieves state-of-the-art performance on VOT benchmarks (LaSOT, TrackingNet, GOT-10k) and VOS benchmark (YouTube-VOS).
Surpasses the best prior VOT competitor by around 6% on GOT-10k test set.
Significantly improves the performance of box initialization on VOS benchmarks. |
The pinpoint box predictor might not generalize well to objects with complex shapes.
The model's efficiency could be further improved for real-time applications with high frame rate videos. |
visual object tracking, video object segmentation, multi-object tracking, deep learning, computer vision |
2308.13252
Report |
Kissing to Find a Match: Efficient Low-Rank Permutation Representation |
Hannah Dröge, Zorah Lähner, Yuval Bahat, Onofre Martorell, Felix Heide, Michael Möller |
Permutation matrices play a key role in matching and assignment problems
across the fields, especially in computer vision and robotics. However, memory
for explicitly representing permutation matrices grows quadratically with the
size of the problem, prohibiting large problem instances. In this work, we
propose to tackle the curse of dimensionality of large permutation matrices by
approximating them using low-rank matrix factorization, followed by a
nonlinearity. To this end, we rely on the Kissing number theory to infer the
minimal rank required for representing a permutation matrix of a given size,
which is significantly smaller than the problem size. This leads to a drastic
reduction in computation and memory costs, e.g., up to $3$ orders of magnitude
less memory for a problem of size $n=20000$, represented using $8.4\times10^5$
elements in two small matrices instead of using a single huge matrix with
$4\times 10^8$ elements. The proposed representation allows for accurate
representations of large permutation matrices, which in turn enables handling
large problems that would have been infeasible otherwise. We demonstrate the
applicability and merits of the proposed approach through a series of
experiments on a range of problems that involve predicting permutation
matrices, from linear and quadratic assignment to shape matching problems. |
This paper introduces a memory-efficient representation for permutation matrices, especially beneficial for large-scale matching and assignment problems in computer vision and robotics. |
Explicitly representing permutation matrices incurs quadratic memory growth with problem size, rendering large instances infeasible. The proposed method overcomes this limitation, enabling handling of previously intractable problem sizes. |
The core idea is to approximate permutation matrices using low-rank matrix factorization followed by a nonlinearity (ReLU or Softmax). The Kissing number theory guides the determination of the minimal rank needed, significantly smaller than the problem size. A stochastic optimization strategy further enhances memory efficiency. |
The method allows accurate representation of large permutation matrices with significantly reduced memory footprint (e.g., 3 orders of magnitude reduction for n=20000).
Experiments on point cloud alignment, linear/quadratic assignment problems, and shape matching demonstrate the applicability and effectiveness of the approach.
The approach allows for a trade-off between accuracy and memory usage, enabling handling of high-resolution data. |
Stochastic learning might not be suitable for all problem formulations, such as specific QAP forms.
The method might require non-trivial, problem-specific adaptations for successful application. |
permutation matrix, low-rank representation, kissing number, stochastic optimization, shape matching |
2308.13223
Report |
EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior |
Zhipeng Hu, Minda Zhao, Chaoyi Zhao, Xinyue Liang, Lincheng Li, Zeng Zhao, Changjie Fan, Xiaowei Zhou, Xin Yu |
While image diffusion models have made significant progress in text-driven 3D
content creation, they often fail to accurately capture the intended meaning of
text prompts, especially for view information. This limitation leads to the
Janus problem, where multi-faced 3D models are generated under the guidance of
such diffusion models. In this paper, we propose a robust high-quality 3D
content generation pipeline by exploiting orthogonal-view image guidance.
First, we introduce a novel 2D diffusion model that generates an image
consisting of four orthogonal-view sub-images based on the given text prompt.
Then, the 3D content is created using this diffusion model. Notably, the
generated orthogonal-view image provides strong geometric structure priors and
thus improves 3D consistency. As a result, it effectively resolves the Janus
problem and significantly enhances the quality of 3D content creation.
Additionally, we present a 3D synthesis fusion network that can further improve
the details of the generated 3D contents. Both quantitative and qualitative
evaluations demonstrate that our method surpasses previous text-to-3D
techniques. Project page: https://efficientdreamer.github.io. |
Presents EfficientDreamer, a method for high-fidelity and stable text-to-3D creation using orthogonal-view diffusion priors to address the Janus problem (inconsistent 3D generation from text prompts, especially view information). |
Existing text-to-3D methods often fail to accurately capture view instructions in text prompts, leading to inconsistent 3D models (e.g., multi-faced). This work aims to improve the stability and quality of 3D content creation. |
Introduces an orthogonal-view diffusion model trained on a large 3D dataset (Objaverse) to generate composite images with consistent orthogonal views. Uses this model as a prior, along with a pre-trained text-to-image diffusion model, in a two-stage coarse-to-fine optimization process to generate 3D models. A 3D synthesis fusion network dynamically balances the guidance from both diffusion models. |
Effectively resolves the Janus problem by enforcing 3D consistency through orthogonal-view supervision.
Achieves superior 3D content quality compared to state-of-the-art methods, as demonstrated by quantitative metrics (CLIP score, FID) and user studies.
Shows the benefits of a two-stage optimization process and the dynamic fusion of orthogonal-view and text-to-image diffusion priors. |
The scale of the 3D dataset used to train the orthogonal-view diffusion model is limited compared to text-image datasets.
Future work could explore alternative view supervision strategies or incorporate more diverse 3D data. |
text-to-3d, diffusion models, orthogonal-view supervision, janus problem, 3d content creation |
2308.13218
Report |
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning |
Bang Yang, Fenglin Liu, Xian Wu, Yaowei Wang, Xu Sun, Yuexian Zou |
Supervised visual captioning models typically require a large scale of images
or videos paired with descriptions in a specific language (i.e., the
vision-caption pairs) for training. However, collecting and labeling
large-scale datasets is time-consuming and expensive for many scenarios and
languages. Therefore, sufficient labeled pairs are usually not available. To
deal with the label shortage problem, we present a simple yet effective
zero-shot approach MultiCapCLIP that can generate visual captions for different
scenarios and languages without any labeled vision-caption pairs of downstream
datasets. In the training stage, MultiCapCLIP only requires text data for
input. Then it conducts two main steps: 1) retrieving concept prompts that
preserve the corresponding domain knowledge of new scenarios; 2) auto-encoding
the prompts to learn writing styles to output captions in a desired language.
In the testing stage, MultiCapCLIP instead takes visual data as input directly
to retrieve the concept prompts to generate the final visual descriptions. The
extensive experiments on image and video captioning across four benchmarks and
four languages (i.e., English, Chinese, German, and French) confirm the
effectiveness of our approach. Compared with state-of-the-art zero-shot and
weakly-supervised methods, our method achieves 4.8% and 21.5% absolute
improvements in terms of BLEU@4 and CIDEr metrics. Our code is available at
https://github.com/yangbang18/MultiCapCLIP. |
Presents MultiCapCLIP, a simple yet effective zero-shot approach for generating visual captions in different scenarios and languages without labeled vision-caption pairs. |
Addresses the challenge of limited labeled data in visual captioning, particularly for non-English languages, by enabling zero-shot multilingual caption generation. |
Utilizes a CLIP-based vision-language model with a prompt-based auto-encoder. It retrieves concept prompts preserving domain knowledge and auto-encodes them to learn writing styles for captioning. During inference, visual input is used to retrieve prompts and generate descriptions. |
Achieves competitive performance on zero-shot multilingual visual captioning across English and Chinese, outperforming previous methods reliant on large datasets.
Significantly outperforms existing zero-shot and weakly-supervised methods in in-domain experiments.
Demonstrates robustness by effectively extending to German and French image captioning. |
Requires independent text data for training, which might be challenging to collect for some low-resource languages.
Relies on CLIP for measuring text similarity, which may not be optimal for intra-modal retrieval and could be improved with better models. |
zero-shot learning, visual captioning, multilingual captioning, clip, prompt-based learning |
2308.13175
Report |
GridPull: Towards Scalability in Learning Implicit Representations from 3D Point Clouds |
Chao Chen, Yu-Shen Liu, Zhizhong Han |
Learning implicit representations has been a widely used solution for surface
reconstruction from 3D point clouds. The latest methods infer a distance or
occupancy field by overfitting a neural network on a single point cloud.
However, these methods suffer from a slow inference due to the slow convergence
of neural networks and the extensive calculation of distances to surface
points, which limits them to small scale points. To resolve the scalability
issue in surface reconstruction, we propose GridPull to improve the efficiency
of learning implicit representations from large scale point clouds. Our novelty
lies in the fast inference of a discrete distance field defined on grids
without using any neural components. To remedy the lack of continuousness
brought by neural networks, we introduce a loss function to encourage
continuous distances and consistent gradients in the field during pulling
queries onto the surface in grids near to the surface. We use uniform grids for
a fast grid search to localize sampled queries, and organize surface points in
a tree structure to speed up the calculation of distances to the surface. We do
not rely on learning priors or normal supervision during optimization, and
achieve superiority over the latest methods in terms of complexity and
accuracy. We evaluate our method on shape and scene benchmarks, and report
numerical and visual comparisons with the latest methods to justify our
effectiveness and superiority. The code is available at
https://github.com/chenchao15/GridPull. |
Proposes GridPull, a method for reconstructing surfaces from large-scale 3D point clouds by efficiently learning implicit representations without neural networks. |
Addresses the scalability limitations of existing neural implicit representation methods, which struggle with slow inference on large point clouds due to extensive distance calculations and slow neural network convergence. |
Directly infers a discrete distance field on a grid by pulling sampled queries onto the surface. Introduces a loss function encouraging continuous distances and consistent gradients to compensate for the lack of neural network continuity. Uses uniform grids and a tree structure for efficient nearest neighbor search and distance calculation. |
Achieves superior accuracy in surface reconstruction compared to state-of-the-art methods on benchmarks like ShapeNet, FAMOUS, SRB, Thingi10K, D-FAUST, 3DScene, SceneNet, 3D-FRONT, Matterport, and KITTI.
Demonstrates significantly faster inference speed compared to neural network-based approaches, making it suitable for large-scale point clouds.
Shows robustness to noise and ability to handle varying point cloud densities effectively. |
Current implementation uses a fixed grid resolution, which could be improved with adaptive resolution schemes.
Exploring alternative distance field representations beyond grids, such as octrees or hash tables, could further enhance performance. |
surface reconstruction, implicit representations, 3d point clouds, distance fields, scalability |
2308.13164
Report |
Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model |
Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, Jiayi Ma |
In this paper, we rethink the low-light image enhancement task and propose a
physically explainable and generative diffusion model for low-light image
enhancement, termed as Diff-Retinex. We aim to integrate the advantages of the
physical model and the generative network. Furthermore, we hope to supplement
and even deduce the information missing in the low-light image through the
generative network. Therefore, Diff-Retinex formulates the low-light image
enhancement problem into Retinex decomposition and conditional image
generation. In the Retinex decomposition, we integrate the superiority of
attention in Transformer and meticulously design a Retinex Transformer
decomposition network (TDN) to decompose the image into illumination and
reflectance maps. Then, we design multi-path generative diffusion networks to
reconstruct the normal-light Retinex probability distribution and solve the
various degradations in these components respectively, including dark
illumination, noise, color deviation, loss of scene contents, etc. Owing to
generative diffusion model, Diff-Retinex puts the restoration of low-light
subtle detail into practice. Extensive experiments conducted on real-world
low-light datasets qualitatively and quantitatively demonstrate the
effectiveness, superiority, and generalization of the proposed method. |
This paper presents Diff-Retinex, a generative diffusion model for low-light image enhancement based on Retinex decomposition, aiming to recover missing information and correct color deviations. |
Existing LLIE methods often struggle to recover missing scene content and suffer from limitations of traditional or GAN-based generative approaches. |
Diff-Retinex uses a Retinex Transformer Decomposition Network (TDN) to decompose images into illumination and reflectance maps. Then, multi-path diffusion models (RDA and IDA) refine these maps by learning the distribution of normal-light components. |
Diff-Retinex demonstrates superior texture completion and reasoning generation for missing scene content compared to state-of-the-art methods.
The method exhibits better illumination and color fidelity, resulting in more visually pleasing enhanced images.
Qualitative and quantitative evaluations on LOL and VE-LOL-L datasets demonstrate the effectiveness and generalization ability of Diff-Retinex. |
While excelling in visual quality, Diff-Retinex may not achieve top performance on pixel-wise error metrics like PSNR.
Future work could explore achieving better pixel-level accuracy with diffusion models for low-light enhancement. |
low-light image enhancement, diffusion models, retinex decomposition, generative models, image restoration |
2308.13133
Report |
AccFlow: Backward Accumulation for Long-Range Optical Flow |
Guangyang Wu, Xiaohong Liu, Kunming Luo, Xi Liu, Qingqing Zheng, Shuaicheng Liu, Xinyang Jiang, Guangtao Zhai, Wenyi Wang |
Recent deep learning-based optical flow estimators have exhibited impressive
performance in generating local flows between consecutive frames. However, the
estimation of long-range flows between distant frames, particularly under
complex object deformation and large motion occlusion, remains a challenging
task. One promising solution is to accumulate local flows explicitly or
implicitly to obtain the desired long-range flow. Nevertheless, the
accumulation errors and flow misalignment can hinder the effectiveness of this
approach. This paper proposes a novel recurrent framework called AccFlow, which
recursively backward accumulates local flows using a deformable module called
as AccPlus. In addition, an adaptive blending module is designed along with
AccPlus to alleviate the occlusion effect by backward accumulation and rectify
the accumulation error. Notably, we demonstrate the superiority of backward
accumulation over conventional forward accumulation, which to the best of our
knowledge has not been explicitly established before. To train and evaluate the
proposed AccFlow, we have constructed a large-scale high-quality dataset named
CVO, which provides ground-truth optical flow labels between adjacent and
distant frames. Extensive experiments validate the effectiveness of AccFlow in
handling long-range optical flow estimation. Codes are available at
https://github.com/mulns/AccFlow . |
This paper proposes AccFlow, a novel recurrent framework that leverages backward accumulation of local optical flows to estimate long-range optical flows, especially in scenarios with complex object deformation and large motion occlusion. |
Long-range optical flow estimation is crucial for various computer vision tasks, including video editing, action recognition, and object tracking, but remains challenging due to occlusion and accumulation errors. |
AccFlow uses a pretrained optical flow estimator for initial flow estimation, then recursively accumulates local flows backward in feature domain using a deformable module called AccPlus. An adaptive blending module rectifies accumulation errors using directly estimated long-range flow as prior information. The paper also introduces a new synthetic dataset, CVO, with ground-truth long-range optical flow annotations for training and evaluation. |
Backward accumulation effectively alleviates occlusion compared to forward accumulation, as demonstrated by quantitative and qualitative results.
Adaptive blending module significantly reduces accumulated error, particularly in non-occluded regions.
AccFlow outperforms previous state-of-the-art methods on CVO and HS-Sintel benchmarks, achieving substantial EPE reduction. |
The current implementation of AccFlow relies on synthetic data and may require further adaptation for real-world scenarios.
Exploring more sophisticated occlusion reasoning and error correction mechanisms within AccFlow could further enhance its performance. |
optical flow, long-range flow estimation, backward accumulation, occlusion handling, synthetic dataset |
2308.12968
Report |
Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation |
Yuxin Jiang, Liming Jiang, Shuai Yang, Chen Change Loy |
Automatic high-quality rendering of anime scenes from complex real-world
images is of significant practical value. The challenges of this task lie in
the complexity of the scenes, the unique features of anime style, and the lack
of high-quality datasets to bridge the domain gap. Despite promising attempts,
previous efforts are still incompetent in achieving satisfactory results with
consistent semantic preservation, evident stylization, and fine details. In
this study, we propose Scenimefy, a novel semi-supervised image-to-image
translation framework that addresses these challenges. Our approach guides the
learning with structure-consistent pseudo paired data, simplifying the pure
unsupervised setting. The pseudo data are derived uniquely from a
semantic-constrained StyleGAN leveraging rich model priors like CLIP. We
further apply segmentation-guided data selection to obtain high-quality pseudo
supervision. A patch-wise contrastive style loss is introduced to improve
stylization and fine details. Besides, we contribute a high-resolution anime
scene dataset to facilitate future research. Our extensive experiments
demonstrate the superiority of our method over state-of-the-art baselines in
terms of both perceptual quality and quantitative performance. |
This paper presents Scenimefy, a novel semi-supervised image-to-image translation framework for converting real-world scenes into high-quality anime style. |
This work addresses the challenges of existing anime stylization methods in preserving semantic content, achieving distinct anime style, and handling fine details, particularly in complex scenes. |
Scenimefy leverages structure-consistent pseudo paired data generated by a semantically-constrained StyleGAN, guided by pre-trained models like CLIP and VGG. It employs a segmentation-guided data selection process for high-quality supervision and introduces a patch-wise contrastive style loss for enhanced stylization. |
Scenimefy outperforms state-of-the-art baselines in both visual quality and quantitative evaluations (FID).
The proposed method effectively captures and transfers unique anime textures and styles, as demonstrated in comparisons.
A new high-resolution anime scene dataset is introduced to facilitate further research in this area. |
The model may not perfectly preserve intricate tiny details like text.
A small number of failure cases exist where semantically distinct objects are translated incorrectly. |
image-to-image translation, anime stylization, scene cartoonization, stylegan, semi-supervised learning |
2308.12966
Report |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou |
In this work, we introduce the Qwen-VL series, a set of large-scale
vision-language models (LVLMs) designed to perceive and understand both texts
and images. Starting from the Qwen-LM as a foundation, we endow it with visual
capacity by the meticulously designed (i) visual receptor, (ii) input-output
interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal
cleaned corpus. Beyond the conventional image description and
question-answering, we implement the grounding and text-reading ability of
Qwen-VLs by aligning image-caption-box tuples. The resulting models, including
Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar
model scales on a broad range of visual-centric benchmarks (e.g., image
captioning, question answering, visual grounding) and different settings (e.g.,
zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our
instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to
existing vision-language chatbots. Code, demo and models are available at
https://github.com/QwenLM/Qwen-VL. |
Introduces Qwen-VL, a series of open-source large-scale vision-language models that excel in visual understanding and instruction following. |
Addresses limitations of existing open-source LVLMs in terms of performance, fine-grained perception, and instruction following. |
Employs a 3-stage training pipeline: (1) Pre-training on a massive image-text dataset, (2) Multi-task pre-training on high-quality annotated data, and (3) Instruction fine-tuning for enhanced dialogue abilities. |
Achieves state-of-the-art results on various vision-language benchmarks, including image captioning, visual question answering, and refer expression comprehension.
Demonstrates superior performance in real-world user behavior evaluations, such as TouchStone, SEED-Bench, and MME.
Exhibits strong few-shot learning capabilities, comparable to larger models. |
Current model size and resolution limit handling of more complex multimodal relationships.
Future work focuses on incorporating additional modalities (speech, video) and enhancing multimodal generation capabilities. |
vision-language model, large language model, multimodal learning, instruction following, open-source |
2308.12964
Report |
Dense Text-to-Image Generation with Attention Modulation |
Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu |
Existing text-to-image diffusion models struggle to synthesize realistic
images given dense captions, where each text prompt provides a detailed
description for a specific image region. To address this, we propose
DenseDiffusion, a training-free method that adapts a pre-trained text-to-image
model to handle such dense captions while offering control over the scene
layout. We first analyze the relationship between generated images' layouts and
the pre-trained model's intermediate attention maps. Next, we develop an
attention modulation method that guides objects to appear in specific regions
according to layout guidance. Without requiring additional fine-tuning or
datasets, we improve image generation performance given dense captions
regarding both automatic and human evaluation scores. In addition, we achieve
similar-quality visual results with models specifically trained with layout
conditions. |
This paper introduces DenseDiffusion, a training-free method that allows pre-trained text-to-image models to generate realistic images from dense captions while offering control over scene layout. |
Existing text-to-image diffusion models struggle to accurately represent images described by dense captions, often omitting or blending objects. Additionally, controlling the layout of generated images using text prompts alone is difficult. |
DenseDiffusion modulates the attention maps of pre-trained models, like Stable Diffusion, based on both text and layout conditions. This is done by identifying positive and negative query-key pairs in the attention layers and adjusting their scores based on original value range and segment size. |
DenseDiffusion improves image generation performance on dense captions compared to other training-free methods, as measured by CLIP-Score, SOA-I score, and IoU.
DenseDiffusion demonstrates superior adherence to layout conditions compared to SD-Pww, a training-free method designed for layout control.
Qualitative results show DenseDiffusion achieves comparable, and in some cases better, layout control than models specifically trained on layout conditions, such as Make-a-Scene and SpaText. |
DenseDiffusion's performance is limited by the capacity of the base text-to-image model (e.g., Stable Diffusion) it modifies.
The method struggles with fine-grained input masks due to the coarse nature of self-attention and cross-attention layers. |
text-to-image generation, diffusion models, attention modulation, layout control, dense captions |
2308.12956
Report |
DLIP: Distilling Language-Image Pre-training |
Huafeng Kuang, Jie Wu, Xiawu Zheng, Ming Li, Xuefeng Xiao, Rui Wang, Min Zheng, Rongrong Ji |
Vision-Language Pre-training (VLP) shows remarkable progress with the
assistance of extremely heavy parameters, which challenges deployment in real
applications. Knowledge distillation is well recognized as the essential
procedure in model compression. However, existing knowledge distillation
techniques lack an in-depth investigation and analysis of VLP, and practical
guidelines for VLP-oriented distillation are still not yet explored. In this
paper, we present DLIP, a simple yet efficient Distilling Language-Image
Pre-training framework, through which we investigate how to distill a light VLP
model. Specifically, we dissect the model distillation from multiple
dimensions, such as the architecture characteristics of different modules and
the information transfer of different modalities. We conduct comprehensive
experiments and provide insights on distilling a light but performant VLP
model. Experimental results reveal that DLIP can achieve a state-of-the-art
accuracy/efficiency trade-off across diverse cross-modal tasks, e.g.,
image-text retrieval, image captioning and visual question answering. For
example, DLIP compresses BLIP by 1.9x, from 213M to 108M parameters, while
achieving comparable or better performance. Furthermore, DLIP succeeds in
retaining more than 95% of the performance with 22.4% parameters and 24.8%
FLOPs compared to the teacher model and accelerates inference speed by 2.7x. |
This paper presents DLIP, a simple yet effective distillation framework designed to train lighter Vision-Language Pre-training (VLP) models. |
Large VLP models are computationally expensive and challenging to deploy. DLIP addresses this by compressing these models while maintaining high performance. |
DLIP leverages knowledge distillation from a large teacher VLP model to a smaller student model. It investigates and analyzes various aspects of model distillation, including module architecture choices and multimodal information transfer. |
Image and text encoders are equally important for compression.
Multimodal information transfer is more effective than unimodal information transfer for distillation.
DLIP achieves state-of-the-art accuracy/efficiency trade-off, compressing BLIP by 1.9x while achieving comparable or better performance on various tasks. |
The study primarily focuses on fully transformer-based VLP models.
Future work could explore more efficient module compression strategies. |
vision-language pre-training, knowledge distillation, model compression, multimodal learning, image-text retrieval |
2308.12866
Report |
ToonTalker: Cross-Domain Face Reenactment |
Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, Yujiu Yang |
We target cross-domain face reenactment in this paper, i.e., driving a
cartoon image with the video of a real person and vice versa. Recently, many
works have focused on one-shot talking face generation to drive a portrait with
a real video, i.e., within-domain reenactment. Straightforwardly applying those
methods to cross-domain animation will cause inaccurate expression transfer,
blur effects, and even apparent artifacts due to the domain shift between
cartoon and real faces. Only a few works attempt to settle cross-domain face
reenactment. The most related work AnimeCeleb requires constructing a dataset
with pose vector and cartoon image pairs by animating 3D characters, which
makes it inapplicable anymore if no paired data is available. In this paper, we
propose a novel method for cross-domain reenactment without paired data.
Specifically, we propose a transformer-based framework to align the motions
from different domains into a common latent space where motion transfer is
conducted via latent code addition. Two domain-specific motion encoders and two
learnable motion base memories are used to capture domain properties. A source
query transformer and a driving one are exploited to project domain-specific
motion to the canonical space. The edited motion is projected back to the
domain of the source with a transformer. Moreover, since no paired data is
provided, we propose a novel cross-domain training scheme using data from two
domains with the designed analogy constraint. Besides, we contribute a cartoon
dataset in Disney style. Extensive evaluations demonstrate the superiority of
our method over competing methods. |
This paper presents ToonTalker, a novel transformer-based framework for cross-domain face reenactment, enabling animation of cartoon images using real human videos and vice versa. |
Existing face reenactment methods struggle with cross-domain animation due to the significant domain shift between cartoon and real faces, resulting in inaccurate expression transfer and artifacts. ToonTalker addresses this challenge by aligning motions from different domains in a shared latent space, eliminating the need for paired training data. |
ToonTalker utilizes domain-specific motion encoders and learnable motion bases to capture domain-specific motion properties. Source and driving query transformers project these motions into a canonical space where motion transfer occurs via latent code addition. A novel training scheme with an analogy constraint compensates for the lack of paired data by enforcing consistent relative motion between domains. |
ToonTalker outperforms state-of-the-art methods in cross-domain reenactment, demonstrating superior image quality, motion consistency, and identity preservation.
The proposed analogy constraint effectively aligns motions from different domains, as evidenced by qualitative and quantitative results.
ToonTalker generalizes well to animating cartoon characters generated by diffusion models, showcasing its potential for various applications. |
The model faces challenges in accurately handling extreme poses due to their limited presence in training data.
Future work could explore incorporating techniques for handling extreme poses and further enhance the model's generalization capabilities for diverse cartoon styles. |
cross-domain face reenactment, motion transfer, analogy constraint, transformer, cartoon animation |
2308.12605
Report |
APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency |
Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng |
Diffusion models have exhibited promising progress in video generation.
However, they often struggle to retain consistent details within local regions
across frames. One underlying cause is that traditional diffusion models
approximate Gaussian noise distribution by utilizing predictive noise, without
fully accounting for the impact of inherent information within the input
itself. Additionally, these models emphasize the distinction between
predictions and references, neglecting information intrinsic to the videos. To
address this limitation, inspired by the self-attention mechanism, we propose a
novel text-to-video (T2V) generation network structure based on diffusion
models, dubbed Additional Perturbation for Latent noise with Adversarial
training (APLA). Our approach only necessitates a single video as input and
builds upon pre-trained stable diffusion networks. Notably, we introduce an
additional compact network, known as the Video Generation Transformer (VGT).
This auxiliary component is designed to extract perturbations from the inherent
information contained within the input, thereby refining inconsistent pixels
during temporal predictions. We leverage a hybrid architecture of transformers
and convolutions to compensate for temporal intricacies, enhancing consistency
between different frames within the video. Experiments demonstrate a noticeable
improvement in the consistency of the generated videos both qualitatively and
quantitatively. |
Proposes APLA, a text-to-video generation network based on diffusion models that improves frame consistency by using a Video Generation Transformer (VGT) to extract and leverage inherent information within input videos. |
Existing diffusion models for video generation struggle to maintain consistent details across frames, particularly in fine-tuned models. |
APLA introduces VGT, an auxiliary network that extracts perturbations from input videos to refine inconsistent pixels during temporal predictions. It leverages a hybrid transformer-convolution architecture for temporal consistency. Adversarial training is used to further enhance generated video quality and consistency. |
APLA demonstrates noticeable improvement in generated video consistency qualitatively and quantitatively.
VGT-Hyper, a variant of VGT using 3D convolutions, exhibits superior performance in reconstruction tasks.
Ablation studies highlight the contribution of each component in APLA, including VGT, adversarial training, and the hyper-loss. |
Limited CUDA memory restricts the use of the more complex VGT-Hyper model.
Excessive training epochs can lead to overfitting and reduced influence of the text prompt. |
text-to-video generation, diffusion models, frame consistency, video generation transformer, adversarial training |
2308.12560
Report |
NOVA: NOvel View Augmentation for Neural Composition of Dynamic Objects |
Dakshit Agrawal, Jiajie Xu, Siva Karthik Mustikovela, Ioannis Gkioulekas, Ashish Shrivastava, Yuning Chai |
We propose a novel-view augmentation (NOVA) strategy to train NeRFs for
photo-realistic 3D composition of dynamic objects in a static scene. Compared
to prior work, our framework significantly reduces blending artifacts when
inserting multiple dynamic objects into a 3D scene at novel views and times;
achieves comparable PSNR without the need for additional ground truth
modalities like optical flow; and overall provides ease, flexibility, and
scalability in neural composition. Our codebase is on GitHub. |
Presents NOVA, a novel-view augmentation strategy for training NeRFs, enabling photo-realistic 3D composition of dynamic objects in static scenes from monocular videos. |
Addresses limitations in existing methods that produce blending artifacts and require additional ground truth data like optical flow. |
Utilizes separate NeRFs for different scene parts, employs novel-view augmentation to reduce blending artifacts, and introduces novel-view losses to ensure high image fidelity. |
Significantly reduces blending artifacts compared to prior work, especially when inserting multiple dynamic objects.
Achieves comparable PSNR to state-of-the-art methods without requiring ground truth optical flow.
Provides a flexible and scalable framework for neural composition of dynamic scenes. |
Current implementation assumes a static camera for capturing the scene.
Future work can explore incorporating techniques to handle dynamic cameras. |
neural radiance fields, nerf, novel view synthesis, scene composition, dynamic scenes |
2308.12538
Report |
Mutual-Guided Dynamic Network for Image Fusion |
Yuanshen Guan, Ruikang Xu, Mingde Yao, Lizhi Wang, Zhiwei Xiong |
Image fusion aims to generate a high-quality image from multiple images
captured under varying conditions. The key problem of this task is to preserve
complementary information while filtering out irrelevant information for the
fused result. However, existing methods address this problem by leveraging
static convolutional neural networks (CNNs), suffering two inherent limitations
during feature extraction, i.e., being unable to handle spatial-variant
contents and lacking guidance from multiple inputs. In this paper, we propose a
novel mutual-guided dynamic network (MGDN) for image fusion, which allows for
effective information utilization across different locations and inputs.
Specifically, we design a mutual-guided dynamic filter (MGDF) for adaptive
feature extraction, composed of a mutual-guided cross-attention (MGCA) module
and a dynamic filter predictor, where the former incorporates additional
guidance from different inputs and the latter generates spatial-variant kernels
for different locations. In addition, we introduce a parallel feature fusion
(PFF) module to effectively fuse local and global information of the extracted
features. To further reduce the redundancy among the extracted features while
simultaneously preserving their shared structural information, we devise a
novel loss function that combines the minimization of normalized mutual
information (NMI) with an estimated gradient mask. Experimental results on five
benchmark datasets demonstrate that our proposed method outperforms existing
methods on four image fusion tasks. The code and model are publicly available
at: https://github.com/Guanys-dar/MGDN. |
This paper introduces MGDN, a novel mutual-guided dynamic network for image fusion, which allows for effective information utilization across different locations and inputs. |
Existing image fusion methods rely on static networks, limiting their ability to handle spatial and scene variations crucial for real-world applications. MGDN addresses this by dynamically adapting to content and leveraging information from multiple inputs. |
The core of MGDN is the Mutual-Guided Dynamic Filter (MGDF). It uses Mutual-Guided Cross-Attention (MGCA) to integrate guidance information from multiple inputs and a dynamic filter predictor to estimate spatial-variant filters for adaptive feature extraction. A Parallel Feature Fusion (PFF) module merges local and global information, and a masked MI loss is employed to reduce feature redundancy while preserving structural information. |
MGDN outperforms existing state-of-the-art image fusion methods on five benchmark datasets across four representative image fusion tasks.
MGDN effectively integrates complementary information, preserves texture details, and maintains appropriate exposure levels in multi-exposure and multi-focus image fusion tasks.
The proposed method excels in HDR deghosting, handling challenging scenes with saturations, motion, and significant intensity variations better than previous approaches. |
The computational complexity of dynamic filtering needs further optimization for real-time applications.
Future work can explore extending MGDN to handle more than two input images for complex fusion scenarios. |
image fusion, dynamic filtering, mutual information, deep learning, computer vision |
2308.12510
Report |
Masked Autoencoders are Efficient Class Incremental Learners |
Jiang-Tian Zhai, Xialei Liu, Andrew D. Bagdanov, Ke Li, Ming-Ming Cheng |
Class Incremental Learning (CIL) aims to sequentially learn new classes while
avoiding catastrophic forgetting of previous knowledge. We propose to use
Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally
designed to learn useful representations through reconstructive unsupervised
learning, and they can be easily integrated with a supervised loss for
classification. Moreover, MAEs can reliably reconstruct original input images
from randomly selected patches, which we use to store exemplars from past tasks
more efficiently for CIL. We also propose a bilateral MAE framework to learn
from image-level and embedding-level fusion, which produces better-quality
reconstructed images and more stable representations. Our experiments confirm
that our approach performs better than the state-of-the-art on CIFAR-100,
ImageNet-Subset, and ImageNet-Full. The code is available at
https://github.com/scok30/MAE-CIL . |
This paper introduces a novel bilateral Masked Autoencoder (MAE) framework for efficient Class Incremental Learning (CIL), leveraging the self-supervised reconstruction capabilities of MAEs for enhanced exemplar replay and representation learning. |
Addressing catastrophic forgetting in CIL is crucial for real-world applications where models need to adapt to new information without losing previously acquired knowledge. This work explores using MAEs for efficient exemplar storage and high-quality replay data generation in CIL. |
The proposed approach employs a bilateral MAE architecture with two branches: one for learning global features and another for detailed reconstruction. It utilizes random masking for efficient exemplar storage, reconstructs images from these masked patches, and incorporates a detailed loss to improve reconstruction quality and embedding diversity. |
The bilateral MAE framework achieves state-of-the-art performance on CIFAR-100, ImageNet-Subset, and ImageNet-Full, outperforming existing methods in average accuracy and forgetting rate.
The method demonstrates the effectiveness of using masked image patches as exemplars for efficient storage and high-quality replay data generation.
Ablation studies confirm the contribution of each component, including the bilateral architecture, self-supervised reconstruction, and masking ratio, to the overall performance improvement. |
The impact of varying the number of stored exemplars per class on performance could be further investigated.
Exploring different masking strategies or incorporating additional self-supervision tasks might lead to further performance improvements. |
class incremental learning, catastrophic forgetting, masked autoencoders, exemplar replay, self-supervised learning |
2308.12469
Report |
Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion |
Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco |
Producing quality segmentation masks for images is a fundamental problem in
computer vision. Recent research has explored large-scale supervised training
to enable zero-shot segmentation on virtually any image style and unsupervised
training to enable segmentation without dense annotations. However,
constructing a model capable of segmenting anything in a zero-shot manner
without any annotations is still challenging. In this paper, we propose to
utilize the self-attention layers in stable diffusion models to achieve this
goal because the pre-trained stable diffusion model has learned inherent
concepts of objects within its attention layers. Specifically, we introduce a
simple yet effective iterative merging process based on measuring KL divergence
among attention maps to merge them into valid segmentation masks. The proposed
method does not require any training or language dependency to extract quality
segmentation for any images. On COCO-Stuff-27, our method surpasses the prior
unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17%
in mean IoU. The project page is at
\url{https://sites.google.com/view/diffseg/home}. |
Presents DiffSeg, an unsupervised and zero-shot segmentation method using a pre-trained stable diffusion model. |
Constructing a model capable of segmenting anything in a zero-shot manner without any annotations remains challenging. This method eliminates the need for annotations and prior knowledge of target images. |
DiffSeg leverages self-attention layers in stable diffusion models, aggregating attention tensors and merging them iteratively based on KL divergence to produce segmentation masks. |
DiffSeg surpasses previous unsupervised zero-shot methods on COCO-Stuff-27 (26% higher pixel accuracy, 17% higher mIoU).
Outperforms prior works on Cityscapes using larger resolution input.
Generalizes well to images of diverse styles, including sketches, paintings, and real-world photographs. |
Performance on specialized datasets like Cityscapes is not satisfactory, potentially due to resolution limitations and limited exposure to such scenes during pre-training.
Computationally demanding, not real-time. |
unsupervised segmentation, zero-shot learning, stable diffusion, self-attention, kl divergence |
2308.12350
Report |
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation |
Duo Peng, Ping Hu, Qiuhong Ke, Jun Liu |
Translating images from a source domain to a target domain for learning
target models is one of the most common strategies in domain adaptive semantic
segmentation (DASS). However, existing methods still struggle to preserve
semantically-consistent local details between the original and translated
images. In this work, we present an innovative approach that addresses this
challenge by using source-domain labels as explicit guidance during image
translation. Concretely, we formulate cross-domain image translation as a
denoising diffusion process and utilize a novel Semantic Gradient Guidance
(SGG) method to constrain the translation process, conditioning it on the
pixel-wise source labels. Additionally, a Progressive Translation Learning
(PTL) strategy is devised to enable the SGG method to work reliably across
domains with large gaps. Extensive experiments demonstrate the superiority of
our approach over state-of-the-art methods. |
This paper proposes a novel diffusion-based image translation framework for Domain Adaptive Semantic Segmentation (DASS) that uses source-domain labels as guidance to preserve semantic details. |
Existing image translation methods for DASS, often based on GANs, struggle to preserve local semantic consistency between original and translated images, leading to sub-optimal adaptation performance. |
The approach involves training an unconditional diffusion model on the target domain and then using it for translating source images. A novel Semantic Gradient Guidance (SGG) method, coupled with a Progressive Translation Learning (PTL) strategy, guides the translation process based on pixel-wise source labels, ensuring semantic consistency even across large domain gaps. |
Achieves state-of-the-art performance on GTA5→Cityscapes and SYNTHIA→Cityscapes benchmarks.
Shows significant improvements over existing GAN-based image translation methods (3.2% to 20.1% improvement across different settings and backbones).
Demonstrates more stable training and comparable inference time compared to other state-of-the-art DASS methods. |
The method requires training across multiple intermediate domains, which increases the overall training time.
Future work could explore incorporating other guidance signals, such as image structure or context, to further improve the quality of translated images. |
domain adaptation, semantic segmentation, image translation, diffusion models, label guidance |
2308.12059
Report |
Manipulating Embeddings of Stable Diffusion Prompts |
Niklas Deckers, Julia Peters, Martin Potthast |
Generative text-to-image models such as Stable Diffusion allow users to
generate images based on a textual description, the prompt. Changing the prompt
is still the primary means for the user to change a generated image as desired.
However, changing the image by reformulating the prompt remains a difficult
process of trial and error, which has led to the emergence of prompt
engineering as a new field of research. We propose and analyze methods to
change the embedding of a prompt directly instead of the prompt text. It allows
for more fine-grained and targeted control that takes into account user
intentions. Our approach treats the generative text-to-image model as a
continuous function and passes gradients between the image space and the prompt
embedding space. By addressing different user interaction problems, we can
apply this idea in three scenarios: (1) Optimization of a metric defined in
image space that could measure, for example, image style. (2) Assistance of
users in creative tasks by enabling them to navigate the image space along a
selection of directions of "near" prompt embeddings. (3) Changing the embedding
of the prompt to include information that the user has seen in a particular
seed but finds difficult to describe in the prompt. Our experiments demonstrate
the feasibility of the described methods. |
This paper presents and analyzes three novel methods for manipulating the embeddings of prompts in Stable Diffusion, allowing for more targeted and fine-grained control over image generation compared to traditional prompt engineering. |
Traditional prompt engineering, while effective, can be tedious, unintuitive, and unpredictable due to the inherent ambiguity of language and the black-box nature of text-to-image models. This work aims to address these shortcomings by providing users with more direct control over image generation. |
The proposed methods involve treating Stable Diffusion as a continuous function and using gradient descent to modify prompt embeddings in three ways:
1. **Metric-Based Optimization:** Optimizing embeddings with respect to specific image metrics (e.g., blurriness, sharpness, aesthetics).
2. **Iterative Human Feedback:** Providing users with options generated from slightly modified embeddings, allowing for iterative refinement based on their choices.
3. **Seed-Invariant Embeddings:** Reconstructing preferred image features observed with specific seeds, making image generation more robust to seed variations. |
Modifying prompt embeddings based on image metrics successfully alters image characteristics like blurriness, sharpness, and aesthetics.
User study demonstrates that iterative feedback on modified embeddings provides a more controlled and less tedious experience compared to prompt engineering, especially for creative tasks.
The method for creating seed-invariant prompt embeddings shows promising preliminary results, demonstrating the potential to encode seed-specific information directly into the embedding. |
The effectiveness of metric-based optimization depends on the chosen metric and can lead to overfitting or artifacts if not carefully monitored.
The iterative feedback method might be less effective for users with a specific target image in mind, as it relies on presented options aligning with their envisioned direction. |
stable diffusion, prompt engineering, text-to-image generation, image manipulation, human-computer interaction |
2308.11974
Report |
Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields |
Hyeonseop Song, Seokhun Choi, Hoseok Do, Chul Lee, Taehyeong Kim |
Text-driven localized editing of 3D objects is particularly difficult as
locally mixing the original 3D object with the intended new object and style
effects without distorting the object's form is not a straightforward process.
To address this issue, we propose a novel NeRF-based model, Blending-NeRF,
which consists of two NeRF networks: pretrained NeRF and editable NeRF.
Additionally, we introduce new blending operations that allow Blending-NeRF to
properly edit target regions which are localized by text. By using a pretrained
vision-language aligned model, CLIP, we guide Blending-NeRF to add new objects
with varying colors and densities, modify textures, and remove parts of the
original object. Our extensive experiments demonstrate that Blending-NeRF
produces naturally and locally edited 3D objects from various text prompts. Our
project page is available at https://seokhunchoi.github.io/Blending-NeRF/ |
Introduces Blending-NeRF, a novel NeRF-based model for text-driven localized editing of 3D objects using a pretrained NeRF and an editable NeRF. |
Addresses the challenge of localized editing in 3D object editing, enabling specific modifications based on text prompts. |
Utilizes a layered NeRF architecture with blending operations to combine a pretrained NeRF with an editable NeRF. It leverages CLIP for text-image alignment and CLIPSeg for target region localization. |
Blending-NeRF successfully edits localized regions of 3D objects based on various text prompts, including color changes, density additions, and removals.
The method outperforms baseline models, particularly in density-based editing tasks, demonstrating its ability for fine-grained control.
It exhibits extensibility by integrating with Instant-NGP for memory efficiency and application to real-world scenes. |
Performance can be influenced by the accuracy of CLIPSeg in segmenting the target region.
Limited patch size input to CLIP's image encoder can impact the sharpness of editing results, particularly with memory-intensive NeRF backbones. |
neural radiance fields, 3d object editing, text-driven editing, clip, localized editing |
2308.11971
Report |
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE |
Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang |
Building scalable vision-language models to learn from diverse, multimodal
data remains an open challenge. In this paper, we introduce an Efficient
Vision-languagE foundation model, namely EVE, which is one unified multimodal
Transformer pre-trained solely by one unified pre-training task. Specifically,
EVE encodes both vision and language within a shared Transformer network
integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which
capture modality-specific information by selectively switching to different
experts. To unify pre-training tasks of vision and language, EVE performs
masked signal modeling on image-text pairs to reconstruct masked signals, i.e.,
image pixels and text tokens, given visible signals. This simple yet effective
pre-training objective accelerates training by 3.5x compared to the model
pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing
to the combination of the unified architecture and pre-training task, EVE is
easy to scale up, enabling better downstream performance with fewer resources
and faster training speed. Despite its simplicity, EVE achieves
state-of-the-art performance on various vision-language downstream tasks,
including visual question answering, visual reasoning, and image-text
retrieval. |
The paper introduces EVE, an efficient vision-language foundation model based on a unified multimodal Transformer with modality-aware sparse Mixture-of-Experts (MoE) modules, pre-trained using a single unified masked signal modeling task. |
Building scalable vision-language models that learn from diverse multimodal data while remaining efficient in training and scaling up is an open challenge. EVE aims to address this by simplifying both the model architecture and the pre-training objective. |
EVE leverages a shared Transformer network for both vision and language, integrating modality-aware MoE modules to capture modality-specific information. It is pre-trained using a unified masked signal modeling task, reconstructing masked image pixels and text tokens given visible signals. |
EVE achieves state-of-the-art performance on various vision-language tasks, including visual question answering, visual reasoning, and image-text retrieval.
The unified architecture and pre-training task enable EVE to be easily scaled up, leading to improved downstream performance with fewer resources and faster training.
Pre-training EVE with masked signal modeling is 3.5 times faster than using Image-Text Contrastive and Image-Text Matching losses. |
The paper mainly focuses on exploring the effectiveness of the unified architecture and pre-training task for image and text modalities.
The impact of using modality-specific MoE modules on model interpretability requires further investigation. |
vision-language pre-training, multimodal learning, mixture-of-experts, masked signal modeling, transformer |
2308.11941
Report |
Boosting Diffusion Models with an Adaptive Momentum Sampler |
Xiyu Wang, Anh-Dung Dinh, Daochang Liu, Chang Xu |
Diffusion probabilistic models (DPMs) have been shown to generate
high-quality images without the need for delicate adversarial training.
However, the current sampling process in DPMs is prone to violent shaking. In
this paper, we present a novel reverse sampler for DPMs inspired by the
widely-used Adam optimizer. Our proposed sampler can be readily applied to a
pre-trained diffusion model, utilizing momentum mechanisms and adaptive
updating to smooth the reverse sampling process and ensure stable generation,
resulting in outputs of enhanced quality. By implicitly reusing update
directions from early steps, our proposed sampler achieves a better balance
between high-level semantics and low-level details. Additionally, this sampler
is flexible and can be easily integrated into pre-trained DPMs regardless of
the sampler used during training. Our experimental results on multiple
benchmarks demonstrate that our proposed reverse sampler yields remarkable
improvements over different baselines. We will make the source code available. |
This paper introduces a novel, training-free reverse sampler for Diffusion Probabilistic Models (DPMs) inspired by the Adam optimizer, which uses momentum and adaptive updating to enhance the quality of generated images. |
Current DPMs' reverse sampling processes suffer from instability, leading to noisy images with missing high-level features. This novel sampler addresses this issue by smoothing the sampling trajectory and balancing high-level and low-level information in generated images. |
The proposed Adaptive Momentum Sampler incorporates a momentum term to accumulate past update directions, smoothing the sampling process. Additionally, it utilizes a moving average of second-order moments to adaptively adjust the denoising step size for each pixel, similar to the RMSProp optimizer. |
The adaptive momentum sampler significantly improves image generation quality over baseline samplers on various datasets, including CIFAR-10, ImageNet, CelebA, LSUN, and CelebA-HQ.
The sampler excels in balancing high-level semantics (shapes, outlines) and low-level details (textures) in generated images.
The proposed method is flexible and can be easily integrated with existing pre-trained DPMs without requiring additional training. |
The improvement of the sampler is less evident when using a small number of sampling steps.
Future work includes incorporating the adaptive momentum strategy into the training process and extending the scheme to continuous settings with solid theoretical foundations. |
diffusion models, generative models, image generation, adaptive momentum, sampling algorithm |
2308.11917
Report |
LFS-GAN: Lifelong Few-Shot Image Generation |
Juwon Seo, Ji-Su Kang, Gyeong-Moon Park |
We address a challenging lifelong few-shot image generation task for the
first time. In this situation, a generative model learns a sequence of tasks
using only a few samples per task. Consequently, the learned model encounters
both catastrophic forgetting and overfitting problems at a time. Existing
studies on lifelong GANs have proposed modulation-based methods to prevent
catastrophic forgetting. However, they require considerable additional
parameters and cannot generate high-fidelity and diverse images from limited
data. On the other hand, the existing few-shot GANs suffer from severe
catastrophic forgetting when learning multiple tasks. To alleviate these
issues, we propose a framework called Lifelong Few-Shot GAN (LFS-GAN) that can
generate high-quality and diverse images in lifelong few-shot image generation
task. Our proposed framework learns each task using an efficient task-specific
modulator - Learnable Factorized Tensor (LeFT). LeFT is rank-constrained and
has a rich representation ability due to its unique reconstruction technique.
Furthermore, we propose a novel mode seeking loss to improve the diversity of
our model in low-data circumstances. Extensive experiments demonstrate that the
proposed LFS-GAN can generate high-fidelity and diverse images without any
forgetting and mode collapse in various domains, achieving state-of-the-art in
lifelong few-shot image generation task. Surprisingly, we find that our LFS-GAN
even outperforms the existing few-shot GANs in the few-shot image generation
task. The code is available at Github. |
This paper addresses the novel and challenging task of lifelong few-shot image generation, where a model needs to learn a sequence of image generation tasks from very limited data without forgetting previous tasks. |
This task is important for real-world scenarios where data is scarce or costly to obtain and adapting a new model for each task is impractical. It combines the challenges of lifelong learning (avoiding catastrophic forgetting) and few-shot learning (generalizing from limited data). |
The paper proposes LFS-GAN, a framework that uses a novel weight modulation technique called Learnable Factorized Tensor (LeFT) to efficiently learn task-specific information without modifying pre-trained weights. Additionally, it introduces a cluster-wise mode seeking loss to enhance generation diversity, especially in low-data regimes. |
LFS-GAN successfully generates high-quality and diverse images in lifelong few-shot settings, outperforming baselines adapted from both lifelong GANs and few-shot GANs.
The proposed LeFT modulation technique is highly efficient in terms of parameter count, using less than 1% of trainable parameters compared to the backbone generator.
LFS-GAN also demonstrates superior performance in the standard few-shot image generation task, indicating its ability to generalize well to different data regimes. |
The paper mainly focuses on StyleGAN2 as a backbone and evaluates on limited datasets. Further exploration with different backbones and diverse datasets is needed.
The impact of task sequence and potential biases in the dataset selection on the performance of LFS-GAN requires further investigation. |
lifelong learning, few-shot learning, image generation, generative adversarial networks (gans), weight modulation |
2308.11793
Report |
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts |
Wenyan Cong, Hanxue Liang, Peihao Wang, Zhiwen Fan, Tianlong Chen, Mukund Varma, Yi Wang, Zhangyang Wang |
Cross-scene generalizable NeRF models, which can directly synthesize novel
views of unseen scenes, have become a new spotlight of the NeRF field. Several
existing attempts rely on increasingly end-to-end "neuralized" architectures,
i.e., replacing scene representation and/or rendering modules with performant
neural networks such as transformers, and turning novel view synthesis into a
feed-forward inference pipeline. While those feedforward "neuralized"
architectures still do not fit diverse scenes well out of the box, we propose
to bridge them with the powerful Mixture-of-Experts (MoE) idea from large
language models (LLMs), which has demonstrated superior generalization ability
by balancing between larger overall model capacity and flexible per-instance
specialization. Starting from a recent generalizable NeRF architecture called
GNT, we first demonstrate that MoE can be neatly plugged in to enhance the
model. We further customize a shared permanent expert and a geometry-aware
consistency loss to enforce cross-scene consistency and spatial smoothness
respectively, which are essential for generalizable view synthesis. Our
proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has
experimentally shown state-of-the-art results when transferring to unseen
scenes, indicating remarkably better cross-scene generalization in both
zero-shot and few-shot settings. Our codes are available at
https://github.com/VITA-Group/GNT-MOVE. |
This paper proposes GNT-MOVE, an LLM-inspired NeRF framework for generalizable novel view synthesis, by introducing Mixture-of-Experts (MoE) into GNT and customizing it with a permanent expert and a geometry-aware spatial consistency objective. |
Existing cross-scene generalizable NeRF models struggle to balance "generality" (covering diverse scenes) and "specialization" (modeling per-scene details). MoE, inspired by its success in LLMs, offers a potential solution. |
The authors integrate MoE into GNT's view transformer. To address the cross-scene consistency and spatial smoothness requirements of NeRF, they introduce a shared permanent expert and a geometry-aware spatial consistency objective. |
GNT-MOVE achieves state-of-the-art results in zero-shot generalization on LLFF, NeRF Synthetic, Shiny-6, Tanks-and-Temples, and NMR datasets.
GNT-MOVE consistently outperforms previous SOTA methods in few-shot generalization on LLFF and NeRF Synthetic datasets.
Analysis of expert selection reveals that GNT-MOVE effectively captures both cross-scene and cross-view consistency, as well as expert specialization for diverse rendering properties. |
The paper primarily focuses on applying MoE to the view transformer in GNT. Exploring its integration with the ray transformer could be promising.
The impact of MoE on computational cost, while claimed to be low, is not thoroughly analyzed. |
nerf, novel view synthesis, mixture-of-experts, generalization, transformer |
2308.11605
Report |
GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning |
Mainak Singha, Ankit Jha, Biplab Banerjee |
Large-scale foundation models, such as CLIP, have demonstrated remarkable
success in visual recognition tasks by embedding images in a semantically rich
space. Self-supervised learning (SSL) has also shown promise in improving
visual recognition by learning invariant features. However, the combination of
CLIP with SSL is found to face challenges due to the multi-task framework that
blends CLIP's contrastive loss and SSL's loss, including difficulties with loss
weighting and inconsistency among different views of images in CLIP's output
space. To overcome these challenges, we propose a prompt learning-based model
called GOPro, which is a unified framework that ensures similarity between
various augmented views of input images in a shared image-text embedding space,
using a pair of learnable image and text projectors atop CLIP, to promote
invariance and generalizability. To automatically learn such prompts, we
leverage the visual content and style primitives extracted from pre-trained
CLIP and adapt them to the target task. In addition to CLIP's cross-domain
contrastive loss, we introduce a visual contrastive loss and a novel prompt
consistency loss, considering the different views of the images. GOPro is
trained end-to-end on all three loss objectives, combining the strengths of
CLIP and SSL in a principled manner. Empirical evaluations demonstrate that
GOPro outperforms the state-of-the-art prompting techniques on three
challenging domain generalization tasks across multiple benchmarks by a
significant margin. Our code is available at
https://github.com/mainaksingha01/GOPro. |
\textsc{GOPro} leverages contrastive SSL and pre-trained CLIP to generate domain and class-agnostic prompts for enhanced generalization and invariance in embedding space against various image transformations. |
Addresses limitations in combining CLIP with SSL, specifically in learning generalizable prompts and ensuring semantic invariance, for improved performance on domain and class generalization tasks. |
Learnable image and text projectors atop frozen CLIP. Employs visual contrastive loss (MoCo v3 augmentations), CLIP's image-text contrastive loss, and a novel prompt consistency loss (MoCo v3 and AugMix augmentations). Prompt learning leverages multi-scale visual content and style information from CLIP. |
\textsc{GOPro} outperforms SOTA prompting techniques on B2N class generalization, achieving superior scores and exceeding SLIP by a significant margin.
\textsc{GOPro} effectively mitigates the generalization gap for diverse domains and classes, outperforming competitors in cross-dataset generalization.
In domain generalization, \textsc{GOPro} surpasses other methods in both source and target domains, demonstrating its robustness. |
Limited exploration of prompt context lengths beyond 4 tokens.
Future work can explore applications in specific domains like medical imaging and remote sensing. |
prompt learning, self-supervised learning, domain generalization, clip, vision-language models |
2308.11568
Report |
SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation |
Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Dong Hwan Kim |
Recent studies show that self-attentions behave like low-pass filters (as
opposed to convolutions) and enhancing their high-pass filtering capability
improves model performance. Contrary to this idea, we investigate existing
convolution-based models with spectral analysis and observe that improving the
low-pass filtering in convolution operations also leads to performance
improvement. To account for this observation, we hypothesize that utilizing
optimal token mixers that capture balanced representations of both high- and
low-frequency components can enhance the performance of models. We verify this
by decomposing visual features into the frequency domain and combining them in
a balanced manner. To handle this, we replace the balancing problem with a mask
filtering problem in the frequency domain. Then, we introduce a novel
token-mixer named SPAM and leverage it to derive a MetaFormer model termed as
SPANet. Experimental results show that the proposed method provides a way to
achieve this balance, and the balanced representations of both high- and
low-frequency components can improve the performance of models on multiple
computer vision tasks. Our code is available at
$\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}}$. |
This paper proposes SPANet, a novel MetaFormer model employing SPAM, a frequency-balancing token mixer, to enhance model performance by capturing balanced representations of high- and low-frequency components in visual features. |
Recent studies highlight the importance of balancing high- and low-pass filtering capabilities in token mixers for improved model performance, prompting the exploration of optimal token mixers. |
The authors introduce SPAM, which uses Spectral Pooling Gates (SPG) to decompose features into frequency components and recombine them with learned weights. They build SPANet by integrating SPAM into a MetaFormer architecture. |
SPANets outperform state-of-the-art CNNs and MetaFormers in image classification and semantic segmentation tasks.
SPANets achieve competitive results in object detection and instance segmentation tasks.
Ablation studies confirm the significance of individual SPAM components and design choices. |
SPANets exhibit limited performance improvement in dense prediction tasks due to the pre-trained backbone's bias toward low-frequency components.
Exploration of frequency-balancing token mixers tailored for task-specific characteristics is needed. |
metaformer, token mixer, frequency balancing, spectral pooling, computer vision |
2308.11506
Report |
LCCo: Lending CLIP to Co-Segmentation |
Xin Duan, Yan Yang, Liyuan Pan, Xiabi Liu |
This paper studies co-segmenting the common semantic object in a set of
images. Existing works either rely on carefully engineered networks to mine the
implicit semantic information in visual features or require extra data (i.e.,
classification labels) for training. In this paper, we leverage the contrastive
language-image pre-training framework (CLIP) for the task. With a backbone
segmentation network that independently processes each image from the set, we
introduce semantics from CLIP into the backbone features, refining them in a
coarse-to-fine manner with three key modules: i) an image set feature
correspondence module, encoding global consistent semantic information of the
image set; ii) a CLIP interaction module, using CLIP-mined common semantics of
the image set to refine the backbone feature; iii) a CLIP regularization
module, drawing CLIP towards this co-segmentation task, identifying the best
CLIP semantic and using it to regularize the backbone feature. Experiments on
four standard co-segmentation benchmark datasets show that the performance of
our method outperforms state-of-the-art methods. |
This paper introduces LCCo, a novel framework that leverages the Contrastive Language-Image Pre-training (CLIP) model for the task of image co-segmentation, aiming to identify and segment common semantic objects within a set of images. |
Existing co-segmentation methods often struggle to accurately extract common semantic information, relying on complex network designs or requiring additional labeled data for training. This paper explores the use of CLIP to overcome these limitations. |
LCCo refines multi-scale features from a backbone segmentation network using CLIP. It employs three key modules: (1) Image Set Feature Correspondence Module - Encodes global semantic information of the image set at a coarse level. (2) CLIP Interaction Module - Modulates mid-level features using CLIP embeddings by fusing image and distilled text semantics. (3) CLIP Regularization Module - Identifies the most common semantic class from CLIP and uses it to refine fine-grained features, drawing CLIP towards the co-segmentation task. |
LCCo achieves state-of-the-art performance on four standard co-segmentation benchmarks (MSRC, Internet, iCoseg, and PASCAL) outperforming existing methods.
The method demonstrates consistent performance improvements with an increasing number of input images, unlike some previous approaches.
Ablation studies validate the effectiveness of each proposed module and the contribution of the novel segmentation and classification losses. |
The use of large-scale CLIP models increases computational demands compared to some past methods, though it remains relatively efficient.
Future work could explore extending the framework to handle more complex scenarios, such as co-segmenting multiple objects. |
image co-segmentation, contrastive language-image pre-training (clip), semantic segmentation, zero-shot learning, computer vision |
2308.11473
Report |
IT3D: Improved Text-to-3D Generation with Explicit View Synthesis |
Yiwen Chen, Chi Zhang, Xiaofeng Yang, Zhongang Cai, Gang Yu, Lei Yang, Guosheng Lin |
Recent strides in Text-to-3D techniques have been propelled by distilling
knowledge from powerful large text-to-image diffusion models (LDMs).
Nonetheless, existing Text-to-3D approaches often grapple with challenges such
as over-saturation, inadequate detailing, and unrealistic outputs. This study
presents a novel strategy that leverages explicitly synthesized multi-view
images to address these issues. Our approach involves the utilization of
image-to-image pipelines, empowered by LDMs, to generate posed high-quality
images based on the renderings of coarse 3D models. Although the generated
images mostly alleviate the aforementioned issues, challenges such as view
inconsistency and significant content variance persist due to the inherent
generative nature of large diffusion models, posing extensive difficulties in
leveraging these images effectively. To overcome this hurdle, we advocate
integrating a discriminator alongside a novel Diffusion-GAN dual training
strategy to guide the training of 3D models. For the incorporated
discriminator, the synthesized multi-view images are considered real data,
while the renderings of the optimized 3D models function as fake data. We
conduct a comprehensive set of experiments that demonstrate the effectiveness
of our method over baseline approaches. |
IT3D is a novel plug-and-play refinement method for text-to-3D generation that leverages explicitly synthesized multi-view images and a Diffusion-GAN dual training strategy. |
Existing text-to-3D methods often struggle with issues like over-saturation, lack of detail, and unrealistic outputs. IT3D addresses these limitations by incorporating high-quality 2D image generation techniques. |
1. Generate a coarse 3D model from text. 2. Synthesize a multi-view image dataset using an image-to-image pipeline conditioned on renderings of the coarse model. 3. Refine the 3D model using a Diffusion-GAN dual training strategy that combines diffusion prior with a discriminator trained on the synthesized dataset. |
Significantly enhances texture detail, geometry, and fidelity to text prompts compared to baseline methods.
Demonstrates robustness by successfully refining models even with low-quality coarse models or imperfections in the synthesized dataset.
Achieves a high user preference score (89.92%) compared to the baseline method in a user study. |
Performance is limited by the capabilities of the image-to-image pipeline used for dataset generation.
Future work could explore dataset update strategies for further quality improvement. |
text-to-3d generation, diffusion models, generative adversarial networks (gans), image-to-image translation, 3d model refinement |
2308.11417
Report |
ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes |
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, Angela Dai |
We present ScanNet++, a large-scale dataset that couples together capture of
high-quality and commodity-level geometry and color of indoor scenes. Each
scene is captured with a high-end laser scanner at sub-millimeter resolution,
along with registered 33-megapixel images from a DSLR camera, and RGB-D streams
from an iPhone. Scene reconstructions are further annotated with an open
vocabulary of semantics, with label-ambiguous scenarios explicitly annotated
for comprehensive semantic understanding. ScanNet++ enables a new real-world
benchmark for novel view synthesis, both from high-quality RGB capture, and
importantly also from commodity-level images, in addition to a new benchmark
for 3D semantic scene understanding that comprehensively encapsulates diverse
and ambiguous semantic labeling scenarios. Currently, ScanNet++ contains 460
scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames. |
\datasetname{} is a large-scale dataset of high-fidelity 3D indoor scenes, including high-resolution RGB images, commodity RGB-D videos, registered laser scans, and dense semantic annotations with an open vocabulary and multi-labeling. |
Existing datasets for 3D scene understanding and novel view synthesis lack either scale, high-quality capture, or dense and rich annotations, limiting the development of methods that generalize well. \datasetname{} bridges this divide by providing a large-scale dataset with high-quality data across multiple modalities. |
The authors captured 81 scenes using a high-end laser scanner, a DSLR camera, and an iPhone 13 Pro. They meticulously aligned all three modalities and densely annotated the reconstructions with semantic and instance labels, explicitly accounting for label ambiguities using multi-labeling. |
Novel view synthesis methods, even state-of-the-art ones, still face challenges in reconstructing detailed scenes with varying view-dependent effects.
Training novel view synthesis models on commodity-level iPhone data, while targeting high-quality DSLR ground truth, presents a significant challenge due to motion blur, varying brightness, and limited field-of-view.
The scale and diversity of \datasetname{} enable the training of generalizable priors for novel view synthesis, leading to improved performance compared to traditional single-scene training. |
Limited diversity due to the focus on indoor scenes and fixed DSLR settings can lead to overexposure or underexposure in certain areas.
The expensive data collection process hinders the scalability of \datasetname{} compared to 2D datasets. |
3d scene understanding, novel view synthesis, dataset, semantic segmentation, instance segmentation |
2308.11357
Report |
Exemplar-Free Continual Transformer with Convolutions |
Anurag Roy, Vinay Kumar Verma, Sravan Voonna, Kripabandhu Ghosh, Saptarshi Ghosh, Abir Das |
Continual Learning (CL) involves training a machine learning model in a
sequential manner to learn new information while retaining previously learned
tasks without the presence of previous training data. Although there has been
significant interest in CL, most recent CL approaches in computer vision have
focused on convolutional architectures only. However, with the recent success
of vision transformers, there is a need to explore their potential for CL.
Although there have been some recent CL approaches for vision transformers,
they either store training instances of previous tasks or require a task
identifier during test time, which can be limiting. This paper proposes a new
exemplar-free approach for class/task incremental learning called ConTraCon,
which does not require task-id to be explicitly present during inference and
avoids the need for storing previous training instances. The proposed approach
leverages the transformer architecture and involves re-weighting the key,
query, and value weights of the multi-head self-attention layers of a
transformer trained on a similar task. The re-weighting is done using
convolution, which enables the approach to maintain low parameter requirements
per task. Additionally, an image augmentation-based entropic task
identification approach is used to predict tasks without requiring task-ids
during inference. Experiments on four benchmark datasets demonstrate that the
proposed approach outperforms several competitive approaches while requiring
fewer parameters. |
Proposes ConTraCon, a dynamic architecture for continual learning on transformers using task-specific convolutions and skip-gating to adapt pre-trained transformer weights for new tasks, achieving low memory overhead and strong performance in exemplar-free continual learning. |
Addresses the challenge of catastrophic forgetting in continual learning, particularly with vision transformers, by enabling efficient adaptation of learned representations to new tasks without storing past data. |
Leverages convolution operations to re-weight key, query, and value weights of pre-trained transformer encoders for new tasks. Employs learnable skip-gating to balance retaining old knowledge and adapting to new information. Uses an entropy-based task identification approach with image augmentations to infer task identity during inference without requiring explicit task labels. |
Outperforms state-of-the-art continual learning approaches, including exemplar-based methods, on CIFAR-100, TinyImageNet-200/10, ImageNet-100/10, and 5-Datasets benchmarks in both task and class incremental settings.
Achieves superior accuracy with significantly lower memory overhead compared to existing methods, demonstrating efficient parameter use for continual learning.
Shows robustness to different task orders, indicating that the initial task's choice does not significantly impact overall performance. |
The selection of the optimal kernel size for the convolution operation is based on a limited validation set and could be further explored.
While the augmentation-based task prediction is lightweight, exploring alternative task inference strategies that don't rely on augmentations could be beneficial. |
continual learning, vision transformers, convolutional adaptation, exemplar-free learning, task identification |
2308.11331
Report |
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training |
Xinchi Deng, Han Shi, Runhui Huang, Changlin Li, Hang Xu, Jianhua Han, James Kwok, Shen Zhao, Wei Zhang, Xiaodan Liang |
Cross-modal pre-training has shown impressive performance on a wide range of
downstream tasks, benefiting from massive image-text pairs collected from the
Internet. In practice, online data are growing constantly, highlighting the
importance of the ability of pre-trained model to learn from data that is
continuously growing. Existing works on cross-modal pre-training mainly focus
on training a network with fixed architecture. However, it is impractical to
limit the model capacity when considering the continuously growing nature of
pre-training data in real-world applications. On the other hand, it is
important to utilize the knowledge in the current model to obtain efficient
training and better performance. To address the above issues, in this paper, we
propose GrowCLIP, a data-driven automatic model growing algorithm for
contrastive language-image pre-training with continuous image-text pairs as
input. Specially, we adopt a dynamic growth space and seek out the optimal
architecture at each growth step to adapt to online learning scenarios. And the
shared encoder is proposed in our growth space to enhance the degree of
cross-modal fusion. Besides, we explore the effect of growth in different
dimensions, which could provide future references for the design of cross-modal
model architecture. Finally, we employ parameter inheriting with momentum (PIM)
to maintain the previous knowledge and address the issue of the local minimum
dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average
top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for
zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text
recall on Flickr30K dataset. |
This paper proposes GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training, designed for scenarios where image-text pair data grows continuously. |
Existing cross-modal pre-training methods primarily use fixed architectures, which are not optimal for continuously growing datasets. This highlights the need for methods that can dynamically adapt model capacity to data size. |
GrowCLIP utilizes a dynamic growth space for neural architecture search, a shared encoder to enhance cross-modal fusion, and parameter inheriting with momentum (PIM) to efficiently transfer knowledge from previous models and avoid local minimum issues. |
GrowCLIP achieves up to 2.3% higher average top-1 accuracy on zero-shot image classification across nine datasets compared to existing methods.
On zero-shot image-text retrieval, GrowCLIP demonstrates a 1.2% improvement in top-1 image-to-text recall on the Flickr30K dataset.
The study also reveals insights into the relationship between model architecture, data size, and training efficiency in cross-modal pre-training. |
The effectiveness of GrowCLIP has only been demonstrated using the CC12M dataset.
Future work will focus on extending GrowCLIP to real-world scenarios with constantly updated data from the web. |
cross-modal pre-training, model growing, online learning, neural architecture search, vision-language pre-training (vlp) |
2308.11199
Report |
ConcatPlexer: Additional Dim1 Batching for Faster ViTs |
Donghoon Han, Seunghyeon Seo, Donghyeon Jeon, Jiho Jang, Chaerin Kong, Nojun Kwak |
Transformers have demonstrated tremendous success not only in the natural
language processing (NLP) domain but also the field of computer vision,
igniting various creative approaches and applications. Yet, the superior
performance and modeling flexibility of transformers came with a severe
increase in computation costs, and hence several works have proposed methods to
reduce this burden. Inspired by a cost-cutting method originally proposed for
language models, Data Multiplexing (DataMUX), we propose a novel approach for
efficient visual recognition that employs additional dim1 batching (i.e.,
concatenation) that greatly improves the throughput with little compromise in
the accuracy. We first introduce a naive adaptation of DataMux for vision
models, Image Multiplexer, and devise novel components to overcome its
weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between
inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and
CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and
83.4% validation accuracy, respectively. |
This paper proposes ConcatPlexer, a novel framework for multiplexing images in the vision domain, aiming to improve computational efficiency by processing multiple images simultaneously. |
Transformer-based models, while powerful, are computationally expensive. Data multiplexing offers a promising way to reduce this cost, but it has been largely unexplored in vision. |
The authors adapt the DataMUX method from NLP to vision by introducing components like a Transformer Encoder Patchifier and a ConcatMultiplexer. They evaluate ConcatPlexer on ImageNet1K and CIFAR100 image classification tasks. |
ConcatPlexer achieves a favorable trade-off between accuracy and inference speed, significantly reducing computational cost compared to conventional ViT models.
Increasing the number of multiplexed images (N_MUX) generally reduces accuracy, highlighting the challenge of this novel task.
The proposed method outperforms a naive adaptation of DataMUX (Image Multiplexer), demonstrating its effectiveness in the vision domain. |
The performance of ConcatPlexer on ImageNet1K, while promising, lags behind conventional ViT models, suggesting room for improvement.
The current Conv-based multiplexing method and hyperparameters could be further optimized to enhance performance. |
vision transformer, data multiplexing, computational efficiency, image classification, concatplexer |
2308.11194
Report |
ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data |
Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari, Curtis Langlotz |
Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained
on datasets consisting of image-caption pairs obtained from the web. However,
real-world multimodal datasets, such as healthcare data, are significantly more
complex: each image (e.g. X-ray) is often paired with text (e.g. physician
report) that describes many distinct attributes occurring in fine-grained
regions of the image. We refer to these samples as exhibiting high pairwise
complexity, since each image-text pair can be decomposed into a large number of
region-attribute pairings. The extent to which VLMs can capture fine-grained
relationships between image regions and textual attributes when trained on such
data has not been previously evaluated. The first key contribution of this work
is to demonstrate through systematic evaluations that as the pairwise
complexity of the training dataset increases, standard VLMs struggle to learn
region-attribute relationships, exhibiting performance degradations of up to
37% on retrieval tasks. In order to address this issue, we introduce ViLLA as
our second key contribution. ViLLA, which is trained to capture fine-grained
region-attribute relationships from complex datasets, involves two components:
(a) a lightweight, self-supervised mapping model to decompose image-text
samples into region-attribute pairs, and (b) a contrastive VLM to learn
representations from generated region-attribute pairs. We demonstrate with
experiments across four domains (synthetic, product, medical, and natural
images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks,
such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP
points on LVIS) and retrieval (up to 14.2 R-Precision points). |
This paper investigates the performance of Vision-Language Models (VLMs) on real-world datasets with high pairwise complexity, where image-text pairs can be decomposed into many region-attribute pairings. They introduce ViLLA, a self-supervised approach to improve fine-grained reasoning in VLMs trained on such datasets. |
Standard VLMs, trained on simple image-caption pairs, struggle to capture fine-grained region-attribute relationships present in complex, real-world multimodal datasets. |
ViLLA uses a two-stage pipeline: 1) a lightweight mapping model decomposes image-text samples into region-attribute pairs using self-supervision, 2) a standard VLM is trained on these generated pairs to learn fine-grained representations. |
ViLLA outperforms comparable VLMs on zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS).
ViLLA achieves significant improvements in text-to-region and region-to-text retrieval tasks (up to 14.2 R-Precision points improvement).
ViLLA's region-attribute mappings are up to 25.8 F1 points more accurate than prior methods. |
Evaluations are currently limited to image-text datasets.
Region-attribute mapping accuracy evaluation is limited on datasets without ground-truth annotations. |
vision-language models, fine-grained reasoning, self-supervised learning, multimodal datasets, region-attribute mapping |
2308.11130
Report |
Efficient View Synthesis with Neural Radiance Distribution Field |
Yushuang Wu, Xiao Li, Jinglu Wang, Xiaoguang Han, Shuguang Cui, Yan Lu |
Recent work on Neural Radiance Fields (NeRF) has demonstrated significant
advances in high-quality view synthesis. A major limitation of NeRF is its low
rendering efficiency due to the need for multiple network forwardings to render
a single pixel. Existing methods to improve NeRF either reduce the number of
required samples or optimize the implementation to accelerate the network
forwarding. Despite these efforts, the problem of multiple sampling persists
due to the intrinsic representation of radiance fields. In contrast, Neural
Light Fields (NeLF) reduce the computation cost of NeRF by querying only one
single network forwarding per pixel. To achieve a close visual quality to NeRF,
existing NeLF methods require significantly larger network capacities which
limits their rendering efficiency in practice. In this work, we propose a new
representation called Neural Radiance Distribution Field (NeRDF) that targets
efficient view synthesis in real-time. Specifically, we use a small network
similar to NeRF while preserving the rendering speed with a single network
forwarding per pixel as in NeLF. The key is to model the radiance distribution
along each ray with frequency basis and predict frequency weights using the
network. Pixel values are then computed via volume rendering on radiance
distributions. Experiments show that our proposed method offers a better
trade-off among speed, quality, and network size than existing methods: we
achieve a ~254x speed-up over NeRF with similar network size, with only a
marginal performance decline. Our project page is at
yushuang-wu.github.io/NeRDF. |
This paper proposes Neural Radiance Distribution Field (NeRDF), a novel neural scene representation for real-time view synthesis. |
Existing view synthesis methods struggle to achieve a balance between high visual quality, fast rendering speed, and low memory cost. NeRDF aims to break this impossible trinity. |
NeRDF predicts the radiance distribution along each ray using a compact neural network. It leverages a knowledge distillation framework with a teacher NeRF and introduces online view sampling and a volume density constraint to enhance learning. |
NeRDF achieves comparable visual quality to NeRF-based methods while being significantly faster.
On the LLFF dataset, NeRDF-8 achieves a rendering speed of ~21 FPS with an 8-layer MLP, outperforming most NeRF-based methods.
With inference optimization, NeRDF achieves up to ~369 FPS, a ~1400x speed-up over an unoptimized NeRF. |
NeRDF needs to be extended to handle 360-degree scenes effectively.
Future work includes improving view synthesis quality and extending NeRDF for dynamic scenes. |
view synthesis, neural radiance fields, neural light fields, knowledge distillation, real-time rendering |
2308.11093
Report |
Video OWL-ViT: Temporally-consistent open-world localization in video |
Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf |
We present an architecture and a training recipe that adapts pre-trained
open-world image models to localization in videos. Understanding the open
visual world (without being constrained by fixed label spaces) is crucial for
many real-world vision tasks. Contrastive pre-training on large image-text
datasets has recently led to significant improvements for image-level tasks.
For more structured tasks involving object localization applying pre-trained
models is more challenging. This is particularly true for video tasks, where
task-specific data is limited. We show successful transfer of open-world models
by building on the OWL-ViT open-vocabulary detection model and adapting it to
video by adding a transformer decoder. The decoder propagates object
representations recurrently through time by using the output tokens for one
frame as the object queries for the next. Our model is end-to-end trainable on
video data and enjoys improved temporal consistency compared to
tracking-by-detection baselines, while retaining the open-world capabilities of
the backbone detector. We evaluate our model on the challenging TAO-OW
benchmark and demonstrate that open-world capabilities, learned from
large-scale image-text pre-training, can be transferred successfully to
open-world localization across diverse videos. |
The paper introduces Video OWL-ViT, an end-to-end trainable model for open-world object localization and tracking in videos. |
Open-world object understanding in videos is crucial for real-world applications but challenging due to limitations in labeled video data. |
The model adapts the OWL-ViT image-based open-world detector by adding a transformer decoder for temporal consistency and is trained on a combination of real and pseudo videos. |
Video OWL-ViT achieves competitive performance with tracking-by-detection baselines on the TAO-OW benchmark.
The model shows strong generalization to unseen object classes in both TAO-OW and YT-VIS datasets.
End-to-end learning of temporal associations leads to improved accuracy compared to heuristic matching methods. |
Performance on short object tracks remains a challenge due to limitations in training data and object presence modeling.
Future work includes exploring better object presence indicators and leveraging larger and more diverse video datasets. |
open-world learning, object tracking, video understanding, vision transformer, open-vocabulary detection |
2308.11025
Report |
Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction |
Sijia Jiang, Jing Hua, Zhizhong Han |
In recent years, huge progress has been made on learning neural implicit
representations from multi-view images for 3D reconstruction. As an additional
input complementing coordinates, using sinusoidal functions as positional
encodings plays a key role in revealing high frequency details with
coordinate-based neural networks. However, high frequency positional encodings
make the optimization unstable, which results in noisy reconstructions and
artifacts in empty space. To resolve this issue in a general sense, we
introduce to learn neural implicit representations with quantized coordinates,
which reduces the uncertainty and ambiguity in the field during optimization.
Instead of continuous coordinates, we discretize continuous coordinates into
discrete coordinates using nearest interpolation among quantized coordinates
which are obtained by discretizing the field in an extremely high resolution.
We use discrete coordinates and their positional encodings to learn implicit
functions through volume rendering. This significantly reduces the variations
in the sample space, and triggers more multi-view consistency constraints on
intersections of rays from different views, which enables to infer implicit
function in a more effective way. Our quantized coordinates do not bring any
computational burden, and can seamlessly work upon the latest methods. Our
evaluations under the widely used benchmarks show our superiority over the
state-of-the-art. Our code is available at
https://github.com/MachinePerceptionLab/CQ-NIR. |
This paper introduces quantized coordinates to stabilize the optimization process and enhance the accuracy of neural implicit representations learned from multi-view images. |
High-frequency positional encodings, while crucial for capturing detail in neural implicit representations, often lead to unstable optimization, resulting in noisy reconstructions and artifacts. This paper addresses this challenge by reducing uncertainty and ambiguity during optimization. |
The authors discretize continuous 3D coordinates into discrete ones using nearest interpolation based on a high-resolution grid of quantized coordinates. These discrete coordinates, along with their positional encodings, are then used as input for learning implicit functions via volume rendering. |
Quantized coordinates significantly reduce variations in the sample space, leading to more stable optimization.
The approach effectively imposes multi-view consistency constraints, improving the accuracy of the learned implicit functions.
Experiments on DTU, ScanNet, and Replica datasets demonstrate superior performance compared to state-of-the-art methods, showcasing smoother surfaces and finer geometric details. |
A very high resolution of quantized coordinates may degenerate the result due to less overlapped samples along rays.
Future work involves exploring alternative discretization strategies beyond nearest interpolation to potentially further enhance accuracy. |
neural implicit representations, 3d reconstruction, multi-view reconstruction, quantized coordinates, positional encoding |
2308.10997
Report |
MarkovGen: Structured Prediction for Efficient Text-to-Image Generation |
Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit, Ayan Chakrabarti, Sanjiv Kumar |
Modern text-to-image generation models produce high-quality images that are
both photorealistic and faithful to the text prompts. However, this quality
comes at significant computational cost: nearly all of these models are
iterative and require running sampling multiple times with large models. This
iterative process is needed to ensure that different regions of the image are
not only aligned with the text prompt, but also compatible with each other. In
this work, we propose a light-weight approach to achieving this compatibility
between different regions of an image, using a Markov Random Field (MRF) model.
We demonstrate the effectiveness of this method on top of the latent
token-based Muse text-to-image model. The MRF richly encodes the compatibility
among image tokens at different spatial locations to improve quality and
significantly reduce the required number of Muse sampling steps. Inference with
the MRF is significantly cheaper, and its parameters can be quickly learned
through back-propagation by modeling MRF inference as a differentiable
neural-network layer. Our full model, MarkovGen, uses this proposed MRF model
to both speed up Muse by 1.5X and produce higher quality images by decreasing
undesirable image artifacts. |
This paper introduces MarkovGen, a text-to-image model that leverages a Markov Random Field (MRF) to improve speed and quality of token-based image generation, focusing on the Muse model. |
Existing text-to-image models, though impressive, often require significant computational resources due to iterative sampling processes. MarkovGen addresses this by using an MRF to efficiently ensure compatibility between image regions, leading to faster and better image generation. |
The authors formulate the token arrangement problem as MAP inference in an MRF, capturing token compatibility via unary (neural network confidence) and pairwise (spatial and label compatibility) terms. They then integrate this MRF into Muse, replacing its later sampling steps to expedite the process. |
MarkovGen achieves a 1.5x speedup compared to the baseline Muse model.
Human evaluation confirms that MarkovGen generates higher quality images than both early exit and full Muse models.
Quantitative evaluation using FID scores on the MS-COCO dataset further validates MarkovGen's improved image quality over Muse. |
The current MRF model doesn't directly incorporate text prompt information, relying solely on unary terms for text guidance.
Future work could explore joint training of the Muse model with the MRF layers for optimal unary generation. |
text-to-image generation, markov random field, structured prediction, muse, image quality |
2308.10916
Report |
Diffusion Model as Representation Learner |
Xingyi Yang, Xinchao Wang |
Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive
results on various generative tasks.Despite its promises, the learned
representations of pre-trained DPMs, however, have not been fully understood.
In this paper, we conduct an in-depth investigation of the representation power
of DPMs, and propose a novel knowledge transfer method that leverages the
knowledge acquired by generative DPMs for recognition tasks. Our study begins
by examining the feature space of DPMs, revealing that DPMs are inherently
denoising autoencoders that balance the representation learning with
regularizing model capacity. To this end, we introduce a novel knowledge
transfer paradigm named RepFusion. Our paradigm extracts representations at
different time steps from off-the-shelf DPMs and dynamically employs them as
supervision for student networks, in which the optimal time is determined
through reinforcement learning. We evaluate our approach on several image
classification, semantic segmentation, and landmark detection benchmarks, and
demonstrate that it outperforms state-of-the-art methods. Our results uncover
the potential of DPMs as a powerful tool for representation learning and
provide insights into the usefulness of generative models beyond sample
generation. The code is available at
\url{https://github.com/Adamdad/Repfusion}. |
This paper investigates the representation learning capability of pre-trained Diffusion Probabilistic Models (DPMs) and proposes RepFusion, a novel knowledge transfer method for recognition tasks. |
While DPMs excel in generative tasks, their potential for representation learning remains underexplored. This work aims to bridge this gap and leverage pre-trained DPMs for improved recognition performance. |
The authors analyze DPMs as denoising autoencoders, revealing a trade-off between representation learning and regularization. RepFusion utilizes knowledge distillation, dynamically extracting intermediate representations from DPMs at optimal time steps determined via reinforcement learning. |
RepFusion enhances semantic segmentation, exceeding both knowledge distillation and self-supervised learning approaches on CelebAMask-HQ.
It improves face keypoint detection, surpassing self-supervised methods on WFLW, particularly in challenging scenarios with pose variation, occlusion, and poor illumination.
RepFusion boosts image classification accuracy on CIFAR-10 and Tiny-ImageNet, outperforming models distilled from supervised teachers. |
The study primarily focuses on visual recognition tasks, leaving exploration of other domains for future work.
Future work can investigate the computational cost associated with the reinforcement learning component of RepFusion. |
diffusion probabilistic models, representation learning, knowledge distillation, reinforcement learning, visual recognition |
2308.10902
Report |
CamP: Camera Preconditioning for Neural Radiance Fields |
Keunhong Park, Philipp Henzler, Ben Mildenhall, Jonathan T. Barron, Ricardo Martin-Brualla |
Neural Radiance Fields (NeRF) can be optimized to obtain high-fidelity 3D
scene reconstructions of objects and large-scale scenes. However, NeRFs require
accurate camera parameters as input -- inaccurate camera parameters result in
blurry renderings. Extrinsic and intrinsic camera parameters are usually
estimated using Structure-from-Motion (SfM) methods as a pre-processing step to
NeRF, but these techniques rarely yield perfect estimates. Thus, prior works
have proposed jointly optimizing camera parameters alongside a NeRF, but these
methods are prone to local minima in challenging settings. In this work, we
analyze how different camera parameterizations affect this joint optimization
problem, and observe that standard parameterizations exhibit large differences
in magnitude with respect to small perturbations, which can lead to an
ill-conditioned optimization problem. We propose using a proxy problem to
compute a whitening transform that eliminates the correlation between camera
parameters and normalizes their effects, and we propose to use this transform
as a preconditioner for the camera parameters during joint optimization. Our
preconditioned camera optimization significantly improves reconstruction
quality on scenes from the Mip-NeRF 360 dataset: we reduce error rates (RMSE)
by 67% compared to state-of-the-art NeRF approaches that do not optimize for
cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint
optimization approaches using the camera parameterization of SCNeRF. Our
approach is easy to implement, does not significantly increase runtime, can be
applied to a wide variety of camera parameterizations, and can
straightforwardly be incorporated into other NeRF-like models. |
This paper proposes \nameacronym, a preconditioning technique for camera parameters in Neural Radiance Fields (NeRFs) that improves joint optimization of camera parameters and scene reconstruction. |
NeRFs are sensitive to camera parameter accuracy, and existing joint optimization methods struggle with local minima due to the ill-conditioned nature of the problem caused by differing parameter sensitivities. |
The method analyzes camera parameterization effects on point projection using a proxy problem. It computes a whitening transform as a preconditioner, normalizing parameter effects and decorrelating them to improve optimization stability. |
Preconditioned camera optimization significantly improves reconstruction quality on the mip-NeRF 360 dataset, reducing error rates compared to non-optimizing and state-of-the-art joint optimization approaches.
The FocalPose camera parameterization, when preconditioned, outperforms other alternatives on both synthetic and real datasets.
Preconditioning consistently improves results across different camera parameterizations and datasets, including challenging cellphone captures with ARKit poses. |
The preconditioning approach may not always prevent local minima, particularly in challenging cases with significant camera pose errors.
Dynamically updating the preconditioner during optimization could be beneficial but requires further investigation. |
neural radiance fields, camera optimization, 3d reconstruction, preconditioning, novel view synthesis |
2308.10718
Report |
Backdooring Textual Inversion for Concept Censorship |
Yutong Wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang |
Recent years have witnessed success in AIGC (AI Generated Content). People
can make use of a pre-trained diffusion model to generate images of high
quality or freely modify existing pictures with only prompts in nature
language. More excitingly, the emerging personalization techniques make it
feasible to create specific-desired images with only a few images as
references. However, this induces severe threats if such advanced techniques
are misused by malicious users, such as spreading fake news or defaming
individual reputations. Thus, it is necessary to regulate personalization
models (i.e., concept censorship) for their development and advancement.
In this paper, we focus on the personalization technique dubbed Textual
Inversion (TI), which is becoming prevailing for its lightweight nature and
excellent performance. TI crafts the word embedding that contains detailed
information about a specific object. Users can easily download the word
embedding from public websites like Civitai and add it to their own stable
diffusion model without fine-tuning for personalization. To achieve the concept
censorship of a TI model, we propose leveraging the backdoor technique for good
by injecting backdoors into the Textual Inversion embeddings. Briefly, we
select some sensitive words as triggers during the training of TI, which will
be censored for normal use. In the subsequent generation stage, if the triggers
are combined with personalized embeddings as final prompts, the model will
output a pre-defined target image rather than images including the desired
malicious concept.
To demonstrate the effectiveness of our approach, we conduct extensive
experiments on Stable Diffusion, a prevailing open-sourced text-to-image model.
Our code, data, and results are available at
https://concept-censorship.github.io. |
This paper presents a novel method for concept censorship in Textual Inversion, a technique for personalizing text-to-image models. |
The increasing availability of AI-generated content, particularly through personalized text-to-image models, presents risks of misuse such as spreading misinformation and creating harmful content. This work aims to mitigate these risks by regulating the use of Textual Inversion. |
The authors propose to backdoor Textual Inversion embeddings by incorporating sensitive words as triggers during training. This results in the model generating a pre-defined target image instead of the desired content when these triggers are included in the prompts. |
The proposed method successfully embeds multiple themes into a single word embedding, allowing for censorship of various concepts.
Backdoored Textual Inversion retains its utility for benign prompts, preserving the fidelity and editability of generated images.
The backdoor is robust against potential countermeasures such as word embedding removal, perturbation, and adaptive attacks. |
The current approach requires training Textual Inversion from scratch, limiting its applicability to scenarios where training data is accessible.
The method relies on a set of hyper-parameters, and finding the optimal configuration can be costly. |
textual inversion, concept censorship, backdoor attacks, text-to-image generation, diffusion models |
2308.10648
Report |
EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints |
Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, Qingpei Guo |
Motivated by the superior performance of image diffusion models, more and
more researchers strive to extend these models to the text-based video editing
task. Nevertheless, current video editing tasks mainly suffer from the dilemma
between the high fine-tuning cost and the limited generation capacity. Compared
with images, we conjecture that videos necessitate more constraints to preserve
the temporal consistency during editing. Towards this end, we propose EVE, a
robust and efficient zero-shot video editing method. Under the guidance of
depth maps and temporal consistency constraints, EVE derives satisfactory video
editing results with an affordable computational and time cost. Moreover,
recognizing the absence of a publicly available video editing dataset for fair
comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive
experimentation, we validate that EVE could achieve a satisfactory trade-off
between performance and efficiency. We will release our dataset and codebase to
facilitate future researchers. |
EVE, a zero-shot, text-based video editing method that balances generation quality and efficiency by using depth map guidance and temporal consistency constraints. |
Current text-based video editing methods struggle with the trade-off between high fine-tuning costs and limited generation quality, especially in maintaining temporal consistency. |
EVE uses a pre-trained LDM and operates in a zero-shot manner. It incorporates depth map features in the DDIM inversion and denoising process and introduces a frame-aligned attention mechanism to enhance temporal consistency. |
EVE outperforms the baseline FateZero in both temporal and prompt consistency on the ZVE-50 dataset.
Depth map guidance and frame-aligned attention are shown to significantly improve temporal consistency in edited videos.
EVE is significantly faster than tuning-based methods and the zero-shot baseline FateZero. |
The performance gap between zero-shot and tuning-based video editing methods remains.
Further research on enhancing temporal stability and knowledge distillation during the editing process is needed. |
video editing, zero-shot learning, diffusion models, depth maps, temporal consistency |
2308.10608
Report |
FocalDreamer: Text-driven 3D Editing via Focal-fusion Assembly |
Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, Bingbing Ni |
While text-3D editing has made significant strides in leveraging score
distillation sampling, emerging approaches still fall short in delivering
separable, precise and consistent outcomes that are vital to content creation.
In response, we introduce FocalDreamer, a framework that merges base shape with
editable parts according to text prompts for fine-grained editing within
desired regions. Specifically, equipped with geometry union and dual-path
rendering, FocalDreamer assembles independent 3D parts into a complete object,
tailored for convenient instance reuse and part-wise control. We propose
geometric focal loss and style consistency regularization, which encourage
focal fusion and congruent overall appearance. Furthermore, FocalDreamer
generates high-fidelity geometry and PBR textures which are compatible with
widely-used graphics engines. Extensive experiments have highlighted the
superior editing capabilities of FocalDreamer in both quantitative and
qualitative evaluations. |
FocalDreamer, a novel framework for text-driven local 3D editing that enables separable, precise, and consistent modifications by assembling base shapes with editable parts. |
Current text-3D editing methods fall short in producing separable, precise, and consistent edits essential for content creation. |
FocalDreamer uses a two-stage training strategy for geometry and appearance, geometry union for merging parts, dual-path rendering for independent texture control, and introduces geometric focal loss and style consistency regularization. |
Outperforms baselines in qualitative and quantitative evaluations, demonstrating superior editing capabilities.
Achieves higher CLIP similarity and direction similarity scores, indicating better prompt alignment and edit direction accuracy.
User study confirms FocalDreamer's effectiveness in preserving base shapes while achieving prompt-relevant edits. |
Limited to single object editing and requires pre-defined base shapes.
Future work includes extending to scene editing and exploring shape generation within focal regions. |
3d editing, text-to-3d, score distillation sampling, geometry processing, deep learning |
2308.10554
Report |
Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations |
Seogkyu Jeon, Bei Liu, Pilhyeon Lee, Kibeom Hong, Jianlong Fu, Hyeran Byun |
Training deep generative models usually requires a large amount of data. To
alleviate the data collection cost, the task of zero-shot GAN adaptation aims
to reuse well-trained generators to synthesize images of an unseen target
domain without any further training samples. Due to the data absence, the
textual description of the target domain and the vision-language models, e.g.,
CLIP, are utilized to effectively guide the generator. However, with only a
single representative text feature instead of real images, the synthesized
images gradually lose diversity as the model is optimized, which is also known
as mode collapse. To tackle the problem, we propose a novel method to find
semantic variations of the target text in the CLIP space. Specifically, we
explore diverse semantic variations based on the informative text feature of
the target domain while regularizing the uncontrolled deviation of the semantic
information. With the obtained variations, we design a novel directional moment
loss that matches the first and second moments of image and text direction
distributions. Moreover, we introduce elastic weight consolidation and a
relation consistency loss to effectively preserve valuable content information
from the source domain, e.g., appearances. Through extensive experiments, we
demonstrate the efficacy of the proposed methods in ensuring sample diversity
in various scenarios of zero-shot GAN adaptation. We also conduct ablation
studies to validate the effect of each proposed component. Notably, our model
achieves a new state-of-the-art on zero-shot GAN adaptation in terms of both
diversity and quality. |
This paper proposes a novel zero-shot GAN adaptation framework that leverages semantic variations to enhance diversity in generated images of unseen target domains, guided solely by textual descriptions. |
Training GANs requires massive data, limiting their application in data-scarce domains. Zero-shot adaptation, reusing pre-trained GANs for new domains without additional data, offers a solution but often suffers from mode collapse (limited diversity) due to relying on a single textual description. |
The proposed method employs a two-stage approach: 1) **Semantic Variation Learning:** Learnable perturbations are applied to the target text embedding in CLIP space to discover diverse yet semantically consistent variations. 2) **Directional Moment Loss:** A novel loss function encourages the generator to align the distribution of image-updating directions with the augmented text directions, promoting diverse generation. |
Significantly improves diversity over existing zero-shot GAN adaptation methods, as demonstrated by higher intra-cluster LPIPS scores.
Achieves comparable performance to few-shot methods that use a small number of target domain images, highlighting its data efficiency.
Successfully preserves content information from the source domain while adapting to the target domain, ensuring realistic and high-quality generation. |
Achieving complete alignment between image and text directions might be challenging for domains with large semantic gaps.
The adaptation process currently relies on expert intervention for selecting source domain descriptions and optimal training iterations due to the lack of automatic quality assessment. |
generative adversarial networks, zero-shot learning, domain adaptation, clip, image generation |
2308.10524
Report |
Dataset Quantization |
Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, Jiashi Feng |
State-of-the-art deep neural networks are trained with large amounts
(millions or even billions) of data. The expensive computation and memory costs
make it difficult to train them on limited hardware resources, especially for
recent popular large language models (LLM) and computer vision models (CV).
Recent popular dataset distillation methods are thus developed, aiming to
reduce the number of training samples via synthesizing small-scale datasets via
gradient matching. However, as the gradient calculation is coupled with the
specific network architecture, the synthesized dataset is biased and performs
poorly when used for training unseen architectures. To address these
limitations, we present dataset quantization (DQ), a new framework to compress
large-scale datasets into small subsets which can be used for training any
neural network architectures. Extensive experiments demonstrate that DQ is able
to generate condensed small datasets for training unseen network architectures
with state-of-the-art compression ratios for lossless model training. To the
best of our knowledge, DQ is the first method that can successfully distill
large-scale datasets such as ImageNet-1k with a state-of-the-art compression
ratio. Notably, with 60% data from ImageNet and 20% data from Alpaca's
instruction tuning data, the models can be trained with negligible or no
performance drop for both vision tasks (including classification, semantic
segmentation, and object detection) as well as language tasks (including
instruction tuning tasks such as BBH and DROP). |
This paper proposes Dataset Quantization (DQ), a novel framework to compress large-scale datasets into small subsets suitable for training unseen neural network architectures. |
Training state-of-the-art deep neural networks is computationally expensive due to the massive datasets involved. Existing dataset distillation methods have limitations in generalization and scalability, particularly for large datasets like ImageNet. |
DQ recursively divides the dataset into diverse bins based on submodular gains, ensuring representation and diversity. A compact subset is then created by uniformly sampling from these bins. Patch dropping and reconstruction using a pre-trained MAE model further reduce storage requirements. |
DQ achieves state-of-the-art compression ratios, achieving lossless compression on ImageNet-1K with 60% data and on Alpaca instruction dataset with 20% data.
The compressed datasets generated by DQ demonstrate excellent cross-architecture generalization, effectively training unseen models from ResNet, ViT, and MobileNetV2 families.
Models pre-trained on DQ-compressed ImageNet data perform competitively on downstream tasks like object detection (COCO) and semantic segmentation (ADE20K). |
The recursive sample selection process in DQ introduces extra computational overhead.
Future work includes exploring more efficient non-recursive selection strategies and extending DQ to other tasks like video understanding and AIGC. |
dataset compression, dataset distillation, coreset selection, cross-architecture generalization, downstream task transfer |
2308.10490
Report |
Texture Generation on 3D Meshes with Point-UV Diffusion |
Xin Yu, Peng Dai, Wenbo Li, Lan Ma, Zhengzhe Liu, Xiaojuan Qi |
In this work, we focus on synthesizing high-quality textures on 3D meshes. We
present Point-UV diffusion, a coarse-to-fine pipeline that marries the
denoising diffusion model with UV mapping to generate 3D consistent and
high-quality texture images in UV space. We start with introducing a point
diffusion model to synthesize low-frequency texture components with our
tailored style guidance to tackle the biased color distribution. The derived
coarse texture offers global consistency and serves as a condition for the
subsequent UV diffusion stage, aiding in regularizing the model to generate a
3D consistent UV texture image. Then, a UV diffusion model with hybrid
conditions is developed to enhance the texture fidelity in the 2D UV space. Our
method can process meshes of any genus, generating diversified,
geometry-compatible, and high-fidelity textures. Code is available at
https://cvmi-lab.github.io/Point-UV-Diffusion |
This paper introduces Point-UV Diffusion, a novel two-stage coarse-to-fine framework for generating high-quality and consistent textures on 3D meshes using diffusion models and UV mapping. |
Creating realistic textures on 3D surfaces is challenging due to the need for suitable representations that balance 3D consistency, high resolution, and geometric fidelity. |
The method uses a 3D point diffusion model in the coarse stage to colorize sampled surface points, guided by style information. These points are projected to UV space to generate a coarse texture. In the fine stage, a 2D UV diffusion model refines this texture with hybrid conditioning (coarse and smooth texture maps) to enhance detail and consistency. |
The approach generates high-quality textures with fine details while preserving geometric structures, outperforming existing methods in qualitative and quantitative (FID, KID) comparisons.
It handles meshes with arbitrary topology and is adaptable for conditional generation based on text prompts or single-view images.
Style guidance successfully addresses color bias in datasets, enhancing texture diversity. |
The method's performance relies heavily on the quality of UV mapping and may struggle with excessively fragmented UV maps.
Training is limited by the scale and diversity of 3D datasets, potentially hindering the generation of highly complex and realistic textures. |
texture generation, 3d meshes, diffusion models, uv mapping, deep learning |
2308.10273
Report |
Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks |
Xin Ding, Yongwei Wang, Zuheng Xu |
Continuous Conditional Generative Adversarial Networks (CcGANs) enable
generative modeling conditional on continuous scalar variables (termed
regression labels). However, they can produce subpar fake images due to limited
training data. Although Negative Data Augmentation (NDA) effectively enhances
unconditional and class-conditional GANs by introducing anomalies into real
training images, guiding the GANs away from low-quality outputs, its impact on
CcGANs is limited, as it fails to replicate negative samples that may occur
during the CcGAN sampling. We present a novel NDA approach called Dual-NDA
specifically tailored for CcGANs to address this problem. Dual-NDA employs two
types of negative samples: visually unrealistic images generated from a
pre-trained CcGAN and label-inconsistent images created by manipulating real
images' labels. Leveraging these negative samples, we introduce a novel
discriminator objective alongside a modified CcGAN training algorithm.
Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA
consistently enhances the visual fidelity and label consistency of fake images
generated by CcGANs, exhibiting a substantial performance gain over the vanilla
NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable
advancement beyond the capabilities of state-of-the-art conditional GANs and
diffusion models, establishing a new pinnacle of performance. Our codes can be
found at https://github.com/UBCDingXin/Dual-NDA. |
This paper proposes Dual-NDA, a novel Negative Data Augmentation strategy specifically designed for Continuous Conditional Generative Adversarial Networks (CcGANs) to enhance the quality and label consistency of generated images. |
CcGANs, while effective for generative modeling with continuous scalar conditions (regression labels), often struggle to produce high-quality fake images due to limitations like sparse training data. Dual-NDA aims to address this by guiding CcGANs away from generating low-quality outputs. |
Dual-NDA employs two types of negative samples: (1) **Label-Inconsistent Real Images:** Generated by dynamically mismatching image-label pairs during discriminator training. (2) **Visually Unrealistic Fake Images:** Obtained by filtering outputs from a pre-trained CcGAN generator based on Naturalness Image Quality Evaluator (NIQE) scores. A modified CcGAN training mechanism incorporating a new vicinal discriminator loss utilizes these negative samples. |
Dual-NDA consistently enhances the visual fidelity (NIQE) and label consistency (Label Score) of fake images generated by CcGANs, showing substantial gains over baseline CcGANs and vanilla NDA.
Dual-NDA-enhanced CcGANs outperform state-of-the-art class-conditional GANs (ReACGAN, ADCGAN) and diffusion models (ADM-G, CFG) on UTKFace and Steering Angle datasets.
Ablation studies confirm the individual contributions of Type I and Type II negative samples, and the robustness of Dual-NDA to hyperparameter variations. |
The current implementation of Dual-NDA relies on a pre-trained CcGAN generator for creating Type II negative samples. Exploring alternative generation mechanisms could be beneficial.
The paper focuses on image generation. Extending Dual-NDA to other data modalities like text or audio could be a potential research direction. |
generative adversarial networks, continuous conditional generative modeling, negative data augmentation, image generation, label consistency |
2308.10257
Report |
Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image |
Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao, Guosheng Lin |
We study the problem of synthesizing a long-term dynamic video from only a
single image. This is challenging since it requires consistent visual content
movements given large camera motions. Existing methods either hallucinate
inconsistent perpetual views or struggle with long camera trajectories. To
address these issues, it is essential to estimate the underlying 4D (including
3D geometry and scene motion) and fill in the occluded regions. To this end, we
present Make-It-4D, a novel method that can generate a consistent long-term
dynamic video from a single image. On the one hand, we utilize layered depth
images (LDIs) to represent a scene, and they are then unprojected to form a
feature point cloud. To animate the visual content, the feature point cloud is
displaced based on the scene flow derived from motion estimation and the
corresponding camera pose. Such 4D representation enables our method to
maintain the global consistency of the generated dynamic video. On the other
hand, we fill in the occluded regions by using a pretrained diffusion model to
inpaint and outpaint the input image. This enables our method to work under
large camera motions. Benefiting from our design, our method can be
training-free which saves a significant amount of training time. Experimental
results demonstrate the effectiveness of our approach, which showcases
compelling rendering results. |
This paper proposes Make-It-4D, a training-free method for synthesizing consistent long-term dynamic videos from a single image, involving both large camera motions and dynamic object animations. |
Existing methods for generating dynamic scenes from single images struggle with either maintaining consistency over long camera trajectories or animating dynamic objects under large camera motions. |
Make-It-4D uses layered depth images (LDIs) for 3D scene representation and a pre-trained diffusion model for inpainting and outpainting. Scene animation is achieved using motion estimation to displace a feature point cloud, and the final video is rendered with a differentiable renderer. |
Make-It-4D outperforms state-of-the-art methods in terms of visual quality and consistency as demonstrated by quantitative and qualitative comparisons.
The method is generalizable to diverse in-the-wild scenes and different image resolutions without requiring training.
User studies confirm that Make-It-4D generates more realistic and immersive results compared to existing alternatives. |
The method may not effectively complement vertical scene information when the camera moves forward.
Inaccurate depth estimation, particularly incorrect layering, can impact the method's performance.
Future work will focus on addressing limitations in handling complex object movements and refining vertical scene completion. |
image animation, novel view synthesis, 3d scene representation, diffusion models, training-free methods |
2308.10253
Report |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data |
Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, Yunchao Wei |
The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have
sparked significant interest in the development of multimodal Large Language
Models (LLMs). A primary research objective of such models is to align visual
and textual modalities effectively while comprehending human instructions.
Current methodologies often rely on annotations derived from benchmark datasets
to construct image-dialogue datasets for training purposes, akin to instruction
tuning in LLMs. However, these datasets often exhibit domain bias, potentially
constraining the generative capabilities of the models. In an effort to
mitigate these limitations, we propose a novel data collection methodology that
synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities
of ChatGPT and text-to-image generative models to yield a diverse and
controllable dataset with varied image content. Additionally, datasets can be
arbitrarily scaled. This not only provides greater flexibility compared to
existing methodologies but also significantly enhances several model
capabilities. Our research includes comprehensive experiments conducted on
various datasets. The results emphasize substantial enhancements in more than
ten commonly assessed capabilities. Additionally, our model achieves
state-of-the-art results across multiple widely recognized multimodal
benchmarks. |
This paper introduces a novel data collection pipeline that uses generative AI models (ChatGPT and StableDiffusion) to synthesize image-dialogue pairs for training multimodal Large Language Models (LLMs). |
Existing methods for training multimodal LLMs often rely on benchmark datasets with limitations such as domain bias and lack of diversity, which restricts the models' capabilities. This new approach offers greater control and flexibility in data generation. |
The pipeline leverages ChatGPT to generate StableDiffusion image prompts and corresponding dialogues tailored to specific LLM capabilities. These prompts are then used to create images, forming image-dialogue pairs for training. This approach enables the creation of diverse datasets, including multi-turn dialogues and multi-image reasoning examples. |
The proposed method enhances performance across various LLM capabilities, including multi-image reasoning and understanding humor in images.
The model trained with the synthesized data outperforms baseline models and achieves state-of-the-art results on multiple multimodal benchmarks.
Qualitative analysis shows the model’s improved ability to follow instructions and generate more accurate and relevant responses compared to baseline models. |
The current pipeline faces limitations in generating text-rich images and tables due to constraints in text-to-image generation models.
Future work aims to incorporate more advanced generative models to further enhance model abilities in areas like spatial comprehension and fine-grained recognition. |
multimodal large language models, visual instruction tuning, data augmentation, generative ai, image-dialogue generation |
2308.10185
Report |
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights |
Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou |
Though the success of CLIP-based training recipes in vision-language models,
their scalability to more modalities (e.g., 3D, audio, etc.) is limited to
large-scale data, which is expensive or even inapplicable for rare modalities.
In this paper, we present ViT-Lens that facilitates efficient omni-modal
representation learning by perceiving novel modalities with a pretrained ViT
and aligning to a pre-defined space. Specifically, the modality-specific lens
is tuned to project multimodal signals to the shared embedding space, which are
then processed by a strong ViT that carries pre-trained image knowledge. The
encoded multimodal representations are optimized toward aligning with the
modal-independent space, pre-defined by off-the-shelf foundation models. A
well-trained lens with a ViT backbone has the potential to serve as one of
these foundation models, supervising the learning of subsequent modalities.
ViT-Lens provides a unified solution for representation learning of increasing
modalities with two appealing benefits: (i) Exploiting the pretrained ViT
across tasks and domains effectively with efficient data regime; (ii) Emergent
downstream capabilities of novel modalities are demonstrated due to the
modality alignment space. We evaluate ViT-Lens in the context of 3D as an
initial verification. In zero-shot 3D classification, ViT-Lens achieves
substantial improvements over previous state-of-the-art, showing 52.0% accuracy
on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore,
we enable zero-shot 3D question-answering by simply integrating the trained 3D
lens into the InstructBLIP model without any adaptation. We will release the
results of ViT-Lens on more modalities in the near future. |
\methodname is a novel method for omni-modal representation learning that leverages a pre-trained vision transformer (ViT) to understand diverse modalities beyond images by introducing modality-specific learnable modules. |
Existing multi-modal learning methods require large-scale datasets for each new modality, which is impractical and resource-intensive. \methodname addresses this by efficiently adapting the knowledge of a pre-trained ViT to new modalities. |
\methodname maps input data from a new modality to the input space of a frozen pre-trained ViT using a modality embedding module and a Perceiver. The encoded representations are then aligned with features from anchor data (images, text, or image-text) from pre-trained foundation models like CLIP via contrastive learning. |
\methodname achieves state-of-the-art zero-shot 3D classification accuracy on ModelNet40, ScanObjectNN, and Objaverse-LVIS, outperforming previous methods by significant margins.
It demonstrates strong generalization and scalability by effectively leveraging the knowledge of pre-trained ViTs and scaling well with larger datasets and model sizes.
By integrating the trained 3D encoder into an MLLM like InstructBLIP, \methodname enables the LLM to understand and interact with 3D data in a zero-shot manner without requiring specific instruction tuning. |
The current implementation focuses on 3D shape understanding as an initial verification.
Future work involves scaling up the training to incorporate more modalities and exploring additional emergent abilities. |
multimodal learning, representation learning, vision transformer, zero-shot learning, 3d shape understanding |
2308.10174
Report |
Neural Interactive Keypoint Detection |
Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, Lei Zhang |
This work proposes an end-to-end neural interactive keypoint detection
framework named Click-Pose, which can significantly reduce more than 10 times
labeling costs of 2D keypoint annotation compared with manual-only annotation.
Click-Pose explores how user feedback can cooperate with a neural keypoint
detector to correct the predicted keypoints in an interactive way for a faster
and more effective annotation process. Specifically, we design the pose error
modeling strategy that inputs the ground truth pose combined with four typical
pose errors into the decoder and trains the model to reconstruct the correct
poses, which enhances the self-correction ability of the model. Then, we attach
an interactive human-feedback loop that allows receiving users' clicks to
correct one or several predicted keypoints and iteratively utilizes the decoder
to update all other keypoints with a minimum number of clicks (NoC) for
efficient annotation. We validate Click-Pose in in-domain, out-of-domain
scenes, and a new task of keypoint adaptation. For annotation, Click-Pose only
needs 1.97 and 6.45 NoC@95 (at precision 95%) on COCO and Human-Art, reducing
31.4% and 36.3% efforts than the SOTA model (ViTPose) with manual correction,
respectively. Besides, without user clicks, Click-Pose surpasses the previous
end-to-end model by 1.4 AP on COCO and 3.0 AP on Human-Art. The code is
available at https://github.com/IDEA-Research/Click-Pose. |
Click-Pose, an end-to-end neural interactive keypoint detection framework that significantly reduces 2D keypoint annotation costs. |
Manual keypoint annotation is time-consuming, labor-intensive, and error-prone. Existing model-assisted methods suffer from model bias and performance bottlenecks, especially in out-of-domain scenarios. |
Click-Pose builds upon ED-Pose and introduces: (1) Pose Error Modeling: enhances decoder robustness by training it to reconstruct accurate poses from erroneous ones. (2) Interactive Human-Feedback Loop: incorporates user clicks to correct keypoints and iteratively refines predictions. |
Reduces annotation time by over 10x compared to manual and 5x compared to SOTA model with manual correction.
Requires 31.4% and 36.3% fewer clicks than ViTPose for 95% precision on COCO and Human-Art, respectively.
Achieves state-of-the-art performance for end-to-end keypoint detection, outperforming ED-Pose by 1.4 AP on COCO and 3.0 AP on Human-Art. |
Current focus is on 2D body keypoints; extending to whole-body (dense) and 3D annotation is crucial.
Exploring multi-task interactive annotation, where correcting one task influences others (e.g., pose, parsing, text). |
keypoint detection, human-in-the-loop, interactive annotation, pose error modeling, human-feedback loop |
2308.10156
Report |
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation |
Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang |
Despite significant progress in Text-to-Image (T2I) generative models, even
lengthy and complex text descriptions still struggle to convey detailed
controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate
realistic and complex scene images from user-specified layouts, has risen to
prominence. However, existing methods transform layout information into tokens
or RGB images for conditional control in the generative process, leading to
insufficient spatial and semantic controllability of individual instances. To
address these limitations, we propose a novel Spatial-Semantic Map Guided
(SSMG) diffusion model that adopts the feature map, derived from the layout, as
guidance. Owing to rich spatial and semantic information encapsulated in
well-designed feature maps, SSMG achieves superior generation quality with
sufficient spatial and semantic controllability compared to previous works.
Additionally, we propose the Relation-Sensitive Attention (RSA) and
Location-Sensitive Attention (LSA) mechanisms. The former aims to model the
relationships among multiple objects within scenes while the latter is designed
to heighten the model's sensitivity to the spatial information embedded in the
guidance. Extensive experiments demonstrate that SSMG achieves highly promising
results, setting a new state-of-the-art across a range of metrics encompassing
fidelity, diversity, and controllability. |
This paper proposes SSMG, a novel Spatial-Semantic Map Guided diffusion model for Layout-to-Image (L2I) generation, which utilizes a feature map derived from layout as guidance to achieve superior generation quality and controllability over individual instances. |
Existing L2I methods, whether token-guided or image-guided, struggle to effectively control both the spatial arrangements and semantic details of generated instances. This new method leverages the richness of feature maps for enhanced control. |
SSMG initializes a spatial-semantic map from layout and text descriptions, enhances it with Relation-Sensitive Attention (RSA) to model relationships among instances, and integrates it into a conditional diffusion model generation process via Location-Sensitive Attention (LSA). |
SSMG achieves state-of-the-art results on benchmark datasets, surpassing previous methods in fidelity, diversity, and controllability metrics.
SSMG demonstrates superior spatial controllability, evidenced by a significant improvement in YOLO scores.
The method supports free-form textual descriptions and diverse layout representations, going beyond bounding boxes and enhancing its flexibility. |
The paper acknowledges potential societal impacts and ethical concerns regarding the misuse of the model for generating harmful content.
Future work can explore further applications of SSMG in other structured image generation tasks. |
layout-to-image generation, diffusion models, spatial control, semantic control, free-form generation |
2308.10122
Report |
HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation |
Xiufeng Xie, Riccardo Gherardi, Zhihong Pan, Stephen Huang |
Neural radiance fields (NeRF) have garnered significant attention, with
recent works such as Instant-NGP accelerating NeRF training and evaluation
through a combination of hashgrid-based positional encoding and neural
networks. However, effectively leveraging the spatial sparsity of 3D scenes
remains a challenge. To cull away unnecessary regions of the feature grid,
existing solutions rely on prior knowledge of object shape or periodically
estimate object shape during training by repeated model evaluations, which are
costly and wasteful.
To address this issue, we propose HollowNeRF, a novel compression solution
for hashgrid-based NeRF which automatically sparsifies the feature grid during
the training phase. Instead of directly compressing dense features, HollowNeRF
trains a coarse 3D saliency mask that guides efficient feature pruning, and
employs an alternating direction method of multipliers (ADMM) pruner to
sparsify the 3D saliency mask during training. By exploiting the sparsity in
the 3D scene to redistribute hash collisions, HollowNeRF improves rendering
quality while using a fraction of the parameters of comparable state-of-the-art
solutions, leading to a better cost-accuracy trade-off. Our method delivers
comparable rendering quality to Instant-NGP, while utilizing just 31% of the
parameters. In addition, our solution can achieve a PSNR accuracy gain of up to
1dB using only 56% of the parameters. |
HollowNeRF, a novel NeRF compression solution using trainable hash collision mitigation to improve rendering accuracy with fewer parameters. |
Effectively leveraging spatial sparsity in 3D scenes for NeRF remains a challenge. |
HollowNeRF introduces a trainable 3D saliency grid to guide feature pruning, a zero-skipping gate to enhance MLP sparsity, and an ADMM pruner to enforce sparsity in the saliency grid. |
HollowNeRF achieves higher PSNR and lower LPIPS than Instant-NGP with fewer parameters.
Using only 31% of the parameters, HollowNeRF delivers comparable rendering quality to Instant-NGP.
A 1dB PSNR accuracy gain is achieved with only 56% of the parameters compared to Instant-NGP. |
Compression gains rely on scene sparsity; performance may regress for non-sparse scenes.
Current implementation, like Instant-NGP, faces challenges in modeling reflective surfaces. |
neural radiance fields, nerf compression, hash collision mitigation, 3d saliency grid, admm pruner |
2308.10110
Report |
Robust Mixture-of-Expert Training for Convolutional Neural Networks |
Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, Sijia Liu |
Sparsely-gated Mixture of Expert (MoE), an emerging deep model architecture,
has demonstrated a great promise to enable high-accuracy and ultra-efficient
model inference. Despite the growing popularity of MoE, little work
investigated its potential to advance convolutional neural networks (CNNs),
especially in the plane of adversarial robustness. Since the lack of robustness
has become one of the main hurdles for CNNs, in this paper we ask: How to
adversarially robustify a CNN-based MoE model? Can we robustly train it like an
ordinary CNN model? Our pilot study shows that the conventional adversarial
training (AT) mechanism (developed for vanilla CNNs) no longer remains
effective to robustify an MoE-CNN. To better understand this phenomenon, we
dissect the robustness of an MoE-CNN into two dimensions: Robustness of routers
(i.e., gating functions to select data-specific experts) and robustness of
experts (i.e., the router-guided pathways defined by the subnetworks of the
backbone CNN). Our analyses show that routers and experts are hard to adapt to
each other in the vanilla AT. Thus, we propose a new router-expert alternating
Adversarial training framework for MoE, termed AdvMoE. The effectiveness of our
proposal is justified across 4 commonly-used CNN model architectures over 4
benchmark datasets. We find that AdvMoE achieves 1% ~ 4% adversarial robustness
improvement over the original dense CNN, and enjoys the efficiency merit of
sparsity-gated MoE, leading to more than 50% inference cost reduction. Codes
are available at https://github.com/OPTML-Group/Robust-MoE-CNN. |
This paper proposes AdvMoE, a novel adversarial training framework for Mixture-of-Expert based Convolutional Neural Networks (MoE-CNNs). |
Conventional adversarial training methods, effective for standard CNNs, fail to robustify MoE-CNNs due to the complex interplay between routers (expert selectors) and experts. |
AdvMoE employs a bi-level optimization approach to alternately train routers and experts, enabling them to adapt to each other and collaboratively enhance robustness. |
AdvMoE significantly improves adversarial robustness over baseline methods, achieving 1% to 5% higher robust accuracy.
AdvMoE outperforms adversarial training on dense CNNs while maintaining over 50% inference cost reduction, demonstrating the effectiveness of combining robustness and MoE efficiency.
Analysis reveals that AdvMoE promotes better router utility, generating more diverse and robust expert pathways compared to conventional methods. |
AdvMoE requires twice the computational cost compared to vanilla adversarial training due to its alternating optimization scheme.
Further investigation into the trade-off between the number of experts, model scale, and training efficiency is needed. |
adversarial robustness, mixture of experts, convolutional neural networks, bi-level optimization, efficient deep learning |
2308.10079
Report |
MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance |
Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen |
This study introduces an efficient and effective method, MeDM, that utilizes
pre-trained image Diffusion Models for video-to-video translation with
consistent temporal flow. The proposed framework can render videos from scene
position information, such as a normal G-buffer, or perform text-guided editing
on videos captured in real-world scenarios. We employ explicit optical flows to
construct a practical coding that enforces physical constraints on generated
frames and mediates independent frame-wise scores. By leveraging this coding,
maintaining temporal consistency in the generated videos can be framed as an
optimization problem with a closed-form solution. To ensure compatibility with
Stable Diffusion, we also suggest a workaround for modifying observation-space
scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning
or test-time optimization of the Diffusion Models. Through extensive
qualitative, quantitative, and subjective experiments on various benchmarks,
the study demonstrates the effectiveness and superiority of the proposed
approach. Our project page can be found at https://medm2023.github.io |
The paper presents MeDM, a method that uses pre-trained image Diffusion Models and optical flow for temporally consistent video-to-video translation. |
Generating videos or performing video-to-video translation with temporal consistency is challenging. Existing methods suffer from flickering or are computationally expensive. This method addresses these issues by efficiently leveraging pre-trained image diffusion models for high-quality video generation. |
The method uses optical flow to establish pixel correspondence across frames, creating a global pixel repository. This allows for harmonizing independently generated frames by minimizing temporal inconsistency. A workaround is also proposed to make it compatible with latent Diffusion Models. |
MeDM generates high-quality, temporally consistent videos from 3D assets, outperforming baselines on MPI Sintel and Virtual KITTI 2.
It excels in text-guided video editing, successfully combining conflicting concepts into realistic videos on the DAVIS 2016 dataset.
MeDM achieves effective video anonymization while preserving video content, outperforming DeepPrivacy in realism and identity concealment on CelebV-HQ videos. |
The method currently relies on discretized optical flow for adjacent frames, which may limit its ability to handle occlusions and large motions.
The method does not explicitly address structural changes, which may lead to misalignment when objects undergo significant deformation. |
video generation, video-to-video translation, diffusion models, optical flow, temporal consistency |
2308.10040
Report |
ControlCom: Controllable Image Composition using Diffusion Model |
Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, Li Niu |
Image composition targets at synthesizing a realistic composite image from a
pair of foreground and background images. Recently, generative composition
methods are built on large pretrained diffusion models to generate composite
images, considering their great potential in image generation. However, they
suffer from lack of controllability on foreground attributes and poor
preservation of foreground identity. To address these challenges, we propose a
controllable image composition method that unifies four tasks in one diffusion
model: image blending, image harmonization, view synthesis, and generative
composition. Meanwhile, we design a self-supervised training framework coupled
with a tailored pipeline of training data preparation. Moreover, we propose a
local enhancement module to enhance the foreground details in the diffusion
model, improving the foreground fidelity of composite images. The proposed
method is evaluated on both public benchmark and real-world data, which
demonstrates that our method can generate more faithful and controllable
composite images than existing approaches. The code and model will be available
at https://github.com/bcmi/ControlCom-Image-Composition. |
This supplementary material provides further details on the training data preparation, demonstrates the utility of controllable image composition, validates the effectiveness of different model components, showcases additional visual results, presents user study findings, and analyzes limitations with failure cases. |
This supplementary information aims to strengthen the main paper's findings and provide a comprehensive understanding of the controllable image composition method using a diffusion model. |
The authors elaborate on data augmentation techniques, training sample generation strategies, ablation studies of model components, qualitative comparisons with baseline methods, user study design for subjective evaluation, and analysis of failure cases. |
The proposed method with indicator (1,1) achieves superior performance in generating high-quality composite images with high fidelity compared to baseline methods.
User study results demonstrate that the proposed method outperforms existing approaches in image blending, shows comparable performance in image harmonization, and exhibits advantages in generative composition.
Ablation studies validate the contribution of each model component to the final performance, particularly the global fusion module, local enhancement module, and training data augmentation. |
The model faces challenges in synthesizing novel views for foreground objects when the input view and target view have minimal overlap.
Low-quality input images, such as blurred or dim foregrounds, can lead to the generation of unnatural composite images with artifacts. |
image composition, diffusion model, controllable generation, data augmentation, user study |
2308.10001
Report |
AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization |
Kun Wang, Zhiqiang Yan, Huang Tian, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang |
Neural Radiance Fields (NeRF) have shown promise in generating realistic
novel views from sparse scene images. However, existing NeRF approaches often
encounter challenges due to the lack of explicit 3D supervision and imprecise
camera poses, resulting in suboptimal outcomes. To tackle these issues, we
propose AltNeRF -- a novel framework designed to create resilient NeRF
representations using self-supervised monocular depth estimation (SMDE) from
monocular videos, without relying on known camera poses. SMDE in AltNeRF
masterfully learns depth and pose priors to regulate NeRF training. The depth
prior enriches NeRF's capacity for precise scene geometry depiction, while the
pose prior provides a robust starting point for subsequent pose refinement.
Moreover, we introduce an alternating algorithm that harmoniously melds NeRF
outputs into SMDE through a consistence-driven mechanism, thus enhancing the
integrity of depth priors. This alternation empowers AltNeRF to progressively
refine NeRF representations, yielding the synthesis of realistic novel views.
Extensive experiments showcase the compelling capabilities of AltNeRF in
generating high-fidelity and robust novel views that closely resemble reality. |
The paper proposes AltNeRF, a novel framework that leverages self-supervised monocular depth estimation (SMDE) to generate high-fidelity neural radiance fields from monocular videos, addressing the challenges of shape ambiguity and imprecise camera poses in NeRF creation. |
Existing NeRF approaches often struggle with suboptimal outcomes due to the lack of explicit 3D supervision and reliance on accurate camera poses, leading to inaccurate novel view synthesis and distorted scene geometry. AltNeRF addresses these limitations by introducing depth-pose priors learned from monocular videos. |
AltNeRF employs an alternating algorithm with two modules: Scene Prior Module (SPM) pretrained on a large dataset and fine-tuned on target video data to provide depth and pose priors, and Scene Representation Module (SRM) which utilizes these priors to learn 3D scene representation and refine camera poses. |
AltNeRF outperforms existing NeRF methods on novel view synthesis tasks across LLFF, CO3D, and Captures datasets, achieving higher PSNR, SSIM, and lower LPIPS values.
It demonstrates superior geometry reconstruction ability on ScanNet, achieving significant improvements in depth estimation metrics (Abs Rel, Sq Rel, RMSE, etc.) compared to NeRF, DS-NeRF, and NerfingMVS.
AltNeRF effectively estimates camera poses from monocular videos, even in challenging scenarios with complex camera motions, surpassing the performance of BARF and NoPe-NeRF. |
The performance of AltNeRF can be limited by the accuracy of the initial depth prior provided by SMDE, particularly in challenging scenes with textureless or view-limited regions.
The alternating optimization process, while effective, can be computationally expensive, and exploring methods to improve its efficiency could be a potential area for future work. |
neural radiance fields, nerf, self-supervised monocular depth estimation, novel view synthesis, camera pose estimation |
2308.09991
Report |
AltDiffusion: A Multilingual Text-to-Image Diffusion Model |
Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu |
Large Text-to-Image(T2I) diffusion models have shown a remarkable capability
to produce photorealistic and diverse images based on text inputs. However,
existing works only support limited language input, e.g., English, Chinese, and
Japanese, leaving users beyond these languages underserved and blocking the
global expansion of T2I models. Therefore, this paper presents AltDiffusion, a
novel multilingual T2I diffusion model that supports eighteen different
languages. Specifically, we first train a multilingual text encoder based on
the knowledge distillation. Then we plug it into a pretrained English-only
diffusion model and train the model with a two-stage schema to enhance the
multilingual capability, including concept alignment and quality improvement
stage on a large-scale multilingual dataset. Furthermore, we introduce a new
benchmark, which includes Multilingual-General-18(MG-18) and
Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I
diffusion models for generating high-quality images and capturing
culture-specific concepts in different languages. Experimental results on both
MG-18 and MC-18 demonstrate that AltDiffusion outperforms current
state-of-the-art T2I models, e.g., Stable Diffusion in multilingual
understanding, especially with respect to culture-specific concepts, while
still having comparable capability for generating high-quality images. All
source code and checkpoints could be found in
https://github.com/superhero-7/AltDiffuson. |
This paper introduces AltDiffusion, a novel multilingual Text-to-Image diffusion model that supports eighteen languages. |
Existing Text-to-Image models have limited language support, hindering global accessibility and introducing translation errors for non-English users. |
The authors train a multilingual text encoder via knowledge distillation and integrate it into a pre-trained English diffusion model. A two-stage training schema (concept alignment and quality improvement) is employed on a large-scale multilingual dataset. |
AltDiffusion is the first multilingual T2I model supporting eighteen languages.
AltDiffusion outperforms translation-based Stable Diffusion and other multilingual diffusion models in multilingual understanding and image generation quality.
AltDiffusion exhibits strong compatibility with downstream T2I tools such as ControlNet and LoRA, and supports mixed language inputs. |
The current version of AltDiffusion does not support all languages.
Future work will focus on expanding language support and exploring alternative training approaches. |
text-to-image, diffusion models, multilingual, culture-specific concepts, knowledge distillation |
2308.09951
Report |
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos |
Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin |
Self-supervised methods have shown remarkable progress in learning high-level
semantics and low-level temporal correspondence. Building on these results, we
take one step further and explore the possibility of integrating these two
features to enhance object-centric representations. Our preliminary experiments
indicate that query slot attention can extract different semantic components
from the RGB feature map, while random sampling based slot attention can
exploit temporal correspondence cues between frames to assist instance
identification. Motivated by this, we propose a novel semantic-aware masked
slot attention on top of the fused semantic features and correspondence maps.
It comprises two slot attention stages with a set of shared learnable Gaussian
distributions. In the first stage, we use the mean vectors as slot
initialization to decompose potential semantics and generate semantic
segmentation masks through iterative attention. In the second stage, for each
semantics, we randomly sample slots from the corresponding Gaussian
distribution and perform masked feature aggregation within the semantic area to
exploit temporal correspondence patterns for instance identification. We adopt
semantic- and instance-level temporal consistency as self-supervision to
encourage temporally coherent object-centric representations. Our model
effectively identifies multiple object instances with semantic structure,
reaching promising results on unsupervised video object discovery. Furthermore,
we achieve state-of-the-art performance on dense label propagation tasks,
demonstrating the potential for object-centric analysis. The code is released
at https://github.com/shvdiwnkozbw/SMTC. |
This paper proposes SMTC, a self-supervised architecture that leverages semantic and temporal correspondence cues to learn object-centric representations in videos. |
Humans rely on both semantic understanding and temporal correspondence for object-centric analysis. Most existing computational models only focus on one of these aspects, limiting their ability to represent objects effectively. |
The model uses a two-stage semantic-aware masked slot attention mechanism. First, it decomposes scenes into semantic components. Second, it identifies individual instances within each semantic component by leveraging temporal correspondence cues. |
SMTC achieves promising results on unsupervised object discovery in both single and multiple object scenarios.
It reaches state-of-the-art performance on label propagation tasks including semi-supervised video object segmentation, pose tracking, and human part tracking.
Ablation studies validate the importance of both semantic and temporal correspondence cues, as well as the effectiveness of the proposed two-stage slot attention design. |
The model faces challenges in generating precise boundaries for small objects due to the lack of pixel-level annotation.
Incorporating multi-scale feature pyramids for better dense perception is left for future work. |
self-supervised learning, object-centric representation, video understanding, temporal correspondence, slot attention |
2308.09939
Report |
Understanding Self-attention Mechanism via Dynamical System Perspective |
Zhongzhan Huang, Mingfu Liang, Jinghui Qin, Shanshan Zhong, Liang Lin |
The self-attention mechanism (SAM) is widely used in various fields of
artificial intelligence and has successfully boosted the performance of
different models. However, current explanations of this mechanism are mainly
based on intuitions and experiences, while there still lacks direct modeling
for how the SAM helps performance. To mitigate this issue, in this paper, based
on the dynamical system perspective of the residual neural network, we first
show that the intrinsic stiffness phenomenon (SP) in the high-precision
solution of ordinary differential equations (ODEs) also widely exists in
high-performance neural networks (NN). Thus the ability of NN to measure SP at
the feature level is necessary to obtain high performance and is an important
factor in the difficulty of training NN. Similar to the adaptive step-size
method which is effective in solving stiff ODEs, we show that the SAM is also a
stiffness-aware step size adaptor that can enhance the model's representational
ability to measure intrinsic SP by refining the estimation of stiffness
information and generating adaptive attention values, which provides a new
understanding about why and how the SAM can benefit the model performance. This
novel perspective can also explain the lottery ticket hypothesis in SAM, design
new quantitative metrics of representational ability, and inspire a new
theoretic-inspired approach, StepNet. Extensive experiments on several popular
benchmarks demonstrate that StepNet can extract fine-grained stiffness
information and measure SP accurately, leading to significant improvements in
various visual tasks. |
This paper proposes a novel understanding of the self-attention mechanism (SAM) by connecting it to the numerical solution of stiff ordinary differential equations (ODEs). It argues that the SAM acts as a stiffness-aware step size adaptor that refines stiffness information and generates adaptive attention values to better measure the intrinsic stiffness phenomenon in neural networks. |
Current explanations of the SAM are largely intuitive and lack direct modeling of its performance impact. This work aims to establish a clearer relationship between the SAM and model performance by analyzing it through the lens of dynamical systems. |
The authors define the stiffness phenomenon (SP) in neural networks at the feature level and introduce the concept of a ground truth trajectory. They theoretically and empirically demonstrate that high-performance networks exhibit SP and that SAM effectively captures and measures this SP, leading to improved representational ability. Inspired by this, they propose StepNet, a novel self-attention network that better estimates stiffness information. |
The stiffness phenomenon, commonly observed in high-precision ODE solutions, is also prevalent in high-performance neural networks.
The self-attention mechanism acts as a stiffness-aware step size adaptor, refining stiffness information and generating adaptive attention values to better measure SP.
StepNet, inspired by this understanding, effectively captures and measures SP, leading to improved performance in image classification and object detection tasks. |
The paper focuses on channel attention networks, with transformer-based models briefly discussed.
The optimal structure of the adaptor in StepNet requires further investigation. |
self-attention mechanism, dynamical systems, stiffness phenomenon, representational ability, stepnet |
2308.09931
Report |
TDG: Text-guided Domain Generalization |
Geng Liu, Yuxi Wang |
Domain generalization (DG) attempts to generalize a model trained on single
or multiple source domains to the unseen target domain. Benefiting from the
success of Visual-and-Language Pre-trained models in recent years, we argue
that it is crucial for domain generalization by introducing extra text
information. In this paper, we develop a novel Text-guided Domain
Generalization (TDG) paradigm for domain generalization, which includes three
following aspects. Specifically, we first devise an automatic words generation
method to extend the description of current domains with novel domain-relevant
words. Then, we embed the generated domain information into the text feature
space, by the proposed prompt learning-based text feature generation method,
which shares a common representation space with the image feature. Finally, we
utilize both input image features and generated text features to train a
specially designed classifier that generalizes well on unseen target domains,
while the image encoder is also updated under the supervision of gradients back
propagated from the classifier. Our experimental results show that the
techniques incorporated by TDG contribute to the performance in an easy
implementation manner. Experimental results on several domain generalization
benchmarks show that our proposed framework achieves superior performance by
effectively leveraging generated text information in domain generalization. |
This document provides author guidelines for the International Conference on Computer Vision (ICCV) proceedings. |
These guidelines ensure uniformity and quality in submissions, aiding the review process and final publication. |
The paper details formatting instructions, including language, length, style, citations, figures, and the submission process. |
Papers should not exceed eight pages excluding references.
A strict double-blind review policy must be followed.
All graphics and figures should be high-resolution and legible in print. |
The document focuses on LaTeX formatting, potentially limiting accessibility for users of other systems.
The guidelines primarily concern formatting, lacking details on the content or structure expected in submissions. |
iccv, conference paper, author guidelines, formatting, latex |
2308.09804
Report |
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control |
Zi-Yuan Hu, Yanyang Li, Michael R. Lyu, Liwei Wang |
As the model size of pre-trained language models (PLMs) grows rapidly, full
fine-tuning becomes prohibitively expensive for model training and storage. In
vision-and-language (VL), parameter-efficient tuning (PET) techniques are
proposed to integrate modular modifications (e.g., Adapter and LoRA) into
encoder-decoder PLMs. By tuning a small set of trainable parameters, these
techniques perform on par with full fine-tuning. However, excessive modular
modifications and neglecting the functionality gap between the encoders and
decoders can lead to performance degradation, while existing PET techniques
(e.g., VL-Adapter) overlook these critical issues. In this paper, we propose a
Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose
effective control over modular modifications via a novel granularity-controlled
mechanism. Considering different granularity-controlled matrices generated by
this mechanism, a variety of model-agnostic VL-PET modules can be instantiated
from our framework for better efficiency and effectiveness trade-offs. We
further propose lightweight PET module designs to enhance VL alignment and
modeling for the encoders and maintain text generation for the decoders.
Extensive experiments conducted on four image-text tasks and four video-text
tasks demonstrate the efficiency, effectiveness and transferability of our
VL-PET framework. In particular, our VL-PET-large with lightweight PET module
designs significantly outperforms VL-Adapter by 2.92% (3.41%) and LoRA by 3.37%
(7.03%) with BART-base (T5-base) on image-text tasks. Furthermore, we validate
the enhanced effect of employing our VL-PET designs on existing PET techniques,
enabling them to achieve significant performance improvements. Our code is
available at https://github.com/HenryHZY/VL-PET. |
Proposes VL-PET, a Vision-and-Language Parameter-Efficient Tuning framework for encoder-decoder generative PLMs with a novel granularity-controlled mechanism, multi-head modular modifications and lightweight PET module designs. |
To address the critical issues of excessive modular modifications leading to performance degradation and neglecting the functionality gap between the encoders and decoders in VL PET techniques. |
Introduces a granularity-controlled mechanism to regulate modular modifications, proposes a multi-head modular modification, and introduces lightweight PET module designs tailored for encoders (enhance VL alignment) and decoders (maintain text generation). |
VL-PET significantly outperforms state-of-the-art PET techniques on image-text tasks with BART-base and T5-base.
VL-PET achieves comparable performance to full fine-tuning while using significantly fewer trainable parameters.
VL-PET designs effectively enhance existing PET techniques like Compacter and VL-Adapter. |
Video-text experiments are conducted with only one seed, potentially affecting result reliability.
Generalization of VL-PET designs to all VL tasks and other domains requires further investigation. |
parameter-efficient tuning, vision-and-language, generative plms, multi-task learning, granularity control |
2308.09779
Report |
EAVL: Explicitly Align Vision and Language for Referring Image Segmentation |
Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu |
Referring image segmentation aims to segment an object mentioned in natural
language from an image. A main challenge is language-related localization,
which means locating the object with the relevant language. Previous approaches
mainly focus on the fusion of vision and language features without fully
addressing language-related localization. In previous approaches, fused
vision-language features are directly fed into a decoder and pass through a
convolution with a fixed kernel to obtain the result, which follows a similar
pattern as traditional image segmentation. This approach does not explicitly
align language and vision features in the segmentation stage, resulting in a
suboptimal language-related localization. Different from previous methods, we
propose Explicitly Align the Vision and Language for Referring Image
Segmentation (EAVL). Instead of using a fixed convolution kernel, we propose an
Aligner which explicitly aligns the vision and language features in the
segmentation stage. Specifically, a series of unfixed convolution kernels are
generated based on the input l, and then are use to explicitly align the vision
and language features. To achieve this, We generate multiple queries that
represent different emphases of the language expression. These queries are
transformed into a series of query-based convolution kernels. Then, we utilize
these kernels to do convolutions in the segmentation stage and obtain a series
of segmentation masks. The final result is obtained through the aggregation of
all masks. Our method can not only fuse vision and language features
effectively but also exploit their potential in the segmentation stage. And
most importantly, we explicitly align language features of different emphases
with the image features to achieve language-related localization. Our method
surpasses previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by
large margins. |
This paper presents EAVL, a novel framework for referring image segmentation that explicitly aligns vision and language features in the segmentation stage to enhance text-to-pixel fine-grained correlation. |
Referring image segmentation, aiming to segment an object referred by natural language from an image, faces a key challenge of text-to-pixel fine-grained correlation, which prior methods fail to address effectively. |
EAVL employs CLIP to extract vision and language features, generates multiple queries representing different emphases of the input sentence, transforms these queries into dynamic convolution kernels, and uses them to produce multiple segmentation masks that are then aggregated based on their importance scores. |
EAVL significantly outperforms previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref datasets.
Explicit alignment of vision and language features in the segmentation stage through query-based convolution kernels proves highly effective.
Utilizing global and fine-grained information from CLIP enhances the model's understanding of both visual and textual inputs. |
The model shows limitations in handling detailed areas, indicating a potential avenue for future research.
The impact of varying query numbers on performance needs further exploration to optimize efficiency. |
referring image segmentation, text-to-pixel fine-grained correlation, vision-language alignment, dynamic convolution kernels, clip |
2308.09718
Report |
Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training |
Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao |
The rapid advancement of deep learning models often attributes to their
ability to leverage massive training data. In contrast, such privilege has not
yet fully benefited 3D deep learning, mainly due to the limited availability of
large-scale 3D datasets. Merging multiple available data sources and letting
them collaboratively train a single model is a potential solution. However, due
to the large domain gap between 3D point cloud datasets, such mixed supervision
could adversely affect the model's performance and lead to degenerated
performance (i.e., negative transfer) compared to single-dataset training. In
view of this challenge, we introduce Point Prompt Training (PPT), a novel
framework for multi-dataset synergistic learning in the context of 3D
representation learning that supports multiple pre-training paradigms. Based on
this framework, we propose Prompt-driven Normalization, which adapts the model
to different datasets with domain-specific prompts and Language-guided
Categorical Alignment that decently unifies the multiple-dataset label spaces
by leveraging the relationship between label text. Extensive experiments verify
that PPT can overcome the negative transfer associated with synergistic
learning and produce generalizable representations. Notably, it achieves
state-of-the-art performance on each dataset using a single weight-shared model
with supervised multi-dataset training. Moreover, when served as a pre-training
framework, it outperforms other pre-training approaches regarding
representation quality and attains remarkable state-of-the-art performance
across over ten diverse downstream tasks spanning both indoor and outdoor 3D
scenarios. |
This paper proposes Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in 3D representation learning to overcome the negative transfer issue in existing methods. |
Scaling up 3D representation learning with limited data from different domains is crucial for the advancement of 3D deep learning. However, existing methods suffer from negative transfer when naively merging different datasets. |
PPT leverages domain-specific prompts to adapt the model to different datasets and employs a language-guided categorical alignment to unify the label space across datasets. It supports both supervised and unsupervised pre-training. |
PPT successfully mitigates negative transfer and achieves state-of-the-art performance on various indoor and outdoor 3D semantic segmentation benchmarks.
It also shows superior performance in instance segmentation and data-efficient learning settings.
The proposed framework is effective with both small and large-scale backbones and consistently improves performance. |
The exploration of more advanced prompting and pre-training techniques is needed.
Designing more efficient large-scale 3D backbones is crucial for fully leveraging the benefit of PPT. |
3d deep learning, representation learning, multi-dataset training, prompt learning, semantic segmentation |
2308.09710
Report |
SimDA: Simple Diffusion Adapter for Efficient Video Generation |
Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang |
The recent wave of AI-generated content has witnessed the great development
and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video
(T2V) still falls short of expectations though attracting increasing interests.
Existing works either train from scratch or adapt large T2I model to videos,
both of which are computation and resource expensive. In this work, we propose
a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B
parameters of a strong T2I model, adapting it to video generation in a
parameter-efficient way. In particular, we turn the T2I model for T2V by
designing light-weight spatial and temporal adapters for transfer learning.
Besides, we change the original spatial attention to the proposed Latent-Shift
Attention (LSA) for temporal consistency. With similar model architecture, we
further train a video super-resolution model to generate high-definition
(1024x1024) videos. In addition to T2V generation in the wild, SimDA could also
be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our
method could minimize the training effort with extremely few tunable parameters
for model adaptation. |
Proposes SimDA, a parameter-efficient video diffusion model based on Stable Diffusion, for text-guided video generation and editing, utilizing lightweight adapters and latent-shift attention for efficient spatial and temporal modeling. |
Addresses the limitations of existing Text-to-Video (T2V) methods that require significant computational resources and training time due to large model sizes, by enabling efficient adaptation of pre-trained Text-to-Image (T2I) models. |
Introduces spatial and temporal adapters to the Stable Diffusion model for transferring knowledge from image to video domain. Employs latent-shift attention for effective and efficient temporal modeling. Trains a separate super-resolution model for generating high-definition videos. |
Achieves state-of-the-art results on text-to-video generation benchmarks, outperforming or being on par with computationally expensive methods.
Demonstrates significant speedup in training and inference time compared to other T2V techniques.
Shows promising results in one-shot text-guided video editing, achieving superior performance with fewer training steps. |
Current model is limited to generating videos at a resolution of 1024x1024, future work could explore higher resolutions.
SimDA currently relies on a two-stage training approach for super-resolution, exploring end-to-end solutions could further improve efficiency. |
text-to-video generation, video diffusion models, parameter-efficient fine-tuning, text-guided video editing, latent-shift attention |
2308.09610
Report |
On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers |
Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, Elisa Ricci |
State-of-the-art rehearsal-free continual learning methods exploit the
peculiarities of Vision Transformers to learn task-specific prompts,
drastically reducing catastrophic forgetting. However, there is a tradeoff
between the number of learned parameters and the performance, making such
models computationally expensive. In this work, we aim to reduce this cost
while maintaining competitive performance. We achieve this by revisiting and
extending a simple transfer learning idea: learning task-specific normalization
layers. Specifically, we tune the scale and bias parameters of LayerNorm for
each continual learning task, selecting them at inference time based on the
similarity between task-specific keys and the output of the pre-trained model.
To make the classifier robust to incorrect selection of parameters during
inference, we introduce a two-stage training procedure, where we first optimize
the task-specific parameters and then train the classifier with the same
selection procedure of the inference time. Experiments on ImageNet-R and
CIFAR-100 show that our method achieves results that are either superior or on
par with {the state of the art} while being computationally cheaper. |
This paper proposes Continual LayerNorm (C-LayerNorm), a novel rehearsal-free continual learning method that tunes task-specific LayerNorm parameters in Vision Transformers to mitigate catastrophic forgetting. |
Existing rehearsal-free methods rely on task-specific prompts and suffer from a trade-off between performance and the number of learned parameters, making them computationally expensive. This paper addresses this limitation by exploring a more efficient approach. |
The method involves learning distinct scale and bias parameters for LayerNorm layers for each task. During inference, task-specific keys are used to select the most relevant LayerNorm parameters based on the input. The paper introduces both two-stage (task identification followed by prediction) and single-stage (integrated task identification and prediction) variants of C-LayerNorm. |
C-LayerNorm achieves state-of-the-art accuracy on both CIFAR-100 and ImageNet-R benchmarks, outperforming existing rehearsal-free methods.
The method significantly reduces the number of trainable parameters compared to prompt-based methods while maintaining competitive performance.
The single-stage variant of C-LayerNorm offers faster inference time compared to two-stage methods (including existing prompt-based approaches) without significant performance degradation. |
Despite achieving higher accuracy, C-LayerNorm exhibits a slightly higher forgetting rate compared to prompt-based methods, suggesting room for improvement in parameter isolation.
The single-stage variant, while faster, demonstrates slightly lower accuracy in certain scenarios compared to the two-stage variant, indicating a potential need for strategies to ensure consistent task identification across layers. |
continual learning, vision transformers, layer normalization, parameter-efficient fine-tuning, catastrophic forgetting |
2308.09599
Report |
Language-Guided Diffusion Model for Visual Grounding |
Sijia Chen, Baochun Li |
Visual grounding (VG) tasks involve explicit cross-modal alignment, as
semantically corresponding image regions are to be located for the language
phrases provided. Existing approaches complete such visual-text reasoning in a
single-step manner. Their performance causes high demands on large-scale
anchors and over-designed multi-modal fusion modules based on human priors,
leading to complicated frameworks that may be difficult to train and overfit to
specific scenarios. Even worse, such once-for-all reasoning mechanisms are
incapable of refining boxes continuously to enhance query-region matching. In
contrast, in this paper, we formulate an iterative reasoning process by
denoising diffusion modeling. Specifically, we propose a language-guided
diffusion framework for visual grounding, LG-DVG, which trains the model to
progressively reason queried object boxes by denoising a set of noisy boxes
with the language guide. To achieve this, LG-DVG gradually perturbs
query-aligned ground truth boxes to noisy ones and reverses this process step
by step, conditional on query semantics. Extensive experiments for our proposed
framework on five widely used datasets validate the superior performance of
solving visual grounding, a cross-modal alignment task, in a generative way.
The source codes are available at
\url{https://github.com/iQua/vgbase/tree/DiffusionVG}. |
Proposes LG-DVG, a language-guided diffusion model for visual grounding, which iteratively refines bounding boxes based on text queries using a Markov Chain. |
Addresses limitations of existing visual grounding methods that rely on single-step reasoning, complex architectures, and pre-defined anchors, leading to difficulties in training and overfitting. |
Formulates visual grounding as a generative task where noisy boxes are progressively denoised to target boxes guided by text queries. Employs a novel cross-modal transformer and a query-conditioned predictor within a diffusion model framework. |
Achieves competitive accuracy on phrase localization and referring expression comprehension tasks, outperforming most state-of-the-art methods.
Demonstrates progressive refinement capability, with accuracy increasing as the number of sampling steps increases.
Effectively handles one-to-many scenarios where a single query may correspond to multiple ground-truth boxes. |
Limited ability to fully exploit semantic relationships within text queries for enhanced reasoning.
Future work could explore incorporating semantic parsing or graph-based representations of text queries. |
visual grounding, diffusion models, iterative reasoning, cross-modal alignment, generative models |
2308.09592
Report |
StableVideo: Text-driven Consistency-aware Diffusion Video Editing |
Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu |
Diffusion-based methods can generate realistic images and videos, but they
struggle to edit existing objects in a video while preserving their appearance
over time. This prevents diffusion models from being applied to natural video
editing in practical scenarios. In this paper, we tackle this problem by
introducing temporal dependency to existing text-driven diffusion models, which
allows them to generate consistent appearance for the edited objects.
Specifically, we develop a novel inter-frame propagation mechanism for
diffusion video editing, which leverages the concept of layered representations
to propagate the appearance information from one frame to the next. We then
build up a text-driven video editing framework based on this mechanism, namely
StableVideo, which can achieve consistency-aware video editing. Extensive
experiments demonstrate the strong editing capability of our approach. Compared
with state-of-the-art video editing methods, our approach shows superior
qualitative and quantitative results. Our code is available at
\href{https://github.com/rese1f/StableVideo}{this https URL}. |
This paper proposes StableVideo, a novel text-driven video editing framework that leverages diffusion models and Neural Layered Atlas (NLA) to enable consistent appearance editing of objects in videos. |
Existing diffusion-based video editing methods struggle to maintain temporal consistency in object appearance, limiting their practical application for natural video editing. |
StableVideo uses NLA to decompose videos into foreground and background layers. For foreground editing, it employs key frame editing with an inter-frame propagation mechanism to ensure geometric and temporal consistency. An aggregation network then generates the final edited foreground atlas from the key frames. The edited foreground and background are finally combined to reconstruct the edited video. |
StableVideo achieves high-quality video editing with consistent object appearance across time, outperforming state-of-the-art methods like Tune-A-Video and Text2LIVE.
The proposed inter-frame propagation mechanism effectively maintains geometric and appearance consistency during key frame editing.
The aggregation network successfully generates coherent atlas representations from the edited key frames, ensuring smooth transitions between edited frames. |
StableVideo's performance depends on the accuracy of NLA, which can be challenged by non-rigid objects or complex motions.
The editing quality is limited by the capabilities of the underlying diffusion model, which may not always generate ideal results, especially for complex scenarios like humans or animals. |
video editing, diffusion models, temporal consistency, neural layered atlas, text-driven generation |
2308.09544
Report |
Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning |
Filip Szatkowski, Mateusz Pyla, Marcin Przewięźlikowski, Sebastian Cygert, Bartłomiej Twardowski, Tomasz Trzciński |
In this work, we investigate exemplar-free class incremental learning (CIL)
with knowledge distillation (KD) as a regularization strategy, aiming to
prevent forgetting. KD-based methods are successfully used in CIL, but they
often struggle to regularize the model without access to exemplars of the
training data from previous tasks. Our analysis reveals that this issue
originates from substantial representation shifts in the teacher network when
dealing with out-of-distribution data. This causes large errors in the KD loss
component, leading to performance degradation in CIL models. Inspired by recent
test-time adaptation methods, we introduce Teacher Adaptation (TA), a method
that concurrently updates the teacher and the main models during incremental
training. Our method seamlessly integrates with KD-based CIL approaches and
allows for consistent enhancement of their performance across multiple
exemplar-free CIL benchmarks. The source code for our method is available at
https://github.com/fszatkowski/cl-teacher-adaptation. |
This paper proposes Teacher Adaptation (TA), a simple yet effective method to improve knowledge distillation-based methods in exemplar-free class-incremental learning. |
Exemplar-free class-incremental learning (CIL) with knowledge distillation (KD) often struggles to regularize the model effectively due to substantial representation shifts in the teacher network when dealing with out-of-distribution data. This can lead to performance degradation in CIL models. |
TA continuously updates the teacher network by adjusting batch normalization statistics during the learning of a new task for both the current model and the teacher model saved from the previous task. This mitigates changes in the model caused by KD loss due to differing normalization statistics. Further improvement is achieved with a warmup phase that trains a new classification head before finetuning the whole model, ensuring more stable initialization. |
TA consistently improves results for various KD methods across standard CIL benchmarks (CIFAR100, TinyImageNet200, ImageNet100).
TA shows more significant improvements in settings with a larger number of tasks and an equal split of classes, where the initial feature extractor is weaker.
TA demonstrates enhanced performance under severe distribution shifts, tested on DomainNet and corrupted CIFAR100 scenarios. |
The performance of TA may be limited with certain KD loss functions, like MKD, which uses a sigmoid function that may result in insignificant probability differences.
TA's effectiveness may be reduced when a sufficient number of exemplars are available, as they help mitigate normalization statistics divergence. |
continual learning, class-incremental learning, knowledge distillation, teacher adaptation, exemplar-free |
2308.09540
Report |
Meta-ZSDETR: Zero-shot DETR with Meta-learning |
Lu Zhang, Chenbo Zhang, Jiajia Zhao, Jihong Guan, Shuigeng Zhou |
Zero-shot object detection aims to localize and recognize objects of unseen
classes. Most of existing works face two problems: the low recall of RPN in
unseen classes and the confusion of unseen classes with background. In this
paper, we present the first method that combines DETR and meta-learning to
perform zero-shot object detection, named Meta-ZSDETR, where model training is
formalized as an individual episode based meta-learning task. Different from
Faster R-CNN based methods that firstly generate class-agnostic proposals, and
then classify them with visual-semantic alignment module, Meta-ZSDETR directly
predict class-specific boxes with class-specific queries and further filter
them with the predicted accuracy from classification head. The model is
optimized with meta-contrastive learning, which contains a regression head to
generate the coordinates of class-specific boxes, a classification head to
predict the accuracy of generated boxes, and a contrastive head that utilizes
the proposed contrastive-reconstruction loss to further separate different
classes in visual space. We conduct extensive experiments on two benchmark
datasets MS COCO and PASCAL VOC. Experimental results show that our method
outperforms the existing ZSD methods by a large margin. |
Proposes Meta-ZSDETR, the first method combining DETR and meta-learning for zero-shot object detection, addressing limitations of previous Faster R-CNN based approaches. |
Existing zero-shot object detection methods suffer from low recall for unseen classes and confusion with background, Meta-ZSDETR overcomes these by utilizing DETR and meta-learning. |
Meta-ZSDETR formalizes training as an episodic meta-learning task, fusing object queries with semantic vectors to predict class-specific boxes, and utilizes meta-contrastive learning with regression, classification, and contrastive heads for optimization. |
Achieves state-of-the-art performance on PASCAL VOC and MS COCO, surpassing previous methods by a significant margin.
Demonstrates strong generalization ability to unseen classes, improving mAP and recall significantly.
Shows effectiveness of meta-contrastive learning and class-specific query fusion in improving detection accuracy. |
Computational cost is higher due to the large number of queries used in DETR.
Future work includes exploring more efficient architectures and investigating other meta-learning strategies. |
zero-shot object detection, meta-learning, detr, contrastive learning, visual-semantic alignment |
2308.09421
Report |
MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection |
Junkai Xu, Liang Peng, Haoran Cheng, Hao Li, Wei Qian, Ke Li, Wenxiao Wang, Deng Cai |
In the field of monocular 3D detection, it is common practice to utilize
scene geometric clues to enhance the detector's performance. However, many
existing works adopt these clues explicitly such as estimating a depth map and
back-projecting it into 3D space. This explicit methodology induces sparsity in
3D representations due to the increased dimensionality from 2D to 3D, and leads
to substantial information loss, especially for distant and occluded objects.
To alleviate this issue, we propose MonoNeRD, a novel detection framework that
can infer dense 3D geometry and occupancy. Specifically, we model scenes with
Signed Distance Functions (SDF), facilitating the production of dense 3D
representations. We treat these representations as Neural Radiance Fields
(NeRF) and then employ volume rendering to recover RGB images and depth maps.
To the best of our knowledge, this work is the first to introduce volume
rendering for M3D, and demonstrates the potential of implicit reconstruction
for image-based 3D perception. Extensive experiments conducted on the KITTI-3D
benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD.
Codes are available at https://github.com/cskkxjk/MonoNeRD. |
MonoNeRD, a novel monocular 3D object detection framework that leverages NeRF-like continuous 3D representations for accurate 3D perception from a single image. |
Existing methods using explicit depth information for 3D representations in monocular 3D detection suffer from sparsity and information loss, especially for distant objects. |
The method constructs position-aware frustum features from 2D image features and 3D coordinates, then uses them to generate signed distance fields and radiance fields. Volume rendering is employed to recover RGB images and depth maps, supervised by original images and LiDAR data. Finally, regular 3D voxel features are generated for object detection. |
MonoNeRD achieves state-of-the-art results on the KITTI 3D detection benchmark, especially for moderate and hard difficulty levels.
The method exhibits superior performance in handling distant and occluded objects on both KITTI and Waymo datasets.
Visualization results demonstrate that MonoNeRD produces denser and more continuous 3D representations compared to depth-map-based methods. |
The performance heavily relies on the modeling approach.
Current implementation with bounds modeling might fail to predict 3D occupancy for areas outside the specified bounds, such as the sky. |
monocular 3d object detection, neural radiance fields, signed distance function, volume rendering, 3d representation learning |
2308.09386
Report |
DReg-NeRF: Deep Registration for Neural Radiance Fields |
Yu Chen, Gim Hee Lee |
Although Neural Radiance Fields (NeRF) is popular in the computer vision
community recently, registering multiple NeRFs has yet to gain much attention.
Unlike the existing work, NeRF2NeRF, which is based on traditional optimization
methods and needs human annotated keypoints, we propose DReg-NeRF to solve the
NeRF registration problem on object-centric scenes without human intervention.
After training NeRF models, our DReg-NeRF first extracts features from the
occupancy grid in NeRF. Subsequently, our DReg-NeRF utilizes a transformer
architecture with self-attention and cross-attention layers to learn the
relations between pairwise NeRF blocks. In contrast to state-of-the-art (SOTA)
point cloud registration methods, the decoupled correspondences are supervised
by surface fields without any ground truth overlapping labels. We construct a
novel view synthesis dataset with 1,700+ 3D objects obtained from Objaverse to
train our network. When evaluated on the test set, our proposed method beats
the SOTA point cloud registration methods by a large margin, with a mean
$\text{RPE}=9.67^{\circ}$ and a mean $\text{RTE}=0.038$.
Our code is available at https://github.com/AIBluefisher/DReg-NeRF. |
This paper introduces DReg-NeRF, a novel deep learning method for registering multiple Neural Radiance Fields (NeRFs) in object-centric scenes without human intervention or initializations. |
Registering multiple NeRFs, trained on data captured in different coordinate frames (e.g., from cameras without absolute pose information), is crucial for consistent novel view synthesis. Existing methods either rely on human annotations or struggle with the implicit nature of NeRF representations. |
DReg-NeRF extracts features from occupancy grids of NeRF models using a 3D Feature Pyramid Network. These features are then processed by a transformer with self-attention and cross-attention layers to learn inter and intra-feature relations. A decoder then predicts correspondences between point clouds and their confidence scores, supervised by surface fields from the NeRF models. Finally, a weighted Kabsch-Umeyama algorithm estimates the relative transformation. |
DReg-NeRF outperforms state-of-the-art point cloud registration methods (FGR, REGTR) on a novel dataset created from Objaverse, demonstrating its effectiveness for object-centric NeRF registration.
Surface field supervision is shown to be critical for accurate registration compared to using noisy density fields.
The method achieves fast inference times (0.4 seconds) making it practical for real-time applications. |
The current method is limited to object-centric scenes and struggles with unbounded scenes due to noisy geometry estimations in NeRF.
The assumption of consistent scale between the registered NeRFs might not hold in real-world scenarios, necessitating further research. |
nerf, registration, deep learning, transformer, surface fields |
2308.09351
Report |
RLIPv2: Fast Scaling of Relational Language-Image Pre-training |
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao |
Relational Language-Image Pre-training (RLIP) aims to align vision
representations with relational texts, thereby advancing the capability of
relational reasoning in computer vision tasks. However, hindered by the slow
convergence of RLIPv1 architecture and the limited availability of existing
scene graph data, scaling RLIPv1 is challenging. In this paper, we propose
RLIPv2, a fast converging model that enables the scaling of relational
pre-training to large-scale pseudo-labelled scene graph data. To enable fast
scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism
that facilitates earlier and deeper gated cross-modal fusion with sparsified
language encoding layers. ALIF leads to comparable or better performance than
RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain
scene graph data at scale, we extend object detection datasets with free-form
relation labels by introducing a captioner (e.g., BLIP) and a designed Relation
Tagger. The Relation Tagger assigns BLIP-generated relation texts to region
pairs, thus enabling larger-scale relational pre-training. Through extensive
experiments conducted on Human-Object Interaction Detection and Scene Graph
Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under
fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2
achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with
just 1% data and yields 45.09mAP with 100% data. Code and models are publicly
available at https://github.com/JacobYuan7/RLIPv2. |
RLIPv2 is a fast-converging model for relational language-image pre-training that scales to large pseudo-labeled scene graph datasets, enabling improved relational reasoning in computer vision. |
Existing methods struggle to scale relational language-image pre-training due to slow convergence and limited scene graph data, hindering progress in relational reasoning. |
RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF) for faster cross-modal alignment and leverages a Relation Tagger (R-Tagger) to pseudo-label object detection datasets with relation annotations. |
RLIPv2 achieves comparable or better performance than its predecessor RLIPv1 in a fraction of the training time.
RLIPv2 demonstrates state-of-the-art results on HOI detection benchmarks like HICO-DET and V-COCO under various settings, including zero-shot, few-shot, and fully-finetuned.
RLIPv2 excels in Scene Graph Generation (SGG), achieving state-of-the-art performance on the Open Images v6 dataset. |
The performance of the relational pseudo-labeling pipeline depends on the quality of the captions generated by external captioners.
Future work includes exploring more advanced captioning methods and investigating the transferability of RLIPv2 to other relational reasoning tasks. |
vision-language pre-training, relational reasoning, human-object interaction detection, scene graph generation, pseudo-labeling |
2308.09314
Report |
Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation |
Peng Xiang, Xin Wen, Yu-Shen Liu, Hui Zhang, Yi Fang, Zhizhong Han |
Learning per-point semantic features from the hierarchical feature pyramid is
essential for point cloud semantic segmentation. However, most previous methods
suffered from ambiguous region features or failed to refine per-point features
effectively, which leads to information loss and ambiguous semantic
identification. To resolve this, we propose Retro-FPN to model the per-point
feature prediction as an explicit and retrospective refining process, which
goes through all the pyramid layers to extract semantic features explicitly for
each point. Its key novelty is a retro-transformer for summarizing semantic
contexts from the previous layer and accordingly refining the features in the
current stage. In this way, the categorization of each point is conditioned on
its local semantic pattern. Specifically, the retro-transformer consists of a
local cross-attention block and a semantic gate unit. The cross-attention
serves to summarize the semantic pattern retrospectively from the previous
layer. And the gate unit carefully incorporates the summarized contexts and
refines the current semantic features. Retro-FPN is a pluggable neural network
that applies to hierarchical decoders. By integrating Retro-FPN with three
representative backbones, including both point-based and voxel-based methods,
we show that Retro-FPN can significantly improve performance over
state-of-the-art backbones. Comprehensive experiments on widely used benchmarks
can justify the effectiveness of our design. The source is available at
https://github.com/AllenXiangX/Retro-FPN |
Proposes Retro-FPN, a plug-and-play neural network, to enhance point cloud semantic segmentation by refining per-point features from hierarchical feature pyramids. |
Addresses the limitations of existing encoder-decoder frameworks that suffer from ambiguous region features or ineffective per-point feature refinement, leading to information loss and inaccurate semantic identification. |
Introduces a retro-transformer within each pyramid layer of the decoder. This transformer uses local cross-attention to summarize semantic contexts from the previous layer and a semantic gate unit to refine current features, enabling explicit and retrospective refinement of point-level semantic information. |
Achieves state-of-the-art performance on the S3DIS Area 5 benchmark (73.0 mIoU).
Significantly improves performance across various backbones, including point-based and voxel-based methods, on S3DIS, ScanNet, and SemanticKITTI datasets.
Demonstrates the effectiveness of the explicit and retrospective refinement strategy for per-point semantic feature learning. |
Reliance on K-NN search for local semantic contexts might not be optimal for all point cloud distributions.
Future work could explore flexible neighbor searching strategies for more accurate context capturing and reduced computational cost. |
point cloud segmentation, semantic segmentation, feature pyramid network, retrospective refinement, transformer |
2308.09306
Report |
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability |
Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei Zhang, Hang Xu |
Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2,
have shown remarkable results on image synthesis. On the other hand,
large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are
competent for various downstream tasks by learning to align vision and language
embeddings. In this paper, we explore the possibility of jointly modeling
generation and discrimination. Specifically, we propose DiffDis to unify the
cross-modal generative and discriminative pretraining into one single framework
under the diffusion process. DiffDis first formulates the image-text
discriminative problem as a generative diffusion process of the text embedding
from the text encoder conditioned on the image. Then, we propose a novel
dual-stream network architecture, which fuses the noisy text embedding with the
knowledge of latent images from different scales for image-text discriminative
learning. Moreover, the generative and discriminative tasks can efficiently
share the image-branch network structure in the multi-modality model.
Benefiting from diffusion-based unified training, DiffDis achieves both better
generation ability and cross-modal semantic alignment in one architecture.
Experimental results show that DiffDis outperforms single-task models on both
the image generation and the image-text discriminative tasks, e.g., 1.65%
improvement on average accuracy of zero-shot classification over 12 datasets
and 2.42 improvement on FID of zero-shot image synthesis. |
Proposes DiffDis, a unified vision-language diffusion model that jointly learns text-conditioned image generation and image-text alignment within a single diffusion framework. |
Aims to bridge the gap between powerful generative diffusion models and cross-modal discriminative models by empowering the former with the ability to understand and discriminate cross-modal data. |
Reformulates image-text discrimination as a generative diffusion process for text embeddings conditioned on images, introduces a dual-stream network architecture for deep fusion of knowledge from latent images and text, and proposes a unified training paradigm that alternates between generative and discriminative tasks. |
Achieves 1.65% average accuracy improvement on zero-shot classification across 12 datasets compared to single-task baseline.
Outperforms CLIP by 4.7% on average zero-shot classification accuracy and by 14.5% on average R@1 of image-text retrieval on Flickr30k and MSCOCO.
Demonstrates comparable text-guided image generation quality to Stable Diffusion, achieving a 1.0 FID improvement. |
Generation quality for specific domains (e.g., humans, animals) can be further improved by incorporating domain-specific training data.
Presence of watermarks in the training dataset can lead to watermarks in generated images. |
diffusion models, cross-modal learning, image generation, zero-shot classification, image-text retrieval |
2308.09294
Report |
Self-Calibrated Cross Attention Network for Few-Shot Segmentation |
Qianxiong Xu, Wenting Zhao, Guosheng Lin, Cheng Long |
The key to the success of few-shot segmentation (FSS) lies in how to
effectively utilize support samples. Most solutions compress support foreground
(FG) features into prototypes, but lose some spatial details. Instead, others
use cross attention to fuse query features with uncompressed support FG. Query
FG could be fused with support FG, however, query background (BG) cannot find
matched BG features in support FG, yet inevitably integrates dissimilar
features. Besides, as both query FG and BG are combined with support FG, they
get entangled, thereby leading to ineffective segmentation. To cope with these
issues, we design a self-calibrated cross attention (SCCA) block. For efficient
patch-based attention, query and support features are firstly split into
patches. Then, we design a patch alignment module to align each query patch
with its most similar support patch for better cross attention. Specifically,
SCCA takes a query patch as Q, and groups the patches from the same query image
and the aligned patches from the support image as K&V. In this way, the query
BG features are fused with matched BG features (from query patches), and thus
the aforementioned issues will be mitigated. Moreover, when calculating SCCA,
we design a scaled-cosine mechanism to better utilize the support features for
similarity calculation. Extensive experiments conducted on PASCAL-5^i and
COCO-20^i demonstrate the superiority of our model, e.g., the mIoU score under
5-shot setting on COCO-20^i is 5.6%+ better than previous state-of-the-arts.
The code is available at https://github.com/Sam1224/SCCAN. |
This paper proposes Self-Calibrated Cross Attention Network (SCCAN) for Few-Shot Segmentation (FSS) to enhance the utilization of support samples by tackling the background mismatch and foreground-background entanglement issues in existing cross-attention based methods. |
Existing FSS methods, particularly those based on cross-attention, struggle with effectively utilizing support samples due to mismatched background features and entanglement of foreground and background information, limiting segmentation accuracy. |
SCCAN leverages a Self-Calibrated Cross Attention (SCCA) block that calculates self and cross attentions concurrently, aligning query patches with the most similar support patches. It also employs a Pseudo Mask Aggregation (PMA) module to generate reliable pseudo masks for query images. |
SCCAN achieves state-of-the-art results on PASCAL-5i and COCO-20i datasets, significantly outperforming previous methods.
The proposed SCCA block effectively addresses the background mismatch and foreground-background entanglement issues.
The PMA module generates robust pseudo masks that aid in locating query foreground objects. |
The current k-shot strategy, which involves averaging support features for k>1, might not be optimal for cross-attention and needs further investigation.
Exploring the potential of using support background information in a more effective manner for cross-attention based FSS. |
few-shot segmentation, cross attention, swin transformer, pseudo mask, foreground-background entanglement |
2308.09281
Report |
Diverse Cotraining Makes Strong Semi-Supervised Segmentor |
Yijiang Li, Xinjiang Wang, Lihe Yang, Litong Feng, Wayne Zhang, Ying Gao |
Deep co-training has been introduced to semi-supervised segmentation and
achieves impressive results, yet few studies have explored the working
mechanism behind it. In this work, we revisit the core assumption that supports
co-training: multiple compatible and conditionally independent views. By
theoretically deriving the generalization upper bound, we prove the prediction
similarity between two models negatively impacts the model's generalization
ability. However, most current co-training models are tightly coupled together
and violate this assumption. Such coupling leads to the homogenization of
networks and confirmation bias which consequently limits the performance. To
this end, we explore different dimensions of co-training and systematically
increase the diversity from the aspects of input domains, different
augmentations and model architectures to counteract homogenization. Our Diverse
Co-training outperforms the state-of-the-art (SOTA) methods by a large margin
across different evaluation protocols on the Pascal and Cityscapes. For
example. we achieve the best mIoU of 76.2%, 77.7% and 80.2% on Pascal with only
92, 183 and 366 labeled images, surpassing the previous best results by more
than 5%. |
This paper investigates the lack of diversity in current deep co-training methods for semi-supervised segmentation and proposes Diverse Co-training, a holistic approach to increase diversity in input domains, augmentations, and model architectures. |
Co-training methods often suffer from homogenization, where the multiple models being trained become too similar, hindering performance. This paper proves theoretically and shows empirically that homogenization negatively impacts generalization ability in co-training. |
The paper theoretically analyzes the generalization upper bound of co-training, linking it to homogenization. It then explores three techniques to increase diversity: using different input domains (RGB and frequency), applying different augmentations to each model, and using different architectures (CNN and Transformer). These techniques are combined to form Diverse Co-training. |
Diverse Co-training significantly outperforms previous state-of-the-art methods on Pascal VOC 2012 and Cityscapes datasets across various partition protocols.
The paper provides empirical evidence that each of the three proposed techniques (diverse input domains, augmentations, and architectures) contributes to improved performance by reducing homogenization.
The proposed method achieves superior performance with fewer parameters compared to some previous SOTA methods, demonstrating its efficiency. |
The paper primarily focuses on two-model and three-model co-training, leaving the exploration of co-training with more models for future work.
While the paper demonstrates the effectiveness of the chosen hyperparameters, a more thorough hyperparameter search for each setting might yield further performance gains. |
semi-supervised segmentation, co-training, diversity, homogenization, deep learning |
2308.09139
Report |
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation |
Giacomo Zara, Alessandro Conti, Subhankar Roy, Stéphane Lathuilière, Paolo Rota, Elisa Ricci |
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in
adapting an action recognition model, trained on a labelled source dataset, to
an unlabelled target dataset, without accessing the actual source data. The
previous approaches have attempted to address SFVUDA by leveraging
self-supervision (e.g., enforcing temporal consistency) derived from the target
data itself. In this work, we take an orthogonal approach by exploiting
"web-supervision" from Large Language-Vision Models (LLVMs), driven by the
rationale that LLVMs contain a rich world prior surprisingly robust to
domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs
for SFVUDA by devising an intuitive and parameter-efficient method, which we
name Domain Adaptation with Large Language-Vision models (DALL-V), that
distills the world prior and complementary source model information into a
student network tailored for the target. Despite the simplicity, DALL-V
achieves significant improvement over state-of-the-art SFVUDA methods. |
This paper presents DALL-V, a novel Source-Free Video Unsupervised Domain Adaptation (SFVUDA) method that leverages Large Language-Vision Models (LLVMs) like CLIP to adapt action recognition models to unlabeled target domains without accessing source data. |
Existing SFVUDA methods often struggle with domain shift and rely heavily on self-supervision from the target data. This paper argues that the rich world prior encoded in LLVMs can effectively bridge the domain gap, exceeding the performance of current sophisticated SFVUDA methods. |
DALL-V works in two stages: (1) Target Adaptation: Uses zero-shot CLIP to pseudo-label target videos and fine-tunes a target-specific adapter. (2) Ensemble Distillation: Distills information from the source model, target adapter, and CLIP into a smaller student network for inference. |
DALL-V outperforms state-of-the-art SFVUDA methods, even exceeding some VUDA methods that use source data, achieving significant improvements on the Daily-DA, UCF-HMDB(full), and Sports-DA benchmarks.
Ablation studies demonstrate the effectiveness of target adaptation, ensemble distillation, and the use of multiple templates for improving performance.
UMAP visualizations show that DALL-V learns a more discriminative feature space compared to using CLIP or source model alone. |
Reliance on CLIP's black-box nature may pose limitations in safety-critical applications.
Lack of theoretical guarantees that LLVMs will always outperform traditional SFVUDA methods. |
source-free domain adaptation, video action recognition, large language-vision models, clip, knowledge distillation |
2308.09098
Report |
ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection |
Tao Tu, Shun-Po Chuang, Yu-Lun Liu, Cheng Sun, Ke Zhang, Donna Roy, Cheng-Hao Kuo, Min Sun |
We propose ImGeoNet, a multi-view image-based 3D object detection framework
that models a 3D space by an image-induced geometry-aware voxel representation.
Unlike previous methods which aggregate 2D features into 3D voxels without
considering geometry, ImGeoNet learns to induce geometry from multi-view images
to alleviate the confusion arising from voxels of free space, and during the
inference phase, only images from multiple views are required. Besides, a
powerful pre-trained 2D feature extractor can be leveraged by our
representation, leading to a more robust performance. To evaluate the
effectiveness of ImGeoNet, we conduct quantitative and qualitative experiments
on three indoor datasets, namely ARKitScenes, ScanNetV2, and ScanNet200. The
results demonstrate that ImGeoNet outperforms the current state-of-the-art
multi-view image-based method, ImVoxelNet, on all three datasets in terms of
detection accuracy. In addition, ImGeoNet shows great data efficiency by
achieving results comparable to ImVoxelNet with 100 views while utilizing only
40 views. Furthermore, our studies indicate that our proposed image-induced
geometry-aware representation can enable image-based methods to attain superior
detection accuracy than the seminal point cloud-based method, VoteNet, in two
practical scenarios: (1) scenarios where point clouds are sparse and noisy,
such as in ARKitScenes, and (2) scenarios involve diverse object classes,
particularly classes of small objects, as in the case in ScanNet200. |
This paper proposes ImGeoNet, a multi-view image-based 3D object detection framework that utilizes an image-induced geometry-aware voxel representation. |
Existing multi-view image-based methods often overlook geometric information during feature volume construction, limiting their accuracy. ImGeoNet addresses this limitation by incorporating geometry awareness. |
ImGeoNet constructs a 3D voxel feature volume from multi-view images and then performs geometry shaping. This process involves predicting the likelihood of each voxel belonging to a surface and weighting the feature volume accordingly, thus emphasizing object surfaces and reducing the impact of free space. |
ImGeoNet outperforms the state-of-the-art multi-view image-based method, ImVoxelNet, on ARKitScenes, ScanNetV2, and ScanNet200 datasets.
ImGeoNet achieves comparable results to ImVoxelNet with significantly fewer input views, demonstrating data efficiency.
The proposed geometry-aware representation enables ImGeoNet to outperform the point cloud-based method, VoteNet, in scenarios with sparse point clouds (ARKitScenes) or diverse object classes (ScanNet200). |
There is a performance gap between ImGeoNet and using ground-truth depth for geometry shaping, indicating room for improvement in the Geometry Shaping Network.
The inference time of ImGeoNet is slightly higher than ImVoxelNet for the same number of views. |
3d object detection, multi-view images, geometry-aware representation, voxel feature volume, indoor scenes |
2308.09091
Report |
Edit Temporal-Consistent Videos with Image Diffusion Model |
Yuanzhi Wang, Yong Li, Xiaoya Zhang, Xin Liu, Anbo Dai, Antoni B. Chan, Zhen Cui |
Large-scale text-to-image (T2I) diffusion models have been extended for
text-guided video editing, yielding impressive zero-shot video editing
performance. Nonetheless, the generated videos usually show spatial
irregularities and temporal inconsistencies as the temporal characteristics of
videos have not been faithfully modeled. In this paper, we propose an elegant
yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the
temporal inconsistency challenge for robust text-guided video editing. In
addition to the utilization of a pretrained T2I 2D Unet for spatial content
manipulation, we establish a dedicated temporal Unet architecture to faithfully
capture the temporal coherence of the input video sequences. Furthermore, to
establish coherence and interrelation between the spatial-focused and
temporal-focused components, a cohesive spatial-temporal modeling unit is
formulated. This unit effectively interconnects the temporal Unet with the
pretrained 2D Unet, thereby enhancing the temporal consistency of the generated
videos while preserving the capacity for video content manipulation.
Quantitative experimental results and visualization results demonstrate that
TCVE achieves state-of-the-art performance in both video temporal consistency
and video editing capability, surpassing existing benchmarks in the field. |
This paper presents TCVE, a novel text-guided video editing method that leverages a dedicated temporal Unet and a spatial-temporal modeling unit to enhance temporal consistency in edited videos. |
Existing text-guided video editing methods often produce videos with temporal inconsistencies (e.g., flickering) due to inadequate temporal modeling. |
TCVE employs a pretrained 2D Unet for spatial editing and a dedicated temporal Unet to capture temporal coherence. A spatial-temporal modeling unit connects these Unets, fusing spatial and temporal information for improved consistency. |
TCVE outperforms state-of-the-art methods in quantitative metrics for frame consistency, textual alignment, and human preference.
Ablation studies confirm the significant contributions of the temporal Unet and the spatial-temporal modeling unit.
Qualitative results demonstrate TCVE's capability to generate temporally consistent videos with successful style transfer, object editing, background change, and multiple-object editing. |
TCVE may struggle with simultaneous manipulation of style, objects, and backgrounds due to limitations of image-based text embedding.
Future work can explore incorporating video-based text embedding for enhanced video editing capabilities. |
text-guided video editing, temporal consistency, temporal unet, spatial-temporal modeling, diffusion models |
2308.08947
Report |
Watch Your Steps: Local Image and Scene Editing by Text Instructions |
Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski |
Denoising diffusion models have enabled high-quality image generation and
editing. We present a method to localize the desired edit region implicit in a
text instruction. We leverage InstructPix2Pix (IP2P) and identify the
discrepancy between IP2P predictions with and without the instruction. This
discrepancy is referred to as the relevance map. The relevance map conveys the
importance of changing each pixel to achieve the edits, and is used to to guide
the modifications. This guidance ensures that the irrelevant pixels remain
unchanged. Relevance maps are further used to enhance the quality of
text-guided editing of 3D scenes in the form of neural radiance fields. A field
is trained on relevance maps of training views, denoted as the relevance field,
defining the 3D region within which modifications should be made. We perform
iterative updates on the training views guided by rendered relevance maps from
the relevance field. Our method achieves state-of-the-art performance on both
image and NeRF editing tasks. Project page:
https://ashmrz.github.io/WatchYourSteps/ |
This paper presents a method for localized image and scene editing using text instructions, leveraging the discrepancy in noise predictions from a diffusion model (InstructPix2Pix) with and without the instruction, termed the 'relevance map'. |
Existing diffusion-based editing methods often lead to over-editing, modifying regions not specified in the instruction. This method addresses this by explicitly localizing edits to relevant areas, enhancing fidelity to the original input. |
The method calculates a 'relevance map' by comparing noise predictions from InstructPix2Pix with and without the text instruction. This map guides the editing process, restricting changes to high-relevance regions. For 3D scene editing, a 'relevance field' is trained on these maps to maintain consistency across views. |
The method achieves state-of-the-art performance on image editing, surpassing baselines in preserving input fidelity while adhering to instructions.
In 3D scene editing, it demonstrates superior performance in view consistency and edit localization compared to existing techniques.
The generated outputs exhibit high quality and sharpness, outperforming baselines in terms of perceptual metrics like NIQE. |
The method's reliance on InstructPix2Pix for relevance prediction means it cannot recover from cases where InstructPix2Pix fails significantly.
Future work could explore alternative diffusion models for improved robustness and generalization to more complex editing scenarios. |
image editing, scene editing, text-guided editing, diffusion models, neural radiance fields |
2308.08884
Report |
SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations |
Zhiming Wang, Lin Gu, Feng Lu |
Due to the prevalence of scale variance in nature images, we propose to use
image scale as a self-supervised signal for Masked Image Modeling (MIM). Our
method involves selecting random patches from the input image and downsampling
them to a low-resolution format. Our framework utilizes the latest advances in
super-resolution (SR) to design the prediction head, which reconstructs the
input from low-resolution clues and other patches. After 400 epochs of
pre-training, our Super Resolution Masked Autoencoders (SRMAE) get an accuracy
of 82.1% on the ImageNet-1K task. Image scale signal also allows our SRMAE to
capture scale invariance representation. For the very low resolution (VLR)
recognition task, our model achieves the best performance, surpassing DeriveNet
by 1.3%. Our method also achieves an accuracy of 74.84% on the task of
recognizing low-resolution facial expressions, surpassing the current
state-of-the-art FMD by 9.48%. |
This paper proposes Super Resolution Masked Autoencoders (SRMAE), a novel Masked Image Modeling (MIM) framework that leverages image scale as a self-supervised signal for learning scale-invariant representations. |
Scale variance is a prevalent characteristic of natural images and poses challenges for neural networks. Achieving scale invariance is crucial for advancing computer vision, particularly in low-resolution image recognition. |
SRMAE modifies the traditional MIM architecture by incorporating downsampled image patches as input to the prediction head alongside encoded high-resolution patches. It utilizes a High Preserving Block (HPB) module and a lightweight Vision Transformer (ViT) for resolution recovery, drawing inspiration from super-resolution techniques. |
SRMAE achieves 82.1% accuracy on ImageNet-1K after 400 epochs, demonstrating its ability to learn scale-invariant representations.
In very low-resolution digit classification on the SVHN dataset, SRMAE surpasses previous state-of-the-art methods by 1.3%, achieving 89.14% accuracy.
For low-resolution facial expression recognition on the ExpW dataset, SRMAE achieves 74.84% accuracy, surpassing the previous state-of-the-art by 9.5%. |
The paper acknowledges that using scale as a self-supervised signal might lead to suboptimal performance in from-scratch and fine-tuning scenarios compared to methods using original pixel intensity.
Future work can explore incorporating additional modules for enhancing super-resolution capabilities to further improve performance. |
masked image modeling, self-supervised learning, scale invariance, super-resolution, low-resolution image recognition |
2308.08857
Report |
D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field |
Xueting Yang, Yihao Luo, Yuliang Xiu, Wei Wang, Hao Xu, Zhaoxin Fan |
Realistic virtual humans play a crucial role in numerous industries, such as
metaverse, intelligent healthcare, and self-driving simulation. But creating
them on a large scale with high levels of realism remains a challenge. The
utilization of deep implicit function sparks a new era of image-based 3D
clothed human reconstruction, enabling pixel-aligned shape recovery with fine
details. Subsequently, the vast majority of works locate the surface by
regressing the deterministic implicit value for each point. However, should all
points be treated equally regardless of their proximity to the surface? In this
paper, we propose replacing the implicit value with an adaptive uncertainty
distribution, to differentiate between points based on their distance to the
surface. This simple ``value to distribution'' transition yields significant
improvements on nearly all the baselines. Furthermore, qualitative results
demonstrate that the models trained using our uncertainty distribution loss,
can capture more intricate wrinkles, and realistic limbs. Code and models are
available for research purposes at https://github.com/psyai-net/D-IF_release. |
This paper introduces D-IF, a novel method that utilizes implicit distribution fields to capture uncertainty in image-based 3D clothed human reconstruction, leading to improved detail recovery, particularly in challenging poses and loose garments. |
Creating realistic digital humans with intricate clothing is crucial for various industries, but current methods struggle to balance detail accuracy with handling loose garments and diverse poses. |
D-IF leverages a distribution-guided network to estimate point-wise occupancy distributions instead of deterministic values. It incorporates an uncertainty distribution loss to balance distribution sharpness, and an Occupancy Rectifier to refine coarse outputs. |
D-IF achieves state-of-the-art performance on CAPE dataset, outperforming previous methods in challenging pose reconstruction.
The method effectively recovers intricate geometric features, mitigating artifacts like distorted limbs and missing details common in other approaches.
D-IF acts as a plug-and-play module, demonstrably improving the accuracy of existing implicit-based human reconstruction methods. |
The method focuses on aleatoric uncertainty, with epistemic uncertainty not directly addressed.
Future work could explore applying D-IF to broader shape reconstruction tasks beyond human bodies. |
3d human reconstruction, implicit distribution fields, uncertainty estimation, deep learning, computer vision |
2308.08769
Report |
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes |
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao |
3D scene understanding has gained significant attention due to its wide range
of applications. However, existing methods for 3D scene understanding are
limited to specific downstream tasks, which hinders their practicality in
real-world applications. This paper presents Chat-3D, which combines the 3D
visual perceptual ability of pre-trained 3D representations and the impressive
reasoning and conversation capabilities of advanced LLMs to achieve the first
universal dialogue systems for 3D scenes. Specifically, we align 3D
representations into the feature space of LLMs, thus enabling LLMs to perceive
the 3D world. Given the scarcity of 3D scene-text data, we propose a
three-stage training strategy to efficiently utilize the available data for
better alignment. To enhance the reasoning ability and develop a user-friendly
interaction scheme, we further construct a high-quality object-centric 3D
instruction dataset and design an associated object-centric prompt. Our
experiments show that Chat-3D achieves an impressive ability to comprehend
diverse instructions for 3D scenes, engage in intricate spatial reasoning, and
incorporate external knowledge into its responses. Chat-3D achieves a 75.6%
relative score compared with GPT-4 on the constructed instruction dataset. |
This paper introduces Chat-3D, the first universal dialogue system for 3D scenes, combining pre-trained 3D representations with the reasoning and conversational abilities of LLMs. |
Existing 3D scene understanding methods are limited to specific downstream tasks, hindering their practicality. Chat-3D enables general dialogue about 3D scenes, crucial for applications like robotics and human-robot interaction. |
A three-stage training scheme is used: 1) aligning 3D object features with word embeddings, 2) learning object relations via 3D scene-text data, and 3) fine-tuning with an object-centric instruction dataset. |
Chat-3D demonstrates impressive ability to comprehend diverse instructions for 3D scenes and engage in spatial reasoning.
A novel three-stage training approach effectively aligns 3D representations with LLMs in low-resource scenarios.
The constructed object-centric instruction dataset and prompt approach enhance Chat-3D's reasoning ability and user-friendliness. |
The reliance on 3D object segmentation, either from models or annotations, can impact performance.
The current implementation focuses on indoor scenes, limiting generalizability to other environments. |
3d scene understanding, universal dialogue system, multi-modal large language model, object-centric instruction dataset, spatial reasoning |
2308.08754
Report |
Fine-grained Text and Image Guided Point Cloud Completion with CLIP Model |
Wei Song, Jun Zhou, Mingjie Wang, Hongchen Tan, Nannan Li, Xiuping Liu |
This paper focuses on the recently popular task of point cloud completion
guided by multimodal information. Although existing methods have achieved
excellent performance by fusing auxiliary images, there are still some
deficiencies, including the poor generalization ability of the model and
insufficient fine-grained semantic information for extracted features. In this
work, we propose a novel multimodal fusion network for point cloud completion,
which can simultaneously fuse visual and textual information to predict the
semantic and geometric characteristics of incomplete shapes effectively.
Specifically, to overcome the lack of prior information caused by the
small-scale dataset, we employ a pre-trained vision-language model that is
trained with a large amount of image-text pairs. Therefore, the textual and
visual encoders of this large-scale model have stronger generalization ability.
Then, we propose a multi-stage feature fusion strategy to fuse the textual and
visual features into the backbone network progressively. Meanwhile, to further
explore the effectiveness of fine-grained text descriptions for point cloud
completion, we also build a text corpus with fine-grained descriptions, which
can provide richer geometric details for 3D shapes. The rich text descriptions
can be used for training and evaluating our network. Extensive quantitative and
qualitative experiments demonstrate the superior performance of our method
compared to state-of-the-art point cloud completion networks. |
This paper introduces FTPNet, a novel multimodal point cloud completion network that leverages pre-trained CLIP model for fusing visual and textual information to predict the complete 3D shape from a partial point cloud. |
Existing point cloud completion methods struggle with limited generalization ability due to small training datasets and insufficient fine-grained semantic information. This work addresses these limitations by incorporating rich multimodal features. |
The method uses a pre-trained CLIP model to extract visual features from rendered images and textual features from fine-grained geometric descriptions. A multi-stage fusion strategy integrates these features into a basic point cloud completion network. Additionally, a new text corpus 'ViPC-Text' is introduced, containing detailed descriptions of 3D shapes. |
FTPNet significantly outperforms state-of-the-art point cloud completion methods on both known and novel object categories from the ShapeNet-ViPC dataset.
The use of pre-trained CLIP model leads to better generalization ability and improves the quality of reconstructed shapes.
Fine-grained text descriptions significantly enhance the model's ability to understand and reconstruct complex structures and details. |
The model's understanding of fine-grained text information can be further improved.
Future work can explore the development of a fine-grained and controllable text-guided 3D point cloud completion framework. |
point cloud completion, multimodal learning, clip model, fine-grained text descriptions, 3d shape understanding |
2308.08428
Report |
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption |
Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu |
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with
image-text pairs collected from the web. However, the presence of intrinsic
noise and unmatched image-text pairs in web data can potentially affect the
performance of representation learning. To address this issue, we first utilize
the OFA model to generate synthetic captions that focus on the image content.
The generated captions contain complementary information that is beneficial for
pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP),
a bi-path model that integrates supervision from both raw text and synthetic
caption. As the core components of ALIP, the Language Consistency Gate (LCG)
and Description Consistency Gate (DCG) dynamically adjust the weights of
samples and image-text/caption pairs during the training process. Meanwhile,
the adaptive contrastive loss can effectively reduce the impact of noise data
and enhances the efficiency of pre-training data. We validate ALIP with
experiments on different scales of models and pre-training datasets.
Experiments results show that ALIP achieves state-of-the-art performance on
multiple downstream tasks including zero-shot image-text retrieval and linear
probe. To facilitate future research, the code and pre-trained models are
released at https://github.com/deepglint/ALIP. |
This paper proposes ALIP (Adaptive Language-Image Pre-training), a bi-path model that leverages synthetic captions to enhance image-text representation learning and mitigate the impact of noisy web data. |
Existing web-crawled datasets for contrastive language-image pre-training suffer from noisy and mismatched image-text pairs, which can degrade representation quality. ALIP addresses this by incorporating synthetic captions and adaptive weighting mechanisms. |
ALIP uses OFA to generate synthetic image captions. It then employs two novel gates: Language Consistency Gate (LCG) to weight samples based on raw text and caption similarity and Description Consistency Gate (DCG) to adjust image-text/caption pair weights. These weights are integrated into an adaptive contrastive loss function. |
ALIP achieves state-of-the-art results on zero-shot image-text retrieval benchmarks Flickr30k and MSCOCO.
ALIP significantly outperforms baselines in linear probe evaluation on 10 downstream datasets, demonstrating enhanced representation power.
While showing improvements on CIFAR10 and CIFAR100, ALIP's zero-shot classification accuracy lags slightly behind state-of-the-art, potentially due to the coarse nature of generated captions. |
The current synthetic caption generation model primarily focuses on coarse-grained descriptions, limiting performance on fine-grained tasks.
Future work will explore the integration of hierarchical information and finer-grained caption generation into ALIP. |
image-text representation learning, contrastive learning, synthetic captions, noise-robust learning, vision-language pre-training |
2308.08393
Report |
SIGMA: Scale-Invariant Global Sparse Shape Matching |
Maolin Gao, Paul Roetzer, Marvin Eisenberger, Zorah Lähner, Michael Moeller, Daniel Cremers, Florian Bernard |
We propose a novel mixed-integer programming (MIP) formulation for generating
precise sparse correspondences for highly non-rigid shapes. To this end, we
introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic
and extrinsic geometric information to measure the deformation quality induced
by predicted correspondences. We integrate the PLBO, together with an
orientation-aware regulariser, into a novel MIP formulation that can be solved
to global optimality for many practical problems. In contrast to previous
methods, our approach is provably invariant to rigid transformations and global
scaling, initialisation-free, has optimality guarantees, and scales to high
resolution meshes with (empirically observed) linear time. We show
state-of-the-art results for sparse non-rigid matching on several challenging
3D datasets, including data with inconsistent meshing, as well as applications
in mesh-to-point-cloud matching. |
A novel mixed-integer programming formulation for generating precise sparse correspondences for highly non-rigid shapes, using a projected Laplace-Beltrami operator (PLBO) and an orientation-aware regulariser. |
Addresses limitations of previous methods, such as sensitivity to initialisation, lack of global optimality guarantees, and poor scalability to high-resolution meshes. |
Develops a PLBO that combines intrinsic and extrinsic geometry to measure deformation quality, integrates PLBO and orientation regulariser into a MIP formulation, and solves for correspondences and shape reconstruction. |
Achieves state-of-the-art accuracy on challenging datasets, including TOSCA, SMAL, SHREC20, and DT4D-M.
Provably invariant to rigid transformations and global scaling, eliminating the need for pre-alignment.
Exhibits linear scaling with mesh resolution, enabling application to high-resolution meshes. |
Performance is not yet perfect for partial shapes due to increased search space.
Struggles with topological changes as the mesh of one shape cannot well-explain deformation into the other. |
shape matching, non-rigid deformation, mixed-integer programming, laplace-beltrami operator, global optimality |
2308.08361
Report |
KernelWarehouse: Towards Parameter-Efficient Dynamic Convolution |
Chao Li, Anbang Yao |
Dynamic convolution learns a linear mixture of $n$ static kernels weighted
with their sample-dependent attentions, demonstrating superior performance
compared to normal convolution. However, existing designs are
parameter-inefficient: they increase the number of convolutional parameters by
$n$ times. This and the optimization difficulty lead to no research progress in
dynamic convolution that can allow us to use a significant large value of $n$
(e.g., $n>100$ instead of typical setting $n<10$) to push forward the
performance boundary. In this paper, we propose $KernelWarehouse$, a more
general form of dynamic convolution, which can strike a favorable trade-off
between parameter efficiency and representation power. Its key idea is to
redefine the basic concepts of "$kernels$" and "$assembling$ $kernels$" in
dynamic convolution from the perspective of reducing kernel dimension and
increasing kernel number significantly. In principle, KernelWarehouse enhances
convolutional parameter dependencies within the same layer and across
successive layers via tactful kernel partition and warehouse sharing, yielding
a high degree of freedom to fit a desired parameter budget. We validate our
method on ImageNet and MS-COCO datasets with different ConvNet architectures,
and show that it attains state-of-the-art results. For instance, the
ResNet18|ResNet50|MobileNetV2|ConvNeXt-Tiny model trained with KernelWarehouse
on ImageNet reaches 76.05%|81.05%|75.52%|82.51% top-1 accuracy. Thanks to its
flexible design, KernelWarehouse can even reduce the model size of a ConvNet
while improving the accuracy, e.g., our ResNet18 model with 36.45%|65.10%
parameter reduction to the baseline shows 2.89%|2.29% absolute improvement to
top-1 accuracy. |
This paper presents KernelWarehouse, a more general and parameter-efficient form of dynamic convolution that balances parameter efficiency and representation power by leveraging parameter dependencies within and across convolutional layers. |
Existing dynamic convolution methods suffer from parameter inefficiency, hindering their capacity to utilize a large number of kernels for improved performance. |
KernelWarehouse introduces kernel partition and warehouse sharing. It divides kernels into smaller kernel cells, represents them as linear mixtures from a shared warehouse, and assembles them. A novel attention function with a specific initialization strategy facilitates diverse attention allocation for effective kernel cell weighting. |
KernelWarehouse consistently outperforms existing dynamic convolution methods on ImageNet and MS-COCO across various ConvNet architectures.
It demonstrates the ability to significantly reduce model size while improving accuracy.
The proposed attention function and initialization strategy are crucial for achieving optimal performance. |
The runtime speed of models trained with KernelWarehouse is slower than counterparts under similar model size budget due to dense computation of linear mixtures.
The paper explores KernelWarehouse on various ConvNets, but further investigation on deeper and larger architectures is limited by computational resources. |
dynamic convolution, parameter efficiency, kernel partition, warehouse sharing, attention mechanism |
2308.08321
Report |
Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations |
Yuewei Yang, Hai Li, Yiran Chen |
In recent years, discriminative self-supervised methods have made significant
strides in advancing various visual tasks. The central idea of learning a data
encoder that is robust to data distortions/augmentations is straightforward yet
highly effective. Although many studies have demonstrated the empirical success
of various learning methods, the resulting learned representations can exhibit
instability and hinder downstream performance. In this study, we analyze
discriminative self-supervised methods from a causal perspective to explain
these unstable behaviors and propose solutions to overcome them. Our approach
draws inspiration from prior works that empirically demonstrate the ability of
discriminative self-supervised methods to demix ground truth causal sources to
some extent. Unlike previous work on causality-empowered representation
learning, we do not apply our solutions during the training process but rather
during the inference process to improve time efficiency. Through experiments on
both controlled image datasets and realistic image datasets, we show that our
proposed solutions, which involve tempering a linear transformation with
controlled synthetic data, are effective in addressing these issues. |
This paper provides a causal perspective on the instability of discriminative self-supervised learning (SSL) methods and proposes solutions to improve the stability of learned representations during inference. |
Existing discriminative SSL methods, while effective, can exhibit unstable behavior when encountering subtle data shifts not encountered during training, hindering downstream task performance. |
Building upon prior work, the authors analyze SSL methods under a causal framework. They demonstrate that learned representations are robust to training augmentations but unstable to unseen data variable shifts. Two solutions, Robust Dimensions and Stable Inference Mapping, are proposed to mitigate this instability during inference. |
Unstable shifts in data variables lead to significant performance drops in downstream tasks, as demonstrated on Causal3DIdent and ImageNet.
Robust Dimensions, leveraging the most important dimensions of stable representations, effectively alleviates deterioration by identifying robust features.
Stable Inference Mapping, learning a linear transformation to absorb unstable shifts, improves accuracy on unseen data, as shown on Causal3DIdent and ObjectNet. |
The proposed solutions assume access to stable-unstable instance pairs or knowledge of specific data variable alterations, limiting their applicability in realistic scenarios.
The effectiveness of Stable Inference Mapping might saturate with longer training, requiring more sophisticated interventions for further improvement. |
self-supervised learning, causal inference, representation learning, domain adaptation, inference stability |
2308.08316
Report |
Dual-Stream Diffusion Net for Text-to-Video Generation |
Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Dan Wang, Zhen Cui, Jian Yang |
With the emerging diffusion models, recently, text-to-video generation has
aroused increasing attention. But an important bottleneck therein is that
generative videos often tend to carry some flickers and artifacts. In this
work, we propose a dual-stream diffusion net (DSDN) to improve the consistency
of content variations in generating videos. In particular, the designed two
diffusion streams, video content and motion branches, could not only run
separately in their private spaces for producing personalized video variations
as well as content, but also be well-aligned between the content and motion
domains through leveraging our designed cross-transformer interaction module,
which would benefit the smoothness of generated videos. Besides, we also
introduce motion decomposer and combiner to faciliate the operation on video
motion. Qualitative and quantitative experiments demonstrate that our method
could produce amazing continuous videos with fewer flickers. |
This paper proposes a dual-stream diffusion net (DSDN) for text-to-video generation that improves the consistency of content variations and reduces flickers in generated videos. |
Generating realistic and continuous videos from text is a challenging task, and existing methods often produce videos with flickers and artifacts due to difficulties in modeling video dynamics. |
DSDN uses two diffusion streams – one for video content and one for motion. It leverages a pre-trained text-to-image diffusion model for content and a 3D U-Net for motion. A cross-transformer interaction module aligns the two streams, and motion decomposer/combiner modules facilitate motion processing. |
DSDN generates videos with higher frame consistency and better textual alignment compared to baselines like CogVideo and Text2Video-Zero.
Ablation studies demonstrate the importance of both the content increment unit and motion unit in generating continuous and realistic videos.
DSDN generates diverse videos with consistent content, as evidenced by varying actions, appearances, and subtle background changes in the generated cat videos. |
The content increment unit has limited parameter volume, potentially restricting the diversity of generated content.
Future work could explore improving the content increment unit and investigating alternative motion modeling techniques. |
text-to-video generation, diffusion models, motion modeling, video consistency, deep learning |
2308.08258
Report |
SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes |
Edith Tretschk, Vladislav Golyanik, Michael Zollhoefer, Aljaz Bozic, Christoph Lassner, Christian Theobalt |
Existing methods for the 4D reconstruction of general, non-rigidly deforming
objects focus on novel-view synthesis and neglect correspondences. However,
time consistency enables advanced downstream tasks like 3D editing, motion
analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a
general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method
takes multi-view RGB videos and background images from static cameras with
known camera parameters as input. It then reconstructs the deformations of an
estimated canonical model of the geometry and appearance in an online fashion.
Since this canonical model is time-invariant, we obtain correspondences even
for long-term, long-range motions. We employ neural scene representations to
parametrize the components of our method. Like prior dynamic-NeRF methods, we
use a backwards deformation model. We find non-trivial adaptations of this
model necessary to handle larger motions: We decompose the deformations into a
strongly regularized coarse component and a weakly regularized fine component,
where the coarse component also extends the deformation field into the space
surrounding the object, which enables tracking over time. We show
experimentally that, unlike prior work that only handles small motion, our
method enables the reconstruction of studio-scale motions. |
SceNeRFlow, an end-to-end differentiable, time-consistent 4D reconstruction method for general dynamic scenes from multi-view RGB input from static cameras. |
Time consistency in 4D reconstruction enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation by providing long-range, long-term dense 3D correspondences. |
The method employs a backward deformation model with a time-invariant canonical model for geometry and appearance, and time-dependent deformations. It uses a coarse-and-fine deformation decomposition and introduces a novel approach to extend the deformation field for handling large motion in online, timestamp-by-timestamp tracking. |
SceNeRFlow achieves time-consistent reconstructions even with large, studio-scale motions, outperforming previous methods in handling complex deformations.
The method effectively establishes stable 3D correspondences over time, unlike previous methods that suffer from drift.
A trade-off exists between time consistency and novel-view synthesis quality, as variants of SceNeRFlow with time-varying canonical models show improved view synthesis but degraded correspondences. |
The current method relies on multi-view input and simplifying assumptions like static background and lack of topology changes.
Future work will focus on reducing the number of cameras required, incorporating a dynamic background, and handling topology changes in a time-consistent manner. |
4d reconstruction, time consistency, neural scene representation, nerf, deformation modeling |
2308.08220
Report |
Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network |
Yinglong Wang, Zhen Liu, Jianzhuang Liu, Songcen Xu, Shuaicheng Liu |
This paper presents a novel network structure with illumination-aware gamma
correction and complete image modelling to solve the low-light image
enhancement problem. Low-light environments usually lead to less informative
large-scale dark areas, directly learning deep representations from low-light
images is insensitive to recovering normal illumination. We propose to
integrate the effectiveness of gamma correction with the strong modelling
capacities of deep networks, which enables the correction factor gamma to be
learned in a coarse to elaborate manner via adaptively perceiving the deviated
illumination. Because exponential operation introduces high computational
complexity, we propose to use Taylor Series to approximate gamma correction,
accelerating the training and inference speed. Dark areas usually occupy large
scales in low-light images, common local modelling structures, e.g., CNN,
SwinIR, are thus insufficient to recover accurate illumination across whole
low-light images. We propose a novel Transformer block to completely simulate
the dependencies of all pixels across images via a local-to-global hierarchical
attention mechanism, so that dark areas could be inferred by borrowing the
information from far informative regions in a highly effective manner.
Extensive experiments on several benchmark datasets demonstrate that our
approach outperforms state-of-the-art methods. |
This paper proposes IAGC, a novel network for low-light image enhancement by integrating illumination-aware gamma correction and a complete image modelling network. |
Existing methods struggle to effectively recover illumination from low-light images, especially in large-scale dark areas, leading to poor image quality and inaccurate color recovery. |
IAGC utilizes a three-stage coarse-to-fine strategy: 1) GGCM module for global brightness enhancement, 2) COMO-ViT block for learning illumination-recovered representations with a local-to-global self-attention mechanism, and 3) LGCM module for local illumination refinement. Taylor Series approximation is used to accelerate gamma correction. |
IAGC achieves state-of-the-art quantitative results (PSNR, SSIM) on LOL datasets, outperforming existing methods by significant margins.
The proposed method effectively enhances illumination and recovers image details in challenging low-light conditions.
Ablation studies demonstrate the effectiveness of the gamma correction modules (GGCM, LGCM) and the local-to-global self-attention mechanism in COMO-ViT. |
IAGC may exhibit slight local color deviation in extreme low-light cases with severe contrast and hue damage.
Future work includes exploring more advanced techniques to address the remaining color deviation in extremely challenging low-light environments. |
low-light image enhancement, gamma correction, vision transformer, self-attention, deep learning |
2308.08157
Report |
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis |
Minho Park, Jooyeol Yun, Seunghwan Choi, Jaegul Choo |
Existing text-to-image generation approaches have set high standards for
photorealism and text-image correspondence, largely benefiting from web-scale
text-image datasets, which can include up to 5~billion pairs. However,
text-to-image generation models trained on domain-specific datasets, such as
urban scenes, medical images, and faces, still suffer from low text-image
correspondence due to the lack of text-image pairs. Additionally, collecting
billions of text-image pairs for a specific domain can be time-consuming and
costly. Thus, ensuring high text-image correspondence without relying on
web-scale text-image datasets remains a challenging task. In this paper, we
present a novel approach for enhancing text-image correspondence by leveraging
available semantic layouts. Specifically, we propose a Gaussian-categorical
diffusion process that simultaneously generates both images and corresponding
layout pairs. Our experiments reveal that we can guide text-to-image generation
models to be aware of the semantics of different image regions, by training the
model to generate semantic labels for each pixel. We demonstrate that our
approach achieves higher text-image correspondence compared to existing
text-to-image generation approaches in the Multi-Modal CelebA-HQ and the
Cityscapes dataset, where text-image pairs are scarce. Codes are available in
this https://pmh9960.github.io/research/GCDP |
This paper proposes a novel Gaussian-categorical diffusion process for text-to-image synthesis, aiming to improve text-image correspondence in domain-specific datasets where text-image pairs are scarce. |
Existing text-to-image models often struggle with low text-image correspondence when trained on domain-specific datasets due to the limited availability of text-image pairs. Collecting billions of pairs for specific domains is costly and challenging. |
The authors define a Gaussian-categorical diffusion process that models the joint distribution of images and corresponding semantic layouts. This approach allows the model to learn the semantics of different image regions by generating semantic labels for each pixel. |
The proposed method achieves higher text-image correspondence compared to existing text-to-image generation approaches on Multi-Modal CelebA-HQ and Cityscapes datasets.
Analysis reveals that jointly generating image-layout pairs enables the model to be aware of image semantics during generation, improving its ability to match text descriptions with image regions.
The model effectively generates image-layout pairs with high alignment, closely resembling the real distribution, and demonstrates promising results in cross-modal outpainting for semantic image synthesis and segmentation. |
Training the model necessitates semantic layout annotations, which may require additional effort.
The model's performance on highly diverse datasets like MS-COCO needs further investigation and improvement. |
text-to-image synthesis, diffusion models, semantic layouts, text-image correspondence, domain-specific generation |
2308.08089
Report |
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory |
Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan |
Controllable video generation has gained significant attention in recent
years. However, two main limitations persist: Firstly, most existing works
focus on either text, image, or trajectory-based control, leading to an
inability to achieve fine-grained control in videos. Secondly, trajectory
control research is still in its early stages, with most experiments being
conducted on simple datasets like Human3.6M. This constraint limits the models'
capability to process open-domain images and effectively handle complex curved
trajectories. In this paper, we propose DragNUWA, an open-domain
diffusion-based video generation model. To tackle the issue of insufficient
control granularity in existing works, we simultaneously introduce text, image,
and trajectory information to provide fine-grained control over video content
from semantic, spatial, and temporal perspectives. To resolve the problem of
limited open-domain trajectory control in current research, We propose
trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable
open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to
control trajectories in different granularities, and an Adaptive Training (AT)
strategy to generate consistent videos following trajectories. Our experiments
validate the effectiveness of DragNUWA, demonstrating its superior performance
in fine-grained control in video generation. The homepage link is
\url{https://www.microsoft.com/en-us/research/project/dragnuwa/} |
DragNUWA, an open-domain diffusion-based video generation model that integrates text, image, and trajectory controls for fine-grained controllability. |
Existing methods for controllable video generation lack fine-grained control and struggle with complex trajectories in open-domain settings. |
DragNUWA introduces three key components: 1) Trajectory Sampler (TS) for sampling arbitrary trajectories from open-domain videos, 2) Multiscale Fusion (MF) for integrating trajectory, text, and image data at different granularities, and 3) Adaptive Training (AT) for generating consistent videos by transitioning from dense optical flow to user-defined trajectories. |
DragNUWA achieves fine-grained control over camera movements, including zooming and panning.
The model effectively handles complex trajectories, including curved paths, variable lengths, and simultaneous control of multiple objects.
DragNUWA demonstrates the essentiality of text, image, and trajectory controls for achieving comprehensive control over video generation. |
The model does not explicitly model camera movement, relying instead on learned representations from trajectory data.
Future work could explore incorporating audio or other modalities for enhanced control and realism. |
video generation, controllable generation, diffusion models, trajectory control, multimodal generation |
2308.07926
Report |
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing |
Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen |
We present the content deformation field CoDeF as a new type of video
representation, which consists of a canonical content field aggregating the
static contents in the entire video and a temporal deformation field recording
the transformations from the canonical image (i.e., rendered from the canonical
content field) to each individual frame along the time axis.Given a target
video, these two fields are jointly optimized to reconstruct it through a
carefully tailored rendering pipeline.We advisedly introduce some
regularizations into the optimization process, urging the canonical content
field to inherit semantics (e.g., the object shape) from the video.With such a
design, CoDeF naturally supports lifting image algorithms for video processing,
in the sense that one can apply an image algorithm to the canonical image and
effortlessly propagate the outcomes to the entire video with the aid of the
temporal deformation field.We experimentally show that CoDeF is able to lift
image-to-image translation to video-to-video translation and lift keypoint
detection to keypoint tracking without any training.More importantly, thanks to
our lifting strategy that deploys the algorithms on only one image, we achieve
superior cross-frame consistency in processed videos compared to existing
video-to-video translation approaches, and even manage to track non-rigid
objects like water and smog.Project page can be found at
https://qiuyu96.github.io/CoDeF/. |
Introduces Content Deformation Fields (CoDeF), a novel video representation comprising a canonical content field for static content and a temporal deformation field mapping to individual frames, enabling the lifting of image algorithms to video processing. |
Addresses limitations in video processing quality and temporal consistency compared to image processing by representing video in a manner conducive to leveraging established image algorithms. |
Employs 2D/3D hash tables for efficient representation, annealed hash encoding for semantic correctness, flow-guided consistency for smoothness, and grouped fields for complex motions. |
Achieves superior video reconstruction quality (4.4 dB higher PSNR) and efficiency compared to layered neural atlas.
Demonstrates successful lifting of image algorithms for video-to-video translation, keypoint tracking, object tracking, super-resolution, and user editing with enhanced temporal consistency.
Outperforms existing video processing methods, particularly in temporal consistency and handling complex motions. |
Current method requires per-scene optimization, limiting scalability.
Handling extreme viewpoint changes and large non-rigid deformations poses challenges. |
video representation, video processing, temporal consistency, content deformation field, image algorithm lifting |
2308.07903
Report |
Relightable and Animatable Neural Avatar from Sparse-View Video |
Zhen Xu, Sida Peng, Chen Geng, Linzhan Mou, Zihan Yan, Jiaming Sun, Hujun Bao, Xiaowei Zhou |
This paper tackles the challenge of creating relightable and animatable
neural avatars from sparse-view (or even monocular) videos of dynamic humans
under unknown illumination. Compared to studio environments, this setting is
more practical and accessible but poses an extremely challenging ill-posed
problem. Previous neural human reconstruction methods are able to reconstruct
animatable avatars from sparse views using deformed Signed Distance Fields
(SDF) but cannot recover material parameters for relighting. While
differentiable inverse rendering-based methods have succeeded in material
recovery of static objects, it is not straightforward to extend them to dynamic
humans as it is computationally intensive to compute pixel-surface intersection
and light visibility on deformed SDFs for inverse rendering. To solve this
challenge, we propose a Hierarchical Distance Query (HDQ) algorithm to
approximate the world space distances under arbitrary human poses.
Specifically, we estimate coarse distances based on a parametric human model
and compute fine distances by exploiting the local deformation invariance of
SDF. Based on the HDQ algorithm, we leverage sphere tracing to efficiently
estimate the surface intersection and light visibility. This allows us to
develop the first system to recover animatable and relightable neural avatars
from sparse view (or monocular) inputs. Experiments demonstrate that our
approach is able to produce superior results compared to state-of-the-art
methods. Our code will be released for reproducibility. |
This paper introduces a novel system for reconstructing a relightable and animatable neural avatar from sparse-view or even monocular videos of a human subject under unknown, real-world illumination. |
Creating such avatars from readily available videos, without the need for specialized studios, significantly expands the potential applications in virtual reality, filmmaking, and video games. |
The system leverages neural inverse rendering techniques and introduces a novel Hierarchical Distance Query (HDQ) algorithm to efficiently estimate surface intersections and light visibility for physically based rendering. It achieves this by blending distance approximations from a parametric human model and a canonical neural signed distance field. |
The approach produces superior results compared to existing methods in terms of visual quality and physical accuracy.
The HDQ algorithm proves to be essential for enabling accurate and efficient rendering under novel poses and lighting.
The system successfully captures challenging material properties like skin shininess and specular highlights on clothing. |
The training time of the neural avatar is relatively long (20 hours).
Future work could explore acceleration methods to speed up the training process. |
neural rendering, relighting, human avatar, inverse rendering, signed distance field |
2308.07868
Report |
ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces |
Qianyi Wu, Kaisiyuan Wang, Kejie Li, Jianmin Zheng, Jianfei Cai |
In recent years, neural implicit surface reconstruction has emerged as a
popular paradigm for multi-view 3D reconstruction. Unlike traditional
multi-view stereo approaches, the neural implicit surface-based methods
leverage neural networks to represent 3D scenes as signed distance functions
(SDFs). However, they tend to disregard the reconstruction of individual
objects within the scene, which limits their performance and practical
applications. To address this issue, previous work ObjectSDF introduced a nice
framework of object-composition neural implicit surfaces, which utilizes 2D
instance masks to supervise individual object SDFs. In this paper, we propose a
new framework called ObjectSDF++ to overcome the limitations of ObjectSDF.
First, in contrast to ObjectSDF whose performance is primarily restricted by
its converted semantic field, the core component of our model is an
occlusion-aware object opacity rendering formulation that directly
volume-renders object opacity to be supervised with instance masks. Second, we
design a novel regularization term for object distinction, which can
effectively mitigate the issue that ObjectSDF may result in unexpected
reconstruction in invisible regions due to the lack of constraint to prevent
collisions. Our extensive experiments demonstrate that our novel framework not
only produces superior object reconstruction results but also significantly
improves the quality of scene reconstruction. Code and more resources can be
found in \url{https://qianyiwu.github.io/objectsdf++} |
This paper presents ObjectSDF++, a novel framework for object-compositional neural implicit surface reconstruction that improves upon ObjectSDF. |
Existing neural implicit surface reconstruction methods often overlook individual object reconstruction, limiting their application in scene editing and understanding. |
ObjectSDF++ introduces an occlusion-aware object opacity rendering scheme and an object distinction regularization term to enhance object and scene reconstruction quality. It leverages a multi-resolution feature grid and monocular geometry cues for faster convergence. |
ObjectSDF++ significantly improves both scene and object reconstruction quality compared to ObjectSDF, as demonstrated on the Replica dataset.
The proposed occlusion-aware object opacity rendering proves crucial in enhancing surface reconstruction.
ObjectSDF++ achieves state-of-the-art scene reconstruction results on the ScanNet dataset, demonstrating the benefits of object-compositional modeling. |
The training time for ObjectSDF++ remains high, requiring further optimization for real-time applications.
The current framework primarily focuses on closed and solid objects, limiting its applicability to other object types. |
neural implicit surface reconstruction, object-compositional representation, occlusion-aware rendering, object distinction regularization, 3d scene understanding |
2308.07863
Report |
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models |
Zhizhong Wang, Lei Zhao, Wei Xing |
Content and style (C-S) disentanglement is a fundamental problem and critical
challenge of style transfer. Existing approaches based on explicit definitions
(e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable
nor easy to control, resulting in entangled representations and less satisfying
results. In this paper, we propose a new C-S disentangled framework for style
transfer without using previous assumptions. The key insight is to explicitly
extract the content information and implicitly learn the complementary style
information, yielding interpretable and controllable C-S disentanglement and
style transfer. A simple yet effective CLIP-based style disentanglement loss
coordinated with a style reconstruction prior is introduced to disentangle C-S
in the CLIP image space. By further leveraging the powerful style removal and
generative ability of diffusion models, our framework achieves superior results
than state of the art and flexible C-S disentanglement and trade-off control.
Our work provides new insights into the C-S disentanglement in style transfer
and demonstrates the potential of diffusion models for learning
well-disentangled C-S characteristics. |
A novel content-style disentangled framework named StyleDiffusion is proposed for artistic style transfer, leveraging diffusion models for explicit content extraction and implicit style learning. |
Existing style transfer methods suffer from entangled representations, lack of interpretability and controllability, resulting in less satisfying results. |
StyleDiffusion employs a diffusion-based style removal module to extract domain-aligned content information and a diffusion-based style transfer module to learn and transfer disentangled style guided by a CLIP-based style disentanglement loss and a style reconstruction prior. |
StyleDiffusion achieves superior style transfer results with fine details and well-preserved content, especially for challenging styles.
The framework offers controllable content-style disentanglement and trade-off by adjusting the return step of diffusion models.
Quantitative comparisons and user studies demonstrate the effectiveness and superiority of StyleDiffusion over state-of-the-art methods. |
The current model requires fine-tuning for each style, limiting its application to arbitrary style transfer.
The efficiency of the method is hindered by the use of diffusion models, demanding further research on faster diffusion sampling. |
style transfer, diffusion models, content-style disentanglement, clip, deep learning |
2308.07837
Report |
CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction |
Yan Di, Chenyangguang Zhang, Pengyuan Wang, Guangyao Zhai, Ruida Zhang, Fabian Manhardt, Benjamin Busam, Xiangyang Ji, Federico Tombari |
In this paper, we present a novel shape reconstruction method leveraging
diffusion model to generate 3D sparse point cloud for the object captured in a
single RGB image. Recent methods typically leverage global embedding or local
projection-based features as the condition to guide the diffusion model.
However, such strategies fail to consistently align the denoised point cloud
with the given image, leading to unstable conditioning and inferior
performance. In this paper, we present CCD-3DR, which exploits a novel centered
diffusion probabilistic model for consistent local feature conditioning. We
constrain the noise and sampled point cloud from the diffusion model into a
subspace where the point cloud center remains unchanged during the forward
diffusion process and reverse process. The stable point cloud center further
serves as an anchor to align each point with its corresponding local
projection-based features. Extensive experiments on synthetic benchmark
ShapeNet-R2N2 demonstrate that CCD-3DR outperforms all competitors by a large
margin, with over 40% improvement. We also provide results on real-world
dataset Pix3D to thoroughly demonstrate the potential of CCD-3DR in real-world
applications. Codes will be released soon |
This paper presents CCD-3DR, a novel single-image 3D reconstruction method leveraging a centered denoising diffusion probabilistic model (CDPM) for consistent local feature conditioning. |
Existing diffusion-based 3D reconstruction methods suffer from uncontrollable center deviation of the point cloud during the denoising process, leading to inferior performance. |
CCD-3DR introduces CDPM, which constrains the noise and point cloud in diffusion and reverse processes to a subspace where the point cloud center is fixed at the origin, enabling consistent local feature conditioning. |
CCD-3DR significantly outperforms state-of-the-art methods on ShapeNet-R2N2, achieving over 40% improvement in F-Score.
CCD-3DR demonstrates superior performance on the real-world Pix3D dataset, showcasing its potential for real-world applications.
Ablation studies validate the effectiveness of the proposed CDPM and the consistent local feature conditioning scheme. |
The centralization scheme might slightly affect the diversity of generated shapes.
Future work includes exploring advanced ordinary differential equation solvers to enhance inference speed. |
3d reconstruction, diffusion models, single-image reconstruction, point cloud, local feature conditioning |
2308.07815
Report |
ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition |
Yixuan Zhou, Yi Qu, Xing Xu, Hengtao Shen |
Class imbalance is a common challenge in real-world recognition tasks, where
the majority of classes have few samples, also known as tail classes. We
address this challenge with the perspective of generalization and empirically
find that the promising Sharpness-Aware Minimization (SAM) fails to address
generalization issues under the class-imbalanced setting. Through investigating
this specific type of task, we identify that its generalization bottleneck
primarily lies in the severe overfitting for tail classes with limited training
data. To overcome this bottleneck, we leverage class priors to restrict the
generalization scope of the class-agnostic SAM and propose a class-aware
smoothness optimization algorithm named Imbalanced-SAM (ImbSAM). With the
guidance of class priors, our ImbSAM specifically improves generalization
targeting tail classes. We also verify the efficacy of ImbSAM on two
prototypical applications of class-imbalanced recognition: long-tailed
classification and semi-supervised anomaly detection, where our ImbSAM
demonstrates remarkable performance improvements for tail classes and anomaly.
Our code implementation is available at
https://github.com/cool-xuan/Imbalanced_SAM. |
This paper proposes Imbalanced SAM (ImbSAM), a class-aware smoothness optimization algorithm that leverages class priors to improve generalization for tail classes in class-imbalanced recognition tasks. |
Standard Sharpness-Aware Minimization (SAM), while effective for balanced datasets, fails to address the generalization bottleneck in class-imbalanced settings, specifically the severe overfitting of tail classes with limited training data. |
ImbSAM incorporates class priors into SAM to restrict smoothness optimization to tail classes. It achieves this by dividing the training set into head and tail sub-sets based on data amount and applying SAM optimization only to the tail sub-set. |
ImbSAM demonstrates consistent accuracy improvement over baselines and SOTA methods on long-tailed classification benchmarks like CIFAR100-LT, ImageNet-LT and iNaturalist.
It significantly improves recognition accuracy for tail classes, especially those with limited training data, effectively addressing the overfitting issue.
ImbSAM also shows promising results in semi-supervised anomaly detection, enhancing AUCROC scores and outperforming previous SOTA methods on benchmark datasets. |
The performance of ImbSAM might be slightly affected when the anomaly ratio is extremely low (<1%) due to the overexposure of limited data.
Future work will explore more sophisticated class prior construction methods beyond the simple data amount threshold. |
class-imbalance, long-tailed classification, anomaly detection, generalization, sharpness-aware minimization |
2308.07749
Report |
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model |
Bosheng Qin, Wentao Ye, Qifan Yu, Siliang Tang, Yueting Zhuang |
The rising demand for creating lifelike avatars in the digital realm has led
to an increased need for generating high-quality human videos guided by textual
descriptions and poses. We propose Dancing Avatar, designed to fabricate human
motion videos driven by poses and textual cues. Our approach employs a
pretrained T2I diffusion model to generate each video frame in an
autoregressive fashion. The crux of innovation lies in our adept utilization of
the T2I diffusion model for producing video frames successively while
preserving contextual relevance. We surmount the hurdles posed by maintaining
human character and clothing consistency across varying poses, along with
upholding the background's continuity amidst diverse human movements. To ensure
consistent human appearances across the entire video, we devise an intra-frame
alignment module. This module assimilates text-guided synthesized human
character knowledge into the pretrained T2I diffusion model, synergizing
insights from ChatGPT. For preserving background continuity, we put forth a
background alignment pipeline, amalgamating insights from segment anything and
image inpainting techniques. Furthermore, we propose an inter-frame alignment
module that draws inspiration from an auto-regressive pipeline to augment
temporal consistency between adjacent frames, where the preceding frame guides
the synthesis process of the current frame. Comparisons with state-of-the-art
methods demonstrate that Dancing Avatar exhibits the capacity to generate human
videos with markedly superior quality, both in terms of human and background
fidelity, as well as temporal coherence compared to existing state-of-the-art
approaches. |
This paper introduces Dancing Avatar, a novel pipeline for synthesizing high-quality human motion videos from text descriptions and pose sequences using a pretrained text-to-image diffusion model. |
Existing text-to-video models for human motion synthesis often produce low-quality videos with temporal inconsistencies. This work addresses these limitations by leveraging the power of pretrained text-to-image models. |
The proposed Dancing Avatar pipeline employs a pretrained T2I diffusion model and introduces three key modules: 1) Intra-frame alignment ensures consistent human appearance across frames, 2) Background alignment maintains background consistency, and 3) Inter-frame alignment enhances detail coherence between adjacent frames. |
Dancing Avatar generates human motion videos with superior quality compared to state-of-the-art approaches, as evidenced by lower NIQE and BRISQUE scores.
The method exhibits strong alignment with input text prompts and poses, achieving lower Pose MSE and higher CLIP Text Consistency scores.
Dancing Avatar excels in maintaining temporal consistency across frames, demonstrating lower Frame MSE and L1 scores, and higher CLIP Frame Consistency scores. |
The current implementation relies on multiple T2I diffusion models, which could be streamlined for efficiency.
Future work can explore extending the framework to generate longer and more complex human motion sequences. |
human motion synthesis, text-to-video generation, text-to-image diffusion model, temporal consistency, video quality |
2308.07732
Report |
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation |
Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang |
Jointly processing information from multiple sensors is crucial to achieving
accurate and robust perception for reliable autonomous driving systems.
However, current 3D perception research follows a modality-specific paradigm,
leading to additional computation overheads and inefficient collaboration
between different sensor data. In this paper, we present an efficient
multi-modal backbone for outdoor 3D perception named UniTR, which processes a
variety of modalities with unified modeling and shared parameters. Unlike
previous works, UniTR introduces a modality-agnostic transformer encoder to
handle these view-discrepant sensor data for parallel modal-wise representation
learning and automatic cross-modal interaction without additional fusion steps.
More importantly, to make full use of these complementary sensor types, we
present a novel multi-modal integration strategy by both considering
semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood
relations. UniTR is also a fundamentally task-agnostic backbone that naturally
supports different 3D perception tasks. It sets a new state-of-the-art
performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object
detection and +12.0 higher mIoU for BEV map segmentation with lower inference
latency. Code will be available at https://github.com/Haiyang-W/UniTR . |
This paper introduces UniTR, a unified and efficient multi-modal transformer backbone for outdoor 3D perception that can process both 3D sparse point clouds and 2D multi-view dense images in parallel to learn unified bird's-eye-view (BEV) representations. |
Integrating information from multiple sensors like cameras and LiDARs is crucial for robust and accurate 3D perception in autonomous driving. However, existing methods often rely on modality-specific encoders and sequential processing, leading to computational overheads and inefficiencies. |
UniTR utilizes a modality-agnostic transformer encoder to handle view-discrepant sensor data in parallel. It introduces a novel multi-modal integration strategy by considering both semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. |
UniTR achieves state-of-the-art performance on the nuScenes benchmark for 3D object detection and BEV map segmentation.
It outperforms previous methods while exhibiting faster inference speed.
The model shows robustness against sensor failures, including LiDAR and camera malfunctions. |
As a single-stride backbone primarily designed for outdoor BEV perception, UniTR's adaptability to tasks like indoor 3D perception is limited.
The model lacks flexibility in switching between different sensor modalities (e.g., LiDAR-only or image-only) during inference. |
autonomous driving, 3d perception, multi-modal fusion, transformer, "birds-eye-view (bev)" |
2308.07665
Report |
Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training |
Ximing Xing, Chuang Wang, Haitao Zhou, Zhihao Hu, Chongxuan Li, Dong Xu, Qian Yu |
Exemplar-based sketch-to-photo synthesis allows users to generate
photo-realistic images based on sketches. Recently, diffusion-based methods
have achieved impressive performance on image generation tasks, enabling
highly-flexible control through text-driven generation or energy functions.
However, generating photo-realistic images with color and texture from sketch
images remains challenging for diffusion models. Sketches typically consist of
only a few strokes, with most regions left blank, making it difficult for
diffusion-based methods to produce photo-realistic images. In this work, we
propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based
sketch-to-photo synthesis. This approach includes shape-enhancing inversion and
full-control inversion. During the shape-enhancing inversion process, an
uncolored photo is generated with the guidance of a shape-energy function. This
step is essential to ensure control over the shape of the generated photo. In
the full-control inversion process, we propose an appearance-energy function to
control the color and texture of the final generated photo.Importantly, our
Inversion-by-Inversion pipeline is training-free and can accept different types
of exemplars for color and texture control. We conducted extensive experiments
to evaluate our proposed method, and the results demonstrate its effectiveness.
The code and project can be found at
https://ximinng.github.io/inversion-by-inversion-project/. |
This paper presents "Inversion-by-Inversion", a novel training-free, exemplar-based sketch-to-photo synthesis method using stochastic differential equations (SDE). |
This approach addresses the challenge of diffusion models in generating photo-realistic images from sparse sketches by disentangling shape and appearance control using exemplars. |
The method utilizes a two-stage inversion process: shape-enhancing inversion generates an uncolored photo preserving the sketch's structure, followed by full-control inversion that incorporates the exemplar's color and texture while maintaining shape fidelity. |
Significantly outperforms baseline methods in FID scores, indicating higher visual quality and realism.
Effectively balances shape control from the sketch and appearance control from the exemplar.
Generalizes well to different types of exemplars, including photos, strokes, segmentation maps, and style images. |
The inference time can be further improved.
Exploring more sophisticated energy functions for finer control over specific image features. |
sketch-to-photo synthesis, diffusion models, stochastic differential equations, exemplar-based image translation, energy-based models |
2308.07615
Report |
Self-supervised Hypergraphs for Learning Multiple World Interpretations |
Alina Marcu, Mihai Pirvu, Dragos Costea, Emanuela Haller, Emil Slusanschi, Ahmed Nabil Belbachir, Rahul Sukthankar, Marius Leordeanu |
We present a method for learning multiple scene representations given a small
labeled set, by exploiting the relationships between such representations in
the form of a multi-task hypergraph. We also show how we can use the hypergraph
to improve a powerful pretrained VisTransformer model without any additional
labeled data. In our hypergraph, each node is an interpretation layer (e.g.,
depth or segmentation) of the scene. Within each hyperedge, one or several
input nodes predict the layer at the output node. Thus, each node could be an
input node in some hyperedges and an output node in others. In this way,
multiple paths can reach the same node, to form ensembles from which we obtain
robust pseudolabels, which allow self-supervised learning in the hypergraph. We
test different ensemble models and different types of hyperedges and show
superior performance to other multi-task graph models in the field. We also
introduce Dronescapes, a large video dataset captured with UAVs in different
complex real-world scenes, with multiple representations, suitable for
multi-task learning. |
This paper introduces a novel self-supervised hypergraph model for learning multiple scene representations (e.g., segmentation, depth, surface normals) from limited labeled data. |
Learning multiple scene interpretations robustly with minimal human supervision is crucial for real-world applications, especially in complex scenarios like UAV navigation. |
The method constructs a multi-task hypergraph where nodes represent scene interpretations and hyperedges capture their relationships. Multiple paths through the hypergraph form ensembles, generating robust pseudolabels for self-supervised learning. |
Higher-order hyperedges outperform pairwise edges in capturing complex relationships between scene interpretations.
Learned ensemble models for pseudolabel generation significantly improve accuracy compared to non-parametric methods.
The hypergraph effectively improves both accuracy and temporal consistency of predictions during iterative self-supervised learning, even surpassing a state-of-the-art expert model when used for initialization. |
The model's performance on metric depth estimation, being highly scene-dependent, is less pronounced compared to other tasks.
Future work includes exploring more complex hyperedge structures and extending the approach to incorporate temporal information for video understanding. |
self-supervised learning, multi-task learning, hypergraphs, scene understanding, uav vision |
2308.07605
Report |
SGDiff: A Style Guided Diffusion Model for Fashion Synthesis |
Zhengwentai Sun, Yanghong Zhou, Honghong He, P. Y. Mok |
This paper reports on the development of \textbf{a novel style guided
diffusion model (SGDiff)} which overcomes certain weaknesses inherent in
existing models for image synthesis. The proposed SGDiff combines image
modality with a pretrained text-to-image diffusion model to facilitate creative
fashion image synthesis. It addresses the limitations of text-to-image
diffusion models by incorporating supplementary style guidance, substantially
reducing training costs, and overcoming the difficulties of controlling
synthesized styles with text-only inputs. This paper also introduces a new
dataset -- SG-Fashion, specifically designed for fashion image synthesis
applications, offering high-resolution images and an extensive range of garment
categories. By means of comprehensive ablation study, we examine the
application of classifier-free guidance to a variety of conditions and validate
the effectiveness of the proposed model for generating fashion images of the
desired categories, product attributes, and styles. The contributions of this
paper include a novel classifier-free guidance method for multi-modal feature
fusion, a comprehensive dataset for fashion image synthesis application, a
thorough investigation on conditioned text-to-image synthesis, and valuable
insights for future research in the text-to-image synthesis domain. The code
and dataset are available at: \url{https://github.com/taited/SGDiff}. |
This paper presents SGDiff, a novel style-guided diffusion model for fashion synthesis that integrates image modality with a pretrained text-to-image diffusion model. |
Existing text-to-image diffusion models struggle to control synthesized styles with text-only inputs and have high training costs. SGDiff addresses these limitations by incorporating style guidance from images. |
SGDiff uses a pretrained CLIP image encoder to extract style representations and a Skip Cross-Attention module to fuse style and text modalities. It formulates synthesis as image reconstruction, learning from cropped image patches as style guidance. |
SGDiff successfully synthesizes fashion images with desired categories, attributes, and styles, outperforming existing methods qualitatively and quantitatively.
A novel multi-condition classifier-free guidance approach is proposed, enabling flexible control over the generated images.
A new dataset, SG-Fashion, is introduced, featuring high-resolution fashion images and a wide range of garment categories. |
The current implementation focuses on single garment synthesis. Future work will explore generating a complete outfit with multiple garments.
The style guidance is limited to a single image patch. Investigating more sophisticated mechanisms for incorporating style information from multiple sources is planned. |
fashion synthesis, style guidance, text-to-image, diffusion models, clip |
2308.07575
Report |
Story Visualization by Online Text Augmentation with Context Memory |
Daechul Ahn, Daneul Kim, Gwangmo Song, Seung Hwan Kim, Honglak Lee, Dongyeop Kang, Jonghyun Choi |
Story visualization (SV) is a challenging text-to-image generation task for
the difficulty of not only rendering visual details from the text descriptions
but also encoding a long-term context across multiple sentences. While prior
efforts mostly focus on generating a semantically relevant image for each
sentence, encoding a context spread across the given paragraph to generate
contextually convincing images (e.g., with a correct character or with a proper
background of the scene) remains a challenge. To this end, we propose a novel
memory architecture for the Bi-directional Transformer framework with an online
text augmentation that generates multiple pseudo-descriptions as supplementary
supervision during training for better generalization to the language variation
at inference. In extensive experiments on the two popular SV benchmarks, i.e.,
the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms
the state of the arts in various metrics including FID, character F1, frame
accuracy, BLEU-2/3, and R-precision with similar or less computational
complexity. |
Presents a novel memory architecture for Bi-directional Transformer with online text augmentation for story visualization, enhancing context encoding and linguistic generalization. |
Addresses the challenge of encoding long-term context across sentences in story visualization for generating contextually consistent images. |
Proposes a context memory module with attentive weighting for dense past information encoding and an online text augmentation scheme generating pseudo-descriptions during training for improved linguistic diversity. |
Significantly outperforms state-of-the-art story visualization methods in FID, character consistency, and semantic matching metrics.
Demonstrates superior performance in preserving character consistency and background context compared to methods without the proposed memory module.
Shows comparable or better performance in certain metrics compared to significantly larger pre-trained models like StoryDALL-E. |
Despite improvements, image quality (FID) still lags behind large pre-trained models due to differences in model size and training data scale.
Further research can explore integrating the proposed method with larger models and investigating its applicability to video generation from long paragraphs. |
story visualization, text-to-image generation, context memory, online text augmentation, transformer |
2308.07415
Report |
Semantify: Simplifying the Control of 3D Morphable Models using CLIP |
Omer Gralnik, Guy Gafni, Ariel Shamir |
We present Semantify: a self-supervised method that utilizes the semantic
power of CLIP language-vision foundation model to simplify the control of 3D
morphable models. Given a parametric model, training data is created by
randomly sampling the model's parameters, creating various shapes and rendering
them. The similarity between the output images and a set of word descriptors is
calculated in CLIP's latent space. Our key idea is first to choose a small set
of semantically meaningful and disentangled descriptors that characterize the
3DMM, and then learn a non-linear mapping from scores across this set to the
parametric coefficients of the given 3DMM. The non-linear mapping is defined by
training a neural network without a human-in-the-loop. We present results on
numerous 3DMMs: body shape models, face shape and expression models, as well as
animal shapes. We demonstrate how our method defines a simple slider interface
for intuitive modeling, and show how the mapping can be used to instantly fit a
3D parametric body shape to in-the-wild images. |
Semantify is a self-supervised method that simplifies 3D morphable model control using CLIP, enabling intuitive modeling with semantically meaningful descriptors. |
Controlling 3DMMs is often difficult due to the uninterpretable nature of their parameters. Semantify addresses this by introducing semantic control using natural language descriptors. |
The method involves: (1) creating a dataset of rendered 3DMM shapes with varying parameters, (2) encoding these images and semantic descriptors into CLIP's latent space, (3) selecting a small set of disentangled descriptors, and (4) training a neural network to map descriptor scores to 3DMM coefficients. |
Semantify defines a simple slider interface for intuitive 3D model manipulation.
It enables zero-shot fitting of 3D body shapes to in-the-wild images, achieving comparable results to state-of-the-art methods.
User studies show Semantify is more user-friendly and efficient than traditional control methods. |
Mapper performance is dependent on the quality and diversity of the training dataset.
While Semantify aims for a self-supervised approach, manual fine-tuning for specific 3DMMs could potentially enhance performance. |
3d morphable models, clip, semantic modeling, zero-shot learning, human-computer interaction |
2308.07391
Report |
PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects |
Jiayi Liu, Ali Mahdavi-Amiri, Manolis Savva |
We address the task of simultaneous part-level reconstruction and motion
parameter estimation for articulated objects. Given two sets of multi-view
images of an object in two static articulation states, we decouple the movable
part from the static part and reconstruct shape and appearance while predicting
the motion parameters. To tackle this problem, we present PARIS: a
self-supervised, end-to-end architecture that learns part-level implicit shape
and appearance models and optimizes motion parameters jointly without any 3D
supervision, motion, or semantic annotation. Our experiments show that our
method generalizes better across object categories, and outperforms baselines
and prior work that are given 3D point clouds as input. Our approach improves
reconstruction relative to state-of-the-art baselines with a Chamfer-L1
distance reduction of 3.94 (45.2%) for objects and 26.79 (84.5%) for parts, and
achieves 5% error rate for motion estimation across 10 object categories.
Video summary at: https://youtu.be/tDSrROPCgUc |
Presents PARIS, a self-supervised method for part-level reconstruction and motion analysis of articulated objects from multi-view images in two static states. |
Enables understanding and manipulation of articulated objects in areas like robotics, animation, and industrial design, without expensive 3D supervision or category-specific models. |
Uses composite neural radiance fields to represent static and movable parts, with a transformation function to align the movable part to a canonical state. Employs self-supervisory losses based on input RGB images and object masks. |
Outperforms baselines in shape and appearance reconstruction, achieving a significant Chamfer-L1 distance reduction.
Achieves accurate motion parameter estimation, with low errors in joint axis and state prediction.
Demonstrates generalization to unseen object categories, unlike category-specific methods. |
Relies on a sufficient number of multi-view observations and pre-alignment of object states.
Faces challenges with severe occlusions and highly symmetric movable parts. |
articulated objects, part-level reconstruction, motion analysis, self-supervised learning, neural radiance fields |
2308.07314
Report |
Dual Associated Encoder for Face Restoration |
Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin C. K. Chan, Ming-Hsuan Yang |
Restoring facial details from low-quality (LQ) images has remained a
challenging problem due to its ill-posedness induced by various degradations in
the wild. The existing codebook prior mitigates the ill-posedness by leveraging
an autoencoder and learned codebook of high-quality (HQ) features, achieving
remarkable quality. However, existing approaches in this paradigm frequently
depend on a single encoder pre-trained on HQ data for restoring HQ images,
disregarding the domain gap between LQ and HQ images. As a result, the encoding
of LQ inputs may be insufficient, resulting in suboptimal performance. To
tackle this problem, we propose a novel dual-branch framework named DAEFR. Our
method introduces an auxiliary LQ branch that extracts crucial information from
the LQ inputs. Additionally, we incorporate association training to promote
effective synergy between the two branches, enhancing code prediction and
output quality. We evaluate the effectiveness of DAEFR on both synthetic and
real-world datasets, demonstrating its superior performance in restoring facial
details. Project page: https://liagm.github.io/DAEFR/ |
This paper introduces DAEFR, a novel dual-branch framework for restoring high-quality facial images from severely degraded ones, addressing limitations in existing codebook prior methods. |
Restoring facial details from low-quality images is crucial for various applications but challenging due to domain gaps and information loss between degraded and high-quality images. |
DAEFR utilizes an auxiliary LQ branch to extract domain-specific information from degraded inputs. It employs association training to align features from HQ and LQ encoders, bridging the domain gap. A multi-head cross-attention module then fuses these features, enhancing code prediction and restoration. |
DAEFR outperforms state-of-the-art methods in perceptual quality metrics (FID, NIQE) on real-world datasets, demonstrating robustness against severe degradation.
On synthetic datasets, DAEFR achieves competitive performance in image quality (FID, LPIPS) and identity preservation (IDA, LMD).
Ablation studies validate the effectiveness of the dual-branch architecture, association stage, and feature fusion module. |
DAEFR's performance may be limited in extreme pose situations due to the limited diversity of training data.
Future work includes exploring alternative feature fusion techniques and extending the approach to handle other image restoration tasks. |
face restoration, codebook prior, dual-branch network, feature association, multi-head cross-attention |
2308.07102
Report |
Temporal Sentence Grounding in Streaming Videos |
Tian Gan, Xiao Wang, Yan Sun, Jianlong Wu, Qingpei Guo, Liqiang Nie |
This paper aims to tackle a novel task - Temporal Sentence Grounding in
Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance
between a video stream and a given sentence query. Unlike regular videos,
streaming videos are acquired continuously from a particular source, and are
always desired to be processed on-the-fly in many applications such as
surveillance and live-stream analysis. Thus, TSGSV is challenging since it
requires the model to infer without future frames and process long historical
frames effectively, which is untouched in the early methods. To specifically
address the above challenges, we propose two novel methods: (1) a TwinNet
structure that enables the model to learn about upcoming events; and (2) a
language-guided feature compressor that eliminates redundant visual frames and
reinforces the frames that are relevant to the query. We conduct extensive
experiments using ActivityNet Captions, TACoS, and MAD datasets. The results
demonstrate the superiority of our proposed methods. A systematic ablation
study also confirms their effectiveness. |
This paper introduces and tackles the novel task of Temporal Sentence Grounding in Streaming Videos (TSGSV), which aims to assess the relevance between a streaming video and a sentence query in an online manner. |
TSGSV is crucial for applications like surveillance and live-stream analysis, where real-time processing of continuous video streams is essential for identifying events of interest. |
The paper proposes a TwinNet architecture with an ordinary and a prophet network. The prophet network, with access to future frames during training, guides the ordinary network to understand upcoming events. Additionally, a language-guided feature compressor efficiently summarizes historical information relevant to the query. |
The proposed method outperforms modified offline temporal sentence grounding methods and online action detection methods on ActivityNet Captions, TACoS, and MAD datasets.
Ablation studies confirm the importance of both the language-guided feature compressor and the prophet decoder for accurate and efficient TSGSV.
The model's optimized implementation achieves real-time performance suitable for online inference. |
The current model relies on offline evaluation metrics due to the lack of established online evaluation protocols for TSGSV.
Future work will explore extending the model for streaming video-text pretraining to enhance its performance further. |
temporal sentence grounding, streaming videos, online inference, twinnet, language-guided feature compression |
2308.07037
Report |
Bayesian Flow Networks |
Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, Faustino Gomez |
This paper introduces Bayesian Flow Networks (BFNs), a new class of
generative model in which the parameters of a set of independent distributions
are modified with Bayesian inference in the light of noisy data samples, then
passed as input to a neural network that outputs a second, interdependent
distribution. Starting from a simple prior and iteratively updating the two
distributions yields a generative procedure similar to the reverse process of
diffusion models; however it is conceptually simpler in that no forward process
is required. Discrete and continuous-time loss functions are derived for
continuous, discretised and discrete data, along with sample generation
procedures. Notably, the network inputs for discrete data lie on the
probability simplex, and are therefore natively differentiable, paving the way
for gradient-based sample guidance and few-step generation in discrete domains
such as language modelling. The loss function directly optimises data
compression and places no restrictions on the network architecture. In our
experiments BFNs achieve competitive log-likelihoods for image modelling on
dynamically binarized MNIST and CIFAR-10, and outperform all known discrete
diffusion models on the text8 character-level language modelling task. |
Introduces Bayesian Flow Networks (BFNs), a new class of generative model that modifies the parameters of independent distributions using Bayesian inference based on noisy data samples, then passes these parameters to a neural network to generate a second, interdependent distribution. |
Aims to combine the strengths of Bayesian inference for summarizing information about individual variables with the power of deep learning for integrating information across many variables. Also seeks to enable smooth and differentiable generative processes for discrete data, unlike traditional discrete diffusion models. |
Derives discrete and continuous-time loss functions based on minimizing the KL divergence between sender and receiver distributions. Provides specializations for continuous, discretized, and discrete data, along with algorithms for training, evaluation, and sample generation. |
BFNs achieve competitive log-likelihoods for image modeling on dynamically binarized MNIST and CIFAR-10.
BFNs outperform all known discrete diffusion models on the text8 character-level language modeling task.
Discretized loss function performs better than continuous loss for CIFAR-10 with 16 bins, but continuous loss performs better for 256 bins. |
The accuracy schedule used for binary and continuous data appears suboptimal.
Further investigation is needed to understand why continuous loss performs better for CIFAR-10 with 256 bins. |
generative models, bayesian inference, deep learning, diffusion models, discrete data |
2308.07032
Report |
S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields |
Zeke Xie, Xindi Yang, Yujie Yang, Qi Sun, Yixiang Jiang, Haoran Wang, Yunfeng Cai, Mingming Sun |
Recently, Neural Radiance Field (NeRF) has shown great success in rendering
novel-view images of a given scene by learning an implicit representation with
only posed RGB images. NeRF and relevant neural field methods (e.g., neural
surface representation) typically optimize a point-wise loss and make
point-wise predictions, where one data point corresponds to one pixel.
Unfortunately, this line of research failed to use the collective supervision
of distant pixels, although it is known that pixels in an image or scene can
provide rich structural information. To the best of our knowledge, we are the
first to design a nonlocal multiplex training paradigm for NeRF and relevant
neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss
that processes multiple data points as a whole set instead of process multiple
inputs independently. Our extensive experiments demonstrate the unreasonable
effectiveness of S3IM in improving NeRF and neural surface representation for
nearly free. The improvements of quality metrics can be particularly
significant for those relatively difficult tasks: e.g., the test MSE loss
unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view
synthesis tasks; a 198% F-score gain and a 64% Chamfer $L_{1}$ distance
reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is
consistently robust even with sparse inputs, corrupted images, and dynamic
scenes. |
This paper introduces S3IM, a novel Stochastic Structural SIMilarity index, and a corresponding multiplex training paradigm for Neural Radiance Fields (NeRF) and neural surface representation methods. S3IM captures nonlocal structural similarity information from stochastically sampled pixels and leverages it as a multiplex loss to improve training. |
Existing NeRF methods rely on point-wise losses (e.g., MSE), neglecting the rich structural information among pixels. This limits their performance, especially for challenging tasks such as few-shot learning and handling corrupted images. S3IM addresses this limitation by incorporating nonlocal structural similarity into the training process. |
S3IM computes SSIM on stochastically generated patches from sampled pixels, capturing nonlocal structural information. It then integrates this information into a multiplex loss function, combined with the conventional point-wise loss, to train NeRF and neural surface representation models. |
S3IM significantly improves image quality metrics (PSNR, SSIM, LPIPS) for NeRF variants like DVGO and TensoRF, achieving up to 16.43 and 24.75 PSNR gains on Replica Dataset.
S3IM enhances robustness to sparse inputs and corrupted images, exhibiting even greater improvements with fewer or noisier training images.
S3IM significantly benefits neural surface reconstruction, leading to substantial gains in both image quality metrics and geometric metrics (e.g., 64% Chamfer L1 distance reduction, 198% F-score gain for NeuS). |
The current study mainly focuses on S3IM for RGB image losses and could explore its application to depth or other non-RGB losses.
Future work can investigate the theoretical understanding of how S3IM improves generalization and affects the flatness of the learned minima. |
neural radiance fields, nerf, neural rendering, surface reconstruction, multiplex loss |
2308.07026
Report |
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning |
Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, Hai Jin |
Multimodal contrastive learning aims to train a general-purpose feature
extractor, such as CLIP, on vast amounts of raw, unlabeled paired image-text
data. This can greatly benefit various complex downstream tasks, including
cross-modal image-text retrieval and image classification. Despite its
promising prospect, the security issue of cross-modal pre-trained encoder has
not been fully explored yet, especially when the pre-trained encoder is
publicly available for commercial use.
In this work, we propose AdvCLIP, the first attack framework for generating
downstream-agnostic adversarial examples based on cross-modal pre-trained
encoders. AdvCLIP aims to construct a universal adversarial patch for a set of
natural images that can fool all the downstream tasks inheriting the victim
cross-modal pre-trained encoder. To address the challenges of heterogeneity
between different modalities and unknown downstream tasks, we first build a
topological graph structure to capture the relevant positions between target
samples and their neighbors. Then, we design a topology-deviation based
generative adversarial network to generate a universal adversarial patch. By
adding the patch to images, we minimize their embeddings similarity to
different modality and perturb the sample distribution in the feature space,
achieving unviersal non-targeted attacks. Our results demonstrate the excellent
attack performance of AdvCLIP on two types of downstream tasks across eight
datasets. We also tailor three popular defenses to mitigate AdvCLIP,
highlighting the need for new defense mechanisms to defend cross-modal
pre-trained encoders. |
AdvCLIP, a novel attack framework, is proposed to generate downstream-agnostic adversarial examples for multimodal contrastive learning models, exposing security vulnerabilities in these models and their downstream tasks. |
Multimodal pre-trained encoders, despite their promising applications in various downstream tasks, have security risks that havent been fully explored, potentially impacting commercially available models and services. |
A topology-deviation based generative adversarial network is designed to generate universal adversarial patches. These patches disrupt the similarity between different modality embeddings and their topological relationships, leading to non-targeted attacks on downstream tasks. |
AdvCLIP demonstrates successful attacks on both image-text retrieval and image classification tasks across various datasets and model architectures.
Transformer-based architectures are found to be more vulnerable to these adversarial attacks compared to ResNet-based models.
AdvCLIP remains effective even when common defense mechanisms such as data corruption, pruning, and adversarial training are applied. |
The attack success rate of AdvCLIP can be influenced by the choice of surrogate dataset used for training the adversarial patch generator.
Further research is needed to develop more robust defense mechanisms specifically designed to protect multimodal pre-trained encoders from these types of attacks. |
adversarial patch, pre-trained encoder, cross-modal retrieval, multimodal contrastive learning, security vulnerability |
2308.06962
Report |
Color-NeuS: Reconstructing Neural Implicit Surfaces with Color |
Licheng Zhong, Lixin Yang, Kailin Li, Haoyu Zhen, Mei Han, Cewu Lu |
The reconstruction of object surfaces from multi-view images or monocular
video is a fundamental issue in computer vision. However, much of the recent
research concentrates on reconstructing geometry through implicit or explicit
methods. In this paper, we shift our focus towards reconstructing mesh in
conjunction with color. We remove the view-dependent color from neural volume
rendering while retaining volume rendering performance through a relighting
network. Mesh is extracted from the signed distance function (SDF) network for
the surface, and color for each surface vertex is drawn from the global color
network. To evaluate our approach, we conceived a in hand object scanning task
featuring numerous occlusions and dramatic shifts in lighting conditions. We've
gathered several videos for this task, and the results surpass those of any
existing methods capable of reconstructing mesh alongside color. Additionally,
our method's performance was assessed using public datasets, including DTU,
BlendedMVS, and OmniObject3D. The results indicated that our method performs
well across all these datasets. Project page:
https://colmar-zlicheng.github.io/color_neus. |
This paper proposes \method, a novel method for reconstructing neural implicit surfaces with view-independent color, compatible with NeuS-like models. |
Reconstructing object surfaces with color from images is a fundamental problem. Existing methods struggle to balance accurate geometry reconstruction with view-independent color extraction, especially under challenging real-world conditions like occlusion and varying lighting. |
The method decouples view-dependent color in neural volume rendering by learning a view-independent global color and a view-dependent relighting effect. It uses a global color network for vertex color and a relighting network to maintain volume rendering performance. During inference, only the global color is used for mesh vertex coloring. |
\method successfully reconstructs object surfaces with accurate color, outperforming alternative solutions and traditional methods like structured-light scanning and COLMAP.
The method handles challenging real-world scenarios with occlusion and reflection effectively, as demonstrated on the IHO-Video dataset.
Quantitative evaluations on DTU, BlendedMVS, OmniObject3D, and IHO-Video datasets show \method achieves high-quality surface reconstruction and color accuracy. |
The relighting network relies on the gradient of the SDF as input, which might limit its performance when the SDF network is sub-optimal.
Future work could explore more sophisticated relighting networks and incorporate techniques like differentiable rendering for improved accuracy. |
neural implicit surface, surface reconstruction, view-independent color, relighting network, neural rendering |
2308.06887
Report |
Robustified ANNs Reveal Wormholes Between Human Category Percepts |
Guy Gaziv, Michael J. Lee, James J. DiCarlo |
The visual object category reports of artificial neural networks (ANNs) are
notoriously sensitive to tiny, adversarial image perturbations. Because human
category reports (aka human percepts) are thought to be insensitive to those
same small-norm perturbations -- and locally stable in general -- this argues
that ANNs are incomplete scientific models of human visual perception.
Consistent with this, we show that when small-norm image perturbations are
generated by standard ANN models, human object category percepts are indeed
highly stable. However, in this very same "human-presumed-stable" regime, we
find that robustified ANNs reliably discover low-norm image perturbations that
strongly disrupt human percepts. These previously undetectable human perceptual
disruptions are massive in amplitude, approaching the same level of sensitivity
seen in robustified ANNs. Further, we show that robustified ANNs support
precise perceptual state interventions: they guide the construction of low-norm
image perturbations that strongly alter human category percepts toward specific
prescribed percepts. These observations suggest that for arbitrary starting
points in image space, there exists a set of nearby "wormholes", each leading
the subject from their current category perceptual state into a semantically
very different state. Moreover, contemporary ANN models of biological visual
processing are now accurate enough to consistently guide us to those portals. |
This paper provides evidence that robustified ANNs can discover low-norm image perturbations that strongly and precisely modulate human object category percepts, challenging the assumption that human categorization is highly robust to such perturbations. |
This finding is significant because it suggests the existence of "wormholes" in image space, where local perturbations can lead to drastic changes in human perception, and demonstrates the increasing accuracy of ANNs as scientific models of ventral visual processing. |
The authors generated image perturbations using robustified and vanilla ANNs in two modes: Disruption Modulation (DM) to induce model errors and Targeted Modulation (TM) to induce specific category judgments. They then measured the effects of these perturbations on human categorization behavior in a nine-way choice task. |
Human category percepts are highly sensitive to low-norm perturbations discovered by robustified ANNs, but not vanilla ANNs.
Robustified ANNs allow for precise targeted modulation of human percepts, guiding them towards specific categories.
These effects persist across different image distributions and even extend to composite category perceptions. |
The study primarily focuses on ResNet50 architecture and L2-norm perturbations, limiting the generalization of findings.
While demonstrating the effectiveness of adversarial training, the study doesn't claim it as the mechanism behind human robustness. |
adversarial robustness, human perception, object categorization, neural networks, visual processing |
2308.06749
Report |
FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Lookup Table |
Wenhao Li, Guangyang Wu, Wenyi Wang, Peiran Ren, Xiaohong Liu |
Low-Light Video Enhancement (LLVE) has received considerable attention in
recent years. One of the critical requirements of LLVE is inter-frame
brightness consistency, which is essential for maintaining the temporal
coherence of the enhanced video. However, most existing single-image-based
methods fail to address this issue, resulting in flickering effect that
degrades the overall quality after enhancement. Moreover, 3D Convolution Neural
Network (CNN)-based methods, which are designed for video to maintain
inter-frame consistency, are computationally expensive, making them impractical
for real-time applications. To address these issues, we propose an efficient
pipeline named FastLLVE that leverages the Look-Up-Table (LUT) technique to
maintain inter-frame brightness consistency effectively. Specifically, we
design a learnable Intensity-Aware LUT (IA-LUT) module for adaptive
enhancement, which addresses the low-dynamic problem in low-light scenarios.
This enables FastLLVE to perform low-latency and low-complexity enhancement
operations while maintaining high-quality results. Experimental results on
benchmark datasets demonstrate that our method achieves the State-Of-The-Art
(SOTA) performance in terms of both image quality and inter-frame brightness
consistency. More importantly, our FastLLVE can process 1,080p videos at
$\mathit{50+}$ Frames Per Second (FPS), which is $\mathit{2 \times}$ faster
than SOTA CNN-based methods in inference time, making it a promising solution
for real-time applications. The code is available at
https://github.com/Wenhao-Li-777/FastLLVE. |
This paper proposes FastLLVE, a novel LUT-based framework for real-time low-light video enhancement, utilizing an Intensity-Aware LUT (IA-LUT) to maintain inter-frame brightness consistency. |
Maintaining brightness consistency in low-light video enhancement is crucial for high perceptual quality, but current methods struggle to balance efficiency and performance. Existing methods either suffer from flickering effects or are computationally expensive, making them impractical for real-time applications. |
The method uses a lightweight encoder-decoder network to extract features from the input video and generate a video-adaptive IA-LUT. The IA-LUT, incorporating enhancement intensity as an additional dimension, facilitates pixel-wise transformation for consistent enhancement and is efficiently implemented via CUDA. |
FastLLVE achieves state-of-the-art performance in terms of both image quality and inter-frame brightness consistency on benchmark datasets.
It maintains superior brightness consistency compared to existing methods, as evidenced by lower AB (Var) and MABD values.
The method achieves real-time processing speed of over 50 FPS for 1080p videos, making it significantly faster than CNN-based methods. |
The dependence on a separate denoising module, while improving visual quality, slightly impacts the overall efficiency.
Future work will explore a denoising strategy specifically designed for LUT-based enhancement to further enhance efficiency. |
low-light video enhancement, lookup table, brightness consistency, real-time, intensity-aware lut |
2308.06739
Report |
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks |
David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou |
Despite the rapid advancement of unsupervised learning in visual
representation, it requires training on large-scale datasets that demand costly
data collection, and pose additional challenges due to concerns regarding data
privacy. Recently, synthetic images generated by text-to-image diffusion
models, have shown great potential for benefiting image recognition. Although
promising, there has been inadequate exploration dedicated to unsupervised
learning on diffusion-generated images. To address this, we start by uncovering
that diffusion models' cross-attention layers inherently provide
annotation-free attention masks aligned with corresponding text inputs on
generated images. We then investigate the problems of three prevalent
unsupervised learning techniques ( i.e., contrastive learning, masked modeling,
and vision-language pretraining) and introduce customized solutions by fully
exploiting the aforementioned free attention masks. Our approach is validated
through extensive experiments that show consistent improvements in baseline
models across various downstream tasks, including image classification,
detection, segmentation, and image-text retrieval. By utilizing our method, it
is possible to close the performance gap between unsupervised pretraining on
synthetic data and real-world scenarios. |
This paper presents Free-ATM, a novel method that leverages the freely available attention masks from text-to-image diffusion models to enhance unsupervised learning on synthetic images. |
Unsupervised learning heavily relies on large-scale datasets, which are costly to collect and raise privacy concerns. Synthetic data offers a solution, but current methods for unsupervised learning on such data, particularly diffusion-generated images, are underdeveloped. |
The study leverages the inherent attention masks within diffusion models' cross-attention layers, which align with text inputs to highlight objects in generated images. These masks are then used to address limitations in three unsupervised learning techniques: contrastive learning, masked modeling, and vision-language pretraining. |
Utilizing the attention masks for instance-level contrastive learning improves performance on object detection and segmentation tasks.
Applying the masks to guide the masking process in masked modeling leads to better image classification and semantic segmentation results.
Employing the masks for generating position-aware prompts significantly boosts image-text retrieval performance in vision-language pretraining. |
The quality of synthetic images, while improving, still influences the overall performance gain.
Exploring the impact of further increasing the volume of synthetic data used for pretraining. |
unsupervised learning, diffusion models, synthetic data, attention masks, computer vision |
2308.06721
Report |
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models |
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang |
Recent years have witnessed the strong power of large text-to-image diffusion
models for the impressive generative capability to create high-fidelity images.
However, it is very tricky to generate desired images using only text prompt as
it often involves complex prompt engineering. An alternative to text prompt is
image prompt, as the saying goes: "an image is worth a thousand words".
Although existing methods of direct fine-tuning from pretrained models are
effective, they require large computing resources and are not compatible with
other base models, text prompt, and structural controls. In this paper, we
present IP-Adapter, an effective and lightweight adapter to achieve image
prompt capability for the pretrained text-to-image diffusion models. The key
design of our IP-Adapter is decoupled cross-attention mechanism that separates
cross-attention layers for text features and image features. Despite the
simplicity of our method, an IP-Adapter with only 22M parameters can achieve
comparable or even better performance to a fully fine-tuned image prompt model.
As we freeze the pretrained diffusion model, the proposed IP-Adapter can be
generalized not only to other custom models fine-tuned from the same base
model, but also to controllable generation using existing controllable tools.
With the benefit of the decoupled cross-attention strategy, the image prompt
can also work well with the text prompt to achieve multimodal image generation.
The project page is available at \url{https://ip-adapter.github.io}. |
Presents IP-Adapter, a lightweight image prompt adapter for text-to-image diffusion models, employing a decoupled cross-attention mechanism for effective integration of image features. |
Addresses the limitations of text prompts in image generation, enabling more intuitive and informative image-based prompts for controlling content generation. |
Leverages a pretrained CLIP image encoder to extract image features, employs a projection network to decompose global image embedding, and introduces decoupled cross-attention layers within the UNet architecture to effectively embed image features. |
Achieves comparable or even better performance than fully fine-tuned image prompt models and existing adapter methods.
Demonstrates strong generalization capabilities by seamlessly integrating with custom models and existing structure control tools like ControlNet.
Enables multimodal image generation by effectively combining image prompts with text prompts for enhanced control and diversity. |
Limited ability to generate highly consistent images with the subject of a given image.
Further research is needed to enhance consistency and explore the use of fine-grained image features for improved control. |
image generation, diffusion models, image prompt, controllable generation, multimodal generation |
2308.06713
Report |
LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts |
Binbin Yang, Yi Luo, Ziliang Chen, Guangrun Wang, Xiaodan Liang, Liang Lin |
Thanks to the rapid development of diffusion models, unprecedented progress
has been witnessed in image synthesis. Prior works mostly rely on pre-trained
linguistic models, but a text is often too abstract to properly specify all the
spatial properties of an image, e.g., the layout configuration of a scene,
leading to the sub-optimal results of complex scene generation. In this paper,
we achieve accurate complex scene generation by proposing a semantically
controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from
the previous Layout-to-Image generation (L2I) methods that only explore
category-aware relationships, LAW-Diffusion introduces a spatial dependency
parser to encode the location-aware semantic coherence across objects as a
layout embedding and produces a scene with perceptually harmonious object
styles and contextual relations. To be specific, we delicately instantiate each
object's regional semantics as an object region map and leverage a
location-aware cross-object attention module to capture the spatial
dependencies among those disentangled representations. We further propose an
adaptive guidance schedule for our layout guidance to mitigate the trade-off
between the regional semantic alignment and the texture fidelity of generated
objects. Moreover, LAW-Diffusion allows for instance reconfiguration while
maintaining the other regions in a synthesized image by introducing a
layout-aware latent grafting mechanism to recompose its local regional
semantics. To better verify the plausibility of generated scenes, we propose a
new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to
measure how the images preserve the rational and harmonious relations among
contextual objects. Comprehensive experiments demonstrate that our
LAW-Diffusion yields the state-of-the-art generative performance, especially
with coherent object relations. |
This paper proposes LAW-Diffusion, a novel Layout-Aware diffusion model for synthesizing complex scene images with harmonious object relations from layout configurations. |
Existing text-to-image models struggle to accurately capture and maintain spatial relationships between objects in complex scenes, while previous layout-to-image methods often lack overall harmony and style consistency. LAW-Diffusion aims to address these limitations. |
LAW-Diffusion leverages a spatial dependency parser to encode location-aware semantic coherence across objects as a layout embedding. It utilizes a location-aware cross-object attention module to capture spatial dependencies and employs an adaptive guidance schedule for balancing regional semantic alignment and texture fidelity. It also incorporates a layout-aware latent grafting mechanism for instance reconfiguration within generated scenes. |
LAW-Diffusion outperforms state-of-the-art L2I methods in terms of FID, Inception Score, and Classification Accuracy Score, demonstrating its superior image fidelity.
The proposed Scene Relation Score (SRS) metric highlights LAW-Diffusion's ability to generate scenes with plausible and coherent object relations.
The layout-aware latent grafting mechanism enables flexible instance-level reconfiguration (adding, removing, restyling) while preserving overall scene coherence. |
LAW-Diffusion currently focuses on closed-world object categories pre-defined in the datasets.
It lacks the ability to specify scene-level style and semantics through global scene descriptions. |
image generation, diffusion models, layout-to-image generation, scene understanding, computer vision |
2308.06699
Report |
Neural Super-Resolution for Real-time Rendering with Radiance Demodulation |
Jia Li, Ziling Chen, Xiaolong Wu, Lu Wang, Beibei Wang, Lei Zhang |
It is time-consuming to render high-resolution images in applications such as
video games and virtual reality, and thus super-resolution technologies become
increasingly popular for real-time rendering. However, it is challenging to
preserve sharp texture details, keep the temporal stability and avoid the
ghosting artifacts in real-time super-resolution rendering. To address this
issue, we introduce radiance demodulation to separate the rendered image or
radiance into a lighting component and a material component, considering the
fact that the light component is smoother than the rendered image so that the
high-resolution material component with detailed textures can be easily
obtained. We perform the super-resolution on the lighting component only and
re-modulate it with the high-resolution material component to obtain the final
super-resolution image with more texture details. A reliable warping module is
proposed by explicitly marking the occluded regions to avoid the ghosting
artifacts. To further enhance the temporal stability, we design a
frame-recurrent neural network and a temporal loss to aggregate the previous
and current frames, which can better capture the spatial-temporal consistency
among reconstructed frames. As a result, our method is able to produce
temporally stable results in real-time rendering with high-quality details,
even in the challenging 4 $\times$ 4 super-resolution scenarios. |
This paper introduces a novel lightweight super-resolution method for real-time rendering that leverages radiance demodulation, a motion-unreliable region detection approach, and a frame-recurrent neural network. |
Real-time rendering demands both high resolution and low latency. Super-resolution rendering helps but struggles to preserve sharp texture details, maintain temporal stability, and avoid ghosting artifacts. |
The method demodulates the rendered image into lighting and material components, performing super-resolution only on the smoother lighting component. It identifies and mitigates ghosting artifacts using a motion mask and employs a frame-recurrent network with a temporal loss for temporal stability. |
Significantly outperforms state-of-the-art VSR and RRSR methods both qualitatively and quantitatively.
Preserves more texture details and avoids ghosting artifacts compared to other methods.
Achieves real-time performance with significant efficiency improvements over rendering high-resolution images directly. |
Generalization across different scenes comes at the cost of slightly reduced quality.
Future work could explore hardware acceleration and further quality improvements. |
super-resolution, real-time rendering, radiance demodulation, motion mask, frame-recurrent neural network |
2308.06624
Report |
ADRMX: Additive Disentanglement of Domain Features with Remix Loss |
Berker Demirel, Erchan Aptoula, Huseyin Ozkan |
The common assumption that train and test sets follow similar distributions
is often violated in deployment settings. Given multiple source domains, domain
generalization aims to create robust models capable of generalizing to new
unseen domains. To this end, most of existing studies focus on extracting
domain invariant features across the available source domains in order to
mitigate the effects of inter-domain distributional changes. However, this
approach may limit the model's generalization capacity by relying solely on
finding common features among the source domains. It overlooks the potential
presence of domain-specific characteristics that could be prevalent in a subset
of domains, potentially containing valuable information. In this work, a novel
architecture named Additive Disentanglement of Domain Features with Remix Loss
(ADRMX) is presented, which addresses this limitation by incorporating domain
variant features together with the domain invariant ones using an original
additive disentanglement strategy. Moreover, a new data augmentation technique
is introduced to further support the generalization capacity of ADRMX, where
samples from different domains are mixed within the latent space. Through
extensive experiments conducted on DomainBed under fair conditions, ADRMX is
shown to achieve state-of-the-art performance. Code will be made available at
GitHub after the revision process. |
This paper presents ADRMX, a novel architecture for domain generalization that leverages both domain variant and invariant features through an additive disentanglement strategy and a novel data augmentation technique. |
Domain generalization aims to improve the robustness of models when faced with distributional shifts between training (source) and unseen (target) domains, a common challenge in real-world deployments. |
ADRMX uses two backbones to extract label and domain features. It then employs an adversarial learning framework to disentangle domain-invariant features while using a novel remix loss and data augmentation technique to combine features from different domains in the latent space. |
ADRMX achieves state-of-the-art performance on the DomainBed benchmark, surpassing previous approaches.
The additive modeling strategy, incorporating both domain-variant and invariant features, proves beneficial for generalization.
The remix loss, facilitating data augmentation in the latent space, further improves the model's performance. |
The computational cost of ADRMX, particularly for large datasets, can be a limitation.
Exploring alternative backbone architectures and data augmentation strategies could further enhance the performance. |
domain generalization, disentanglement, deep learning, image classification, data augmentation |
2308.06622
Report |
DFM-X: Augmentation by Leveraging Prior Knowledge of Shortcut Learning |
Shunxin Wang, Christoph Brune, Raymond Veldhuis, Nicola Strisciuglio |
Neural networks are prone to learn easy solutions from superficial statistics
in the data, namely shortcut learning, which impairs generalization and
robustness of models. We propose a data augmentation strategy, named DFM-X,
that leverages knowledge about frequency shortcuts, encoded in Dominant
Frequencies Maps computed for image classification models. We randomly select
X% training images of certain classes for augmentation, and process them by
retaining the frequencies included in the DFMs of other classes. This strategy
compels the models to leverage a broader range of frequencies for
classification, rather than relying on specific frequency sets. Thus, the
models learn more deep and task-related semantics compared to their counterpart
trained with standard setups. Unlike other commonly used augmentation
techniques which focus on increasing the visual variations of training data,
our method targets exploiting the original data efficiently, by distilling
prior knowledge about destructive learning behavior of models from data. Our
experimental results demonstrate that DFM-X improves robustness against common
corruptions and adversarial attacks. It can be seamlessly integrated with other
augmentation techniques to further enhance the robustness of models. |
Proposes DFM-X, a novel data augmentation method leveraging prior knowledge of frequency shortcuts to improve generalization and robustness of image classification models. |
Addresses the issue of shortcut learning in neural networks, where models rely on superficial statistics instead of task-related semantics, hindering generalization and robustness. |
Computes Dominant Frequency Maps (DFMs) for each class, identifying frequency shortcuts. Augments training images by filtering their frequency spectrum using DFMs of other classes, forcing models to utilize a broader range of frequencies. |
DFM-X improves robustness against common corruptions and adversarial attacks without sacrificing accuracy on clean images.
Combining DFM-X with AugMix or AutoAugment further enhances robustness, indicating complementarity.
The percentage of images augmented by DFM-X (X) influences robustness, with lower-capacity models benefiting from higher values. |
Limited investigation into the interplay between model capacity, DFM-X augmentation percentage, and specific augmentation operations.
Further exploration of combining DFM-X with other augmentation techniques beyond AugMix and AutoAugment. |
shortcut learning, data augmentation, frequency analysis, robustness, generalization |
2308.06571
Report |
ModelScope Text-to-Video Technical Report |
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang |
This paper introduces ModelScopeT2V, a text-to-video synthesis model that
evolves from a text-to-image synthesis model (i.e., Stable Diffusion).
ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame
generation and smooth movement transitions. The model could adapt to varying
frame numbers during training and inference, rendering it suitable for both
image-text and video-text datasets. ModelScopeT2V brings together three
components (i.e., VQGAN, a text encoder, and a denoising UNet), totally
comprising 1.7 billion parameters, in which 0.5 billion parameters are
dedicated to temporal capabilities. The model demonstrates superior performance
over state-of-the-art methods across three evaluation metrics. The code and an
online demo are available at
\url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}. |
This paper introduces ModelScopeT2V, the first open-source text-to-video synthesis model based on diffusion models. |
Creating a publicly available and effective text-to-video synthesis model can catalyze further research efforts and advancements in video generation. |
ModelScopeT2V incorporates a spatio-temporal block into the diffusion-based UNet architecture to model temporal dependencies and is trained on both image-text and video-text datasets using a multi-frame training strategy. |
ModelScopeT2V demonstrates superior performance over state-of-the-art methods on FID-vid and FVD metrics.
ModelScopeT2V generates videos with diverse and dynamic motion.
The code and online demos are publicly available, fostering community engagement and novel applications. |
The model could be further enhanced by incorporating multi-condition approaches or LoRA techniques.
Future work could focus on generating longer videos with more semantic information. |
text-to-video synthesis, diffusion models, spatio-temporal modeling, multi-frame training, open-source |
2308.06548
Report |
Revisiting Vision Transformer from the View of Path Ensemble |
Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou |
Vision Transformers (ViTs) are normally regarded as a stack of transformer
layers. In this work, we propose a novel view of ViTs showing that they can be
seen as ensemble networks containing multiple parallel paths with different
lengths. Specifically, we equivalently transform the traditional cascade of
multi-head self-attention (MSA) and feed-forward network (FFN) into three
parallel paths in each transformer layer. Then, we utilize the identity
connection in our new transformer form and further transform the ViT into an
explicit multi-path ensemble network. From the new perspective, these paths
perform two functions: the first is to provide the feature for the classifier
directly, and the second is to provide the lower-level feature representation
for subsequent longer paths. We investigate the influence of each path for the
final prediction and discover that some paths even pull down the performance.
Therefore, we propose the path pruning and EnsembleScale skills for
improvement, which cut out the underperforming paths and re-weight the ensemble
components, respectively, to optimize the path combination and make the short
paths focus on providing high-quality representation for subsequent paths. We
also demonstrate that our path combination strategies can help ViTs go deeper
and act as high-pass filters to filter out partial low-frequency signals. To
further enhance the representation of paths served for subsequent paths,
self-distillation is applied to transfer knowledge from the long paths to the
short paths. This work calls for more future research to explain and design
ViTs from new perspectives. |
This paper presents a novel perspective on Vision Transformers (ViTs), demonstrating that they can be interpreted as ensemble networks comprising multiple parallel paths of varying lengths. |
This ensemble view provides a new framework for understanding and optimizing ViTs by manipulating the contributions of individual paths. |
The authors mathematically decouple the traditional cascade of Multi-Head Self-Attention (MHSA) and Feed-Forward Network (FFN) layers into parallel paths, leveraging identity connections to construct an explicit ensemble network. |
The analysis reveals that short paths often contribute minimally to the final prediction accuracy and may even hinder performance.
Path pruning, eliminating underperforming short paths, and EnsembleScale, re-weighting path contributions, are introduced to optimize path combination, leading to improved accuracy.
A self-distillation method is proposed to transfer knowledge from longer to shorter paths, further enhancing representation learning and boosting overall performance. |
The study focuses on image classification tasks, leaving the applicability of the ensemble view to other vision tasks for future investigation.
While the ensemble view provides a new perspective, exploring alternative path manipulation techniques beyond pruning and scaling could yield further insights. |
vision transformers, ensemble networks, path pruning, ensemblescale, self-distillation |
2308.06531
Report |
SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning |
Muzhi Zhu, Hengtao Li, Hao Chen, Chengxiang Fan, Weian Mao, Chenchen Jing, Yifan Liu, Chunhua Shen |
Current closed-set instance segmentation models rely on pre-defined class
labels for each mask during training and evaluation, largely limiting their
ability to detect novel objects. Open-world instance segmentation (OWIS) models
address this challenge by detecting unknown objects in a class-agnostic manner.
However, previous OWIS approaches completely erase category information during
training to keep the model's ability to generalize to unknown objects. In this
work, we propose a novel training mechanism termed SegPrompt that uses category
information to improve the model's class-agnostic segmentation ability for both
known and unknown categories. In addition, the previous OWIS training setting
exposes the unknown classes to the training set and brings information leakage,
which is unreasonable in the real world. Therefore, we provide a new open-world
benchmark closer to a real-world scenario by dividing the dataset classes into
known-seen-unseen parts. For the first time, we focus on the model's ability to
discover objects that never appear in the training set images.
Experiments show that SegPrompt can improve the overall and unseen detection
performance by 5.6% and 6.1% in AR on our new benchmark without affecting the
inference efficiency. We further demonstrate the effectiveness of our method on
existing cross-dataset transfer and strongly supervised settings, leading to
5.5% and 12.3% relative improvement. |
Proposes SegPrompt, a category-level prompt learning method for open-world segmentation that boosts the segmentation performance on unseen categories by leveraging the knowledge from seen classes. |
Addresses the limitations of current open-world segmentation methods that struggle to generalize to novel categories not present in the training data. |
Introduces a new benchmark, LVIS-OW, to evaluate open-world segmentation by dividing categories into known, seen, and unseen sets. Employs category-level prompt learning to transfer knowledge from seen categories to unseen ones during training. |
Demonstrates the effectiveness of category-level prompt learning in improving segmentation performance on unseen categories.
Establishes a new benchmark, LVIS-OW, for evaluating open-world segmentation with a focus on unseen categories.
Highlights the importance of considering semantic overlap between seen and unseen categories in open-world segmentation. |
Limited evaluation of SegPrompt on other query-based models besides Mask2former.
Reliance on the availability of a sufficient number of seen categories for effective knowledge transfer. |
open-world segmentation, prompt learning, unseen object segmentation, long-tailed recognition, lvis-ow benchmark |
2308.06412
Report |
Taming Self-Training for Open-Vocabulary Object Detection |
Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B. G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas |
Recent studies have shown promising performance in open-vocabulary object
detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and
language models (VLMs). However, teacher-student self-training, a powerful and
widely used paradigm to leverage PLs, is rarely explored for OVD. This work
identifies two challenges of using self-training in OVD: noisy PLs from VLMs
and frequent distribution changes of PLs. To address these challenges, we
propose SAS-Det that tames self-training for OVD from two key perspectives.
First, we present a split-and-fusion (SAF) head that splits a standard
detection into an open-branch and a closed-branch. This design can reduce noisy
supervision from pseudo boxes. Moreover, the two branches learn complementary
knowledge from different training data, significantly enhancing performance
when fused together. Second, in our view, unlike in closed-set tasks, the PL
distributions in OVD are solely determined by the teacher model. We introduce a
periodic update strategy to decrease the number of updates to the teacher,
thereby decreasing the frequency of changes in PL distributions, which
stabilizes the training process. Extensive experiments demonstrate SAS-Det is
both efficient and effective. SAS-Det outperforms recent models of the same
scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories
of the COCO and LVIS benchmarks, respectively. Code is available at
\url{https://github.com/xiaofeng94/SAS-Det}. |
This paper proposes SAS-Det, a novel open-vocabulary object detection method leveraging self-training with a split-and-fusion head and periodic teacher updates to address challenges of noisy pseudo labels and distribution shifts from pretrained vision and language models. |
Open-Vocabulary Detection (OVD) aims to detect objects from novel categories without specific training examples, demanding efficient utilization of pseudo labels from pretrained Vision and Language Models (VLMs). This work tackles the challenges of noisy pseudo labels and distribution shifts in self-training for OVD, crucial for accurate and robust detection in open-world scenarios. |
SAS-Det introduces a split-and-fusion head dividing detection into open and closed branches to mitigate noisy supervision from pseudo boxes. It also employs a periodic teacher update strategy to stabilize training by reducing the frequency of pseudo label distribution changes. |
SAS-Det achieves state-of-the-art performance on COCO and LVIS OVD benchmarks.
Ablation studies demonstrate the effectiveness of the split-and-fusion head and periodic updates.
The method shows promising efficiency in pseudo labeling compared to prior art. |
Self-training with a teacher model increases GPU memory consumption.
Online pseudo labeling, although faster than previous methods, still adds overhead to the training process. |
open-vocabulary object detection, self-training, pseudo labels, vision and language models, split-and-fusion head |
2308.06248
Report |
FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods |
Robin Hesse, Simone Schaub-Meyer, Stefan Roth |
The field of explainable artificial intelligence (XAI) aims to uncover the
inner workings of complex deep neural models. While being crucial for
safety-critical domains, XAI inherently lacks ground-truth explanations, making
its automatic evaluation an unsolved problem. We address this challenge by
proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying
automatic evaluation protocols. Our dataset allows performing semantically
meaningful image interventions, e.g., removing individual object parts, which
has three important implications. First, it enables analyzing explanations on a
part level, which is closer to human comprehension than existing methods that
evaluate on a pixel level. Second, by comparing the model output for inputs
with removed parts, we can estimate ground-truth part importances that should
be reflected in the explanations. Third, by mapping individual explanations
into a common space of part importances, we can analyze a variety of different
explanation types in a single common framework. Using our tools, we report
results for 24 different combinations of neural models and XAI methods,
demonstrating the strengths and weaknesses of the assessed methods in a fully
automatic and systematic manner. |
This paper introduces "FunnyBirds," a synthetic vision dataset specifically designed for the quantitative evaluation and analysis of explainable AI (XAI) methods. |
Evaluating XAI methods is challenging due to the lack of ground-truth explanations. Existing automatic evaluation methods often rely on pixel-level interventions, which are not aligned with human perception and can introduce domain shifts. |
The authors create a synthetic dataset of bird images with controllable features (beak, wings, feet, eyes, tail). They propose a multi-dimensional analysis framework (FunnyBirds framework) with six evaluation protocols covering completeness, correctness, and contrastivity of explanations. They also showcase custom analyses for deeper insights into specific XAI methods. |
Methods relying on simpler model structures like BagNet achieve higher explainability scores.
Integrated Gradients performs best among model-agnostic methods across different backbones.
The study reveals weaknesses in the ability of assessed XAI methods to reliably communicate the relative importance of input features, particularly in terms of correctness. |
The synthetic nature of the dataset might not fully represent real-world image complexities.
The framework currently focuses on a subset of explainability dimensions, omitting aspects like compactness and confidence. |
explainable ai, xai evaluation, synthetic datasets, computer vision, deep learning |
2308.06160
Report |
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models |
Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen |
Current deep networks are very data-hungry and benefit from training on
largescale datasets, which are often time-consuming to collect and annotate. By
contrast, synthetic data can be generated infinitely using generative models
such as DALL-E and diffusion models, with minimal effort and cost. In this
paper, we present DatasetDM, a generic dataset generation model that can
produce diverse synthetic images and the corresponding high-quality perception
annotations (e.g., segmentation masks, and depth). Our method builds upon the
pre-trained diffusion model and extends text-guided image synthesis to
perception data generation. We show that the rich latent code of the diffusion
model can be effectively decoded as accurate perception annotations using a
decoder module. Training the decoder only needs less than 1% (around 100
images) manually labeled images, enabling the generation of an infinitely large
annotated dataset. Then these synthetic data can be used for training various
perception models for downstream tasks. To showcase the power of the proposed
approach, we generate datasets with rich dense pixel-wise labels for a wide
range of downstream tasks, including semantic segmentation, instance
segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art
results on semantic segmentation and instance segmentation; 2) significantly
more robust on domain generalization than using the real data alone; and
state-of-the-art results in zero-shot segmentation setting; and 3) flexibility
for efficient application and novel task composition (e.g., image editing). The
project website and code can be found at
https://weijiawu.github.io/DatasetDM_page/ and
https://github.com/showlab/DatasetDM, respectively |
Presents DatasetDM, a text-to-data generation model that leverages pre-trained diffusion models to produce synthetic images with diverse perception annotations (e.g., segmentation masks, depth) using minimal manually labeled data. |
Addresses the data-hungry nature of deep learning models for perception tasks by enabling the generation of infinitely large annotated datasets with minimal cost and effort. |
Trains a unified perception decoder (P-Decoder) on a small set of real images paired with their latent representations extracted from a pre-trained diffusion model using diffusion inversion. Employs GPT-4 to enhance prompt diversity and guide data generation. |
Achieves state-of-the-art results on semantic and instance segmentation tasks.
Exhibits significantly improved robustness in domain generalization compared to using real data alone.
Offers flexibility for novel task composition, such as image editing. |
The quality and complexity of synthesized data are limited by the capabilities of the base diffusion model.
Further improvements in prompt generation efficiency and domain-specific prompt design are possible. |
synthetic data generation, text-to-image synthesis, perception tasks, diffusion models, domain generalization |
2308.06101
Report |
Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow |
Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, Liqing Zhang |
Virtual try-on is a critical image synthesis task that aims to transfer
clothes from one image to another while preserving the details of both humans
and clothes. While many existing methods rely on Generative Adversarial
Networks (GANs) to achieve this, flaws can still occur, particularly at high
resolutions. Recently, the diffusion model has emerged as a promising
alternative for generating high-quality images in various applications.
However, simply using clothes as a condition for guiding the diffusion model to
inpaint is insufficient to maintain the details of the clothes. To overcome
this challenge, we propose an exemplar-based inpainting approach that leverages
a warping module to guide the diffusion model's generation effectively. The
warping module performs initial processing on the clothes, which helps to
preserve the local details of the clothes. We then combine the warped clothes
with clothes-agnostic person image and add noise as the input of diffusion
model. Additionally, the warped clothes is used as local conditions for each
denoising process to ensure that the resulting output retains as much detail as
possible. Our approach, namely Diffusion-based Conditional Inpainting for
Virtual Try-ON (DCI-VTON), effectively utilizes the power of the diffusion
model, and the incorporation of the warping module helps to produce
high-quality and realistic virtual try-on results. Experimental results on
VITON-HD demonstrate the effectiveness and superiority of our method. |
This paper presents DCI-VTON, a novel diffusion model-based virtual try-on framework that utilizes appearance flow for high-quality image synthesis. |
Existing GAN-based virtual try-on methods often struggle to maintain detail, particularly at high resolutions, highlighting the need for more robust approaches. Diffusion models offer an appealing alternative with superior generative capabilities. |
DCI-VTON consists of two main modules: 1) a warping module that predicts an appearance flow field to align clothes to the target person's pose, generating a coarse composite image. 2) a refinement module that leverages a diffusion model to refine the initial result using warped clothes as local conditions during denoising. |
DCI-VTON outperforms previous state-of-the-art virtual try-on methods on standard benchmarks like VITON-HD across various resolutions.
The inclusion of a warping module proves beneficial, particularly in challenging scenarios involving significant pose changes.
Ablation studies demonstrate the complementary nature of global, local, and initial conditions in guiding the diffusion model's generation process. |
The model currently focuses on trying on upper-body garments, leaving the extension to full-body outfits for future exploration.
While DCI-VTON effectively handles various clothes styles, addressing highly intricate designs or extreme poses remains an area for improvement. |
virtual try-on, diffusion models, appearance flow, high-resolution image synthesis, conditional image generation |
2308.06097
Report |
RIGID: Recurrent GAN Inversion and Editing of Real Face Videos |
Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo |
GAN inversion is indispensable for applying the powerful editability of GAN
to real images. However, existing methods invert video frames individually
often leading to undesired inconsistent results over time. In this paper, we
propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo
\textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and
simultaneously enforce temporally coherent GAN inversion and facial editing of
real videos. Our approach models the temporal relations between current and
previous frames from three aspects. To enable a faithful real video
reconstruction, we first maximize the inversion fidelity and consistency by
learning a temporal compensated latent code. Second, we observe incoherent
noises lie in the high-frequency domain that can be disentangled from the
latent space. Third, to remove the inconsistency after attribute manipulation,
we propose an \textit{in-between frame composition constraint} such that the
arbitrary frame must be a direct composite of its neighboring frames. Our
unified framework learns the inherent coherence between input frames in an
end-to-end manner, and therefore it is agnostic to a specific attribute and can
be applied to arbitrary editing of the same video without re-training.
Extensive experiments demonstrate that RIGID outperforms state-of-the-art
methods qualitatively and quantitatively in both inversion and editing tasks.
The deliverables can be found in \url{https://cnnlstm.github.io/RIGID} |
Proposes RIGID, a recurrent framework for temporally coherent GAN inversion and facial editing of real videos. |
Existing methods struggle to maintain temporal consistency when inverting and editing videos using GANs, leading to unrealistic and disjointed results. |
A recurrent encoder learns temporal compensated latent codes and disentangles high-frequency artifacts for coherent inversion. A novel in-between frame composition constraint enforces smoothness in edited videos. |
Achieves comparable or better results in video inversion quality and temporal coherence than optimization-based methods (e.g., STIT) with significantly faster inference times.
Enables attribute-agnostic editing, allowing various edits on a single video without retraining.
Outperforms competitors in maintaining temporal coherence and identity preservation during video editing, as evidenced by quantitative metrics and visual comparisons. |
Limited editing capability for hair portions outside the cropped face region.
Higher GPU memory requirements during training compared to some alternatives. |
gan inversion, video editing, temporal coherence, recurrent neural networks, generative adversarial networks |
2308.06093
Report |
Experts Weights Averaging: A New General Training Scheme for Vision Transformers |
Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, Tong He, Wanli Ouyang |
Structural re-parameterization is a general training scheme for Convolutional
Neural Networks (CNNs), which achieves performance improvement without
increasing inference cost. As Vision Transformers (ViTs) are gradually
surpassing CNNs in various visual tasks, one may question: if a training scheme
specifically for ViTs exists that can also achieve performance improvement
without increasing inference cost? Recently, Mixture-of-Experts (MoE) has
attracted increasing attention, as it can efficiently scale up the capacity of
Transformers at a fixed cost through sparsely activated experts. Considering
that MoE can also be viewed as a multi-branch structure, can we utilize MoE to
implement a ViT training scheme similar to structural re-parameterization? In
this paper, we affirmatively answer these questions, with a new general
training strategy for ViTs. Specifically, we decouple the training and
inference phases of ViTs. During training, we replace some Feed-Forward
Networks (FFNs) of the ViT with specially designed, more efficient MoEs that
assign tokens to experts by random uniform partition, and perform Experts
Weights Averaging (EWA) on these MoEs at the end of each iteration. After
training, we convert each MoE into an FFN by averaging the experts,
transforming the model back into original ViT for inference. We further provide
a theoretical analysis to show why and how it works. Comprehensive experiments
across various 2D and 3D visual tasks, ViT architectures, and datasets validate
the effectiveness and generalizability of the proposed training scheme.
Besides, our training scheme can also be applied to improve performance when
fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can
significantly improve the effectiveness of naive MoE in various 2D visual small
datasets and 3D visual tasks. |
This paper introduces Experts Weights Averaging (EWA), a novel training scheme for Vision Transformers (ViTs) that improves performance without increasing inference cost. |
Existing methods for improving model performance, such as structural re-parameterization and Mixture-of-Experts (MoE), are either limited to CNNs or introduce significant computational overhead during inference. EWA aims to overcome these limitations. |
EWA decouples training and inference phases: During training, it replaces some ViT feed-forward networks (FFNs) with specially designed, more efficient MoEs using random uniform partition. It then averages expert weights after each training iteration. During inference, each MoE is converted back into a single FFN by averaging its expert weights. |
EWA training consistently improves the performance of various ViT architectures on diverse 2D and 3D visual tasks and datasets.
EWA fine-tuning further enhances the performance of pre-trained ViT models.
Experts Weights Averaging significantly improves the effectiveness of naive MoE, particularly on small 2D visual datasets and 3D visual tasks where naive MoE struggles. |
The optimal share rate for Experts Weights Averaging needs to be determined for each ViT architecture.
The paper primarily focuses on image classification and semantic segmentation tasks. Exploring EWA's applicability to other vision tasks like object detection and instance segmentation is a potential avenue for future research. |
vision transformer, mixture-of-experts, structural re-parameterization, weight averaging, deep learning |
2308.06038
Report |
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning |
Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, Wangmeng Zuo |
Benefiting from prompt tuning, recent years have witnessed the promising
performance of pre-trained vision-language models, e.g., CLIP, on versatile
downstream tasks. In this paper, we focus on a particular setting of learning
adaptive prompts on the fly for each test sample from an unseen new domain,
which is known as test-time prompt tuning (TPT). Existing TPT methods typically
rely on data augmentation and confidence selection. However, conventional data
augmentation techniques, e.g., random resized crops, suffers from the lack of
data diversity, while entropy-based confidence selection alone is not
sufficient to guarantee prediction fidelity. To address these issues, we
propose a novel TPT method, named DiffTPT, which leverages pre-trained
diffusion models to generate diverse and informative new data. Specifically, we
incorporate augmented data by both conventional method and pre-trained stable
diffusion to exploit their respective merits, improving the models ability to
adapt to unknown new test data. Moreover, to ensure the prediction fidelity of
generated data, we introduce a cosine similarity-based filtration technique to
select the generated data with higher similarity to the single test sample. Our
experiments on test datasets with distribution shifts and unseen categories
demonstrate that DiffTPT improves the zero-shot accuracy by an average of
5.13\% compared to the state-of-the-art TPT method. Our code and models will be
publicly released. |
This paper proposes DiffTPT, a novel test-time prompt tuning method that leverages pre-trained diffusion models for diverse data augmentation, enhancing the performance of pre-trained vision-language models on unseen domains. |
Existing test-time prompt tuning methods suffer from limited data diversity and insufficient prediction fidelity when adapting to new domains. |
DiffTPT employs Stable Diffusion to generate diverse augmented images from test samples and introduces a cosine similarity-based filtration to remove spurious augmentations, balancing data diversity and prediction fidelity. |
DiffTPT achieves state-of-the-art zero-shot accuracy, outperforming existing test-time prompt tuning methods by an average of 5.13%.
Combining DiffTPT with few-shot prompt tuning methods further improves both in-domain and out-of-distribution performance.
Ablation studies confirm the effectiveness of diffusion-based augmentation, cosine similarity filtration, and the impact of augmented data size and prompt updating steps. |
The inference speed of using diffusion models for data augmentation can be further improved.
Exploring other filtration techniques or combining multiple metrics to further enhance prediction fidelity. |
test-time prompt tuning, diffusion models, data augmentation, zero-shot learning, vision-language models |
2308.06027
Report |
Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation |
Yuki Endo |
Text-to-image synthesis has achieved high-quality results with recent
advances in diffusion models. However, text input alone has high spatial
ambiguity and limited user controllability. Most existing methods allow spatial
control through additional visual guidance (e.g., sketches and semantic masks)
but require additional training with annotated images. In this paper, we
propose a method for spatially controlling text-to-image generation without
further training of diffusion models. Our method is based on the insight that
the cross-attention maps reflect the positional relationship between words and
pixels. Our aim is to control the attention maps according to given semantic
masks and text prompts. To this end, we first explore a simple approach of
directly swapping the cross-attention maps with constant maps computed from the
semantic regions. Some prior works also allow training-free spatial control of
text-to-image diffusion models by directly manipulating cross-attention maps.
However, these approaches still suffer from misalignment to given masks because
manipulated attention maps are far from actual ones learned by diffusion
models. To address this issue, we propose masked-attention guidance, which can
generate images more faithful to semantic masks via indirect control of
attention to each word and pixel by manipulating noise images fed to diffusion
models. Masked-attention guidance can be easily integrated into pre-trained
off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the
tasks of text-guided image editing. Experiments show that our method enables
more accurate spatial control than baselines qualitatively and quantitatively. |
This paper proposes "masked-attention guidance," a training-free method to spatially control text-to-image generation in diffusion models using visual guidance like semantic masks. |
Text-to-image generation lacks controllability due to the spatial ambiguity of text descriptions. Existing methods for spatial control often require costly additional training. |
The method leverages cross-attention maps in diffusion models, which reflect word-pixel relationships. It manipulates noise maps fed to the model to indirectly guide the cross-attention towards user-specified regions in the semantic mask. |
Quantitative evaluation on COCO dataset shows significant improvement in mask alignment (mIoU) compared to training-free baselines.
Qualitative results demonstrate better alignment with semantic masks and ability to handle diverse and challenging inputs.
Analysis of cross-attention maps shows that the method effectively guides attention estimation in the diffusion model. |
The method struggles with small or detailed regions in the semantic mask.
Strong guidance can sometimes lead to unnatural image generation.
Future work includes handling other visual guidance types like scribbles and addressing limitations in handling small regions. |
text-to-image generation, diffusion models, spatial control, semantic masks, cross-attention |
2308.06015
Report |
Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation |
Xuannan Liu, Yaoyao Zhong, Yuhang Zhang, Lixiong Qin, Weihong Deng |
Deep neural networks are vulnerable to universal adversarial perturbation
(UAP), an instance-agnostic perturbation capable of fooling the target model
for most samples. Compared to instance-specific adversarial examples, UAP is
more challenging as it needs to generalize across various samples and models.
In this paper, we examine the serious dilemma of UAP generation methods from a
generalization perspective -- the gradient vanishing problem using small-batch
stochastic gradient optimization and the local optima problem using large-batch
optimization. To address these problems, we propose a simple and effective
method called Stochastic Gradient Aggregation (SGA), which alleviates the
gradient vanishing and escapes from poor local optima at the same time.
Specifically, SGA employs the small-batch training to perform multiple
iterations of inner pre-search. Then, all the inner gradients are aggregated as
a one-step gradient estimation to enhance the gradient stability and reduce
quantization errors. Extensive experiments on the standard ImageNet dataset
demonstrate that our method significantly enhances the generalization ability
of UAP and outperforms other state-of-the-art methods. The code is available at
https://github.com/liuxuannan/Stochastic-Gradient-Aggregation. |
This paper proposes Stochastic Gradient Aggregation (SGA), a novel method to enhance the generalization ability of Universal Adversarial Perturbations (UAPs). |
Existing UAP generation methods suffer from either gradient vanishing with small-batch training or sub-optimal generalization with large-batch training. This paper addresses this dilemma to improve UAP's generalization across diverse samples and models. |
SGA employs inner-outer iterations. It conducts pre-search with multiple inner iterations using small-batch samples. Then, it aggregates all inner gradients to update UAP with a one-step gradient estimation, enhancing gradient stability and reducing quantization errors. |
SGA outperforms state-of-the-art methods in the white-box setting, achieving a higher fooling ratio across five tested models.
SGA also significantly improves the fooling ratio in the black-box setting, demonstrating better cross-model generalization ability.
SGA maintains superior performance in limit-sample settings, effectively crafting UAPs with only 500 training samples. |
The paper primarily focuses on the ImageNet dataset, limiting the generalizability of the findings.
Future work could explore the effectiveness of SGA in conjunction with other advanced gradient optimization techniques. |
universal adversarial perturbation, generalization, gradient vanishing, quantization error, stochastic gradient aggregation |
2308.05739
Report |
Zero Grads: Learning Local Surrogate Losses for Non-Differentiable Graphics |
Michael Fischer, Tobias Ritschel |
Gradient-based optimization is now ubiquitous across graphics, but
unfortunately can not be applied to problems with undefined or zero gradients.
To circumvent this issue, the loss function can be manually replaced by a
``surrogate'' that has similar minima but is differentiable. Our proposed
framework, ZeroGrads, automates this process by learning a neural approximation
of the objective function, which in turn can be used to differentiate through
arbitrary black-box graphics pipelines. We train the surrogate on an actively
smoothed version of the objective and encourage locality, focusing the
surrogate's capacity on what matters at the current training episode. The
fitting is performed online, alongside the parameter optimization, and
self-supervised, without pre-computed data or pre-trained models. As sampling
the objective is expensive (it requires a full rendering or simulator run), we
devise an efficient sampling scheme that allows for tractable run-times and
competitive performance at little overhead. We demonstrate optimizing diverse
non-convex, non-differentiable black-box problems in graphics, such as
visibility in rendering, discrete parameter spaces in procedural modelling or
optimal control in physics-driven animation. In contrast to other
derivative-free algorithms, our approach scales well to higher dimensions,
which we demonstrate on problems with up to 35k interlinked variables. |
This paper introduces a novel optimization method that replaces gradient calculations with a learned surrogate model, allowing for gradient-free optimization in various computer graphics applications. |
The method addresses the limitations of traditional gradient-based optimization techniques in scenarios where gradients are unavailable or computationally expensive, expanding the scope of optimizable problems in computer graphics. |
The method trains a neural network to approximate the local behavior of the objective function around a current parameter set. This surrogate model then provides gradient estimates for optimization, avoiding the need for explicit gradient calculations. |
The method demonstrates comparable performance to gradient-based methods on low-dimensional tasks.
It scales effectively to high-dimensional problems involving tens of thousands of parameters, surpassing traditional gradient-free algorithms.
The approach proves successful across various computer graphics applications, including texture optimization, mesh reconstruction, and caustic rendering. |
The method's performance might be sensitive to the choice of hyperparameters, such as the spread of the locality kernel and the number of samples used for gradient estimation.
Further investigation is needed to explore the generalization capabilities of the learned surrogate models across different problem instances. |
gradient-free optimization, surrogate modeling, computer graphics, high-dimensional optimization, differentiable rendering |
2308.05733
Report |
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models |
Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, Feng Zhao |
3D scene reconstruction is a long-standing vision task. Existing approaches
can be categorized into geometry-based and learning-based methods. The former
leverages multi-view geometry but can face catastrophic failures due to the
reliance on accurate pixel correspondence across views. The latter was
proffered to mitigate these issues by learning 2D or 3D representation
directly. However, without a large-scale video or 3D training data, it can
hardly generalize to diverse real-world scenarios due to the presence of tens
of millions or even billions of optimization parameters in the deep network.
Recently, robust monocular depth estimation models trained with large-scale
datasets have been proven to possess weak 3D geometry prior, but they are
insufficient for reconstruction due to the unknown camera parameters, the
affine-invariant property, and inter-frame inconsistency. Here, we propose a
novel test-time optimization approach that can transfer the robustness of
affine-invariant depth models such as LeReS to challenging diverse scenes while
ensuring inter-frame consistency, with only dozens of parameters to optimize
per video frame. Specifically, our approach involves freezing the pre-trained
affine-invariant depth model's depth predictions, rectifying them by optimizing
the unknown scale-shift values with a geometric consistency alignment module,
and employing the resulting scale-consistent depth maps to robustly obtain
camera poses and achieve dense scene reconstruction, even in low-texture
regions. Experiments show that our method achieves state-of-the-art
cross-dataset reconstruction on five zero-shot testing datasets. |
This paper presents FrozenRecon, a novel test-time optimization approach for robust and efficient 3D scene reconstruction from monocular videos. It leverages the robustness of pre-trained affine-invariant depth models while ensuring inter-frame consistency. |
Existing 3D reconstruction methods, whether geometry-based or learning-based, often face limitations such as reliance on accurate pixel correspondence, need for large-scale training data, or susceptibility to low-texture regions. FrozenRecon addresses these limitations by efficiently transferring the robustness of pre-trained depth models to diverse real-world scenes. |
FrozenRecon freezes a pre-trained affine-invariant depth model and optimizes a sparse set of parameters (scale, shift, weight factors) per frame to achieve scale-consistent depth maps. It jointly optimizes camera poses and intrinsic parameters alongside depth, guided by photometric and geometric consistency constraints. |
FrozenRecon achieves state-of-the-art cross-dataset reconstruction performance on five unseen datasets, outperforming previous methods in terms of accuracy and robustness.
The method is efficient, requiring optimization of only dozens of parameters per frame, unlike learning-based methods with millions of parameters.
FrozenRecon demonstrates strong generalization ability, effectively reconstructing diverse scenes without requiring offline-acquired camera parameters. |
The method assumes a pinhole camera model, which may limit its accuracy in scenarios with significant lens distortion.
Future work could explore incorporating semantic information or multi-scale features to further enhance reconstruction quality, especially in challenging low-texture or dynamic environments. |
3d scene reconstruction, monocular video, affine-invariant depth, test-time optimization, geometric consistency |
2308.05695
Report |
Masked Diffusion as Self-supervised Representation Learner |
Zixuan Pan, Jianxu Chen, Yiyu Shi |
Denoising diffusion probabilistic models have recently demonstrated
state-of-the-art generative performance and have been used as strong
pixel-level representation learners. This paper decomposes the interrelation
between the generative capability and representation learning ability inherent
in diffusion models. We present the masked diffusion model (MDM), a scalable
self-supervised representation learner for semantic segmentation, substituting
the conventional additive Gaussian noise of traditional diffusion with a
masking mechanism. Our proposed approach convincingly surpasses prior
benchmarks, demonstrating remarkable advancements in both medical and natural
image semantic segmentation tasks, particularly in few-shot scenarios. |
This paper introduces the Masked Diffusion Model (MDM), a new self-supervised representation learning approach for semantic segmentation that replaces the additive Gaussian noise of traditional diffusion models with a masking mechanism. |
This work addresses the limitations of denoising diffusion probabilistic models (DDPM) for representation learning by decoupling generative capability from representation learning ability and proposing a more efficient alternative. |
MDM utilizes a masking strategy guided by a sampled timestep, reconstructing the original image from a masked version. It also leverages the Structural Similarity Index (SSIM) loss to enhance structural information preservation during reconstruction, improving semantic representation quality. |
MDM outperforms DDPM and other state-of-the-art methods on both medical (GlaS, MoNuSeg) and natural image (FFHQ-34, CelebA-19) segmentation tasks.
It excels in few-shot scenarios, achieving comparable results to full label settings with significantly fewer labels.
Ablation studies demonstrate the effectiveness of the masking strategy, SSIM loss, and the importance of selecting appropriate diffusion timesteps and UNet decoder blocks. |
The current implementation focuses on U-Net architecture and evaluation on a limited set of datasets. Exploring other architectures and datasets is crucial.
While SSIM loss shows promise, investigating alternative optimization objectives tailored for specific tasks and datasets could lead to further improvements. |
self-supervised learning, semantic segmentation, diffusion models, representation learning, few-shot learning |
2308.05659
Report |
AD-CLIP: Adapting Domains in Prompt Space Using CLIP |
Mainak Singha, Harsh Pal, Ankit Jha, Biplab Banerjee |
Although deep learning models have shown impressive performance on supervised
learning tasks, they often struggle to generalize well when the training
(source) and test (target) domains differ. Unsupervised domain adaptation (DA)
has emerged as a popular solution to this problem. However, current DA
techniques rely on visual backbones, which may lack semantic richness. Despite
the potential of large-scale vision-language foundation models like CLIP, their
effectiveness for DA has yet to be fully explored. To address this gap, we
introduce AD-CLIP, a domain-agnostic prompt learning strategy for CLIP that
aims to solve the DA problem in the prompt space. We leverage the frozen vision
backbone of CLIP to extract both image style (domain) and content information,
which we apply to learn prompt tokens. Our prompts are designed to be
domain-invariant and class-generalizable, by conditioning prompt learning on
image style and content features simultaneously. We use standard supervised
contrastive learning in the source domain, while proposing an entropy
minimization strategy to align domains in the embedding space given the target
domain data. We also consider a scenario where only target domain samples are
available during testing, without any source domain data, and propose a
cross-domain style mapping network to hallucinate domain-agnostic tokens. Our
extensive experiments on three benchmark DA datasets demonstrate the
effectiveness of AD-CLIP compared to existing literature. |
This paper proposes AD-CLIP, a novel domain adaptation framework leveraging prompt learning with the CLIP model. AD-CLIP learns domain-invariant and class-generic prompt tokens using visual features extracted from CLIP's vision encoder, aiming to improve cross-domain generalization. |
Existing domain adaptation techniques rely heavily on visual backbones, which may lack semantic richness and lead to sub-optimal performance. This work explores the potential of large-scale vision-language models like CLIP for improved domain adaptation. |
AD-CLIP utilizes the frozen vision and text backbones of CLIP. It introduces learnable style and content projectors to enable prompt learning from visual information of different layers from the CLIP vision encoder. The framework learns three types of prompt tokens: domain tokens, image tokens, and class tokens. Additionally, it employs distribution divergence loss and entropy minimization loss for domain alignment. |
AD-CLIP achieves state-of-the-art performance on three benchmark domain adaptation datasets: Office-Home, VisDA-2017, and Mini-DomainNet.
The method demonstrates the effectiveness of learning domain-invariant and class-generic prompt tokens for improving cross-domain generalization.
Ablation studies validate the importance of incorporating multi-scale visual features, the domain-agnostic token, and the proposed loss functions for optimal performance. |
While AD-CLIP demonstrates strong overall performance, it exhibits limitations on certain classes of the VisDA-2017 dataset, indicating scope for further improvement.
Future work will focus on extending AD-CLIP to specific applications like person re-identification and medical imaging, where domain adaptation is crucial. |
domain adaptation, prompt learning, clip, vision-language models, unsupervised learning |
2308.05128
Report |
High-Level Parallelism and Nested Features for Dynamic Inference Cost and Top-Down Attention |
André Peter Kelm, Niels Hannemann, Bruno Heberle, Lucas Schmidt, Tim Rolff, Christian Wilms, Ehsan Yaghoubi, Simone Frintrop |
This paper introduces a novel network topology that seamlessly integrates
dynamic inference cost with a top-down attention mechanism, addressing two
significant gaps in traditional deep learning models. Drawing inspiration from
human perception, we combine sequential processing of generic low-level
features with parallelism and nesting of high-level features. This design not
only reflects a finding from recent neuroscience research regarding - spatially
and contextually distinct neural activations - in human cortex, but also
introduces a novel "cutout" technique: the ability to selectively activate
%segments of the network for task-relevant only network segments of
task-relevant categories to optimize inference cost and eliminate the need for
re-training. We believe this paves the way for future network designs that are
lightweight and adaptable, making them suitable for a wide range of
applications, from compact edge devices to large-scale clouds. Our proposed
topology also comes with a built-in top-down attention mechanism, which allows
processing to be directly influenced by either enhancing or inhibiting
category-specific high-level features, drawing parallels to the selective
attention mechanism observed in human cognition. Using targeted external
signals, we experimentally enhanced predictions across all tested models. In
terms of dynamic inference cost our methodology can achieve an exclusion of up
to $73.48\,\%$ of parameters and $84.41\,\%$ fewer giga-multiply-accumulate
(GMAC) operations, analysis against comparative baselines show an average
reduction of $40\,\%$ in parameters and $8\,\%$ in GMACs across the cases we
evaluated. |
This paper introduces a novel network topology called SeqPar (Sequential-Parallel) for deep learning models, combining sequential processing for low-level features with parallel processing for high-level features to enable dynamic inference cost reduction and top-down attention. |
Current deep learning models lack the ability to dynamically adjust inference costs or incorporate top-down attention, limiting their efficiency and adaptability in tasks where high-level knowledge is available. |
The proposed SeqPar structure separates high-level features into parallel branches, allowing for category-specific feature extraction. This enables a novel "cutout" technique where only relevant branches are activated during inference based on prior knowledge, reducing computation. The structure also inherently allows for top-down attention by amplifying or inhibiting specific branches. |
The SeqPar structure with cutouts achieves up to 73.48% reduction in parameters and 84.41% fewer GMACs, with an average reduction of 40% in parameters and 8% in GMACs compared to baselines.
The built-in top-down attention mechanism, tested by amplifying target category features, consistently improves classification accuracy across various datasets.
Nested SeqPar structures (NHL), grouping categories by similarity, further reduce parameter count and improve accuracy compared to conventional ResNet models on ImageNet100. |
The performance of NHL on ImageNet with 1000 categories is limited, potentially due to the small number of training images per category.
The optimal split point for transitioning from sequential to parallel processing requires further investigation. |
deep learning, dynamic inference, top-down attention, network topology, image classification |
2308.05095
Report |
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation |
Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, Tat-Seng Chua |
In the text-to-image generation field, recent remarkable progress in Stable
Diffusion makes it possible to generate rich kinds of novel photorealistic
images. However, current models still face misalignment issues (e.g.,
problematic spatial relation understanding and numeration failure) in complex
natural scenes, which impedes the high-faithfulness text-to-image generation.
Although recent efforts have been made to improve controllability by giving
fine-grained guidance (e.g., sketch and scribbles), this issue has not been
fundamentally tackled since users have to provide such guidance information
manually. In this work, we strive to synthesize high-fidelity images that are
semantically aligned with a given textual prompt without any guidance. Toward
this end, we propose a coarse-to-fine paradigm to achieve layout planning and
image generation. Concretely, we first generate the coarse-grained layout
conditioned on a given textual prompt via in-context learning based on Large
Language Models. Afterward, we propose a fine-grained object-interaction
diffusion method to synthesize high-faithfulness images conditioned on the
prompt and the automatically generated layout. Extensive experiments
demonstrate that our proposed method outperforms the state-of-the-art models in
terms of layout and image generation. Our code and settings are available at
https://layoutllm-t2i.github.io. |
This paper proposes LayoutLLM-T2I, a novel approach for text-to-image generation that leverages the layout planning abilities of Large Language Models (LLMs) to enhance the faithfulness of synthesized images, particularly in complex scenes. |
Current text-to-image models struggle with complex scenes, exhibiting issues like spatial relation misunderstanding and numeration errors. This work aims to address this by incorporating layout planning into the generation process. |
The method employs a two-stage process: 1) **Text-to-Layout Induction:** An LLM is used to generate a coarse-grained layout from the text prompt, aided by a feedback-based sampler learning mechanism. 2) **Layout-guided Image Generation:** A layout-aware adapter is integrated into a pre-trained diffusion model, enabling relation-aware object interaction guided by the generated layout. |
LayoutLLM-T2I significantly outperforms existing methods in both layout generation and image synthesis, achieving state-of-the-art results on the COCO dataset.
The feedback-based sampler learning mechanism is shown to effectively activate and improve the layout planning capabilities of LLMs.
The relation-aware image generation module, incorporating object interactions, is crucial for enhancing the faithfulness of images, particularly in complex scenes. |
The performance of layout planning with LLMs is sensitive to the number of in-context examples, highlighting the need for further research on sample efficiency.
The work focuses on layout planning as a single modality; future work could explore the integration of other modalities, such as depth information, to further enhance image faithfulness. |
text-to-image generation, diffusion model, large language model, layout planning, scene understanding |
2308.04868
Report |
InstantAvatar: Efficient 3D Head Reconstruction via Surface Rendering |
Antonio Canela, Pol Caselles, Ibrar Malik, Eduard Ramon, Jaime García, Jordi Sánchez-Riera, Gil Triginer, Francesc Moreno-Noguer |
Recent advances in full-head reconstruction have been obtained by optimizing
a neural field through differentiable surface or volume rendering to represent
a single scene. While these techniques achieve an unprecedented accuracy, they
take several minutes, or even hours, due to the expensive optimization process
required. In this work, we introduce InstantAvatar, a method that recovers
full-head avatars from few images (down to just one) in a few seconds on
commodity hardware. In order to speed up the reconstruction process, we propose
a system that combines, for the first time, a voxel-grid neural field
representation with a surface renderer. Notably, a naive combination of these
two techniques leads to unstable optimizations that do not converge to valid
solutions. In order to overcome this limitation, we present a novel statistical
model that learns a prior distribution over 3D head signed distance functions
using a voxel-grid based architecture. The use of this prior model, in
combination with other design choices, results into a system that achieves 3D
head reconstructions with comparable accuracy as the state-of-the-art with a
100x speed-up. |
InstantAvatar: a method for fast full-head avatar reconstruction from a few images (down to one) using a novel statistical model that combines a voxel-grid neural field representation with a surface renderer. |
Existing full-head reconstruction techniques, while accurate, are slow (taking minutes or hours) due to expensive optimization processes. This work aims to achieve comparable accuracy at significantly faster speeds. |
The method leverages: (1) a multi-resolution grid-based neural field trained on a dataset of 3D head scans to represent a prior distribution of head SDFs, (2) differentiable surface rendering for optimization, (3) monocular normal predictions to guide and speed up convergence, and (4) a parallel ray-surface intersection algorithm inspired by volume rendering for efficiency. |
Achieves comparable accuracy to state-of-the-art methods like H3D-Net and SIRA.
Reconstructs full-head avatars in seconds, a 100x speedup over neural-field-based alternatives.
Successfully combines a grid-based architecture with surface rendering for fast and accurate 3D reconstruction. |
Memory concerns arise from the use of dense grids.
The representation capacity is limited by the accuracy of the predicted normals and the prior architecture could be improved to allow grid features optimization. |
3d reconstruction, neural fields, surface rendering, avatar generation, statistical shape models |
2308.04832
Report |
TSSR: A Truncated and Signed Square Root Activation Function for Neural Networks |
Yuanhao Gong |
Activation functions are essential components of neural networks. In this
paper, we introduce a new activation function called the Truncated and Signed
Square Root (TSSR) function. This function is distinctive because it is odd,
nonlinear, monotone and differentiable. Its gradient is continuous and always
positive. Thanks to these properties, it has the potential to improve the
numerical stability of neural networks. Several experiments confirm that the
proposed TSSR has better performance than other stat-of-the-art activation
functions. The proposed function has significant implications for the
development of neural network models and can be applied to a wide range of
applications in fields such as computer vision, natural language processing,
and speech recognition. |
Introduces a novel activation function called Truncated and Signed Square Root (TSSR) for improved neural network performance. |
Existing activation functions lack the ideal combination of mathematical properties for optimal numerical stability and performance. |
Analyzes desired properties of activation functions, proposes TSSR function, compares TSSR to existing functions on CIFAR-10/100 datasets using various network architectures. |
TSSR is odd, monotone, differentiable, has unbounded values, and a bounded continuous gradient.
TSSR outperforms ReLU, Mish, and Serf in accuracy on CIFAR-10/100 benchmarks.
TSSR shows promise for enhancing accuracy and efficiency in a variety of neural network applications. |
Current experiments are limited to CIFAR datasets and a subset of network architectures.
Future work includes testing TSSR on larger datasets and a wider range of applications. |
activation function, tssr, neural network, deep learning, numerical stability |
2308.04830
Report |
VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer |
Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, Sheng Zhao |
Current talking face generation methods mainly focus on speech-lip
synchronization. However, insufficient investigation on the facial talking
style leads to a lifeless and monotonous avatar. Most previous works fail to
imitate expressive styles from arbitrary video prompts and ensure the
authenticity of the generated video. This paper proposes an unsupervised
variational style transfer model (VAST) to vivify the neutral photo-realistic
avatars. Our model consists of three key components: a style encoder that
extracts facial style representations from the given video prompts; a hybrid
facial expression decoder to model accurate speech-related movements; a
variational style enhancer that enhances the style space to be highly
expressive and meaningful. With our essential designs on facial style learning,
our model is able to flexibly capture the expressive facial style from
arbitrary video prompts and transfer it onto a personalized image renderer in a
zero-shot manner. Experimental results demonstrate the proposed approach
contributes to a more vivid talking avatar with higher authenticity and richer
expressiveness. |
Proposes VAST, a novel variational style transfer model, to generate vivid talking avatars by transferring expressive facial styles from arbitrary video prompts onto neutral avatars in a zero-shot manner. |
Existing talking face generation methods lack expressiveness and struggle to imitate natural styles from arbitrary videos, limiting the creation of engaging and realistic avatars. |
VAST leverages a style encoder, variational style enhancer, and hybrid decoder. The style encoder extracts style representation from video prompts. The enhancer enriches style space using normalizing flow. The hybrid decoder predicts speech-related and weakly-related expressions separately to ensure authenticity. |
VAST outperforms state-of-the-art methods in generating high-fidelity videos with accurate lip synchronization.
The variational style enhancer significantly improves the expressiveness of generated avatars.
VAST exhibits strong performance in transferring various facial styles, as demonstrated by subjective and objective evaluations. |
The image renderer's performance is limited by the training data, leading to artifacts when transferring highly exaggerated styles.
Exploring alternative rendering techniques and training on more diverse data could further enhance the quality of generated avatars. |
talking face generation, facial style transfer, variational autoencoder, normalizing flow, zero-shot learning |
2308.04829
Report |
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation |
Kaixin Cai, Pengzhen Ren, Yi Zhu, Hang Xu, Jianzhuang Liu, Changlin Li, Guangrun Wang, Xiaodan Liang |
Recently, semantic segmentation models trained with image-level text
supervision have shown promising results in challenging open-world scenarios.
However, these models still face difficulties in learning fine-grained semantic
alignment at the pixel level and predicting accurate object masks. To address
this issue, we propose MixReorg, a novel and straightforward pre-training
paradigm for semantic segmentation that enhances a model's ability to
reorganize patches mixed across images, exploring both local visual relevance
and global semantic coherence. Our approach involves generating fine-grained
patch-text pairs data by mixing image patches while preserving the
correspondence between patches and text. The model is then trained to minimize
the segmentation loss of the mixed images and the two contrastive losses of the
original and restored features. With MixReorg as a mask learner, conventional
text-supervised semantic segmentation models can achieve highly generalizable
pixel-semantic alignment ability, which is crucial for open-world segmentation.
After training with large-scale image-text data, MixReorg models can be applied
directly to segment visual objects of arbitrary categories, without the need
for further fine-tuning. Our proposed framework demonstrates strong performance
on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT
by significant margins of 5.0%, 6.2%, 2.5%, and 3.4% mIoU on PASCAL VOC2012,
PASCAL Context, MS COCO, and ADE20K, respectively. |
This paper proposes MixReorg, a novel pre-training paradigm for open-world semantic segmentation that leverages mixed patch reorganization to enhance a model's ability to learn fine-grained semantic alignment from image-text pairs. |
Existing text-supervised semantic segmentation models struggle to learn fine-grained semantic alignment at the pixel level, limiting their performance in challenging open-world scenarios. This work addresses this limitation by improving cross-modal alignment using a novel pre-training approach. |
MixReorg generates fine-grained patch-text pairs by mixing image patches while preserving their textual correspondence. The model is then trained to reconstruct the original images and predict segmentation masks from the mixed images, using both segmentation and contrastive losses. The process involves three stages: contextual mixing, progressive mixing, and mixing restoration. |
MixReorg outperforms previous state-of-the-art methods on open-world semantic segmentation benchmarks, including PASCAL VOC2012, PASCAL Context, MS COCO, and ADE20K.
The method effectively learns fine-grained semantic alignment, as demonstrated by its ability to accurately segment mixed images.
MixReorg shows significant improvements in handling complex segmentation examples and segmenting stuff classes compared to previous methods. |
The contextual mixing stage increases computational cost during training.
The constructed patch-text pairs, while fine-grained, are not yet at the pixel level, leaving room for further improvement. |
semantic segmentation, open-world learning, vision-language pre-training, cross-modal alignment, mixed image modeling |
2308.04826
Report |
WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields |
Muyu Xu, Fangneng Zhan, Jiahui Zhang, Yingchen Yu, Xiaoqin Zhang, Christian Theobalt, Ling Shao, Shijian Lu |
Neural Radiance Field (NeRF) has shown impressive performance in novel view
synthesis via implicit scene representation. However, it usually suffers from
poor scalability as requiring densely sampled images for each new scene.
Several studies have attempted to mitigate this problem by integrating
Multi-View Stereo (MVS) technique into NeRF while they still entail a
cumbersome fine-tuning process for new scenes. Notably, the rendering quality
will drop severely without this fine-tuning process and the errors mainly
appear around the high-frequency features. In the light of this observation, we
design WaveNeRF, which integrates wavelet frequency decomposition into MVS and
NeRF to achieve generalizable yet high-quality synthesis without any per-scene
optimization. To preserve high-frequency information when generating 3D feature
volumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating
the discrete wavelet transform into the classical cascade MVS, which
disentangles high-frequency information explicitly. With that, disentangled
frequency features can be injected into classic NeRF via a novel hybrid neural
renderer to yield faithful high-frequency details, and an intuitive
frequency-guided sampling strategy can be designed to suppress artifacts around
high-frequency regions. Extensive experiments over three widely studied
benchmarks show that WaveNeRF achieves superior generalizable radiance field
modeling when only given three images as input. |
This paper proposes WaveNeRF, a novel generalizable Neural Radiance Field (NeRF) model that leverages wavelet frequency decomposition within a multi-view stereo (MVS) framework to achieve high-quality novel view synthesis without per-scene optimization. |
Existing generalizable NeRF methods often suffer from performance degradation and artifacts when per-scene fine-tuning is not employed, particularly around high-frequency image regions. This work addresses this limitation by explicitly incorporating high-frequency information during training. |
The proposed WaveNeRF introduces a Wavelet Multi-view Stereo (WMVS) module to extract both spatial and frequency domain features. It also employs a Frequency-guided Sampling Strategy (FSS) to concentrate sampling near object surfaces. These components are integrated into a Hybrid Neural Renderer (HNR) that combines spatial and frequency information for enhanced rendering. |
WaveNeRF outperforms existing generalizable NeRF methods on DTU, NeRF Synthetic, and LLFF datasets, demonstrating superior performance with only three input views.
The proposed Frequency-guided Sampling Strategy (FSS) effectively increases the density of samples around object surfaces, leading to improved detail rendering.
Evaluation using the HFIV metric confirms that WaveNeRF effectively reconstructs high-frequency details compared to previous methods. |
The model's performance with a larger number of input views is limited by GPU memory constraints.
The reliance on MVS techniques may lead to artifacts in regions with inaccurate stereo reconstruction. |
neural radiance fields, novel view synthesis, multi-view stereo, wavelet transform, frequency decomposition |
2308.04758
Report |
Bird's-Eye-View Scene Graph for Vision-Language Navigation |
Rui Liu, Xiaohan Wang, Wenguan Wang, Yi Yang |
Vision-language navigation (VLN), which entails an agent to navigate 3D
environments following human instructions, has shown great advances. However,
current agents are built upon panoramic observations, which hinders their
ability to perceive 3D scene geometry and easily leads to ambiguous selection
of panoramic view. To address these limitations, we present a BEV Scene Graph
(BSG), which leverages multi-step BEV representations to encode scene layouts
and geometric cues of indoor environment under the supervision of 3D detection.
During navigation, BSG builds a local BEV representation at each step and
maintains a BEV-based global scene map, which stores and organizes all the
online collected local BEV representations according to their topological
relations. Based on BSG, the agent predicts a local BEV grid-level decision
score and a global graph-level decision score, combined with a sub-view
selection score on panoramic views, for more accurate action prediction. Our
approach significantly outperforms state-of-the-art methods on REVERIE, R2R,
and R4R, showing the potential of BEV perception in VLN. |
This paper presents BEV Scene Graph (BSG), a novel approach for vision-language navigation (VLN) that leverages Bird’s-Eye-View (BEV) representations to overcome limitations of panoramic-view based methods. |
Current VLN agents based on panoramic views struggle to perceive 3D scene geometry and suffer from ambiguity in action prediction due to multiple candidate nodes mapping to the same view. BEV perception offers a solution by encoding scene layouts and geometric cues effectively. |
BSG builds local BEV representations at each navigation step and maintains a global scene map connecting them topologically. It leverages 3D detection on multi-step BEV representations to encode object-level information. Finally, it predicts actions based on a fused score from local BEV grid-level and global graph-level decision scores. |
BSG significantly outperforms state-of-the-art methods on REVERIE, R2R, and R4R benchmarks.
On REVERIE, BSG surpasses the previous best model by 5.14% on Success Rate and 3.21% on Remote Grounding Success on the val unseen split.
Ablation studies validate the contribution of individual components like BEV updating, neighborhood size for node embeddings, and the importance of 3D object detection. |
The model is trained in static environments, limiting its applicability to dynamic real-world scenarios with moving objects.
Future work could explore the integration of more advanced BEV frameworks and address the challenges of amodal perception for enhanced scene understanding. |
vision-language navigation, "birds-eye-view", 3d object detection, scene graph, embodied ai |
2308.04657
Report |
Which Tokens to Use? Investigating Token Reduction in Vision Transformers |
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund |
Since the introduction of the Vision Transformer (ViT), researchers have
sought to make ViTs more efficient by removing redundant information in the
processed tokens. While different methods have been explored to achieve this
goal, we still lack understanding of the resulting reduction patterns and how
those patterns differ across token reduction methods and datasets. To close
this gap, we set out to understand the reduction patterns of 10 different token
reduction methods using four image classification datasets. By systematically
comparing these methods on the different classification tasks, we find that the
Top-K pruning method is a surprisingly strong baseline. Through in-depth
analysis of the different methods, we determine that: the reduction patterns
are generally not consistent when varying the capacity of the backbone model,
the reduction patterns of pruning-based methods significantly differ from fixed
radial patterns, and the reduction patterns of pruning-based methods are
correlated across classification datasets. Finally we report that the
similarity of reduction patterns is a moderate-to-strong proxy for model
performance. Project page at https://vap.aau.dk/tokens. |
This paper conducts a systematic comparison and analysis of 10 state-of-the-art token reduction methods for Vision Transformers (ViTs) across four image classification datasets. |
Token reduction methods aim to improve ViT efficiency by removing redundant tokens. However, a lack of understanding exists regarding how these methods differ and their reduction patterns across datasets. |
The authors implemented 10 token reduction methods, including pruning and merging based approaches, and evaluated their performance on ImageNet, NABirds, COCO, and NUS-WIDE. They analyzed the consistency of reduction patterns when varying keep rate, backbone capacity, and datasets. |
Top-K pruning and its extension, EViT, consistently perform well across all datasets and backbone capacities.
Reduction patterns of pruning-based methods are not consistent when varying backbone capacity but are consistent when changing the keep rate.
Reduction patterns of pruning-based methods are highly correlated across datasets, suggesting common token usage patterns despite dataset differences. |
The analysis is limited to the image classification task and using an ImageNet pre-trained backbone.
Efficiency evaluation of the methods is not included in the study. |
vision transformer, token reduction, pruning, merging, efficiency |
2308.04603
Report |
A Brief Yet In-Depth Survey of Deep Learning-Based Image Watermarking |
Xin Zhong, Arjon Das, Fahad Alrasheedi, Abdullah Tanvir |
This paper presents a comprehensive survey on deep learning-based image
watermarking, a technique that entails the invisible embedding and extraction
of watermarks within a cover image, aiming to offer a seamless blend of
robustness and adaptability. We navigate the complex landscape of this
interdisciplinary domain, linking historical foundations, current innovations,
and prospective developments. Unlike existing literature, our study
concentrates exclusively on image watermarking with deep learning, delivering
an in-depth, yet brief analysis enriched by three fundamental contributions.
First, we introduce a refined categorization, segmenting the field into
Embedder-Extractor, Deep Networks as a Feature Transformation, and Hybrid
Methods. This taxonomy, inspired by the varied roles of deep learning across
studies, is designed to infuse clarity, offering readers technical insights and
directional guidance. Second, our exploration dives into representative
methodologies, encapsulating the diverse research directions and inherent
challenges within each category to provide a consolidated perspective. Lastly,
we venture beyond established boundaries to outline emerging frontiers,
offering a detailed insight into prospective research avenues. |
This paper presents a comprehensive survey of deep learning-based image watermarking techniques, aiming to provide a consolidated perspective on historical foundations, current innovations, and prospective developments. |
The integration of deep learning in image watermarking is crucial due to its potential for enhanced robustness, adaptability, and the ability to learn and adapt to evolving threats. |
The paper categorizes deep learning-based image watermarking into three types: (1) Embedder-Extractor Joint Training, (2) Deep Networks as a Feature Transformation, and (3) Hybrid Methods. The methodologies, challenges, and representative solutions within each category are then analyzed. |
Joint training of embedder and extractor networks has proven effective, leading to numerous variations and innovations.
Using deep networks for feature transformation, particularly with pre-trained models, offers promising results in zero watermarking.
Hybrid methods, combining traditional watermarking calculations with deep learning, leverage the strengths of both approaches for enhanced efficiency. |
Current research primarily focuses on static images, neglecting the dynamism of video content and real-time applications.
There's a need for standardized evaluation metrics and benchmark datasets to enable accurate comparison and evaluation of different deep learning-based image watermarking techniques. |
deep learning, image watermarking, embedder-extractor, feature transformation, hybrid methods |
2308.04553
Report |
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Bias |
Maan Qraitem, Kate Saenko, Bryan A. Plummer |
Visual recognition models are prone to learning spurious correlations induced
by a biased training set where certain conditions $B$ (\eg, Indoors) are
over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from
generative models offers a promising direction to mitigate this issue by
augmenting underrepresented conditions in the real dataset. However, this
introduces another potential source of bias from generative model artifacts in
the synthetic data. Indeed, as we will show, prior work uses synthetic data to
resolve the model's bias toward $B$, but it doesn't correct the models' bias
toward the pair $(B, G)$ where $G$ denotes whether the sample is real or
synthetic. Thus, the model could simply learn signals based on the pair $(B,
G)$ (\eg, Synthetic Indoors) to make predictions about $Y$ (\eg, Big Dogs). To
address this issue, we propose a two-step training pipeline that we call From
Fake to Real (FFR). The first step of FFR pre-trains a model on balanced
synthetic data to learn robust representations across subgroups. In the second
step, FFR fine-tunes the model on real data using ERM or common loss-based bias
mitigation methods. By training on real and synthetic data separately, FFR
avoids the issue of bias toward signals from the pair $(B, G)$. In other words,
synthetic data in the first step provides effective unbiased representations
that boosts performance in the second step. Indeed, our analysis of high bias
setting (99.9\%) shows that FFR improves performance over the state-of-the-art
by 7-14\% over three datasets (CelebA, UTK-Face, and SpuCO Animals). |
This paper proposes 'From Fake to Real' (FFR), a two-step training pipeline using synthetic data to mitigate spurious correlations in visual recognition models. |
Existing methods for mitigating bias with synthetic data introduce new biases due to distributional differences between real and synthetic data. This work aims to address this limitation. |
FFR first pre-trains on balanced synthetic data to learn robust representations. Then, it fine-tunes on real data using ERM or common loss-based bias mitigation methods. |
FFR outperforms prior synthetic data augmentation methods, especially in high-bias settings.
FFR is more data-efficient, achieving better results with less synthetic data.
Qualitative analysis shows FFR focuses on relevant features and disregards spurious background features, unlike other methods. |
The use of pre-trained text-to-image models for synthetic data generation may introduce new biases.
Evaluation datasets, while reflecting recent advancements, are smaller than those used in large-scale systems. |
bias mitigation, synthetic data augmentation, spurious correlations, visual recognition, generative models |
2308.04409
Report |
V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection |
Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo |
We introduce a highly performant 3D object detector for point clouds using
the DETR framework. The prior attempts all end up with suboptimal results
because they fail to learn accurate inductive biases from the limited scale of
training data. In particular, the queries often attend to points that are far
away from the target objects, violating the locality principle in object
detection. To address the limitation, we introduce a novel 3D Vertex Relative
Position Encoding (3DV-RPE) method which computes position encoding for each
point based on its relative position to the 3D boxes predicted by the queries
in each decoder layer, thus providing clear information to guide the model to
focus on points near the objects, in accordance with the principle of locality.
In addition, we systematically improve the pipeline from various aspects such
as data normalization based on our understanding of the task. We show
exceptional results on the challenging ScanNetV2 benchmark, achieving
significant improvements over the previous 3DETR in
$\rm{AP}_{25}$/$\rm{AP}_{50}$ from 65.0\%/47.0\% to 77.8\%/66.0\%,
respectively. In addition, our method sets a new record on ScanNetV2 and SUN
RGB-D datasets.Code will be released at http://github.com/yichaoshen-MS/V-DETR. |
This paper introduces V-DETR, a novel 3D object detection method for point clouds using the DETR framework, enhanced with a 3D Vertex Relative Position Encoding (3DV-RPE) method for improved object localization. |
Previous DETR-based 3D object detectors struggled to learn accurate inductive biases from limited training data, leading to queries attending to irrelevant points far from target objects. This paper aims to address this limitation and improve 3D object detection accuracy. |
V-DETR introduces 3DV-RPE, which computes position encoding for each point based on its relative offset to the vertices of the predicted 3D boxes. It operates in a canonical object space to ensure consistent encoding regardless of object orientation. The approach also incorporates object-based normalization for box parameterization and leverages advancements in 2D DETR. |
V-DETR with 3DV-RPE significantly outperforms previous DETR-based methods and achieves state-of-the-art results on ScanNetV2 and SUN RGB-D datasets.
3DV-RPE effectively guides the model to focus on points near objects, improving localization accuracy, especially under higher IoU thresholds.
The approach demonstrates advantages over voxel expansion methods by directly utilizing accurate 3D surface information. |
The method's performance with a high number of object queries and one-to-many matching can be sensitive to the number of input points.
Future work includes extending the approach to outdoor 3D object detection and unifying architecture design for indoor and outdoor 3D detection tasks. |
3d object detection, point cloud processing, detr, transformer, position encoding |
2308.04352
Report |
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment |
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li |
3D vision-language grounding (3D-VL) is an emerging field that aims to
connect the 3D physical world with natural language, which is crucial for
achieving embodied intelligence. Current 3D-VL models rely heavily on
sophisticated modules, auxiliary losses, and optimization tricks, which calls
for a simple and unified model. In this paper, we propose 3D-VisTA, a
pre-trained Transformer for 3D Vision and Text Alignment that can be easily
adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention
layers for both single-modal modeling and multi-modal fusion without any
sophisticated task-specific design. To further enhance its performance on 3D-VL
tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs
dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185
unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with
paired 278K scene descriptions generated from existing 3D-VL tasks, templates,
and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object
modeling and scene-text matching. It achieves state-of-the-art results on
various 3D-VL tasks, ranging from visual grounding and dense captioning to
question answering and situated reasoning. Moreover, 3D-VisTA demonstrates
superior data efficiency, obtaining strong performance even with limited
annotations during downstream task fine-tuning. |
This paper proposes 3D-VisTA, a pre-trained Transformer for 3D vision and text alignment. |
Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, while 3D-VisTA provides a simple and unified approach. |
The authors construct ScanScribe, a large-scale 3D scene-text pairs dataset, and pre-train 3D-VisTA on it with masked language/object modeling and scene-text matching objectives. |
3D-VisTA achieves state-of-the-art results on various 3D-VL tasks, including visual grounding, dense captioning, question answering, and situated reasoning.
Pre-training on ScanScribe significantly improves the performance of 3D-VisTA.
3D-VisTA demonstrates superior data efficiency, obtaining strong results even with limited annotations. |
The data amount in ScanScribe is still insufficient for large-scale 3D-VL pre-training.
3D-VisTA currently uses an offline 3D object detection module, which may be a bottleneck for further improvement. |
3d vision-language grounding, pre-trained transformer, self-supervised learning, large-scale dataset, 3d scene understanding |
2308.04288
Report |
Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual Try-On |
Daiheng Gao, Xu Chen, Xindi Zhang, Qi Wang, Ke Sun, Bang Zhang, Liefeng Bo, Qixing Huang |
Fabricating and designing 3D garments has become extremely demanding with the
increasing need for synthesizing realistic dressed persons for a variety of
applications, e.g. 3D virtual try-on, digitalization of 2D clothes into 3D
apparel, and cloth animation. It thus necessitates a simple and straightforward
pipeline to obtain high-quality texture from simple input, such as 2D reference
images. Since traditional warping-based texture generation methods require a
significant number of control points to be manually selected for each type of
garment, which can be a time-consuming and tedious process. We propose a novel
method, called Cloth2Tex, which eliminates the human burden in this process.
Cloth2Tex is a self-supervised method that generates texture maps with
reasonable layout and structural consistency. Another key feature of Cloth2Tex
is that it can be used to support high-fidelity texture inpainting. This is
done by combining Cloth2Tex with a prevailing latent diffusion model. We
evaluate our approach both qualitatively and quantitatively and demonstrate
that Cloth2Tex can generate high-quality texture maps and achieve the best
visual effects in comparison to other methods. Project page:
tomguluson92.github.io/projects/cloth2tex/ |
The paper proposes Cloth2Tex, a two-stage pipeline for converting 2D clothing images into 3D textured meshes, supporting a wider variety of clothing types than previous methods. |
This is important for applications like virtual try-on, 3D garment design, and digital fashion, where realistic 3D clothes are crucial. |
The method uses neural mesh rendering to obtain coarse textures and a diffusion model-based data simulation approach to train a texture refinement network. |
Cloth2Tex generates high-fidelity 3D textures with sharp details, outperforming previous state-of-the-art methods in terms of visual quality and quantitative metrics.
The proposed method supports 10+ clothing categories, significantly more than previous work.
User studies confirm that Cloth2Tex generates more realistic and consistent results compared to 2D and 3D baselines. |
Cloth2Tex faces challenges with clothes having complex patterns and maintaining uniformity for garments with densely assembled grids.
Future work includes exploring methods for generating homogeneous meshes with uniformly-spaced triangles to address texture uniformity issues. |
3d texture synthesis, virtual try-on, neural mesh rendering, latent diffusion models, texture inpainting |
2308.04206
Report |
Exploring Transformers for Open-world Instance Segmentation |
Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo |
Open-world instance segmentation is a rising task, which aims to segment all
objects in the image by learning from a limited number of base-category
objects. This task is challenging, as the number of unseen categories could be
hundreds of times larger than that of seen categories. Recently, the DETR-like
models have been extensively studied in the closed world while stay unexplored
in the open world. In this paper, we utilize the Transformer for open-world
instance segmentation and present SWORD. Firstly, we introduce to attach the
stop-gradient operation before classification head and further add IoU heads
for discovering novel objects. We demonstrate that a simple stop-gradient
operation not only prevents the novel objects from being suppressed as
background, but also allows the network to enjoy the merit of heuristic label
assignment. Secondly, we propose a novel contrastive learning framework to
enlarge the representations between objects and background. Specifically, we
maintain a universal object queue to obtain the object center, and dynamically
select positive and negative samples from the object queries for contrastive
learning. While the previous works only focus on pursuing average recall and
neglect average precision, we show the prominence of SWORD by giving
consideration to both criteria. Our models achieve state-of-the-art performance
in various open-world cross-category and cross-dataset generalizations.
Particularly, in VOC to non-VOC setup, our method sets new state-of-the-art
results of 40.0% on ARb100 and 34.9% on ARm100. For COCO to UVO generalization,
SWORD significantly outperforms the previous best open-world model by 5.9% on
APm and 8.1% on ARm100. |
Presents SWORD, a Transformer-based framework for open-world instance segmentation, using a stop-gradient operation and contrastive learning to enhance novel object discovery. |
Addresses the limitations of existing open-world instance segmentation models that struggle to identify unseen objects in images. |
Utilizes a stop-gradient operation before the classification head and incorporates IoU heads to prevent novel object suppression and enable heuristic label assignment. Employs contrastive learning with a universal object queue to learn distinct representations between objects and background. |
Achieves state-of-the-art performance on cross-category generalization benchmarks like VOC to non-VOC and COCO to LVIS.
Demonstrates significant improvements in cross-dataset generalization from COCO to UVO and COCO to Objects365.
Shows that using pseudo ground-truths from SWORD further enhances performance, creating a strong model extension. |
Pseudo ground-truth training, while beneficial for recall, can negatively impact average precision due to potential noise.
Further research is needed to address the balance between average precision and recall when incorporating pseudo labels. |
open-world instance segmentation, transformers, contrastive learning, stop-gradient operation, pseudo ground-truth training |
2308.04079
Report |
3D Gaussian Splatting for Real-Time Radiance Field Rendering |
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis |
Radiance Field methods have recently revolutionized novel-view synthesis of
scenes captured with multiple photos or videos. However, achieving high visual
quality still requires neural networks that are costly to train and render,
while recent faster methods inevitably trade off speed for quality. For
unbounded and complete scenes (rather than isolated objects) and 1080p
resolution rendering, no current method can achieve real-time display rates. We
introduce three key elements that allow us to achieve state-of-the-art visual
quality while maintaining competitive training times and importantly allow
high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution.
First, starting from sparse points produced during camera calibration, we
represent the scene with 3D Gaussians that preserve desirable properties of
continuous volumetric radiance fields for scene optimization while avoiding
unnecessary computation in empty space; Second, we perform interleaved
optimization/density control of the 3D Gaussians, notably optimizing
anisotropic covariance to achieve an accurate representation of the scene;
Third, we develop a fast visibility-aware rendering algorithm that supports
anisotropic splatting and both accelerates training and allows realtime
rendering. We demonstrate state-of-the-art visual quality and real-time
rendering on several established datasets. |
This paper introduces a novel method for real-time radiance field rendering using 3D Gaussian splatting, achieving state-of-the-art visual quality at real-time frame rates. |
Existing radiance field methods for novel view synthesis either compromise quality for speed or require long training times and are not capable of real-time rendering. |
The method represents the scene as 3D Gaussians, optimized via an interleaved optimization and density control process. It employs a fast, differentiable tile-based rasterizer for rendering, supporting anisotropic splatting and efficient gradient backpropagation. |
The method achieves state-of-the-art visual quality on par with Mip-NeRF360 while maintaining competitive training times.
It enables real-time rendering (≥ 30 fps) at 1080p resolution for various scenes.
The proposed 3D Gaussian representation is compact and efficiently captures complex geometry. |
Artifacts may appear in regions with sparse view coverage.
The method's memory footprint, while lower than previous point-based methods, is larger compared to NeRF-based solutions. |
novel view synthesis, radiance fields, 3d gaussian splatting, real-time rendering, differentiable rendering |
2308.03793
Report |
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation |
Xuefeng Hu, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, Nan Qiao, Xiao Zeng, Min Sun, Cheng-Hao Kuo, Ram Nevatia |
Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated
outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1
accuracy on ImageNet without seeing any example, which leads to potential
benefits to many tasks that have no labeled data. However, while applying CLIP
to a downstream target domain, the presence of visual and text domain gaps and
cross-modality misalignment can greatly impact the model performance. To
address such challenges, we propose ReCLIP, the first source-free domain
adaptation method for vision-language models, which does not require any source
data or target labeled data. ReCLIP first learns a projection space to mitigate
the misaligned visual-text embeddings and learns pseudo labels, and then
deploys cross-modality self-training with the pseudo labels, to update visual
and text encoders, refine labels and reduce domain gaps and misalignments
iteratively. With extensive experiments, we demonstrate ReCLIP reduces the
average error rate of CLIP from 30.17% to 25.06% on 22 image classification
benchmarks. Code available at https://github.com/michiganleon/ReCLIP_WACV. |
ReCLIP is a novel source-free domain adaptation method for Vision-Language Models (VLMs) like CLIP, addressing performance degradation due to domain gaps and misaligned visual-text embeddings. |
VLMs like CLIP, while powerful, suffer performance drops in target domains due to visual and text domain gaps and misaligned cross-modality embeddings, necessitating adaptation. |
ReCLIP learns a projection space to realign embeddings and generate pseudo labels. It then iteratively refines these labels and updates visual and text encoders via cross-modality self-training. |
ReCLIP significantly outperforms baseline adaptation methods (AaD, POUF) and the original CLIP across 22 image classification benchmarks.
ReCLIP demonstrates consistent improvement on various VLM architectures and pre-training strategies.
The proposed projection space and label propagation method effectively generate accurate pseudo labels for self-training. |
Label propagation accuracy becomes unstable on datasets with more than 500 categories, requiring further investigation.
Exploring the use of augmentation consistency in conjunction with ReCLIP could potentially enhance adaptation performance. |
source-free domain adaptation, vision-language models, clip, cross-modality alignment, self-training |
2308.03772
Report |
Improved Neural Radiance Fields Using Pseudo-depth and Fusion |
Jingliang Li, Qiang Zhou, Chaohui Yu, Zhengda Lu, Jun Xiao, Zhibin Wang, Fan Wang |
Since the advent of Neural Radiance Fields, novel view synthesis has received
tremendous attention. The existing approach for the generalization of radiance
field reconstruction primarily constructs an encoding volume from nearby source
images as additional inputs. However, these approaches cannot efficiently
encode the geometric information of real scenes with various scale
objects/structures. In this work, we propose constructing multi-scale encoding
volumes and providing multi-scale geometry information to NeRF models. To make
the constructed volumes as close as possible to the surfaces of objects in the
scene and the rendered depth more accurate, we propose to perform depth
prediction and radiance field reconstruction simultaneously. The predicted
depth map will be used to supervise the rendered depth, narrow the depth range,
and guide points sampling. Finally, the geometric information contained in
point volume features may be inaccurate due to occlusion, lighting, etc. To
this end, we propose enhancing the point volume feature from depth-guided
neighbor feature fusion. Experiments demonstrate the superior performance of
our method in both novel view synthesis and dense geometry modeling without
per-scene optimization. |
This paper proposes an end-to-end framework for generalizable radiance field reconstruction using multi-scale encoding volumes, an auxiliary depth prediction head, and depth-guided adaptive feature fusion. |
Existing NeRF methods struggle to generalize across scenes with diverse object scales and delicate geometry. This work aims to address these limitations and improve rendering quality. |
The framework constructs pyramid encoding volumes to provide multi-scale geometric information. An auxiliary depth prediction head guides point sampling and refines depth ranges. Depth-guided adaptive feature fusion enhances point volume features. |
The method achieves state-of-the-art results on view synthesis benchmarks, including DTU, NeRF Synthetic, and Real Forward-Facing datasets.
Depth reconstruction is significantly improved, as demonstrated by quantitative metrics and qualitative comparisons.
Ablation studies validate the contribution of each proposed module, particularly the multi-scale approach, depth guidance, and feature fusion. |
The method's performance on scenes with highly complex geometry or significant occlusions could be further investigated.
Exploring more efficient architectures for encoding volumes and feature fusion could reduce computational cost. |
neural radiance fields, novel view synthesis, multi-scale representation, depth prediction, feature fusion |
2308.03757
Report |
3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields |
Brandon Y. Feng, Hadi Alzayer, Michael Rubinstein, William T. Freeman, Jia-Bin Huang |
Motion magnification helps us visualize subtle, imperceptible motion.
However, prior methods only work for 2D videos captured with a fixed camera. We
present a 3D motion magnification method that can magnify subtle motions from
scenes captured by a moving camera, while supporting novel view rendering. We
represent the scene with time-varying radiance fields and leverage the Eulerian
principle for motion magnification to extract and amplify the variation of the
embedding of a fixed point over time. We study and validate our proposed
principle for 3D motion magnification using both implicit and tri-plane-based
radiance fields as our underlying 3D scene representation. We evaluate the
effectiveness of our method on both synthetic and real-world scenes captured
under various camera setups. |
This paper presents a 3D motion magnification method using Neural Radiance Fields (NeRF), extending the capabilities of traditional 2D video motion magnification to handle dynamic 3D scenes and novel view synthesis. |
This method allows for the visualization and analysis of subtle, imperceptible 3D motions, overcoming the limitations of prior 2D methods that fail on moving camera footage and lack 3D motion analysis. |
The method leverages the Eulerian principle for motion magnification, applying it to the feature embedding space of NeRF instead of directly to color values. It represents the scene with time-varying radiance fields and amplifies temporal variations in the embedding of 3D points, enabling 3D motion magnified rendering. This is achieved by either modifying the positional encoding or by applying 2D video magnification techniques to tri-plane based learned embedding functions. |
Experiments on synthetic scenes show superior performance of the 3D motion magnification compared to applying 2D methods on rendered videos.
Phase-based Eulerian magnification on tri-plane features exhibits the best performance among the explored methods.
The method generalizes to real-world scenes, successfully magnifying subtle motions from both multi-view and single-camera captures, even supporting handheld videos. |
Performance is limited by the quality of NeRF reconstruction, which can be affected by factors like motion blur and inaccurate camera pose estimation.
Future work could explore alternative embedding functions and magnification methods for better handling of complex motions and challenging capture conditions. |
motion magnification, neural radiance fields, 3d vision, novel view synthesis, eulerian analysis |
2308.03747
Report |
Mask Frozen-DETR: High Quality Instance Segmentation with One GPU |
Zhanhao Liang, Yuhui Yuan |
In this paper, we aim to study how to build a strong instance segmenter with
minimal training time and GPUs, as opposed to the majority of current
approaches that pursue more accurate instance segmenter by building more
advanced frameworks at the cost of longer training time and higher GPU
requirements. To achieve this, we introduce a simple and general framework,
termed Mask Frozen-DETR, which can convert any existing DETR-based object
detection model into a powerful instance segmentation model. Our method only
requires training an additional lightweight mask network that predicts instance
masks within the bounding boxes given by a frozen DETR-based object detector.
Remarkably, our method outperforms the state-of-the-art instance segmentation
method Mask DINO in terms of performance on the COCO test-dev split (55.3% vs.
54.7%) while being over 10X times faster to train. Furthermore, all of our
experiments can be trained using only one Tesla V100 GPU with 16 GB of memory,
demonstrating the significant efficiency of our proposed framework. |
This paper proposes Mask Frozen-DETR, a method to convert existing DETR-based object detectors into strong instance segmenters by training an additional lightweight mask network using the output of a frozen DETR detector. |
Training modern instance segmentation models from scratch is resource-intensive and time-consuming. This work explores a more efficient approach by leveraging readily available, powerful object detection models. |
The method utilizes a frozen DETR-based object detector to generate bounding boxes. Then, it trains a lightweight mask network, incorporating image feature encoder, box feature encoder, and query feature encoder, to predict instance masks within those boxes. |
Mask Frozen-DETR outperforms the state-of-the-art Mask DINO on COCO test-dev (55.3% vs. 54.7%).
The method significantly reduces training time, achieving more than 10x speedup compared to Mask DINO.
All experiments were conducted on a single Tesla V100 GPU with 16 GB memory, demonstrating its efficiency. |
The work primarily focuses on COCO dataset; further validation on other datasets is needed.
Fine-tuning the frozen DETR detector might yield further improvements, although at the cost of increased training time. |
instance segmentation, object detection, detr, efficient training, mask frozen-detr |
2308.03610
Report |
AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose |
Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, Min Zheng |
Creating expressive, diverse and high-quality 3D avatars from highly
customized text descriptions and pose guidance is a challenging task, due to
the intricacy of modeling and texturing in 3D that ensure details and various
styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline
for generating expressive high-quality 3D avatars from nothing but text
descriptions and pose guidance. In specific, we introduce a 2D diffusion model
conditioned on DensePose signal to establish 3D pose control of avatars through
2D images, which enhances view consistency from partially observed scenarios.
It addresses the infamous Janus Problem and significantly stablizes the
generation process. Moreover, we propose a progressive high-resolution 3D
synthesis strategy, which obtains substantial improvement over the quality of
the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves
zero-shot 3D modeling of 3D avatars that are not only more expressive, but also
in higher quality and fidelity than previous works. Rigorous qualitative
evaluations and user studies showcase AvatarVerse's superiority in synthesizing
high-fidelity 3D avatars, leading to a new standard in high-quality and stable
3D avatar creation. Our project page is: https://avatarverse3d.github.io |
AvatarVerse, a pipeline to automatically generate high-quality and stable 3D avatars from text prompts and poses. |
Automating the creation of high-quality 3D avatars can save resources in fields like game production and AR/VR. |
Leverages a DensePose-conditioned ControlNet to optimize an explicit NeRF with a progressive high-resolution generation strategy and avatar surface smoothing. |
Generates higher-quality avatars with more detail than previous methods.
Enables flexible avatar generation, including partial avatars and arbitrary poses.
Outperforms SOTA methods in user studies for both geometry and texture quality. |
The quality of the generated avatars still has room for improvement.
The current framework relies on the pre-trained SMPL model, limiting its generalizability to other 3D objects. |
3d avatar generation, text-to-3d, densepose, controlnet, neural radiance fields |
2308.03463
Report |
DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis |
Zhongjie Duan, Lizhou You, Chengyu Wang, Cen Chen, Ziheng Wu, Weining Qian, Jun Huang |
In recent years, diffusion models have emerged as the most powerful approach
in image synthesis. However, applying these models directly to video synthesis
presents challenges, as it often leads to noticeable flickering contents.
Although recently proposed zero-shot methods can alleviate flicker to some
extent, we still struggle to generate coherent videos. In this paper, we
propose DiffSynth, a novel approach that aims to convert image synthesis
pipelines to video synthesis pipelines. DiffSynth consists of two key
components: a latent in-iteration deflickering framework and a video
deflickering algorithm. The latent in-iteration deflickering framework applies
video deflickering to the latent space of diffusion models, effectively
preventing flicker accumulation in intermediate steps. Additionally, we propose
a video deflickering algorithm, named patch blending algorithm, that remaps
objects in different frames and blends them together to enhance video
consistency. One of the notable advantages of DiffSynth is its general
applicability to various video synthesis tasks, including text-guided video
stylization, fashion video synthesis, image-guided video stylization, video
restoring, and 3D rendering. In the task of text-guided video stylization, we
make it possible to synthesize high-quality videos without cherry-picking. The
experimental results demonstrate the effectiveness of DiffSynth. All videos can
be viewed on our project page. Source codes will also be released. |
This paper introduces DiffSynth, a novel approach for converting image synthesis pipelines to video synthesis pipelines using diffusion models, resulting in coherent and realistic video generation. |
Directly applying image synthesis methods to videos leads to flickering and inconsistencies. DiffSynth addresses these challenges and enables the application of diffusion models to video synthesis with high quality. |
DiffSynth utilizes a latent in-iteration deflickering framework to remove flicker in the latent space during intermediate synthesis steps. It also employs a patch blending algorithm, based on patch matching, to blend objects across frames for enhanced video consistency. |
DiffSynth effectively eliminates flicker and generates coherent videos without cherry-picking.
It outperforms existing methods in quantitative metrics such as Pixel-MSE, CLIP Score, FID, and user studies.
The approach demonstrates general applicability in various video synthesis tasks including stylization, synthesis, restoration, and 3D rendering. |
The computational efficiency of DiffSynth can be further improved.
The blending operator in the patch blending algorithm can be further enhanced for better detail generation. |
video synthesis, diffusion models, deflickering, patch matching, latent space |
2308.03040
Report |
Learning Fine-Grained Features for Pixel-wise Video Correspondences |
Rui Li, Shenglong Zhou, Dong Liu |
Video analysis tasks rely heavily on identifying the pixels from different
frames that correspond to the same visual target. To tackle this problem,
recent studies have advocated feature learning methods that aim to learn
distinctive representations to match the pixels, especially in a
self-supervised fashion. Unfortunately, these methods have difficulties for
tiny or even single-pixel visual targets. Pixel-wise video correspondences were
traditionally related to optical flows, which however lead to deterministic
correspondences and lack robustness on real-world videos. We address the
problem of learning features for establishing pixel-wise correspondences.
Motivated by optical flows as well as the self-supervised feature learning, we
propose to use not only labeled synthetic videos but also unlabeled real-world
videos for learning fine-grained representations in a holistic framework. We
adopt an adversarial learning scheme to enhance the generalization ability of
the learned features. Moreover, we design a coarse-to-fine framework to pursue
high computational efficiency. Our experimental results on a series of
correspondence-based tasks demonstrate that the proposed method outperforms
state-of-the-art rivals in both accuracy and efficiency. |
This paper proposes a method to learn fine-grained features for establishing pixel-wise correspondences in videos by combining supervised learning on synthetic data with self-supervised and adversarial learning on unlabeled real-world data. |
Accurately identifying pixel-wise correspondences across video frames is crucial for various computer vision tasks, but existing methods struggle to capture fine-grained differences over space and time, especially on real-world videos. |
The proposed approach leverages synthetic videos with optical flow labels to learn an initial feature representation. It then introduces soft labeling to convert deterministic correspondences into probabilistic maps, enhancing the model's robustness. Furthermore, it incorporates self-supervised reconstructive learning on unlabeled real-world videos and employs adversarial training to bridge the domain gap between synthetic and real data. |
The method achieves state-of-the-art results on point tracking benchmarks like BADJA, JHMDB, TAP-Vid-DAVIS, and TAP-Vid-Kinetics, demonstrating its effectiveness in capturing fine-grained motion.
It also surpasses previous methods in semi-supervised video object segmentation on DAVIS-2017, highlighting the benefits of fine-grained features for this task.
A proposed coarse-to-fine framework maintains competitive accuracy while significantly improving computational efficiency. |
The authors acknowledge that the focus on fine-grained features might hinder object-centric learning in some cases, suggesting further exploration.
Future work could investigate leveraging more powerful 2D feature extractors for generating soft labels. |
video correspondences, fine-grained feature learning, self-supervised learning, adversarial training, optical flow |
2308.02935
Report |
Bias Behind the Wheel: Fairness Analysis of Autonomous Driving Systems |
Xinyue Li, Zhenpeng Chen, Jie M. Zhang, Federica Sarro, Ying Zhang, Xuanzhe Liu |
This paper analyzes fairness in automated pedestrian detection, a crucial but
under-explored issue in autonomous driving systems. We evaluate eight
state-of-the-art deep learning-based pedestrian detectors across demographic
groups on large-scale real-world datasets. To enable thorough fairness testing,
we provide extensive annotations for the datasets, resulting in 8,311 images
with 16,070 gender labels, 20,115 age labels, and 3,513 skin tone labels. Our
findings reveal significant fairness issues, particularly related to age. The
undetected proportions for children are 20.14% higher compared to adults.
Furthermore, we explore how various driving scenarios affect the fairness of
pedestrian detectors. We find that pedestrian detectors demonstrate significant
gender biases during night time, potentially exacerbating the prevalent
societal issue of female safety concerns during nighttime out. Moreover, we
observe that pedestrian detectors can demonstrate both enhanced fairness and
superior performance under specific driving conditions, which challenges the
fairness-performance trade-off theory widely acknowledged in the fairness
literature. We publicly release the code, data, and results to support future
research on fairness in autonomous driving. |
This paper presents the first comprehensive study on fairness issues in pedestrian detection for autonomous driving, evaluating eight state-of-the-art detectors across diverse demographic groups. |
Fairness in autonomous driving systems, crucial for preventing discriminatory outcomes and ensuring equal treatment, remains under-explored. This study aims to uncover and analyze these issues to pave the way for more equitable and unbiased systems. |
The study evaluates eight deep learning-based pedestrian detectors on four real-world datasets enriched with manually annotated demographic labels (gender, age, skin tone). The authors analyze performance disparities (miss rates) across demographic groups under different driving scenarios (brightness, contrast, weather conditions). |
State-of-the-art pedestrian detectors exhibit significant age bias, with a 20.14% higher miss rate for children compared to adults.
Significant gender bias is observed during nighttime, with higher miss rates for females, potentially exacerbating safety concerns.
Contrary to common belief, pedestrian detectors can achieve enhanced fairness and detection performance under specific driving scenarios (e.g., higher brightness). |
The manual labeling process, while mitigated by using two annotators and an arbitrator, poses inherent subjectivity.
The study's focus on eight specific pedestrian detectors and four datasets might introduce selection bias, although the authors carefully chose representative models and widely-used datasets. |
fairness, pedestrian detection, autonomous driving, deep learning, bias |
2308.02915
Report |
DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation |
Qiaosong Qi, Le Zhuo, Aixi Zhang, Yue Liao, Fei Fang, Si Liu, Shuicheng Yan |
When hearing music, it is natural for people to dance to its rhythm.
Automatic dance generation, however, is a challenging task due to the physical
constraints of human motion and rhythmic alignment with target music.
Conventional autoregressive methods introduce compounding errors during
sampling and struggle to capture the long-term structure of dance sequences. To
address these limitations, we present a novel cascaded motion diffusion model,
DiffDance, designed for high-resolution, long-form dance generation. This model
comprises a music-to-dance diffusion model and a sequence super-resolution
diffusion model. To bridge the gap between music and motion for conditional
generation, DiffDance employs a pretrained audio representation learning model
to extract music embeddings and further align its embedding space to motion via
contrastive loss. During training our cascaded diffusion model, we also
incorporate multiple geometric losses to constrain the model outputs to be
physically plausible and add a dynamic loss weight that adaptively changes over
diffusion timesteps to facilitate sample diversity. Through comprehensive
experiments performed on the benchmark dataset AIST++, we demonstrate that
DiffDance is capable of generating realistic dance sequences that align
effectively with the input music. These results are comparable to those
achieved by state-of-the-art autoregressive methods. |
Presents DiffDance, a cascaded motion diffusion model for generating high-resolution, long-form dance sequences from music. |
Addresses limitations of autoregressive methods in dance generation, which suffer from compounding errors and struggle to capture long-term structure. DiffDance leverages diffusion models for realistic and diverse dance sequence generation. |
Employs a two-stage approach: 1) Music-to-Dance diffusion model generates low-resolution dance. 2) Sequence Super-Resolution diffusion model upscales to high-resolution. Uses Wav2CLIP for music embedding and aligns it with motion embedding via contrastive loss. Incorporates geometric losses and dynamic loss weight for realistic and diverse motion. |
Achieves state-of-the-art performance on FID_k and Beat Align Score, demonstrating superior dance quality and music-dance alignment.
Generates dance sequences with distinct long-term choreographic structures, as observed in user studies.
Shows the effectiveness of the cascaded approach, embedding alignment, and geometric losses through ablation studies. |
Limited diversity in geometric features potentially due to regularization losses.
Beat Align Score, while effective, may not capture all nuances of human dance evaluation. |
diffusion model, music-to-dance generation, conditional generation, multimodal learning, motion synthesis |
2308.02874
Report |
Sketch and Text Guided Diffusion Model for Colored Point Cloud Generation |
Zijie Wu, Yaonan Wang, Mingtao Feng, He Xie, Ajmal Mian |
Diffusion probabilistic models have achieved remarkable success in text
guided image generation. However, generating 3D shapes is still challenging due
to the lack of sufficient data containing 3D models along with their
descriptions. Moreover, text based descriptions of 3D shapes are inherently
ambiguous and lack details. In this paper, we propose a sketch and text guided
probabilistic diffusion model for colored point cloud generation that
conditions the denoising process jointly with a hand drawn sketch of the object
and its textual description. We incrementally diffuse the point coordinates and
color values in a joint diffusion process to reach a Gaussian distribution.
Colored point cloud generation thus amounts to learning the reverse diffusion
process, conditioned by the sketch and text, to iteratively recover the desired
shape and color. Specifically, to learn effective sketch-text embedding, our
model adaptively aggregates the joint embedding of text prompt and the sketch
based on a capsule attention network. Our model uses staged diffusion to
generate the shape and then assign colors to different parts conditioned on the
appearance prompt while preserving precise shapes from the first stage. This
gives our model the flexibility to extend to multiple tasks, such as appearance
re-editing and part segmentation. Experimental results demonstrate that our
model outperforms recent state-of-the-art in point cloud generation. |
This paper introduces STPD, a novel sketch and text guided diffusion model for generating colored 3D point clouds. |
Generating 3D shapes from text is challenging due to data scarcity and ambiguity in textual descriptions. STPD addresses this using sketches, which provide unambiguous geometric information. |
STPD uses a capsule attention network to extract sparse features from sketches, fuses them with text embeddings, and employs a staged diffusion process to generate shape and color separately. |
STPD outperforms state-of-the-art methods in colored point cloud generation.
The attention-based capsule network effectively learns from sparse sketch data.
STPD demonstrates strong representation learning ability, applicable to 3D object classification and part segmentation. |
STPD's generalization ability is limited by training data size.
Handling conflicting sketch and text inputs needs further investigation. |
3d point cloud generation, diffusion models, sketch-based modeling, text-to-3d, capsule attention networks |
2308.02840
Report |
Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis |
Yuxin Wang, Wayne Wu, Dan Xu |
Implicit neural representations have shown powerful capacity in modeling
real-world 3D scenes, offering superior performance in novel view synthesis. In
this paper, we target a more challenging scenario, i.e., joint scene novel view
synthesis and editing based on implicit neural scene representations.
State-of-the-art methods in this direction typically consider building separate
networks for these two tasks (i.e., view synthesis and editing). Thus, the
modeling of interactions and correlations between these two tasks is very
limited, which, however, is critical for learning high-quality scene
representations. To tackle this problem, in this paper, we propose a unified
Neural Radiance Field (NeRF) framework to effectively perform joint scene
decomposition and composition for modeling real-world scenes. The decomposition
aims at learning disentangled 3D representations of different objects and the
background, allowing for scene editing, while scene composition models an
entire scene representation for novel view synthesis. Specifically, with a
two-stage NeRF framework, we learn a coarse stage for predicting a global
radiance field as guidance for point sampling, and in the second fine-grained
stage, we perform scene decomposition by a novel one-hot object radiance field
regularization module and a pseudo supervision via inpainting to handle
ambiguous background regions occluded by objects. The decomposed object-level
radiance fields are further composed by using activations from the
decomposition module. Extensive quantitative and qualitative results show the
effectiveness of our method for scene decomposition and composition,
outperforming state-of-the-art methods for both novel-view synthesis and
editing tasks. |
This paper presents a novel Neural Radiance Field (NeRF) framework that unifies scene decomposition and composition for editable novel view synthesis. |
Existing methods for object-aware scene representation often use separate networks for view synthesis and editing, limiting the modeling of interactions between these tasks crucial for high-quality representations. |
The proposed two-stage framework first learns a global radiance field for point sampling guidance. In the fine-grained stage, it performs decomposition using learnable object codes, one-hot object radiance regularization, and in-painting pseudo-supervision for occluded regions. Composition is achieved by utilizing learned activation weights for object-level radiance fields. |
The proposed method demonstrates superior performance in novel view synthesis compared to state-of-the-art methods like ObjectNeRF and ObjectSDF.
It enables effective scene decomposition, allowing for object manipulations such as removal, addition, duplication, and position changes.
The framework shows clear advantages in background rendering quality, particularly in handling unseen or occluded regions. |
The current method relies on object masks as supervisory signals, which may limit its applicability in scenarios where such annotations are unavailable.
Future work could explore the extension of the framework to handle dynamic scenes with moving objects. |
neural radiance fields, novel view synthesis, scene decomposition, scene composition, object editing |
2308.02669
Report |
ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints |
Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or |
Recent text-to-image generative models have enabled us to transform our words
into vibrant, captivating imagery. The surge of personalization techniques that
has followed has also allowed us to imagine unique concepts in new scenes.
However, an intriguing question remains: How can we generate a new, imaginary
concept that has never been seen before? In this paper, we present the task of
creative text-to-image generation, where we seek to generate new members of a
broad category (e.g., generating a pet that differs from all existing pets). We
leverage the under-studied Diffusion Prior models and show that the creative
generation problem can be formulated as an optimization process over the output
space of the diffusion prior, resulting in a set of "prior constraints". To
keep our generated concept from converging into existing members, we
incorporate a question-answering Vision-Language Model (VLM) that adaptively
adds new constraints to the optimization problem, encouraging the model to
discover increasingly more unique creations. Finally, we show that our prior
constraints can also serve as a strong mixing mechanism allowing us to create
hybrids between generated concepts, introducing even more flexibility into the
creative process. |
ConceptLab, a method for generating novel image concepts (e.g., a new type of pet) that belong to a broad category (e.g., pets) but differ from existing members of that category (e.g., cats, dogs). |
Existing text-to-image generation techniques excel at generating existing concepts or personalizing models to specific subjects but lack the ability to creatively imagine entirely new concepts within a category. |
ConceptLab optimizes a token embedding in the text encoder space of a pretrained text-to-image diffusion model. It uses "prior constraints" derived from CLIP similarities between a target category and existing members, leveraging a Diffusion Prior model to guide the optimization process. An iterative feedback loop with a VLM expands the set of negative constraints, fostering greater concept uniqueness. |
ConceptLab successfully generates novel concepts across various categories, like pets, buildings, and even artistic styles.
Generated concepts can be seamlessly integrated into different scenes and artistic renderings through text prompts.
Quantitative and user study evaluations confirm ConceptLab's superiority over baseline methods like negative prompting in creating unique and diverse concepts within target categories. |
Editing generated concepts using text prompts does not always consistently maintain the concept's unique properties.
The success of ConceptLab can be limited by the performance of the VLM used for adaptive negative constraint generation. |
creative generation, text-to-image synthesis, diffusion models, diffusion prior, vision-language models |
2308.02552
Report |
Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion |
Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian |
Owing to the unrestricted nature of the content in the training data, large
text-to-image diffusion models, such as Stable Diffusion (SD), are capable of
generating images with potentially copyrighted or dangerous content based on
corresponding textual concepts information. This includes specific intellectual
property (IP), human faces, and various artistic styles. However, Negative
Prompt, a widely used method for content removal, frequently fails to conceal
this content due to inherent limitations in its inference logic. In this work,
we propose a novel strategy named \textbf{Degeneration-Tuning (DT)} to shield
contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to
reconstruct the correlation between undesired concepts and their corresponding
image domain, we guide SD to generate meaningless content when such textual
concepts are provided as input. As this adaptation occurs at the level of the
model's weights, the SD, after DT, can be grafted onto other conditional
diffusion frameworks like ControlNet to shield unwanted concepts. In addition
to qualitatively showcasing the effectiveness of our DT method in protecting
various types of concepts, a quantitative comparison of the SD before and after
DT indicates that the DT method does not significantly impact the generative
quality of other contents. The FID and IS scores of the model on COCO-30K
exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and
38.25, respectively, which clearly outperforms the previous methods. |
This paper introduces Degeneration-Tuning (DT), a novel technique to prevent Stable Diffusion from generating images of undesired concepts by disrupting the low-frequency visual information associated with these concepts, guiding the model to produce meaningless content instead. |
Large text-to-image diffusion models like Stable Diffusion, trained on unrestricted data, risk generating potentially copyrighted or harmful content. Existing methods for content removal, such as Negative Prompt or Safety Filters, have limitations. DT offers a solution by directly modifying model weights, making it robust to parameter leakage. |
DT employs a Scrambled Grid operation to disrupt the low-frequency visual content of targeted concepts. The model is then fine-tuned on this degraded dataset alongside anchor images generated without the specific concepts, effectively masking the original semantic content. |
DT successfully shields various concepts, including specific IPs, artistic styles, and individuals, without significantly affecting the generation quality of other content.
DT remains effective when grafted onto other conditional diffusion models like ControlNet.
Continual DT, while feasible, presents challenges in maintaining image quality due to potential bias amplification. |
Continual DT requires further investigation to address the observed decline in generated image quality.
The impact of DT on the generation of conceptually related terms needs further exploration. |
stable diffusion, content protection, degeneration-tuning, scrambled grid, continual learning |
2308.02535
Report |
Learning to Generate Training Datasets for Robust Semantic Segmentation |
Marwane Hariat, Olivier Laurent, Rémi Kazmierczak, Shihao Zhang, Andrei Bursuc, Angela Yao, Gianni Franchi |
Semantic segmentation methods have advanced significantly. Still, their
robustness to real-world perturbations and object types not seen during
training remains a challenge, particularly in safety-critical applications. We
propose a novel approach to improve the robustness of semantic segmentation
techniques by leveraging the synergy between label-to-image generators and
image-to-label segmentation models. Specifically, we design Robusta, a novel
robust conditional generative adversarial network to generate realistic and
plausible perturbed images that can be used to train reliable segmentation
models. We conduct in-depth studies of the proposed generative model, assess
the performance and robustness of the downstream segmentation network, and
demonstrate that our approach can significantly enhance the robustness in the
face of real-world perturbations, distribution shifts, and out-of-distribution
samples. Our results suggest that this approach could be valuable in
safety-critical applications, where the reliability of perception modules such
as semantic segmentation is of utmost importance and comes with a limited
computational budget in inference. We release our code at
https://github.com/ENSTA-U2IS-AI/robusta. |
This paper presents Robusta, a novel cascaded cGAN architecture that improves the robustness of semantic segmentation models against input perturbations and enables them to detect outlier objects. |
Robustness in semantic segmentation is crucial for safety-critical applications like autonomous driving where unexpected objects or conditions can lead to failures. |
Robusta leverages attention layers and sub-networks to generate realistic images even from corrupted label maps. The generated images are used to train an observer network for anomaly detection. The authors introduce a new framework to evaluate the robustness of label-to-image generators and compare Robusta to SOTA methods. |
Robusta generates images comparable or superior in quality to SOTA label-to-image translation methods.
Robusta exhibits superior robustness to label map perturbations compared to other cGANs.
Using Robusta generated data improves the robustness and out-of-distribution detection capabilities of semantic segmentation models. |
The paper primarily focuses on specific types of outliers and may not generalize to all unseen objects.
The two-stage training process of Robusta increases computational cost compared to single-stage methods. |
semantic segmentation, robustness, generative adversarial networks, anomaly detection, out-of-distribution detection |
2308.02487
Report |
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP |
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen |
Open-vocabulary segmentation is a challenging task requiring segmenting and
recognizing objects from an open set of categories. One way to address this
challenge is to leverage multi-modal models, such as CLIP, to provide image and
text features in a shared embedding space, which bridges the gap between
closed-vocabulary and open-vocabulary recognition. Hence, existing methods
often adopt a two-stage framework to tackle the problem, where the inputs first
go through a mask generator and then through the CLIP model along with the
predicted masks. This process involves extracting features from images multiple
times, which can be ineffective and inefficient. By contrast, we propose to
build everything into a single-stage framework using a shared Frozen
Convolutional CLIP backbone, which not only significantly simplifies the
current two-stage pipeline, but also remarkably yields a better accuracy-cost
trade-off. The proposed FC-CLIP, benefits from the following observations: the
frozen CLIP backbone maintains the ability of open-vocabulary classification
and can also serve as a strong mask generator, and the convolutional CLIP
generalizes well to a larger input resolution than the one used during
contrastive image-text pretraining. When training on COCO panoptic data only
and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1
mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2
mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU
on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes,
respectively. Additionally, the training and testing time of FC-CLIP is 7.5x
and 6.6x significantly faster than the same prior art, while using 5.9x fewer
parameters. FC-CLIP also sets a new state-of-the-art performance across various
open-vocabulary semantic segmentation datasets. Code at
https://github.com/bytedance/fc-clip |
This paper introduces FC-CLIP, a single-stage framework for open-vocabulary segmentation, that builds upon a shared frozen convolutional CLIP backbone. |
Existing open-vocabulary segmentation methods rely on two-stage frameworks that are inefficient and ineffective due to separate feature extraction for mask generation and classification. FC-CLIP addresses these limitations with a unified and efficient approach. |
FC-CLIP leverages a frozen convolutional CLIP backbone for both mask generation and classification. It consists of a class-agnostic mask generator, an in-vocabulary classifier trained on seen classes, and an out-of-vocabulary classifier for novel classes, combined using geometric ensembling. |
FC-CLIP achieves state-of-the-art results on open-vocabulary panoptic segmentation benchmarks, including ADE20K, Cityscapes, and Mapillary Vistas, outperforming prior arts like ODISE significantly.
FC-CLIP demonstrates strong performance in open-vocabulary semantic segmentation, achieving state-of-the-art results on ADE20K-847 and PASCAL-Context-459.
FC-CLIP offers a significantly faster inference speed, running 6.6 times faster than ODISE. |
The paper identifies potential for further research in better utilizing CLIP for mask segmentation and classification.
Addressing potential biases present in the Internet data used for CLIP pre-training is crucial. |
open-vocabulary segmentation, panoptic segmentation, semantic segmentation, clip, single-stage framework |
2308.02299
Report |
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension |
Qiang Zhou, Chaohui Yu, Shaofeng Zhang, Sitong Wu, Zhibing Wang, Fan Wang |
In this work, we investigate extending the comprehension of Multi-modal Large
Language Models (MLLMs) to regional objects. To this end, we propose to extract
features corresponding to regional objects as soft prompts for LLM, which
provides a straightforward and scalable approach and eliminates the need for
LLM fine-tuning. To effectively extract regional features from regular image
features and irregular point cloud features, we present a novel and unified
position-assisted feature extraction module. Furthermore, training an MLLM from
scratch is highly time-consuming. Thus, we propose incrementally extending
existing pre-trained MLLMs to comprehend more modalities and the regional
objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2,
an impressive MLLM, and optimize the modality-specific Lora parameters in
Q-Former and LLM for each newly introduced modality. The freezing of the
Q-Former eliminates the need for extensive pre-training on massive image-text
data. The freezed Q-Former pre-trained from massive image-text data is also
beneficial for the pre-training on image-region-text data. We name our
framework RegionBLIP. We pre-train RegionBLIP on image-region-text,
point-cloud-text, and point-cloud-region-text data. Experimental results verify
that \Ours{} can preserve the image comprehension capability of BILP-2 and
further gain a comprehension of the newly introduced point cloud modality and
regional objects. The Data, Code, and Pre-trained models will be available at
https://github.com/mightyzau/RegionBLIP. |
This paper presents RegionBLIP, a unified Multi-modal Large Language Model (MLLM) framework that incorporates both holistic and regional object comprehension for image and point cloud modalities. |
Comprehending regional objects is essential in many applications like virtual reality. Existing MLLMs struggle to efficiently incorporate this capability, especially across multiple modalities. |
The authors introduce a position-assisted feature extraction (PaFE) module to extract regional features from both regular image features and irregular point cloud features. They also propose an incremental pre-training scheme that freezes the Q-Former from BLIP-2 and learns modality-specific Lora parameters, enabling efficient extension to new modalities. |
RegionBLIP preserves the image comprehension capabilities of BLIP-2 while extending it to point cloud and regional object comprehension.
The PaFE module significantly improves regional comprehension performance for both image and point cloud modalities.
The incremental pre-training scheme effectively extends MLLM's comprehension capabilities to new modalities without retraining on massive datasets. |
The performance of point cloud region captioning is somewhat limited due to not utilizing point cloud color information.
Future work will involve increasing the size of the RegionCap dataset to improve the generalization of image-region comprehension for MLLM models. |
multi-modal learning, large language models, region comprehension, incremental pre-training, point cloud understanding |
2308.02236
Report |
FB-BEV: BEV Representation from Forward-Backward View Transformations |
Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, Jose M. Alvarez |
View Transformation Module (VTM), where transformations happen between
multi-view image features and Bird-Eye-View (BEV) representation, is a crucial
step in camera-based BEV perception systems. Currently, the two most prominent
VTM paradigms are forward projection and backward projection. Forward
projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV
features without post-processing. Backward projection, with BEVFormer being an
example, tends to generate false-positive BEV features from incorrect
projections due to the lack of utilization on depth. To address the above
limitations, we propose a novel forward-backward view transformation module.
Our approach compensates for the deficiencies in both existing methods,
allowing them to enhance each other to obtain higher quality BEV
representations mutually. We instantiate the proposed module with FB-BEV, which
achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set.
Code and models are available at https://github.com/NVlabs/FB-BEV. |
Proposes FB-BEV, a novel forward-backward view transformation module for camera-based 3D object detection that generates dense and accurate Bird's Eye View (BEV) representations. |
Existing view transformation modules for BEV perception either produce sparse BEV features (forward projection) or suffer from false positives due to inaccurate depth utilization (backward projection). |
Combines forward projection (F-VTM) to generate initial sparse BEV and depth-aware backward projection (B-VTM) to refine foreground regions identified by a lightweight Foreground Region Proposal Network (FRPN). Depth consistency ensures accurate feature mapping in B-VTM. |
FB-BEV achieves state-of-the-art 62.4% NDS on the nuScenes test set, outperforming previous methods.
Depth-aware backward projection significantly improves performance compared to standard backward projection, demonstrating effective depth utilization.
FRPN improves both inference efficiency and detection accuracy by focusing refinement on foreground regions. |
The two-stage design, while efficient, could be further optimized for end-to-end training.
Exploration of alternative depth-aware mechanisms for backward projection could yield further performance gains. |
3d object detection, "birds eye view (bev)", view transformation module (vtm), forward-backward projection, depth consistency |
2308.02157
Report |
Improved Order Analysis and Design of Exponential Integrator for Diffusion Models Sampling |
Qinsheng Zhang, Jiaming Song, Yongxin Chen |
Efficient differential equation solvers have significantly reduced the
sampling time of diffusion models (DMs) while retaining high sampling quality.
Among these solvers, exponential integrators (EI) have gained prominence by
demonstrating state-of-the-art performance. However, existing high-order
EI-based sampling algorithms rely on degenerate EI solvers, resulting in
inferior error bounds and reduced accuracy in contrast to the theoretically
anticipated results under optimal settings. This situation makes the sampling
quality extremely vulnerable to seemingly innocuous design choices such as
timestep schedules. For example, an inefficient timestep scheduler might
necessitate twice the number of steps to achieve a quality comparable to that
obtained through carefully optimized timesteps. To address this issue, we
reevaluate the design of high-order differential solvers for DMs. Through a
thorough order analysis, we reveal that the degeneration of existing high-order
EI solvers can be attributed to the absence of essential order conditions. By
reformulating the differential equations in DMs and capitalizing on the theory
of exponential integrators, we propose refined EI solvers that fulfill all the
order conditions, which we designate as Refined Exponential Solver (RES).
Utilizing these improved solvers, RES exhibits more favorable error bounds
theoretically and achieves superior sampling efficiency and stability in
practical applications. For instance, a simple switch from the single-step
DPM-Solver++ to our order-satisfied RES solver when Number of Function
Evaluations (NFE) $=9$, results in a reduction of numerical defects by $25.2\%$
and FID improvement of $25.4\%$ (16.77 vs 12.51) on a pre-trained ImageNet
diffusion model. |
This paper proposes Refined Exponential Solver (RES), an improved exponential integrator for diffusion model sampling that addresses the order condition violations in existing methods. |
Existing high-order exponential integrator-based sampling algorithms often lead to suboptimal performance due to the use of degenerate solvers that violate necessary order conditions. This results in worse error bounds and reduced accuracy compared to theoretical expectations. |
The authors perform a thorough order analysis of single-step numerical schemes for the diffusion probability flow ODE, identify the overlooked order conditions, and derive a refined exponential integrator that satisfies these conditions. They also extend the analysis to multistep deterministic and stochastic sampling algorithms. |
RES demonstrates significantly smaller numerical defects and faster convergence compared to existing methods like DDIM, Heun, and DPM-Solver++.
The reduction in numerical defects achieved by RES translates to improved sampling quality, as evidenced by higher FID scores.
RES exhibits enhanced robustness to suboptimal time-step schedules compared to other methods. |
The choice of logarithmic transformation for the noise level, while empirically beneficial, lacks theoretical justification.
Training-free methods, even with RES, are still slower than GANs or distillation-based methods. |
diffusion models, sampling algorithms, exponential integrators, numerical ode solvers, order conditions |
2308.02154
Report |
SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation |
Shikun Sun, Longhui Wei, Junliang Xing, Jia Jia, Qi Tian |
Recent score-based diffusion models (SBDMs) show promising results in
unpaired image-to-image translation (I2I). However, existing methods, either
energy-based or statistically-based, provide no explicit form of the interfered
intermediate generative distributions. This work presents a new
score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the
tangled distributions during image generation. SDDM derives manifolds to make
the distributions of adjacent time steps separable and decompose the score
function or energy guidance into an image ``denoising" part and a content
``refinement" part. To refine the image in the same noise level, we equalize
the refinement parts of the score function and energy guidance, which permits
multi-objective optimization on the manifold. We also leverage the block
adaptive instance normalization module to construct manifolds with lower
dimensions but still concentrated with the perturbed reference image. SDDM
outperforms existing SBDM-based methods with much fewer diffusion steps on
several I2I benchmarks. |
This paper proposes SDDM, a novel score-decomposed diffusion model on manifolds, for unpaired image-to-image translation. |
Existing score-based diffusion models for image translation lack explicit control over intermediate generative distributions, leading to suboptimal results. |
SDDM decomposes score function and energy guidance into "denoising" and "refinement" parts using manifolds. It utilizes statistical guidance to separate adjacent time-step distributions and leverages the BAdaIN module to construct low-dimensional manifolds. Finally, it performs multi-objective optimization on these manifolds. |
SDDM achieves superior performance on I2I benchmarks compared to other SBDM-based methods.
It requires significantly fewer diffusion steps (100) than methods like EGSDE (1000) while achieving better or comparable results.
Ablation studies confirm the effectiveness of score decomposition, BAdaIN-based manifolds, and multi-objective optimization. |
The approach introduces additional computations, albeit negligible compared to neural network inferences.
Future work includes exploring stronger energy functions and applying SDDM to a wider range of image translation tasks. |
image-to-image translation, diffusion models, score-based models, manifold optimization, generative models |
2308.02117
Report |
VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs |
Ling Yang, Ye Tian, Minkai Xu, Zhongyi Liu, Shenda Hong, Wei Qu, Wentao Zhang, Bin Cui, Muhan Zhang, Jure Leskovec |
GNN-to-MLP distillation aims to utilize knowledge distillation (KD) to learn
computationally-efficient multi-layer perceptron (student MLP) on graph data by
mimicking the output representations of teacher GNN. Existing methods mainly
make the MLP to mimic the GNN predictions over a few class labels. However, the
class space may not be expressive enough for covering numerous diverse local
graph structures, thus limiting the performance of knowledge transfer from GNN
to MLP. To address this issue, we propose to learn a new powerful graph
representation space by directly labeling nodes' diverse local structures for
GNN-to-MLP distillation. Specifically, we propose a variant of VQ-VAE to learn
a structure-aware tokenizer on graph data that can encode each node's local
substructure as a discrete code. The discrete codes constitute a codebook as a
new graph representation space that is able to identify different local graph
structures of nodes with the corresponding code indices. Then, based on the
learned codebook, we propose a new distillation target, namely soft code
assignments, to directly transfer the structural knowledge of each node from
GNN to MLP. The resulting framework VQGraph achieves new state-of-the-art
performance on GNN-to-MLP distillation in both transductive and inductive
settings across seven graph datasets. We show that VQGraph with better
performance infers faster than GNNs by 828x, and also achieves accuracy
improvement over GNNs and stand-alone MLPs by 3.90% and 28.05% on average,
respectively. Code: https://github.com/YangLing0818/VQGraph. |
This paper introduces VQGraph, a novel GNN-to-MLP distillation framework that enhances the expressiveness of graph representation space by directly labeling diverse local node structures using a codebook for structure-aware knowledge transfer. |
Existing GNN-to-MLP distillation methods rely on class label space which lacks expressiveness to capture the diverse local graph structures, limiting their performance. |
VQGraph leverages a variant of VQ-VAE to learn a structure-aware tokenizer on graph data, encoding each node's local substructure into a discrete code. These codes constitute a codebook, forming a powerful representation space that can distinguish different local structures. VQGraph then utilizes this codebook to perform structure-aware distillation by minimizing the KL divergence between GNN and MLP predictions over the discrete codes (soft code assignment). |
VQGraph achieves state-of-the-art performance on GNN-to-MLP distillation, outperforming teacher GNNs by 3.90% on average accuracy while being 828x faster in inference.
The learned representation space in VQGraph is more compact and better captures both local and global graph structural information compared to existing methods.
Extensive experiments across seven datasets, including both transductive and inductive settings, show the effectiveness and robustness of VQGraph. |
The codebook size selection is crucial and currently relies on dataset-specific tuning.
Further exploration of different relation modules for computing code assignments could be beneficial. |
graph neural networks, knowledge distillation, graph representation learning, structure-aware distillation, vq-vae |
2308.02065
Report |
On the Biometric Capacity of Generative Face Models |
Vishnu Naresh Boddeti, Gautam Sreekumar, Arun Ross |
There has been tremendous progress in generating realistic faces with high
fidelity over the past few years. Despite this progress, a crucial question
remains unanswered: "Given a generative face model, how many unique identities
can it generate?" In other words, what is the biometric capacity of the
generative face model? A scientific basis for answering this question will
benefit evaluating and comparing different generative face models and establish
an upper bound on their scalability. This paper proposes a statistical approach
to estimate the biometric capacity of generated face images in a hyperspherical
feature space. We employ our approach on multiple generative models, including
unconditional generators like StyleGAN, Latent Diffusion Model, and "Generated
Photos," as well as DCFace, a class-conditional generator. We also estimate
capacity w.r.t. demographic attributes such as gender and age. Our capacity
estimates indicate that (a) under ArcFace representation at a false acceptance
rate (FAR) of 0.1%, StyleGAN3 and DCFace have a capacity upper bound of
$1.43\times10^6$ and $1.190\times10^4$, respectively; (b) the capacity reduces
drastically as we lower the desired FAR with an estimate of $1.796\times10^4$
and $562$ at FAR of 1% and 10%, respectively, for StyleGAN3; (c) there is no
discernible disparity in the capacity w.r.t gender; and (d) for some generative
models, there is an appreciable disparity in the capacity w.r.t age. Code is
available at https://github.com/human-analysis/capacity-generative-face-models. |
This paper proposes the first statistically robust method for estimating the biometric capacity, or the maximum number of unique identities a generative face model can produce, by analyzing the distribution of generated faces in a hyperspherical feature space. |
Estimating capacity provides an upper bound on the scalability of generative face models without exhaustive empirical evaluation, allowing for informed deployment and comparison of different models based on the uniqueness of generated identities. |
The approach involves representing generated faces in a hyperspherical feature space (using face recognition models like ArcFace, AdaFace), approximating population and class-specific manifolds as hyperspherical caps, and calculating capacity as the ratio of their surface areas. |
StyleGAN3 and DCFace have capacity upper bounds of 1.43 million and 11,900 respectively at a false acceptance rate (FAR) of 0.1%.
Capacity decreases drastically with stricter FAR thresholds (e.g., StyleGAN3 capacity drops to 562 at 10% FAR).
While capacity remains consistent across genders, some models show disparity in capacity across different age groups. |
The estimation relies on the assumption that intra-class variance can be approximated from real-face datasets.
The approach provides an upper bound, and relaxing assumptions could lead to tighter capacity estimates in future work. |
generative face models, biometric capacity, hyperspherical feature space, face recognition, diversity and uniqueness |
2308.01944
Report |
Dynamic Token-Pass Transformers for Semantic Segmentation |
Yuang Liu, Qiang Zhou, Jing Wang, Fan Wang, Jun Wang, Wei Zhang |
Vision transformers (ViT) usually extract features via forwarding all the
tokens in the self-attention layers from top to toe. In this paper, we
introduce dynamic token-pass vision transformers (DoViT) for semantic
segmentation, which can adaptively reduce the inference cost for images with
different complexity. DoViT gradually stops partial easy tokens from
self-attention calculation and keeps the hard tokens forwarding until meeting
the stopping criteria. We employ lightweight auxiliary heads to make the
token-pass decision and divide the tokens into keeping/stopping parts. With a
token separate calculation, the self-attention layers are speeded up with
sparse tokens and still work friendly with hardware. A token reconstruction
module is built to collect and reset the grouped tokens to their original
position in the sequence, which is necessary to predict correct semantic masks.
We conduct extensive experiments on two common semantic segmentation tasks, and
demonstrate that our method greatly reduces about 40% $\sim$ 60% FLOPs and the
drop of mIoU is within 0.8% for various segmentation transformers. The
throughput and inference speed of ViT-L/B are increased to more than 2$\times$
on Cityscapes. |
This paper presents DoViT, a dynamic token-pass vision transformer for semantic segmentation that adaptively reduces inference cost based on image complexity. |
Current vision transformers, though achieving high performance, are computationally expensive, making them prohibitive for real-time applications and resource-constrained devices. |
DoViT uses a semantic early-probe scheme to progressively stop easy tokens from self-attention calculation based on prediction confidence. It employs separate self-attention for remaining tokens and reconstructs the token sequence to ensure correct semantic prediction. |
DoViT reduces FLOPs by 40-60% with less than 0.8% mIoU drop on Cityscapes compared to standard ViT backbones.
Throughput and FPS are improved to over 2x on Cityscapes, demonstrating significant speedup.
The adaptive token-pass allows for image-specific inference cost, leading to varying levels of computation reduction based on complexity. |
Smaller networks show less FLOPs reduction due to lower confidence at early-probe stages, especially on challenging datasets like ADE20K.
Future work includes combining data-aware acceleration with parameter-aware compression techniques and extending it to other dense prediction tasks. |
semantic segmentation, vision transformer, model acceleration, dynamic token pass, early-probe |
2308.01904
Report |
DETR Doesn't Need Multi-Scale or Locality Design |
Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu |
This paper presents an improved DETR detector that maintains a "plain"
nature: using a single-scale feature map and global cross-attention
calculations without specific locality constraints, in contrast to previous
leading DETR-based detectors that reintroduce architectural inductive biases of
multi-scale and locality into the decoder. We show that two simple technologies
are surprisingly effective within a plain design to compensate for the lack of
multi-scale feature maps and locality constraints. The first is a box-to-pixel
relative position bias (BoxRPB) term added to the cross-attention formulation,
which well guides each query to attend to the corresponding object region while
also providing encoding flexibility. The second is masked image modeling
(MIM)-based backbone pre-training which helps learn representation with
fine-grained localization ability and proves crucial for remedying dependencies
on the multi-scale feature maps. By incorporating these technologies and recent
advancements in training and problem formation, the improved "plain" DETR
showed exceptional improvements over the original DETR detector. By leveraging
the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a
Swin-L backbone, which is highly competitive with state-of-the-art detectors
which all heavily rely on multi-scale feature maps and region-based feature
extraction. Code is available at https://github.com/impiga/Plain-DETR . |
This paper proposes an improved DETR detector that maintains a "plain" nature by using a single-scale feature map and global cross-attention calculations without specific locality constraints, unlike previous DETR-based detectors. |
The paper aims to improve upon the original DETR detector while preserving its simplicity and reducing reliance on domain-specific architectural biases. |
The paper introduces two key technologies: 1) Box-to-pixel relative position bias (BoxRPB) to guide cross-attention computation, and 2) Masked image modeling (MIM) pre-training to enhance feature representation with fine-grained localization. |
BoxRPB significantly improves detection accuracy by +8.9 mAP over the plain DETR baseline.
MIM pre-training further boosts performance by +7.4 mAP and enables the removal of multi-scale feature maps.
The improved plain DETR achieves 63.9 mAP with a Swin-L backbone, making it competitive with state-of-the-art detectors. |
The paper primarily focuses on object detection, and its generalizability to other vision tasks needs further exploration.
Further investigation into the interplay between BoxRPB and MIM pre-training is needed. |
object detection, detr, transformer, relative position bias, masked image modeling |
2308.01779
Report |
Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport |
Wentong Li, Yuqian Yuan, Song Wang, Jianke Zhu, Jianshu Li, Jian Liu, Lei Zhang |
Weakly-supervised image segmentation has recently attracted increasing
research attentions, aiming to avoid the expensive pixel-wise labeling. In this
paper, we present an effective method, namely Point2Mask, to achieve
high-quality panoptic prediction using only a single random point annotation
per target for training. Specifically, we formulate the panoptic pseudo-mask
generation as an Optimal Transport (OT) problem, where each ground-truth (gt)
point label and pixel sample are defined as the label supplier and consumer,
respectively. The transportation cost is calculated by the introduced
task-oriented maps, which focus on the category-wise and instance-wise
differences among the various thing and stuff targets. Furthermore, a
centroid-based scheme is proposed to set the accurate unit number for each gt
point supplier. Hence, the pseudo-mask generation is converted into finding the
optimal transport plan at a globally minimal transportation cost, which can be
solved via the Sinkhorn-Knopp Iteration. Experimental results on Pascal VOC and
COCO demonstrate the promising performance of our proposed Point2Mask approach
to point-supervised panoptic segmentation. Source code is available at:
https://github.com/LiWentomng/Point2Mask. |
Presents Point2Mask, a novel weakly supervised panoptic segmentation method that leverages Optimal Transport (OT) to generate pseudo-labels from single point annotations. |
Addresses the limitations of existing weakly supervised methods that struggle to accurately segment objects using only point supervision, particularly in differentiating between nearby instances of the same category. |
1. **Feature Learning:** Employs a two-branch network to extract category- and instance-level representations. 2. **OT-based Pseudo-label Generation:** Formulates an OT problem to assign pixels to ground truth labels based on a cost function that considers semantic and boundary information. 3. **Training:** Trains a panoptic segmentation model using the generated pseudo-labels. |
Achieves state-of-the-art performance on Pascal VOC and COCO datasets using single-point supervision.
Demonstrates superior performance compared to previous weakly supervised methods, particularly in distinguishing nearby instances.
Showcases the effectiveness of OT in assigning pixels for accurate pseudo-label generation. |
May not perform well on dense objects of the same category due to reliance on single-point annotation.
Relies on a relatively simple segmentation architecture which could limit performance on complex scenes. |
panoptic segmentation, weakly supervised learning, optimal transport, point annotation, pseudo-label |
2308.01766
Report |
Neural Poisson Surface Reconstruction: Resolution-Agnostic Shape Reconstruction from Point Clouds |
Hector Andrade-Loarca, Julius Hege, Daniel Cremers, Gitta Kutyniok |
We introduce Neural Poisson Surface Reconstruction (nPSR), an architecture
for shape reconstruction that addresses the challenge of recovering 3D shapes
from points. Traditional deep neural networks face challenges with common 3D
shape discretization techniques due to their computational complexity at higher
resolutions. To overcome this, we leverage Fourier Neural Operators to solve
the Poisson equation and reconstruct a mesh from oriented point cloud
measurements. nPSR exhibits two main advantages: First, it enables efficient
training on low-resolution data while achieving comparable performance at
high-resolution evaluation, thanks to the resolution-agnostic nature of FNOs.
This feature allows for one-shot super-resolution. Second, our method surpasses
existing approaches in reconstruction quality while being differentiable and
robust with respect to point sampling rates. Overall, the neural Poisson
surface reconstruction not only improves upon the limitations of classical deep
neural networks in shape reconstruction but also achieves superior results in
terms of reconstruction quality, running time, and resolution agnosticism. |
Introduces Neural Poisson Surface Reconstruction, a novel architecture using Fourier Neural Operators for reconstructing 3D shapes from oriented point clouds by solving the Poisson equation. |
Addresses limitations of traditional deep learning methods in 3D shape reconstruction, particularly in handling high resolutions and low sampling rates. |
Leverages Fourier Neural Operators to learn a mapping from a point cloud rasterized to a voxel grid representation of the divergence of the normal field, to the reconstructed shape. Employs Otsu's thresholding and marching cubes for post-processing. |
Significantly outperforms existing methods in low sampling scenarios (3,000-25,000 points).
Achieves comparable performance to state-of-the-art in high sampling regimes (250,000 points).
Exhibits resolution agnosticism, enabling training on low-resolution data and evaluating on higher resolutions with similar performance. |
Requires pre-determined resolution for training data, potentially leading to loss of detail.
Further exploration of alternative architectures and regularization techniques for optimization. |
3d shape reconstruction, point cloud processing, fourier neural operator, poisson surface reconstruction, resolution agnostic |
2308.01544
Report |
Multimodal Neurons in Pretrained Text-Only Transformers |
Sarah Schwettmann, Neil Chowdhury, Samuel Klein, David Bau, Antonio Torralba |
Language models demonstrate remarkable capacity to generalize representations
learned in one modality to downstream tasks in other modalities. Can we trace
this ability to individual neurons? We study the case where a frozen text
transformer is augmented with vision using a self-supervised visual encoder and
a single linear projection learned on an image-to-text task. Outputs of the
projection layer are not immediately decodable into language describing image
content; instead, we find that translation between modalities occurs deeper
within the transformer. We introduce a procedure for identifying "multimodal
neurons" that convert visual representations into corresponding text, and
decoding the concepts they inject into the model's residual stream. In a series
of experiments, we show that multimodal neurons operate on specific visual
concepts across inputs, and have a systematic causal effect on image
captioning. |
The paper investigates the emergence of "multimodal neurons" within the MLP layers of a frozen text transformer (GPT-J) augmented with a self-supervised visual encoder (BEIT) for image captioning. |
This work aims to understand how language models, trained solely on text, demonstrate cross-modal generalization abilities when combined with visual encoders. |
The authors introduce a gradient-based attribution method to identify neurons that significantly contribute to predicting specific nouns in image captions. They decode the language contributions of these neurons by analyzing the corresponding output embedding weights. |
Image representations projected into the transformer's embedding space do not directly encode interpretable semantic information, implying that cross-modal translation happens within the transformer.
Multimodal neurons, found predominantly in earlier transformer layers, exhibit selectivity for specific visual concepts and consistently translate them into related text.
Ablating these multimodal neurons significantly alters the generated captions, demonstrating their causal role in translating visual information into language. |
The study focuses on a single vision-language model (LiMBeR-BEIT) with separate vision and language components. Future work should investigate the presence of multimodal neurons in other architectures.
While the authors demonstrate the existence and influence of multimodal neurons, a deeper understanding of their formation and how they assemble concepts from upstream representations is needed. |
multimodal learning, vision-language models, transformer networks, neuron interpretability, cross-modal generalization |
2308.01536
Report |
MFIM: Megapixel Facial Identity Manipulation |
Sanghyeon Na |
Face swapping is a task that changes a facial identity of a given image to
that of another person. In this work, we propose a novel face-swapping
framework called Megapixel Facial Identity Manipulation (MFIM). The
face-swapping model should achieve two goals. First, it should be able to
generate a high-quality image. We argue that a model which is proficient in
generating a megapixel image can achieve this goal. However, generating a
megapixel image is generally difficult without careful model design. Therefore,
our model exploits pretrained StyleGAN in the manner of GAN-inversion to
effectively generate a megapixel image. Second, it should be able to
effectively transform the identity of a given image. Specifically, it should be
able to actively transform ID attributes (e.g., face shape and eyes) of a given
image into those of another person, while preserving ID-irrelevant attributes
(e.g., pose and expression). To achieve this goal, we exploit 3DMM that can
capture various facial attributes. Specifically, we explicitly supervise our
model to generate a face-swapped image with the desirable attributes using
3DMM. We show that our model achieves state-of-the-art performance through
extensive experiments. Furthermore, we propose a new operation called ID
mixing, which creates a new identity by semantically mixing the identities of
several people. It allows the user to customize the new identity. |
This paper presents MFIM, a novel face-swapping framework that generates high-quality megapixel face-swapped images and effectively performs identity transformation. |
Face swapping has applications in entertainment, privacy protection, and the theatrical industry, making high-quality and effective face swapping techniques increasingly important. |
MFIM utilizes a pretrained StyleGAN generator and a facial attribute encoder to generate images. It leverages 3DMM for explicit supervision during training to ensure effective identity transformation, particularly in face shape. It introduces a novel ID mixing operation, creating new identities by combining attributes from multiple source images. |
MFIM achieves state-of-the-art performance on face swapping benchmarks, outperforming baselines in identity, shape, expression, and pose metrics.
The use of style maps in the encoder allows MFIM to preserve details from the target image, addressing limitations of previous StyleGAN-based methods.
The ID mixing operation enables semantic control over identity creation, blending global and local attributes from multiple source images without requiring additional training or labels. |
The disentanglement of ID and ID-irrelevant representations in MFIM can be further improved to prevent attribute leakage.
Investigating the application of MFIM to high-frequency detail reconstruction, potentially through techniques like ROI-only synthesis, is a promising direction for future work. |
face swapping, gan inversion, stylegan, 3dmm, identity mixing |
2308.01532
Report |
Multimodal Adaptation of CLIP for Few-Shot Action Recognition |
Jiazheng Xing, Mengmeng Wang, Xiaojun Hou, Guang Dai, Jingdong Wang, Yong Liu |
Applying large-scale pre-trained visual models like CLIP to few-shot action
recognition tasks can benefit performance and efficiency. Utilizing the
"pre-training, fine-tuning" paradigm makes it possible to avoid training a
network from scratch, which can be time-consuming and resource-intensive.
However, this method has two drawbacks. First, limited labeled samples for
few-shot action recognition necessitate minimizing the number of tunable
parameters to mitigate over-fitting, also leading to inadequate fine-tuning
that increases resource consumption and may disrupt the generalized
representation of models. Second, the video's extra-temporal dimension
challenges few-shot recognition's effective temporal modeling, while
pre-trained visual models are usually image models. This paper proposes a novel
method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues.
It adapts CLIP for few-shot action recognition by adding lightweight adapters,
which can minimize the number of learnable parameters and enable the model to
transfer across different tasks quickly. The adapters we design can combine
information from video-text multimodal sources for task-oriented spatiotemporal
modeling, which is fast, efficient, and has low training costs. Additionally,
based on the attention mechanism, we design a text-guided prototype
construction module that can fully utilize video-text information to enhance
the representation of video prototypes. Our MA-CLIP is plug-and-play, which can
be used in any different few-shot action recognition temporal alignment metric. |
This paper proposes MA-CLIP, a novel method that adapts the CLIP model for few-shot action recognition by incorporating lightweight adapters and a text-guided prototype construction module. |
Few-shot action recognition benefits from large pre-trained models but suffers from overfitting with limited data and difficulty in effective temporal modeling. MA-CLIP addresses these issues by using adapters to minimize trainable parameters and enable quick task transfer while enhancing temporal modeling with minimal extra parameters. |
MA-CLIP freezes the pre-trained CLIP encoders and inserts lightweight adapters for task-specific spatiotemporal modeling. These adapters leverage multimodal information from video and text. A text-guided prototype construction module, based on attention, enhances video prototype representations. MA-CLIP is designed to be compatible with any temporal alignment metric used in few-shot action recognition. |
MA-CLIP achieves state-of-the-art performance on five widely used datasets for few-shot action recognition, surpassing previous methods in accuracy.
The use of adapters allows for significant reduction in trainable parameters and training time compared to full fine-tuning of the visual encoder, making it more efficient.
Experiments demonstrate that incorporating text information significantly boosts performance, highlighting the importance of multimodal learning for this task. |
The performance improvement from CLIP pre-training is less significant for datasets where temporal information is crucial.
Future work could explore different adapter architectures or integrate other parameter-efficient fine-tuning techniques. |
few-shot action recognition, clip, multimodal learning, parameter-efficient fine-tuning, adapters |
2308.01499
Report |
TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations |
Qi Yang, Joel Jung, Timon Deschamps, Xiaozhong Xu, Shan Liu |
Dynamic colored meshes (DCM) are widely used in various applications;
however, these meshes may undergo different processes, such as compression or
transmission, which can distort them and degrade their quality. To facilitate
the development of objective metrics for DCMs and study the influence of
typical distortions on their perception, we create the Tencent - dynamic
colored mesh database (TDMD) containing eight reference DCM objects with six
typical distortions. Using processed video sequences (PVS) derived from the
DCM, we have conducted a large-scale subjective experiment that resulted in 303
distorted DCM samples with mean opinion scores, making the TDMD the largest
available DCM database to our knowledge. This database enabled us to study the
impact of different types of distortion on human perception and offer
recommendations for DCM compression and related tasks. Additionally, we have
evaluated three types of state-of-the-art objective metrics on the TDMD,
including image-based, point-based, and video-based metrics, on the TDMD. Our
experimental results highlight the strengths and weaknesses of each metric, and
we provide suggestions about the selection of metrics in practical DCM
applications. The TDMD will be made publicly available at the following
location: https://multimedia.tencent.com/resources/tdmd. |
This paper introduces TDMD, a new database for Dynamic Colored Mesh (DCM) quality assessment. It contains 8 reference DCMs and 6 types of distortions (color noise, texture map downsampling, geometrical Gaussian noise, mesh decimation, MPEG lossy compression, and texture map compression) at various severity levels, totaling 303 distorted samples with MOS obtained via subjective experiments. |
Existing mesh quality assessment work mainly focuses on static, often non-colored meshes. However, DCMs are increasingly used, demanding dedicated quality assessment tools and an understanding of how distortions impact human perception. |
The researchers applied distortions to reference DCMs, converted them into processed video sequences (PVSs) using a predefined camera path, and conducted subjective experiments to obtain MOS. They then evaluated the performance of three types of objective metrics (image-based, point-based, and video-based) on TDMD. |
The impact of mesh decimation and texture map compression on perceived quality is limited at the tested levels.
Point-based metric \({
m PCQM}_\rm{p}\) and video-based metric MS-SSIM achieve the best performance in predicting DCM quality.
Sampling resolution and method impact the performance of point-based metrics, with denser sampling generally leading to higher accuracy. |
The study only considers a 2D monitor viewing environment, while VR viewing might yield different results.
Further research is needed to explore optimal camera paths for PVS generation, as different DCM content might have varying regions of interest. |
dynamic mesh quality assessment, subjective experiment, database, objective metric, point cloud |
2308.01472
Report |
Reverse Stable Diffusion: What prompt was used to generate this image? |
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah |
Text-to-image diffusion models such as Stable Diffusion have recently
attracted the interest of many researchers, and inverting the diffusion process
can play an important role in better understanding the generative process and
how to engineer prompts in order to obtain the desired images. To this end, we
introduce the new task of predicting the text prompt given an image generated
by a generative diffusion model. We combine a series of white-box and black-box
models (with and without access to the weights of the diffusion network) to
deal with the proposed task. We propose a novel learning framework comprising
of a joint prompt regression and multi-label vocabulary classification
objective that generates improved prompts. To further improve our method, we
employ a curriculum learning procedure that promotes the learning of
image-prompt pairs with lower labeling noise (i.e. that are better aligned),
and an unsupervised domain-adaptive kernel learning method that uses the
similarities between samples in the source and target domains as extra
features. We conduct experiments on the DiffusionDB data set, predicting text
prompts from images generated by Stable Diffusion. Our novel learning framework
produces excellent results on the aforementioned task, yielding the highest
gains when applied on the white-box model. In addition, we make an interesting
discovery: training a diffusion model on the prompt generation task can make
the model generate images that are much better aligned with the input prompts,
when the model is directly reused for text-to-image generation. |
This paper introduces the novel task of predicting the text prompt embedding given an image generated by a text-to-image diffusion model, aiming to reverse the generative process and better understand prompt engineering. |
Understanding the image-to-text mapping in diffusion models is crucial for improving prompt design, understanding the generative process, and potentially enhancing image generation quality. |
The paper proposes a learning framework that combines white-box and black-box models, incorporating three novel components: a joint prompt regression and multi-label vocabulary classification objective, a curriculum learning procedure for handling noisy labels, and a domain-adaptive kernel learning (DAKL) method for leveraging target domain information. |
The proposed learning framework, particularly the classification head and curriculum learning, consistently improves the performance across different image encoders.
The joint framework, combining embeddings from multiple models, outperforms individual models, with DAKL further enhancing performance.
Training a diffusion model on the prompt generation task leads to generating images better aligned with the input prompts, showcasing a promising application for improving text-to-image generation quality. |
The paper primarily focuses on Stable Diffusion and a single dataset (DiffusionDB), potentially limiting generalizability.
The computational cost of DAKL, despite using k-means for efficiency, can still be a concern for larger datasets. |
diffusion models, text-to-image generation, image-to-text generation, prompt engineering, curriculum learning |
2308.01390
Report |
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models |
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt |
We introduce OpenFlamingo, a family of autoregressive vision-language models
ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce
an open-source replication of DeepMind's Flamingo models. On seven
vision-language datasets, OpenFlamingo models average between 80 - 89% of
corresponding Flamingo performance. This technical report describes our models,
training data, hyperparameters, and evaluation suite. We share our models and
code at https://github.com/mlfoundations/open_flamingo. |
Introduced OpenFlamingo, a family of open-source autoregressive vision-language models with 3B to 9B parameters, replicating DeepMind's Flamingo models. |
Addresses the lack of open-source alternatives to closed-source autoregressive vision-language models, enabling research on their capabilities and safety. |
Trained on LAION-2B and Multimodal C4 datasets, using CLIP as the vision encoder and MPT or RedPajama language models as decoders. |
OpenFlamingo models achieve 80-89% of corresponding Flamingo models' performance on seven vision-language datasets.
Performance generally improves with more in-context examples but at a lower rate than Flamingo.
OpenFlamingo models exhibit limitations in visual question answering, particularly in counting, answer verbosity, and handling non-central objects. |
Limited performance with many in-context examples potentially due to few images in training sequences.
Unexpected performance degradation in 4B models with frozen image and end-of-chunk embeddings. |
vision-language models, open-source, flamingo, in-context learning, multimodal |
2308.01379
Report |
Computational Long Exposure Mobile Photography |
Eric Tabellion, Nikhil Karnad, Noa Glaser, Ben Weiss, David E. Jacobs, Yael Pritch |
Long exposure photography produces stunning imagery, representing moving
elements in a scene with motion-blur. It is generally employed in two
modalities, producing either a foreground or a background blur effect.
Foreground blur images are traditionally captured on a tripod-mounted camera
and portray blurred moving foreground elements, such as silky water or light
trails, over a perfectly sharp background landscape. Background blur images,
also called panning photography, are captured while the camera is tracking a
moving subject, to produce an image of a sharp subject over a background
blurred by relative motion. Both techniques are notoriously challenging and
require additional equipment and advanced skills. In this paper, we describe a
computational burst photography system that operates in a hand-held smartphone
camera app, and achieves these effects fully automatically, at the tap of the
shutter button. Our approach first detects and segments the salient subject. We
track the scene motion over multiple frames and align the images in order to
preserve desired sharpness and to produce aesthetically pleasing motion
streaks. We capture an under-exposed burst and select the subset of input
frames that will produce blur trails of controlled length, regardless of scene
or camera motion velocity. We predict inter-frame motion and synthesize
motion-blur to fill the temporal gaps between the input frames. Finally, we
composite the blurred image with the sharp regular exposure to protect the
sharpness of faces or areas of the scene that are barely moving, and produce a
final high resolution and high dynamic range (HDR) photograph. Our system
democratizes a capability previously reserved to professionals, and makes this
creative style accessible to most casual photographers.
More information and supplementary material can be found on our project
webpage: https://motion-mode.github.io/ |
This paper presents a computational burst photography system for smartphones that automatically produces long exposure effects, with either blurred foregrounds or backgrounds, by compensating for camera and subject motion. |
Long exposure photography, traditionally requiring tripods, filters, and advanced skills, is made accessible to casual photographers through this system. |
The system analyzes scene motion, detects and tracks subjects, aligns images for desired sharpness, predicts motion for blur synthesis, and composites with a sharp exposure for optimal results. |
The system effectively synthesizes long exposure effects in both foreground and background blur modes, as demonstrated by examples.
A novel background blur alignment technique using temporal regularization produces aesthetically pleasing, consistent motion blur trails.
A simplified motion prediction model, designed for mobile efficiency, achieves comparable quality to more complex models. |
Background blur for very small subjects can lead to misalignments due to prediction and tracking errors.
Large motion disparities exceeding the model's receptive field can cause artifacts, limiting the system's ability to handle all motion magnitudes. |
computational photography, long exposure, motion blur, mobile photography, computer vision |
2308.01316
Report |
Patched Denoising Diffusion Models For High-Resolution Image Synthesis |
Zheng Ding, Mengqi Zhang, Jiajun Wu, Zhuowen Tu |
We propose an effective denoising diffusion model for generating
high-resolution images (e.g., 1024$\times$512), trained on small-size image
patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new
feature collage strategy is designed to avoid the boundary artifact when
synthesizing large-size images. Feature collage systematically crops and
combines partial features of the neighboring patches to predict the features of
a shifted image patch, allowing the seamless generation of the entire image due
to the overlap in the patch feature space. Patch-DM produces high-quality image
synthesis results on our newly collected dataset of nature images
(1024$\times$512), as well as on standard benchmarks of smaller sizes
(256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our
method with previous patch-based generation methods and achieve
state-of-the-art FID scores on all four datasets. Further, Patch-DM also
reduces memory complexity compared to the classic diffusion models. |
Proposes Patch-DM, a patch-based denoising diffusion model for high-resolution image synthesis, using a novel feature collage strategy to avoid boundary artifacts. |
Addresses limitations of current diffusion models in high-resolution image generation due to high computational costs and memory requirements. |
Trains a patch-level denoising U-Net model with a feature collage strategy, where features from neighboring patches are combined to predict shifted patches, ensuring consistency. |
Achieves state-of-the-art FID scores on a newly collected dataset of 1024x512 natural images and standard benchmarks (LSUN-Bedroom, LSUN-Church, FFHQ).
Generates high-quality images with minimal boundary artifacts despite being patch-based.
Reduces memory complexity compared to classic diffusion models due to the patch-level representation. |
Loss of some detailed image information when downsampling for global condition extraction using pre-trained encoders.
Limited exploration of patch sizes beyond 64x64. |
image synthesis, denoising diffusion models, high-resolution images, patch-based generation, feature collage |
2308.01300
Report |
Revisiting DETR Pre-training for Object Detection |
Yan Ma, Weicong Liang, Bohan Chen, Yiduo Hao, Bojian Hou, Xiangyu Yue, Chao Zhang, Yuhui Yuan |
Motivated by the remarkable achievements of DETR-based approaches on COCO
object detection and segmentation benchmarks, recent endeavors have been
directed towards elevating their performance through self-supervised
pre-training of Transformers while preserving a frozen backbone. Noteworthy
advancements in accuracy have been documented in certain studies. Our
investigation delved deeply into a representative approach, DETReg, and its
performance assessment in the context of emerging models like
$\mathcal{H}$-Deformable-DETR. Regrettably, DETReg proves inadequate in
enhancing the performance of robust DETR-based models under full data
conditions. To dissect the underlying causes, we conduct extensive experiments
on COCO and PASCAL VOC probing elements such as the selection of pre-training
datasets and strategies for pre-training target generation. By contrast, we
employ an optimized approach named Simple Self-training which leads to marked
enhancements through the combination of an improved box predictor and the
Objects$365$ benchmark. The culmination of these endeavors results in a
remarkable AP score of $59.3\%$ on the COCO val set, outperforming
$\mathcal{H}$-Deformable-DETR + Swin-L without pre-training by $1.4\%$.
Moreover, a series of synthetic pre-training datasets, generated by merging
contemporary image-to-text(LLaVA) and text-to-image (SDXL) models,
significantly amplifies object detection capabilities. |
This paper revisits self-supervised pre-training for DETR object detection models, finding existing methods ineffective for stronger DETR variants and proposing a Simple Self-training scheme with improved pre-training targets. |
Pre-training the Transformer components of DETR models is crucial to fully realize their potential and enhance object detection performance. |
The paper investigates the limitations of DETReg, proposes using pseudo-boxes and pseudo-class predictions as pre-training targets, and explores using synthetic datasets generated by text-to-image models. |
Simple Self-training significantly outperforms DETReg and achieves competitive results on COCO (59.3% AP).
Accurate pseudo-box targets are more crucial than classification targets for effective pre-training.
Pre-training with synthetic datasets generated from text-to-image models show promising results, comparable to using real data (Objects365). |
The study primarily focuses on object detection, leaving extensions to other vision tasks for future work.
Further exploration of larger batch sizes and longer training schedules for pre-training is necessary. |
object detection, detr, self-supervised learning, pre-training, synthetic data |
2308.01140
Report |
Dynamically Scaled Temperature in Self-Supervised Contrastive Learning |
Siladittya Manna, Soumitri Chattopadhyay, Rakesh Dey, Saumik Bhattacharya, Umapada Pal |
In contemporary self-supervised contrastive algorithms like SimCLR, MoCo,
etc., the task of balancing attraction between two semantically similar samples
and repulsion between two samples of different classes is primarily affected by
the presence of hard negative samples. While the InfoNCE loss has been shown to
impose penalties based on hardness, the temperature hyper-parameter is the key
to regulating the penalties and the trade-off between uniformity and tolerance.
In this work, we focus our attention on improving the performance of InfoNCE
loss in self-supervised learning by proposing a novel cosine similarity
dependent temperature scaling function to effectively optimize the distribution
of the samples in the feature space. We also provide mathematical analyses to
support the construction of such a dynamically scaled temperature function.
Experimental evidence shows that the proposed framework outperforms the
contrastive loss-based SSL algorithms. |
The paper proposes DySTreSS, a novel self-supervised contrastive learning framework that dynamically scales the temperature parameter in the InfoNCE loss based on cosine similarity. |
The temperature parameter in InfoNCE loss significantly impacts the trade-off between uniformity and tolerance in feature representation. Dynamically scaling it helps to better optimize this trade-off and improve representation learning. |
The authors theoretically analyze the effect of temperature on local and global feature structures, deriving criteria for a suitable temperature scaling function. They propose a cosine-based function that satisfies these criteria and apply it to the SimCLR framework. |
DySTreSS outperforms state-of-the-art SSL methods like SimCLR, MoCov2, and DCL on linear evaluation benchmarks including ImageNet and CIFAR.
The proposed method also shows superior performance on transfer learning tasks for both image and text modalities.
Ablation studies validate the effectiveness of the chosen temperature function and its impact on uniformity, tolerance, and overall accuracy. |
The paper primarily focuses on cosine similarity-based temperature scaling and its effectiveness on other similarity measures is not explored.
The impact of dynamic temperature scaling on computational overhead, specifically for large-scale datasets and models, is not discussed. |
self-supervised learning, contrastive learning, infonce loss, temperature scaling, representation learning |
2308.01045
Report |
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation |
Quan Tang, Bowen Zhang, Jiajun Liu, Fagui Liu, Yifan Liu |
Vision transformers have achieved leading performance on various visual tasks
yet still suffer from high computational complexity. The situation deteriorates
in dense prediction tasks like semantic segmentation, as high-resolution inputs
and outputs usually imply more tokens involved in computations. Directly
removing the less attentive tokens has been discussed for the image
classification task but can not be extended to semantic segmentation since a
dense prediction is required for every patch. To this end, this work introduces
a Dynamic Token Pruning (DToP) method based on the early exit of tokens for
semantic segmentation. Motivated by the coarse-to-fine segmentation process by
humans, we naturally split the widely adopted auxiliary-loss-based network
architecture into several stages, where each auxiliary block grades every
token's difficulty level. We can finalize the prediction of easy tokens in
advance without completing the entire forward pass. Moreover, we keep $k$
highest confidence tokens for each semantic category to uphold the
representative context information. Thus, computational complexity will change
with the difficulty of the input, akin to the way humans do segmentation.
Experiments suggest that the proposed DToP architecture reduces on average
$20\% - 35\%$ of computational cost for current semantic segmentation methods
based on plain vision transformers without accuracy degradation. |
This paper introduces Dynamic Token Pruning (DToP), a method for reducing computational cost in vision transformers for semantic segmentation by allowing early exit of easy-to-recognize tokens. |
Vision transformers achieve high performance but suffer from heavy computational overhead, especially in dense prediction tasks like semantic segmentation where high-resolution images generate numerous tokens. |
DToP divides the network into stages using inherent auxiliary blocks. It grades token difficulty at each stage, finalizing predictions for easy tokens and pruning them from further computation, while harder tokens proceed to subsequent stages. |
DToP reduces computational cost by 20-35% on average without sacrificing accuracy on benchmarks like ADE20K, Pascal Context, and COCO-Stuff-10K.
The method effectively allocates computation by pruning more tokens in simple images and fewer in complex ones.
Keeping the 'k' most confident tokens for each semantic category during pruning helps retain contextual information and improves performance. |
DToP, like other dynamic networks, faces limitations in fully utilizing mini-batch computation efficiency.
Future work includes optimizing DToP to further expedite vision transformers. |
semantic segmentation, vision transformer, token pruning, computational efficiency, dynamic network |
2308.00951
Report |
From Sparse to Soft Mixtures of Experts |
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby |
Sparse mixture of expert architectures (MoEs) scale model capacity without
large increases in training or inference costs. Despite their success, MoEs
suffer from a number of issues: training instability, token dropping, inability
to scale the number of experts, or ineffective finetuning. In this work, we
proposeSoft MoE, a fully-differentiable sparse Transformer that addresses these
challenges, while maintaining the benefits of MoEs. Soft MoE performs an
implicit soft assignment by passing different weighted combinations of all
input tokens to each expert. As in other MoE works, experts in Soft MoE only
process a subset of the (combined) tokens, enabling larger model capacity at
lower inference cost. In the context of visual recognition, Soft MoE greatly
outperforms standard Transformers (ViTs) and popular MoE variants (Tokens
Choice and Experts Choice). For example, Soft MoE-Base/16 requires 10.5x lower
inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its
performance after similar training. Soft MoE also scales well: Soft MoE Huge/14
with 128 experts in 16 MoE layers has over 40x more parameters than ViT
Huge/14, while inference time cost grows by only 2%, and it performs
substantially better. |
The paper introduces \name, a fully-differentiable sparse Transformer that addresses the challenges of training instability, token dropping, scalability, and ineffective fine-tuning in existing sparse Mixture of Expert (MoE) architectures. |
Sparse MoEs are crucial for scaling model capacity without excessive computational costs, making them essential for improving performance in various tasks like visual recognition. |
\name utilizes a soft assignment mechanism, computing weighted averages of input tokens for each expert instead of discrete token-to-expert assignments. This approach simplifies training, avoids token dropping and expert imbalance, and enhances speed. |
\name consistently outperforms dense Vision Transformers (ViTs) and other sparse MoE variants (Tokens Choice and Experts Choice) in image classification tasks, achieving better performance with lower training costs.
The paper demonstrates \name's scalability to thousands of experts, enabling the training of large models with improved performance and manageable inference costs.
Experiments on image-language contrastive learning show that representations learned by \name are beneficial for other tasks like image-text alignment. |
The current design of \name makes its application in auto-regressive decoders challenging due to the need to preserve causality between tokens.
While \name maintains computational efficiency, its memory requirements can increase with a large number of experts, especially when using one slot per expert, which is often optimal for performance. |
mixture of experts, transformers, visual recognition, sparse models, image classification |
2308.00906
Report |
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation |
Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike |
While language-guided image manipulation has made remarkable progress, the
challenge of how to instruct the manipulation process faithfully reflecting
human intentions persists. An accurate and comprehensive description of a
manipulation task using natural language is laborious and sometimes even
impossible, primarily due to the inherent uncertainty and ambiguity present in
linguistic expressions. Is it feasible to accomplish image manipulation without
resorting to external cross-modal language information? If this possibility
exists, the inherent modality gap would be effortlessly eliminated. In this
paper, we propose a novel manipulation methodology, dubbed ImageBrush, that
learns visual instructions for more accurate image editing. Our key idea is to
employ a pair of transformation images as visual instructions, which not only
precisely captures human intention but also facilitates accessibility in
real-world scenarios. Capturing visual instructions is particularly challenging
because it involves extracting the underlying intentions solely from visual
demonstrations and then applying this operation to a new image. To address this
challenge, we formulate visual instruction learning as a diffusion-based
inpainting problem, where the contextual information is fully exploited through
an iterative process of generation. A visual prompting encoder is carefully
devised to enhance the model's capacity in uncovering human intent behind the
visual instructions. Extensive experiments show that our method generates
engaging manipulation results conforming to the transformations entailed in
demonstrations. Moreover, our model exhibits robust generalization capabilities
on various downstream tasks such as pose transfer, image translation and video
inpainting. |
This paper introduces ImageBrush, a novel framework for image manipulation that utilizes pairs of exemplar images as visual instructions, eliminating the need for language-based guidance. |
This approach addresses challenges associated with the ambiguity and limitations of language in accurately conveying human intention for image manipulation. |
ImageBrush leverages a diffusion-based inpainting strategy with a grid-like input containing exemplar images and a target image. A visual prompting encoder extracts semantic relationships and a user interface enables bounding box annotations for specifying regions of interest. |
ImageBrush outperforms language-guided methods in qualitative evaluations, demonstrating superior fidelity to provided examples.
Quantitative results on diverse in-the-wild datasets demonstrate ImageBrush's superior performance in tasks like image translation, pose transfer, and video inpainting.
Ablation studies highlight the importance of each component in ImageBrush, including the diffusion process, visual prompting encoder, and region of interest interface. |
ImageBrush may face challenges with significant disparities between instructions and query images.
Handling intricate details like subtle background changes or small object additions remains challenging. |
image manipulation, visual instruction, diffusion models, in-context learning, visual prompting |
2308.00773
Report |
High-Fidelity Eye Animatable Neural Radiance Fields for Human Face |
Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang |
Face rendering using neural radiance fields (NeRF) is a rapidly developing
research area in computer vision. While recent methods primarily focus on
controlling facial attributes such as identity and expression, they often
overlook the crucial aspect of modeling eyeball rotation, which holds
importance for various downstream tasks. In this paper, we aim to learn a face
NeRF model that is sensitive to eye movements from multi-view images. We
address two key challenges in eye-aware face NeRF learning: how to effectively
capture eyeball rotation for training and how to construct a manifold for
representing eyeball rotation. To accomplish this, we first fit FLAME, a
well-established parametric face model, to the multi-view images considering
multi-view consistency. Subsequently, we introduce a new Dynamic Eye-aware NeRF
(DeNeRF). DeNeRF transforms 3D points from different views into a canonical
space to learn a unified face NeRF model. We design an eye deformation field
for the transformation, including rigid transformation, e.g., eyeball rotation,
and non-rigid transformation. Through experiments conducted on the ETH-XGaze
dataset, we demonstrate that our model is capable of generating high-fidelity
images with accurate eyeball rotation and non-rigid periocular deformation,
even under novel viewing angles. Furthermore, we show that utilizing the
rendered images can effectively enhance gaze estimation performance. |
This paper introduces DeNeRF, a novel dynamic eye-aware Neural Radiance Field (NeRF) capable of rendering high-fidelity faces with animatable eyes from multi-view images under novel viewpoints and eye poses. |
Controllable eye movement in face rendering is crucial for realism and downstream tasks like gaze estimation, yet existing face NeRF models often overlook this aspect. |
The proposed DeNeRF leverages multi-view face tracking with FLAME to accurately capture eyeball rotation. It then learns a unified face NeRF in a canonical space, employing an eye deformation field (including rigid and non-rigid transformations) to transform 3D points from observation space to the canonical space. |
DeNeRF generates high-fidelity face images with accurate eyeball rotation and non-rigid periocular deformation, even under novel viewing angles.
Quantitative comparisons demonstrate DeNeRF's superior performance over existing 2D and 3D face rendering methods, particularly in the eye region.
Using DeNeRF's rendered images for data augmentation significantly improves the performance of downstream gaze estimation tasks. |
The model currently requires multi-view images, limiting its applicability to single-view scenarios.
The computational cost of DeNeRF is relatively high, hindering its deployment in real-time applications. |
neural radiance fields, face rendering, eye animation, gaze estimation, computer vision |
2308.00759
Report |
Decomposition Ascribed Synergistic Learning for Unified Image Restoration |
Jinghao Zhang, Feng Zhao |
Learning to restore multiple image degradations within a single model is
quite beneficial for real-world applications. Nevertheless, existing works
typically concentrate on regarding each degradation independently, while their
relationship has been less exploited to ensure the synergistic learning. To
this end, we revisit the diverse degradations through the lens of singular
value decomposition, with the observation that the decomposed singular vectors
and singular values naturally undertake the different types of degradation
information, dividing various restoration tasks into two groups, \ie, singular
vector dominated and singular value dominated. The above analysis renders a
more unified perspective to ascribe the diverse degradations, compared to
previous task-level independent learning. The dedicated optimization of
degraded singular vectors and singular values inherently utilizes the potential
relationship among diverse restoration tasks, attributing to the Decomposition
Ascribed Synergistic Learning (DASL). Specifically, DASL comprises two
effective operators, namely, Singular VEctor Operator (SVEO) and Singular VAlue
Operator (SVAO), to favor the decomposed optimization, which can be lightly
integrated into existing image restoration backbone. Moreover, the congruous
decomposition loss has been devised for auxiliary. Extensive experiments on
blended five image restoration tasks demonstrate the effectiveness of our
method. |
This paper proposes Decomposition Ascribed Synergistic Learning (DASL), a novel approach to unified image restoration that leverages the relationship between different degradation types through singular value decomposition. |
Existing multi-degradation learning methods often treat each degradation independently, neglecting their potential synergistic relationships. DASL aims to address this limitation by enabling a more unified learning process. |
DASL decomposes image degradations based on singular value decomposition, observing that singular vectors and singular values capture distinct degradation information. It then employs two operators, Singular VEctor Operator (SVEO) and Singular VAlue Operator (SVAO), to optimize degraded singular vectors and values, respectively, alongside a congruous decomposition loss. |
DASL consistently outperforms existing general image restoration and all-in-one methods on five common image restoration tasks.
The method demonstrates reduced computational complexity and faster inference compared to baseline methods.
Ablation studies confirm the contribution of SVEO, SVAO, and decomposition loss to the performance gain. |
Exploring more sophisticated correlations beyond decomposed singular vectors and singular values.
Investigating the potential of leveraging the distribution discrepancy of degradations on separate orders of decomposed components. |
image restoration, multi-degradation learning, singular value decomposition, synergistic learning, deep learning |
2308.00755
Report |
The Bias Amplification Paradox in Text-to-Image Generation |
Preethi Seshadri, Sameer Singh, Yanai Elazar |
Bias amplification is a phenomenon in which models exacerbate biases or
stereotypes present in the training data. In this paper, we study bias
amplification in the text-to-image domain using Stable Diffusion by comparing
gender ratios in training vs. generated images. We find that the model appears
to amplify gender-occupation biases found in the training data (LAION)
considerably. However, we discover that amplification can be largely attributed
to discrepancies between training captions and model prompts. For example, an
inherent difference is that captions from the training data often contain
explicit gender information while our prompts do not, which leads to a
distribution shift and consequently inflates bias measures. Once we account for
distributional differences between texts used for training and generation when
evaluating amplification, we observe that amplification decreases drastically.
Our findings illustrate the challenges of comparing biases in models and their
training data, and highlight confounding factors that impact analyses. |
This paper investigates bias amplification in text-to-image models, focusing on how distributional differences between training captions and generation prompts contribute to the phenomenon. |
Understanding bias amplification is crucial as it can exacerbate stereotypes and disparities. The work aims to explain why models amplify biases despite being trained to fit the training data. |
The authors analyze gender-occupation bias in Stable Diffusion and its training dataset (LAION). They compare gender ratios in generated images to those in training images, considering different methods for selecting relevant training captions. |
Naively selecting training captions based on occupation keywords leads to an overestimation of bias amplification.
Excluding captions with explicit gender indicators and using nearest neighbors based on text embeddings to select training captions significantly reduces observed amplification.
Prompting the model with training captions directly results in minimal amplification, suggesting that the model largely reflects the bias present in the training data when distributional differences are minimized. |
The analysis doesn't account for biases stemming from the text embedding model (CLIP).
Gender classification relies on a binary model, neglecting nuances in gender identity and potentially perpetuating stereotypes. |
bias amplification, text-to-image generation, stable diffusion, gender bias, dataset bias |
2308.00729
Report |
Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment |
Hongbo Liu, Mingda Wu, Kun Yuan, Ming Sun, Yansong Tang, Chuanchuan Zheng, Xing Wen, Xiu Li |
Video quality assessment (VQA) has attracted growing attention in recent
years. While the great expense of annotating large-scale VQA datasets has
become the main obstacle for current deep-learning methods. To surmount the
constraint of insufficient training data, in this paper, we first consider the
complete range of video distribution diversity (\ie content, distortion,
motion) and employ diverse pretrained models (\eg architecture, pretext task,
pre-training dataset) to benefit quality representation. An Adaptive Diverse
Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture
desired quality-related features generated by these frozen pretrained models.
By leveraging the Quality-aware Acquisition Module (QAM), the framework is able
to extract more essential and relevant features to represent quality. Finally,
the learned quality representation is utilized as supplementary supervisory
information, along with the supervision of the labeled quality score, to guide
the training of a relatively lightweight VQA model in a knowledge distillation
manner, which largely reduces the computational cost during inference.
Experimental results on three mainstream no-reference VQA benchmarks clearly
show the superior performance of Ada-DQA in comparison with current
state-of-the-art approaches without using extra training data of VQA. |
This paper proposes Ada-DQA, an Adaptive Diverse Quality-aware Feature Acquisition framework for Video Quality Assessment (VQA) that leverages diverse pre-trained models to overcome limitations of limited labeled training data in VQA. |
DNN-based VQA methods suffer from the limited scale of existing VQA datasets and using only content-aware features from pre-trained models is insufficient to represent quality degradation in videos. |
Ada-DQA constructs a pool of diverse pre-trained models (different architectures, pre-training tasks, datasets) covering various quality-related factors (content, distortion, motion). A Quality-aware Acquisition Module (QAM) dynamically captures desired features from these models, with a sparsity constraint on gating weights to emphasize crucial features. Finally, knowledge distillation transfers learned representations to a lightweight VQA model. |
Ada-DQA achieves state-of-the-art results on three NR-VQA benchmarks (KoNViD-1k, LIVE-VQC, YouTube-UGC) without using external QA training data.
Using diverse pre-trained models outperforms using a single pre-trained model consistently across datasets.
Adding a sparsity constraint to QAM leads to continuous performance improvement as the number of pre-trained models increases. |
The quality-related information provided by adding more pre-trained models plateaus at a certain point.
Future work could explore incorporating more diverse pre-trained models or other techniques beyond knowledge distillation. |
video quality assessment, diverse pretrained model, knowledge distillation, quality-aware representation, sparsity constraint |
2308.00727
Report |
Adaptive Semantic Consistency for Cross-domain Few-shot Classification |
Hengchu Lu, Yuanjie Shao, Xiang Wang, Changxin Gao |
Cross-domain few-shot classification (CD-FSC) aims to identify novel target
classes with a few samples, assuming that there exists a domain shift between
source and target domains. Existing state-of-the-art practices typically
pre-train on source domain and then finetune on the few-shot target data to
yield task-adaptive representations. Despite promising progress, these methods
are prone to overfitting the limited target distribution since data-scarcity
and ignore the transferable knowledge learned in the source domain. To
alleviate this problem, we propose a simple plug-and-play Adaptive Semantic
Consistency (ASC) framework, which improves cross-domain robustness by
preserving source transfer capability during the finetuning stage. Concretely,
we reuse the source images in the pretraining phase and design an adaptive
weight assignment strategy to highlight the samples similar to target domain,
aiming to aggregate informative target-related knowledge from source domain.
Subsequently, a semantic consistency regularization is applied to constrain the
consistency between the semantic features of the source images output by the
source model and target model. In this way, the proposed ASC enables explicit
transfer of source domain knowledge to prevent the model from overfitting the
target domain. Extensive experiments on multiple benchmarks demonstrate the
effectiveness of the proposed ASC, and ASC provides consistent improvements
over the baselines. The source code will be released. |
This paper proposes Adaptive Semantic Consistency (ASC), a plug-and-play framework for cross-domain few-shot classification that mitigates overfitting by preserving transferable knowledge from the source domain during finetuning. |
Existing cross-domain few-shot classification methods are prone to overfitting the limited target data and often neglect valuable knowledge learned from the source domain. |
ASC employs an adaptive weight assignment strategy to emphasize source domain samples similar to the target domain. It also introduces a semantic consistency regularization, constraining the semantic features of source images from the source and target models to be consistent during finetuning. |
ASC consistently improves performance on multiple benchmarks compared to baseline methods.
The adaptive weight assignment strategy effectively highlights transferable knowledge from the source domain.
Regularizing semantic-level features is more effective than mid-level features in preserving transferable knowledge and preventing negative transfer. |
The source image selection strategy relies on source image labels, which may not always be available.
Future work can explore alternative strategies for selecting relevant source images without label dependency. |
cross-domain few-shot learning, semantic consistency, transfer learning, overfitting prevention, few-shot classification |
2308.00692
Report |
LISA: Reasoning Segmentation via Large Language Model |
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia |
Although perception systems have made remarkable advancements in recent
years, they still rely on explicit human instruction or pre-defined categories
to identify the target objects before executing visual recognition tasks. Such
systems cannot actively reason and comprehend implicit user intention. In this
work, we propose a new segmentation task -- reasoning segmentation. The task is
designed to output a segmentation mask given a complex and implicit query text.
Furthermore, we establish a benchmark comprising over one thousand
image-instruction-mask data samples, incorporating intricate reasoning and
world knowledge for evaluation purposes. Finally, we present LISA: large
Language Instructed Segmentation Assistant, which inherits the language
generation capabilities of multimodal Large Language Models (LLMs) while also
possessing the ability to produce segmentation masks. We expand the original
vocabulary with a token and propose the embedding-as-mask paradigm to
unlock the segmentation capability. Remarkably, LISA can handle cases involving
complex reasoning and world knowledge. Also, it demonstrates robust zero-shot
capability when trained exclusively on reasoning-free datasets. In addition,
fine-tuning the model with merely 239 reasoning segmentation data samples
results in further performance enhancement. Both quantitative and qualitative
experiments show our method effectively unlocks new reasoning segmentation
capabilities for multimodal LLMs. Code, models, and data are available at
https://github.com/dvlab-research/LISA. |
This paper introduces 'reasoning segmentation', a new task requiring segmentation masks to be generated from implicit text queries involving complex reasoning and world knowledge. |
Current perception systems rely heavily on explicit instructions, highlighting the need for systems capable of understanding implicit user intent, crucial for advancing AI and robotics. |
The authors present LISA, a model that equips multimodal LLMs with segmentation abilities. It leverages an embedding-as-mask paradigm, where a '' token's embedding is decoded into a segmentation mask, enabling end-to-end training. |
LISA effectively handles complex reasoning and world knowledge in segmentation tasks.
It exhibits strong zero-shot performance on the ReasonSeg benchmark, even when trained solely on reasoning-free datasets.
Fine-tuning LISA on a mere 239 reasoning segmentation samples considerably boosts its performance. |
The current performance bottleneck might lie in the query text understanding, suggesting the need for stronger multimodal LLMs.
The research highlights the need for more reasoning segmentation training data to further improve performance. |
reasoning segmentation, multimodal large language models, implicit instruction understanding, embedding-as-mask, lisa |
2308.00520
Report |
NormKD: Normalized Logits for Knowledge Distillation |
Zhihao Chi, Tu Zheng, Hengjia Li, Zheng Yang, Boxi Wu, Binbin Lin, Deng Cai |
Logit based knowledge distillation gets less attention in recent years since
feature based methods perform better in most cases. Nevertheless, we find it
still has untapped potential when we re-investigate the temperature, which is a
crucial hyper-parameter to soften the logit outputs. For most of the previous
works, it was set as a fixed value for the entire distillation procedure.
However, as the logits from different samples are distributed quite variously,
it is not feasible to soften all of them to an equal degree by just a single
temperature, which may make the previous work transfer the knowledge of each
sample inadequately. In this paper, we restudy the hyper-parameter temperature
and figure out its incapability to distill the knowledge from each sample
sufficiently when it is a single value. To address this issue, we propose
Normalized Knowledge Distillation (NormKD), with the purpose of customizing the
temperature for each sample according to the characteristic of the sample's
logit distribution. Compared to the vanilla KD, NormKD barely has extra
computation or storage cost but performs significantly better on CIRAR-100 and
ImageNet for image classification. Furthermore, NormKD can be easily applied to
the other logit based methods and achieve better performance which can be
closer to or even better than the feature based method. |
This paper proposes NormKD, a novel knowledge distillation approach that customizes the temperature for each sample based on its logit distribution, enhancing knowledge transfer from teacher to student models. |
Existing logit-based knowledge distillation methods often use a fixed temperature, which inadequately softens logits from different samples with varying distributions, hindering effective knowledge transfer. |
NormKD replaces the fixed temperature with the scaled standard variance of each sample's logit output. This normalizes the logits, enabling more equal knowledge distillation from individual samples. |
NormKD significantly outperforms vanilla KD on CIFAR-100 and ImageNet datasets, demonstrating its effectiveness.
Combining NormKD with other logit-based methods, such as DKD, further boosts performance, surpassing even some feature-based methods.
NormKD achieves these improvements with minimal computational overhead, making it efficient and easy to implement. |
The assumption of logit distributions as normal distributions may not always hold true, potentially limiting the effectiveness in certain cases.
Future work could explore alternative methods to better characterize and normalize logit distributions for enhanced knowledge distillation. |
knowledge distillation, logit-based distillation, temperature scaling, normalization, deep learning |
2308.00458
Report |
Center Contrastive Loss for Metric Learning |
Bolun Cai, Pengfei Xiong, Shangxuan Tian |
Contrastive learning is a major studied topic in metric learning. However,
sampling effective contrastive pairs remains a challenge due to factors such as
limited batch size, imbalanced data distribution, and the risk of overfitting.
In this paper, we propose a novel metric learning function called Center
Contrastive Loss, which maintains a class-wise center bank and compares the
category centers with the query data points using a contrastive loss. The
center bank is updated in real-time to boost model convergence without the need
for well-designed sample mining. The category centers are well-optimized
classification proxies to re-balance the supervisory signal of each class.
Furthermore, the proposed loss combines the advantages of both contrastive and
classification methods by reducing intra-class variations and enhancing
inter-class differences to improve the discriminative power of embeddings. Our
experimental results, as shown in Figure 1, demonstrate that a standard network
(ResNet50) trained with our loss achieves state-of-the-art performance and
faster convergence. |
This paper proposes Center Contrastive Loss (CCL), a novel metric learning loss function that maintains and updates a class-wise center bank, contrasting category centers with query data points using a contrastive loss. |
CCL overcomes limitations of existing contrastive learning methods, addressing challenges in sampling effective pairs due to factors like limited batch size and imbalanced data distribution. |
CCL utilizes a center bank updated in sync with the encoder, contrasting it with data points using a contrastive loss enhanced with a large-margin component. This reduces intra-class variations while enhancing inter-class differences, boosting discriminative power. |
CCL achieves state-of-the-art Recall@1 accuracy on benchmark datasets like SOP, CUB, and Cars196.
It exhibits faster convergence compared to previous methods, achieving superior performance within a fraction of training epochs.
CCL demonstrates robustness to noisy labels, outperforming other robust metric learning methods under various noise settings. |
The impact of hyperparameters like hypersphere radius (s) is not extensively explored.
Future work could investigate extensions of CCL for broader applications such as face recognition, person re-identification, and clustering. |
metric learning, contrastive learning, center loss, image retrieval, deep learning |
2308.00261
Report |
Improving Pixel-based MIM by Reducing Wasted Modeling Capability |
Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin |
There has been significant progress in Masked Image Modeling (MIM). Existing
MIM methods can be broadly categorized into two groups based on the
reconstruction target: pixel-based and tokenizer-based approaches. The former
offers a simpler pipeline and lower computational cost, but it is known to be
biased toward high-frequency details. In this paper, we provide a set of
empirical studies to confirm this limitation of pixel-based MIM and propose a
new method that explicitly utilizes low-level features from shallow layers to
aid pixel reconstruction. By incorporating this design into our base method,
MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its
convergence and achieving non-trivial improvements across various downstream
tasks. To the best of our knowledge, we are the first to systematically
investigate multi-level feature fusion for isotropic architectures like the
standard Vision Transformer (ViT). Notably, when applied to a smaller model
(e.g., ViT-S), our method yields significant performance gains, such as 1.2\%
on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation.
Code and models are available at https://github.com/open-mmlab/mmpretrain. |
This paper proposes Multi-level Feature Fusion (MFF), a method to improve pixel-based Masked Image Modeling (MIM) by incorporating low-level features from shallow layers into the output layer for pixel reconstruction. |
Pixel-based MIM, while simple and efficient, is biased towards high-frequency details, wasting modeling capacity that could be used to capture low-frequency semantics crucial for downstream tasks. |
MFF extends MAE by fusing features from multiple shallow layers with the output layer. It explores different projection layers (linear, non-linear) and fusion strategies (weighted average pooling, self-attention). |
MFF significantly improves MAE's performance on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.
MFF enhances training efficiency, achieving comparable results to MAE with 5x fewer epochs.
Analysis reveals that MFF reduces high-frequency bias in learned features and flattens the loss landscape, aiding optimization. |
The study primarily focuses on ViT architecture and its effectiveness on other architectures needs further investigation.
The selection of specific layers for fusion is currently based on empirical analysis and a more principled approach could be explored. |
masked image modeling, self-supervised learning, multi-level feature fusion, vision transformer, representation learning |
2308.00255
Report |
LGViT: Dynamic Early Exiting for Accelerating Vision Transformer |
Guanyu Xu, Jiawei Hao, Li Shen, Han Hu, Yong Luo, Hui Lin, Jialie Shen |
Recently, the efficient deployment and acceleration of powerful vision
transformers (ViTs) on resource-limited edge devices for providing multimedia
services have become attractive tasks. Although early exiting is a feasible
solution for accelerating inference, most works focus on convolutional neural
networks (CNNs) and transformer models in natural language processing
(NLP).Moreover, the direct application of early exiting methods to ViTs may
result in substantial performance degradation. To tackle this challenge, we
systematically investigate the efficacy of early exiting in ViTs and point out
that the insufficient feature representations in shallow internal classifiers
and the limited ability to capture target semantic information in deep internal
classifiers restrict the performance of these methods. We then propose an early
exiting framework for general ViTs termed LGViT, which incorporates
heterogeneous exiting heads, namely, local perception head and global
aggregation head, to achieve an efficiency-accuracy trade-off. In particular,
we develop a novel two-stage training scheme, including end-to-end training and
self-distillation with the backbone frozen to generate early exiting ViTs,
which facilitates the fusion of global and local information extracted by the
two types of heads. We conduct extensive experiments using three popular ViT
backbones on three vision datasets. Results demonstrate that our LGViT can
achieve competitive performance with approximately 1.8 $\times$ speed-up. |
This paper proposes LGViT, an early exiting framework for Vision Transformers (ViTs) that uses heterogeneous exiting heads to improve inference speed while maintaining accuracy. |
Deploying powerful ViTs on resource-limited edge devices for real-time multimedia applications is challenging due to their high computational complexity. Early exiting offers a solution but needs to be adapted for ViTs to avoid performance degradation. |
LGViT incorporates local perception heads (based on convolution) at shallow exiting points and global aggregation heads (based on self-attention) at deep exiting points. This leverages the strengths of both convolution and self-attention for better feature representation. A novel two-stage training strategy, including end-to-end training and self-distillation, is used to further improve performance. |
LGViT achieves competitive performance with an average speed-up of 1.8x compared to the original ViT models while sacrificing only 2% accuracy on three vision datasets.
The heterogeneous exiting heads (LPH + GAH) outperform other exiting architectures, such as using only MLP, convolution, or attention, in terms of speed-accuracy trade-off.
The proposed two-stage training strategy is shown to be more effective than other training schemes, like normal, weighted, distillation, and alternating training, for early exiting in ViTs. |
The exiting positions and optimal exiting paths are currently chosen manually.
Future work will explore using Bayesian optimization to automate the exiting decision process. |
vision transformer, early exit, heterogeneous exiting heads, self-distillation, inference acceleration |
2308.00135
Report |
InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing |
Anant Khandelwal |
Large text-to-image diffusion models have achieved remarkable success in
generating diverse, high-quality images. Additionally, these models have been
successfully leveraged to edit input images by just changing the text prompt.
But when these models are applied to videos, the main challenge is to ensure
temporal consistency and coherence across frames. In this paper, we propose
InFusion, a framework for zero-shot text-based video editing leveraging large
pre-trained image diffusion models. Our framework specifically supports editing
of multiple concepts with pixel-level control over diverse concepts mentioned
in the editing prompt. Specifically, we inject the difference in features
obtained with source and edit prompts from U-Net residual blocks of decoder
layers. When these are combined with injected attention features, it becomes
feasible to query the source contents and scale edited concepts along with the
injection of unedited parts. The editing is further controlled in a
fine-grained manner with mask extraction and attention fusion, which cut the
edited part from the source and paste it into the denoising pipeline for the
editing prompt. Our framework is a low-cost alternative to one-shot tuned
models for editing since it does not require training. We demonstrated complex
concept editing with a generalised image model (Stable Diffusion v1.5) using
LoRA. Adaptation is compatible with all the existing image diffusion
techniques. Extensive experimental results demonstrate the effectiveness of
existing methods in rendering high-quality and temporally consistent videos. |
This paper introduces InFusion, a zero-shot text-based video editing framework that leverages pre-trained image diffusion models (specifically Stable Diffusion v1.5) to enable multi-concept editing with pixel-level control. |
The method addresses limitations in existing video editing techniques that struggle to maintain temporal consistency and fine-grained control when modifying multiple concepts within a video. |
InFusion employs a two-part strategy: 1) **Inject**: Injects differences in spatial and attention features from source and edit prompts into the denoising pipeline to guide concept modification. 2) **Attention Fusion**: Uses masks extracted from cross-attention maps to combine source and edit attention, ensuring accurate concept replacement while preserving unedited content. |
Achieves state-of-the-art temporal consistency and editing accuracy in edited videos compared to baseline methods, as evidenced by CLIP metrics and user studies.
Demonstrates successful editing of complex concepts, including object replacement, color changes, style transfer, and scene modifications, while maintaining source video fidelity.
Offers a cost-effective and flexible alternative to one-shot fine-tuned models, as it requires no training and is compatible with existing image diffusion techniques. |
The mask thresholding in Attention Fusion might require case-by-case adjustments for optimal performance.
Future work could explore extending InFusion to incorporate additional control mechanisms, such as motion guidance or user-specified editing regions. |
video editing, text-guided synthesis, diffusion models, zero-shot learning, temporal consistency |
2307.16867
Report |
Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy |
Shibo Jie, Haoqing Wang, Zhi-Hong Deng |
Current state-of-the-art results in computer vision depend in part on
fine-tuning large pre-trained vision models. However, with the exponential
growth of model sizes, the conventional full fine-tuning, which needs to store
a individual network copy for each tasks, leads to increasingly huge storage
and transmission overhead. Adapter-based Parameter-Efficient Tuning (PET)
methods address this challenge by tuning lightweight adapters inserted into the
frozen pre-trained models. In this paper, we investigate how to make adapters
even more efficient, reaching a new minimum size required to store a
task-specific fine-tuned network. Inspired by the observation that the
parameters of adapters converge at flat local minima, we find that adapters are
resistant to noise in parameter space, which means they are also resistant to
low numerical precision. To train low-precision adapters, we propose a
computational-efficient quantization method which minimizes the quantization
error. Through extensive experiments, we find that low-precision adapters
exhibit minimal performance degradation, and even 1-bit precision is sufficient
for adapters. The experimental results demonstrate that 1-bit adapters
outperform all other PET methods on both the VTAB-1K benchmark and few-shot
FGVC tasks, while requiring the smallest storage size. Our findings show, for
the first time, the significant potential of quantization techniques in PET,
providing a general solution to enhance the parameter efficiency of
adapter-based PET methods. Code: https://github.com/JieShibo/PETL-ViT |
This paper explores precision redundancy in adapter-based parameter-efficient tuning (PET) for vision transformers, proposing a method to train and store adapters in low-bit parameter space, significantly improving their efficiency with minimal performance loss. |
Storing task-specific fine-tuned large vision models incurs prohibitive storage and transmission costs. Adapter-based PET methods, while more efficient than full fine-tuning, still require significant storage, especially for numerous tasks. This work leverages the precision redundancy in adapters to further improve efficiency. |
The authors analyze the loss landscape of adapters and find they converge at flatter minima, implying resilience to noise, including quantization error. They propose an efficient quantization-aware training method based on empirical observations of adapter parameter distributions, minimizing quantization error during training. |
Quantizing adapters to low-bit precision, even 1-bit, results in negligible performance degradation unlike quantizing entire models.
With a fixed storage budget, 1-bit quantized adapters achieve superior performance compared to higher bit-width settings.
The proposed 1-bit adapter method outperforms previous PET methods, including low-rank factorization methods, while using the smallest storage size on both VTAB-1K and few-shot FGVC tasks. |
The study focuses on ViT backbones, and the optimal bit-width for other architectures may require further investigation.
Exploring more sophisticated quantization strategies beyond the Gaussian distribution assumption could further improve performance. |
parameter-efficient tuning, vision transformers, quantization, adapter-based tuning, low-bit neural networks |
2307.16813
Report |
Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment |
Kun Yuan, Zishang Kong, Chuanchuan Zheng, Ming Sun, Xing Wen |
Video Quality Assessment (VQA), which aims to predict the perceptual quality
of a video, has attracted raising attention with the rapid development of
streaming media technology, such as Facebook, TikTok, Kwai, and so on. Compared
with other sequence-based visual tasks (\textit{e.g.,} action recognition), VQA
faces two under-estimated challenges unresolved in User Generated Content (UGC)
videos. \textit{First}, it is not rare that several frames containing serious
distortions (\textit{e.g.,}blocking, blurriness), can determine the perceptual
quality of the whole video, while other sequence-based tasks require more
frames of equal importance for representations. \textit{Second}, the perceptual
quality of a video exhibits a multi-distortion distribution, due to the
differences in the duration and probability of occurrence for various
distortions. In order to solve the above challenges, we propose \textit{Visual
Quality Transformer (VQT)} to extract quality-related sparse features more
efficiently. Methodologically, a Sparse Temporal Attention (STA) is proposed to
sample keyframes by analyzing the temporal correlation between frames, which
reduces the computational complexity from $O(T^2)$ to $O(T \log T)$.
Structurally, a Multi-Pathway Temporal Network (MPTN) utilizes multiple STA
modules with different degrees of sparsity in parallel, capturing co-existing
distortions in a video. Experimentally, VQT demonstrates superior performance
than many \textit{state-of-the-art} methods in three public no-reference VQA
datasets. Furthermore, VQT shows better performance in four full-reference VQA
datasets against widely-adopted industrial algorithms (\textit{i.e.,} VMAF and
AVQT). |
This paper proposes Visual Quality Transformer (VQT), a novel Transformer-based architecture designed for no-reference video quality assessment, specifically targeting the challenges posed by co-existing distortions in user-generated content. |
Accurately assessing the quality of user-generated content (UGC) videos, often characterized by diverse and co-existing distortions, is crucial for various applications like content filtering and video enhancement. |
VQT utilizes two key components: a Sparse Temporal Attention (STA) module for efficiently sampling keyframes containing distortions, and a Multi-Pathway Temporal Network (MPTN) to capture different distortion characteristics simultaneously. |
VQT significantly outperforms state-of-the-art methods on three NR-VQA datasets, demonstrating substantial improvements in prediction accuracy.
It surpasses even widely-adopted industrial algorithms (VMAF and AVQT) on four FR-VQA datasets, highlighting its robust generalization ability.
VQT exhibits good generalization to general video classification tasks, achieving competitive results on Kinetics-400 while being computationally efficient. |
The current keyframe selection in STA relies on pre-defined hyperparameters, which could be improved with adaptive mechanisms.
Future work could investigate further speed-up techniques like knowledge distillation and quantization for real-time applications. |
video quality assessment, user-generated content, video transformer, sparse temporal attention, co-existing distortions |
2307.16686
Report |
Guiding Image Captioning Models Toward More Specific Captions |
Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen |
Image captioning is conventionally formulated as the task of generating
captions for images that match the distribution of reference image-caption
pairs. However, reference captions in standard captioning datasets are short
and may not uniquely identify the images they describe. These problems are
further exacerbated when models are trained directly on image-alt text pairs
collected from the internet. In this work, we show that it is possible to
generate more specific captions with minimal changes to the training process.
We implement classifier-free guidance for an autoregressive captioning model by
fine-tuning it to estimate both conditional and unconditional distributions
over captions. The guidance scale applied at decoding controls a trade-off
between maximizing $p(\mathrm{caption}|\mathrm{image})$ and
$p(\mathrm{image}|\mathrm{caption})$. Compared to standard greedy decoding,
decoding with a guidance scale of 2 substantially improves reference-free
metrics such as CLIPScore (0.808 vs. 0.775) and caption$\to$image retrieval
performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens
standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We
further explore the use of language models to guide the decoding process,
obtaining small improvements over the Pareto frontier of reference-free vs.
reference-based captioning metrics that arises from classifier-free guidance,
and substantially improving the quality of captions generated from a model
trained only on minimally curated web data. |
This paper investigates methods for guiding image captioning models to generate more specific captions, focusing on classifier-free guidance (CFG) and language model (LM) guidance. |
Standard image captioning models often produce generic captions. This paper addresses this issue by exploring techniques to enhance caption specificity, aiming to better capture image details. |
The authors employ CFG, a technique originally designed for diffusion models, by fine-tuning an autoregressive captioning model to estimate conditional and unconditional caption distributions. Additionally, they explore using a few-shot prompted LM to guide the caption generation process, influencing caption style and improving quality. |
Applying CFG with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore and caption-to-image retrieval performance but negatively impacts reference-based metrics like CIDEr.
LM guidance with descriptive prompts slightly outperforms CFG in balancing reference-free and reference-based metrics.
LM guidance significantly enhances captions generated by a model trained on minimally curated web data, demonstrating its potential for zero-shot captioning. |
The study primarily utilizes greedy decoding, which may not be optimal for LM guidance with structured prompts. Exploring beam search could be beneficial.
While CFG enhances caption specificity, it can lead to grammatical errors and nonsensical words at higher guidance scales. Further research on regularizing the estimator of pointwise mutual information could mitigate this. |
image captioning, classifier-free guidance, language model guidance, caption specificity, zero-shot captioning |
2307.16601
Report |
Sampling to Distill: Knowledge Transfer from Open-World Data |
Yuzheng Wang, Zhaoyu Chen, Jie Zhang, Dingkang Yang, Zuhao Ge, Yang Liu, Siao Liu, Yunquan Sun, Wenqiang Zhang, Lizhe Qi |
Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train
high-performance student models using only the teacher network without original
training data. Despite encouraging results, existing DFKD methods rely heavily
on generation modules with high computational costs. Meanwhile, they ignore the
fact that the generated and original data exist domain shifts due to the lack
of supervision information. Moreover, knowledge is transferred through each
example, ignoring the implicit relationship among multiple examples. To this
end, we propose a novel Open-world Data Sampling Distillation (ODSD) method
without a redundant generation process. First, we try to sample open-world data
close to the original data's distribution by an adaptive sampling module. Then,
we introduce a low-noise representation to alleviate the domain shifts and
build a structured relationship of multiple data examples to exploit data
knowledge. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet
show that our ODSD method achieves state-of-the-art performance. Especially, we
improve 1.50\%-9.59\% accuracy on the ImageNet dataset compared with the
existing results. |
This paper proposes Open-world Data Sampling Distillation (ODSD), a novel data-free knowledge distillation method that avoids the computational cost of data generation by effectively utilizing open-world unlabeled data. |
Existing DFKD methods suffer from high computational costs associated with generation modules and domain shifts between generated and original data. Additionally, they often overlook the implicit relationship among multiple data examples, limiting knowledge transfer. |
ODSD employs Adaptive Prototype Sampling (APS) to select unlabeled data resembling the original data distribution. It introduces Denoising Contrastive Relational Distillation (DCRD) with low-noise representation to mitigate label noise and utilizes a contrastive structured relationship to leverage knowledge from both data and the teacher network. |
ODSD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, NYUv2, and ImageNet datasets.
The method shows significant improvements, exceeding existing methods by up to 9.59% accuracy on ImageNet.
Experiments demonstrate the effectiveness of the proposed sampling method, distillation approach, and structured knowledge framework. |
The performance of prototype-based sampling might be sensitive to the choice of clustering algorithm and the number of prototypes.
Future work could explore more effective contrastive learning strategies for structured knowledge distillation. |
knowledge distillation, data-free learning, contrastive learning, domain adaptation, computer vision |
2307.16586
Report |
SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model |
Shili Zhou, Ruian He, Weimin Tan, Bo Yan |
Optical Flow Estimation aims to find the 2D dense motion field between two
frames. Due to the limitation of model structures and training datasets,
existing methods often rely too much on local clues and ignore the integrity of
objects, resulting in fragmented motion estimation. Through theoretical
analysis, we find the pre-trained large vision models are helpful in optical
flow estimation, and we notice that the recently famous Segment Anything Model
(SAM) demonstrates a strong ability to segment complete objects, which is
suitable for solving the fragmentation problem. We thus propose a solution to
embed the frozen SAM image encoder into FlowFormer to enhance object
perception. To address the challenge of in-depth utilizing SAM in
non-segmentation tasks like optical flow estimation, we propose an Optical Flow
Task-Specific Adaption scheme, including a Context Fusion Module to fuse the
SAM encoder with the optical flow context encoder, and a Context Adaption
Module to adapt the SAM features for optical flow task with Learned
Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10
clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set,
surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model
achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks,
ranking #1 among all two-frame methods on Sintel clean pass. |
This paper proposes SAMFlow, a novel approach to enhance the accuracy of optical flow estimation by embedding a frozen Segment Anything Model (SAM) image encoder into FlowFormer, effectively addressing fragmentation issues. |
Existing optical flow estimation methods often produce fragmented results due to the limitations of datasets and model structures, relying too heavily on local clues and ignoring object integrity. SAM's strong object segmentation ability makes it suitable for solving this fragmentation problem. |
The authors integrate the SAM image encoder into FlowFormer and propose an Optical Flow Task-Specific Adaption scheme. This scheme consists of a Context Fusion Module (CFM) to fuse SAM and FlowFormer encoder features, and a Context Adaption Module (CAM) to adapt the fused features for optical flow estimation using Learned Task-Specific Embedding (LTSE). |
SAMFlow achieves state-of-the-art performance on Sintel and KITTI-15 benchmarks, surpassing FlowFormer by a significant margin.
The model exhibits strong robustness against fragmentation attacks, outperforming other methods in scenarios with occlusions and complex textures.
SAMFlow ranks #1 among all two-frame methods on the Sintel clean pass benchmark. |
The model's performance improvement comes with increased computational cost, although this can be mitigated by using smaller SAM encoder scales.
The authors primarily focus on two-frame optical flow estimation and plan to explore the application of SAM in multi-frame settings in future work. |
optical flow estimation, fragmentation, segment anything model (sam), flowformer, task-specific adaptation |
2307.16489
Report |
BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models |
Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian |
The rise in popularity of text-to-image generative artificial intelligence
(AI) has attracted widespread public interest. We demonstrate that this
technology can be attacked to generate content that subtly manipulates its
users. We propose a Backdoor Attack on text-to-image Generative Models (BAGM),
which upon triggering, infuses the generated images with manipulative details
that are naturally blended in the content. Our attack is the first to target
three popular text-to-image generative models across three stages of the
generative process by modifying the behaviour of the embedded tokenizer, the
language model or the image generative model. Based on the penetration level,
BAGM takes the form of a suite of attacks that are referred to as surface,
shallow and deep attacks in this article. Given the existing gap within this
domain, we also contribute a comprehensive set of quantitative metrics designed
specifically for assessing the effectiveness of backdoor attacks on
text-to-image models. The efficacy of BAGM is established by attacking
state-of-the-art generative models, using a marketing scenario as the target
domain. To that end, we contribute a dataset of branded product images. Our
embedded backdoors increase the bias towards the target outputs by more than
five times the usual, without compromising the model robustness or the
generated content utility. By exposing generative AI's vulnerabilities, we
encourage researchers to tackle these challenges and practitioners to exercise
caution when using pre-trained models. Relevant code, input prompts and
supplementary material can be found at https://github.com/JJ-Vice/BAGM, and the
dataset is available at:
https://ieee-dataport.org/documents/marketable-foods-mf-dataset.
Keywords: Generative Artificial Intelligence, Generative Models,
Text-to-Image generation, Backdoor Attacks, Trojan, Stable Diffusion. |
The paper proposes BAGM, a novel backdoor attack framework targeting text-to-image generative AI models, demonstrating manipulation of generated outputs across different stages of the generative process. |
The work exposes vulnerabilities of increasingly popular text-to-image AI models to subtle manipulation, raising important security and ethical concerns as these models can be exploited to influence user sentiments. |
The authors introduce three types of backdoor attacks: (1) Surface attack manipulating tokenization, (2) Shallow attack targeting the language model, and (3) Deep attack targeting the image generative model. They evaluate the attacks on three popular text-to-image pipelines (Stable Diffusion, Kandinsky, DeepFloyd-IF) using the proposed Marketable Foods dataset and novel evaluation metrics. |
Backdoor attacks successfully injected into all three pipelines at various stages, demonstrating the vulnerability of these systems.
Proposed attacks achieved high attack success rates while maintaining low impact on model utility, making them stealthy and effective.
The paper establishes a benchmark for evaluating backdoor attacks on generative AI models through novel metrics, paving the way for future research on defense mechanisms. |
The paper primarily focuses on a marketing scenario for demonstration. Exploring other domains and attack vectors is crucial for a comprehensive understanding of these vulnerabilities.
While the proposed metrics offer a valuable starting point, further research is needed to develop more robust and comprehensive evaluation standards for generative AI model attacks. |
generative ai, backdoor attacks, text-to-image synthesis, model security, digital marketing |
2307.16441
Report |
Interactive Neural Painting |
Elia Peruzzo, Willi Menapace, Vidit Goel, Federica Arrigoni, Hao Tang, Xingqian Xu, Arman Chopikyan, Nikita Orlov, Yuxiao Hu, Humphrey Shi, Nicu Sebe, Elisa Ricci |
In the last few years, Neural Painting (NP) techniques became capable of
producing extremely realistic artworks. This paper advances the state of the
art in this emerging research domain by proposing the first approach for
Interactive NP. Considering a setting where a user looks at a scene and tries
to reproduce it on a painting, our objective is to develop a computational
framework to assist the users creativity by suggesting the next strokes to
paint, that can be possibly used to complete the artwork. To accomplish such a
task, we propose I-Paint, a novel method based on a conditional transformer
Variational AutoEncoder (VAE) architecture with a two-stage decoder. To
evaluate the proposed approach and stimulate research in this area, we also
introduce two novel datasets. Our experiments show that our approach provides
good stroke suggestions and compares favorably to the state of the art.
Additional details, code and examples are available at
https://helia95.github.io/inp-website. |
This paper introduces Interactive Neural Painting (INP), a novel image generation task where a computational tool assists users in painting by suggesting subsequent strokes based on a reference image and user input, aiming to make painting more accessible. |
Current Neural Painting (NP) methods lack interactivity and limit user control over the artistic process. INP addresses this gap by enabling user participation and integrating their style, potentially democratizing artistic expression through painting. |
The proposed method, INP-VAE, uses a conditional transformer VAE architecture. It leverages a context encoder to extract information from the reference image, user-painted canvas, and recent strokes. A two-stage decoder predicts stroke parameters, ensuring coherence with the reference and mimicking human painting styles learned from a synthetic dataset. |
INP-VAE generates stroke suggestions that accurately reflect reference images while adhering to characteristics of human painting demonstrations.
The method exhibits superior performance compared to adapted state-of-the-art NP techniques in terms of stroke sequence similarity, diversity, and adherence to painting style.
A user study confirms a clear preference for INP-VAE over baselines in generating stroke sequences that align with human-like painting processes. |
The reliance on synthetic datasets for training, while demonstrating the framework's capability, highlights the need for real human painting data for further refinement.
Future research can explore incorporating user feedback mechanisms to adapt the model's suggestions and better align with user intentions over time. |
interactive neural painting, image generation, conditional vae, transformer networks, human-computer interaction |
2307.16371
Report |
MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text |
Junchen Zhu, Huan Yang, Wenjing Wang, Huiguo He, Zixi Tuo, Yongsheng Yu, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, Jianlong Fu, Jiebo Luo |
Videos for mobile devices become the most popular access to share and acquire
information recently. For the convenience of users' creation, in this paper, we
present a system, namely MobileVidFactory, to automatically generate vertical
mobile videos where users only need to give simple texts mainly. Our system
consists of two parts: basic and customized generation. In the basic
generation, we take advantage of the pretrained image diffusion model, and
adapt it to a high-quality open-domain vertical video generator for mobile
devices. As for the audio, by retrieving from our big database, our system
matches a suitable background sound for the video. Additionally to produce
customized content, our system allows users to add specified screen texts to
the video for enriching visual expression, and specify texts for automatic
reading with optional voices as they like. |
Introduces MobileVidFactory, the first automatic system for generating vertical videos for mobile devices from text, incorporating both basic and user-customized content creation. |
Addresses the growing popularity of vertical videos on social media and the need for accessible, easy-to-use video creation tools. |
Combines a pretrained image diffusion model adapted for vertical video generation, an audio retrieval model for background sound, and optional user-specified text overlays and text-to-speech narration. |
Generates high-quality vertical videos with detailed frames and smooth motion.
Enables users to customize videos with text overlays and personalized voiceovers.
Offers a user-friendly way to create engaging content for mobile consumption. |
The current training dataset for vertical video finetuning is limited.
Exploring more sophisticated audio-visual matching techniques is of interest. |
vertical video generation, diffusion model, mobile video, text-to-video, social media |
2307.16275
Report |
Stylized Projected GAN: A Novel Architecture for Fast and Realistic Image Generation |
Md Nurul Muttakin, Malik Shahid Sultan, Robert Hoehndorf, Hernando Ombao |
Generative Adversarial Networks are used for generating the data using a
generator and a discriminator, GANs usually produce high-quality images, but
training GANs in an adversarial setting is a difficult task. GANs require high
computation power and hyper-parameter regularization for converging. Projected
GANs tackle the training difficulty of GANs by using transfer learning to
project the generated and real samples into a pre-trained feature space.
Projected GANs improve the training time and convergence but produce artifacts
in the generated images which reduce the quality of the generated samples, we
propose an optimized architecture called Stylized Projected GANs which
integrates the mapping network of the Style GANs with Skip Layer Excitation of
Fast GAN. The integrated modules are incorporated within the generator
architecture of the Fast GAN to mitigate the problem of artifacts in the
generated images. |
The paper proposes Stylized Projected GAN (SPGAN), a novel architecture that integrates the mapping network of StyleGAN with Skip Layer Excitation (SLE) of FastGAN for faster generation of realistic images with fewer training samples. |
Training GANs is challenging, often requiring extensive computational resources and large datasets. Existing methods like Projected GANs, while faster, suffer from artifacts in generated images. This work addresses the need for architectures that balance training speed and image quality. |
The authors experiment with different combinations of architectural components from StyleGAN and FastGAN, focusing on the generator design. They investigate the impact of integrating the mapping network at different resolutions, using deeper mapping networks, and combining the mapping network with SLE. |
SPGAN with stylization in initial layers significantly reduces the number of training samples required compared to Projected GAN, achieving better FID, KID, and precision scores.
Integrating the mapping network with SLE in later layers is ineffective, suggesting artifacts originate in low-resolution layers.
Deeper mapping networks improve image diversity (higher recall) but may slightly reduce image quality (lower precision). |
Despite improvements, artifacts persist in generated images. Future work will focus on refining the discriminator to address this.
Potential solutions include incorporating artifact-aware loss functions, a separate artifact classification head, or an encoder-based approach for artifact detection. |
generative adversarial networks, image generation, transfer learning, stylegan, fastgan |
2307.16204
Report |
Open-Set Domain Adaptation with Visual-Language Foundation Models |
Qing Yu, Go Irie, Kiyoharu Aizawa |
Unsupervised domain adaptation (UDA) has proven to be very effective in
transferring knowledge obtained from a source domain with labeled data to a
target domain with unlabeled data. Owing to the lack of labeled data in the
target domain and the possible presence of unknown classes, open-set domain
adaptation (ODA) has emerged as a potential solution to identify these classes
during the training phase. Although existing ODA approaches aim to solve the
distribution shifts between the source and target domains, most methods
fine-tuned ImageNet pre-trained models on the source domain with the adaptation
on the target domain. Recent visual-language foundation models (VLFM), such as
Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution
shifts and, therefore, should substantially improve the performance of ODA. In
this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We
investigate the performance of zero-shot prediction using CLIP, and then
propose an entropy optimization strategy to assist the ODA models with the
outputs of CLIP. The proposed approach achieves state-of-the-art results on
various benchmarks, demonstrating its effectiveness in addressing the ODA
problem. |
This paper proposes a novel method for Open-Set Domain Adaptation (ODA) that leverages the power of Visual-Language Foundation Models (VLFM), particularly CLIP. |
Existing ODA methods often struggle with unknown classes and distribution shifts between domains. This work leverages the robust zero-shot capabilities and large-scale pre-training of CLIP to enhance ODA performance. |
The method utilizes the zero-shot predictions from CLIP and an entropy optimization strategy. The entropy of CLIP's predictions identifies potential unknown samples. An ODA model is then trained on source data and adapted to the target domain using entropy separation and CLIP's predictions as guidance. |
The proposed method achieves state-of-the-art results on various ODA benchmarks, including Office, Office-Home, VisDA, and DomainNet.
The approach is effective for both ODA and Source-Free ODA (SF-ODA).
The study reveals that CLIP's zero-shot performance is comparable to existing ODA methods, especially on datasets with coarse-grained classes and common domains. |
The current method utilizes a simple strategy for leveraging CLIP. Exploring more sophisticated integration and fine-tuning strategies could further improve performance.
Future work includes investigating computationally efficient methods for adapting CLIP to ODA while preventing overfitting to the source domain. |
open-set domain adaptation, source-free domain adaptation, visual-language foundation models, clip, zero-shot learning |
2307.16184
Report |
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks |
Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord |
Large Language Models (LLMs) have made the ambitious quest for generalist
agents significantly far from being a fantasy. A key hurdle for building such
general models is the diversity and heterogeneity of tasks and modalities. A
promising solution is unification, allowing the support of a myriad of tasks
and modalities within one unified framework. While few large models (e.g.,
Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more
than two modalities, current small to mid-scale unified models are still
limited to 2 modalities, usually image-text or video-text. The question that we
ask is: is it possible to build efficiently a unified model that can support
all modalities? To answer this, we propose UnIVAL, a step further towards this
ambitious goal. Without relying on fancy datasets sizes or models with billions
of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities
and unifies text, images, video, and audio into a single model. Our model is
efficiently pretrained on many tasks, based on task balancing and multimodal
curriculum learning. UnIVAL shows competitive performance to existing
state-of-the-art approaches, across image and video-text tasks. The feature
representations learned from image and video-text modalities, allows the model
to achieve competitive performance when finetuned on audio-text tasks, despite
not being pretrained on audio. Thanks to the unified model, we propose a novel
study on multimodal model merging via weight interpolation of models trained on
different multimodal tasks, showing their benefits in particular for
out-of-distribution generalization. Finally, we motivate unification by showing
the synergy between tasks. The model weights and code are released here:
https://github.com/mshukor/UnIVAL. |
Proposes UnIVAL, a unified model handling image, video, and audio-text tasks within a single architecture, vocabulary, input/output format, and training objective. |
Overcomes limitations of models focused on one or two modalities by leveraging synergies between diverse tasks and modalities for a more generalist approach. |
Employs a Transformer-based encoder-decoder LM with modality-specific CNN encoders, pretrained on a variety of image/video-text datasets using a multimodal curriculum learning and task balancing strategy. |
Achieves competitive performance on image/video-text tasks, including new SOTA on Visual Grounding.
Shows strong generalization to new modalities, achieving competitive performance on audio-text tasks without pretraining on audio data.
Demonstrates the effectiveness of weight interpolation for merging models finetuned on different multimodal tasks, improving multitask performance without inference overhead. |
Limited performance on complex instructions and tasks requiring intricate reasoning.
Hallucinations and potential biases inherited from training data need further mitigation. |
multimodal learning, unified models, curriculum learning, weight interpolation, generalist agents |
2307.16183
Report |
HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation |
Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo Liu, Errui Ding |
In this paper, we study Text-to-3D content generation leveraging 2D diffusion
priors to enhance the quality and detail of the generated 3D models. Recent
progress (Magic3D) in text-to-3D has shown that employing high-resolution
(e.g., 512 x 512) renderings can lead to the production of high-quality 3D
models using latent diffusion priors. To enable rendering at even higher
resolutions, which has the potential to further augment the quality and detail
of the models, we propose a novel approach that combines multiple noise
estimation processes with a pretrained 2D diffusion prior. Distinct from the
Bar-Tal et al.s' study which binds multiple denoised results to generate images
from texts, our approach integrates the computation of scoring distillation
losses such as SDS loss and VSD loss which are essential techniques for the 3D
content generation with 2D diffusion priors. We experimentally evaluated the
proposed approach. The results show that the proposed approach can generate
high-quality details compared to the baselines. |
This paper proposes HD-Fusion, a novel text-to-3D generation method leveraging multiple noise estimation processes with pretrained 2D diffusion priors to produce highly detailed 3D models. |
Generating high-quality, detailed 3D models from text is crucial for applications like the Metaverse, requiring computationally expensive and data-intensive 3D diffusion models. HD-Fusion addresses this challenge by using 2D diffusion priors for efficient training and high-quality output. |
The method utilizes a two-stage coarse-to-fine approach. First, a neural field represents the object's shape and color, optimized using SDS loss in the latent space. The second stage uses a DMTet model and a color network, optimized by rendering views at higher resolutions and employing multiple noise estimation for memory efficiency. ControlNet is incorporated for geometric accuracy. |
The proposed multiple noise estimation enables training with high-resolution rendering, leading to finer details compared to baselines.
HD-Fusion outperforms SOTA methods like Magic3D and Fantasia3D in terms of visual quality.
Task-specific guidance, like pose guidance using ControlNet, significantly improves geometric accuracy, as shown in 3D human character generation. |
The impact of varying the number of tiles in the multiple noise estimation process needs further investigation.
Exploring the combination of the proposed method with other advancements in text-to-3D, such as VSD, could lead to even better visual quality.
Future work involves investigating the application of the proposed approach to more challenging tasks, such as 3D scene generation from text. |
text-to-3d generation, diffusion models, multiple noise estimation, high-resolution rendering, controlnet |
2307.16151
Report |
StylePrompter: All Styles Need Is Attention |
Chenyi Zhuang, Pan Gao, Aljosa Smolic |
GAN inversion aims at inverting given images into corresponding latent codes
for Generative Adversarial Networks (GANs), especially StyleGAN where exists a
disentangled latent space that allows attribute-based image manipulation at
latent level. As most inversion methods build upon Convolutional Neural
Networks (CNNs), we transfer a hierarchical vision Transformer backbone
innovatively to predict $\mathcal{W^+}$ latent codes at token level. We further
apply a Style-driven Multi-scale Adaptive Refinement Transformer (SMART) in
$\mathcal{F}$ space to refine the intermediate style features of the generator.
By treating style features as queries to retrieve lost identity information
from the encoder's feature maps, SMART can not only produce high-quality
inverted images but also surprisingly adapt to editing tasks. We then prove
that StylePrompter lies in a more disentangled $\mathcal{W^+}$ and show the
controllability of SMART. Finally, quantitative and qualitative experiments
demonstrate that StylePrompter can achieve desirable performance in balancing
reconstruction quality and editability, and is "smart" enough to fit into most
edits, outperforming other $\mathcal{F}$-involved inversion methods. |
This paper introduces StylePrompter, a novel Transformer-based GAN inversion framework for generating high-quality, editable images by mapping real images into StyleGAN's latent space. |
Balancing high-quality image inversion with flexible editing capabilities in GANs remains a challenge. Existing methods often struggle with this trade-off, especially in deeper, more expressive latent spaces like StyleGAN's &F; space. |
The authors employ a hierarchical Swin Transformer backbone to predict latent codes (&W;+) at a token level, allowing for disentangled attribute learning. Additionally, they introduce a Style-driven Multi-scale Adaptive Refinement Transformer (SMART) block to refine intermediate style features in &F; space, enhancing reconstruction quality and enabling flexible editing. |
StylePrompter achieves a better balance between reconstruction quality and editability compared to previous methods.
The study shows that the predicted latent codes in &W;+ space exhibit a higher degree of disentanglement, enabling more controlled and meaningful image manipulations.
The proposed SMART block effectively refines style features, enhancing inversion quality while surprisingly preserving editing capabilities in the &F; space. |
The model struggles to reconstruct out-of-domain details, potentially due to style feature modification at a shallow layer.
Future work could explore stacking multiple SMART blocks for progressive refinement and improved out-of-domain detail reconstruction. |
gan inversion, transformer, image editing, multi-scale attention, stylegan |
2307.16125
Report |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension |
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan |
Based on powerful Large Language Models (LLMs), recent generative Multimodal
Large Language Models (MLLMs) have gained prominence as a pivotal research
area, exhibiting remarkable capability for both comprehension and generation.
In this work, we address the evaluation of generative comprehension in MLLMs as
a preliminary step towards a comprehensive assessment of generative models, by
introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple
choice questions with accurate human annotations (x 6 larger than existing
benchmarks), which spans 12 evaluation dimensions including the comprehension
of both the image and video modality. We develop an advanced pipeline for
generating multiple-choice questions that target specific evaluation
dimensions, integrating both automatic filtering and manual verification
processes. Multiple-choice questions with groundtruth options derived from
human annotation enables an objective and efficient assessment of model
performance, eliminating the need for human or GPT intervention during
evaluation. We further evaluate the performance of 18 models across all 12
dimensions, covering both the spatial and temporal understanding. By revealing
the limitations of existing MLLMs through evaluation results, we aim for
SEED-Bench to provide insights for motivating future research. We will launch
and consistently maintain a leaderboard to provide a platform for the community
to assess and investigate model capability. |
This paper introduces SEED-Bench, a large-scale benchmark designed to evaluate the generative comprehension abilities of Multimodal Large Language Models (MLLMs). |
Existing benchmarks for evaluating MLLMs are limited in scale, scope, and objectivity. SEED-Bench addresses these limitations by providing a comprehensive and objective evaluation framework for MLLMs. |
SEED-Bench leverages foundation models to extract visual information from images and videos, which is then used by ChatGPT/GPT-4 to generate multiple-choice questions. The generated questions are then filtered automatically and manually to ensure quality and relevance. |
Most MLLMs still exhibit limited performance across all 12 evaluation dimensions, especially in fine-grained temporal understanding and text recognition.
InstructBLIP achieves state-of-the-art results on SEED-Bench, demonstrating superior performance in 8 out of 12 evaluation dimensions.
VideoLLMs, despite being trained on video data, fail to achieve competitive performance on temporal understanding tasks compared to ImageLLMs. |
The current version of SEED-Bench primarily focuses on multiple-choice questions, potentially limiting the diversity of evaluated abilities.
Future work includes expanding the benchmark with additional evaluation dimensions, incorporating more diverse question formats, and exploring automatic generation of video-related questions. |
multimodal large language models, benchmarking, generative comprehension, visual reasoning, temporal understanding |
2307.15860
Report |
What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Network |
Ziheng Huang, Boheng Li, Yan Cai, Run Wang, Shangwei Guo, Liming Fang, Jing Chen, Lina Wang |
In recent decades, Generative Adversarial Network (GAN) and its variants have
achieved unprecedented success in image synthesis. However, well-trained GANs
are under the threat of illegal steal or leakage. The prior studies on remote
ownership verification assume a black-box setting where the defender can query
the suspicious model with specific inputs, which we identify is not enough for
generation tasks. To this end, in this paper, we propose a novel IP protection
scheme for GANs where ownership verification can be done by checking outputs
only, without choosing the inputs (i.e., box-free setting). Specifically, we
make use of the unexploited potential of the discriminator to learn a
hypersphere that captures the unique distribution learned by the paired
generator. Extensive evaluations on two popular GAN tasks and more than 10 GAN
architectures demonstrate our proposed scheme to effectively verify the
ownership. Our proposed scheme shown to be immune to popular input-based
removal attacks and robust against other existing attacks. The source code and
models are available at
https://github.com/AbstractTeen/gan_ownership_verification |
This paper proposes a novel, box-free ownership verification scheme for Generative Adversarial Networks (GANs) by leveraging the discriminator's ability to capture the generator's learned data distribution. |
Existing black-box verification methods for GANs are vulnerable to input manipulation and ambiguity attacks, particularly in tasks where deterministic inputs are not feasible. |
The method trains a hypersphere-based classifier using the discriminator's representations. This classifier captures the unique distribution of images generated by the paired generator. A pearson correlation loss is introduced during training to prevent discriminator degradation and preserve its representational capacity. |
The method effectively distinguishes between GANs with different architectures, training datasets, and even initialization seeds.
The scheme is robust against model pruning and output image transformations while maintaining acceptable image quality.
It is resilient to ambiguity attacks as it relies on the discriminator's unique representation, which is difficult to replicate without the original discriminator. |
The security relies on the secrecy of the discriminator, as its disclosure could enable attacks.
Future work can explore extending the approach to other generative models like diffusion models. |
generative adversarial networks, ownership verification, box-free verification, discriminator representation, ambiguity attack |
2307.15697
Report |
SimDETR: Simplifying self-supervised pretraining for DETR |
Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos |
DETR-based object detectors have achieved remarkable performance but are
sample-inefficient and exhibit slow convergence. Unsupervised pretraining has
been found to be helpful to alleviate these impediments, allowing training with
large amounts of unlabeled data to improve the detector's performance. However,
existing methods have their own limitations, like keeping the detector's
backbone frozen in order to avoid performance degradation and utilizing
pretraining objectives misaligned with the downstream task. To overcome these
limitations, we propose a simple pretraining framework for DETR-based detectors
that consists of three simple yet key ingredients: (i) richer, semantics-based
initial proposals derived from high-level feature maps, (ii) discriminative
training using object pseudo-labels produced via clustering, (iii)
self-training to take advantage of the improved object proposals learned by the
detector. We report two main findings: (1) Our pretraining outperforms prior
DETR pretraining works on both the full and low data regimes by significant
margins. (2) We show we can pretrain DETR from scratch (including the backbone)
directly on complex image datasets like COCO, paving the path for unsupervised
representation learning directly using DETR. |
This paper proposes SimDETR, a self-supervised pretraining framework for DETR-based object detectors that improves sample efficiency and convergence speed. |
DETR-based detectors, while achieving high performance, are known for slow convergence and requiring large amounts of labeled data. |
SimDETR uses three key components: 1) semantics-based initial proposals from clustered high-level feature maps, 2) class-aware pretraining using object pseudo-labels derived from clustering, and 3) iterative self-training for refining object proposals and enhancing supervision. |
SimDETR outperforms prior DETR pretraining methods in full data, semi-supervised, and few-shot settings.
SimDETR allows pretraining DETR from scratch, including the backbone, directly on complex datasets like COCO, demonstrating effective unsupervised representation learning.
SimDETR achieves competitive results for self-supervised representation learning on scene-centric images, indicating its potential for general-purpose representation learning. |
The performance of SimDETR, while competitive, is still slightly lower than object-centric pretraining on ImageNet, suggesting further room for improvement.
The paper focuses on DETR-based architectures, leaving the exploration of SimDETR's effectiveness on other detection frameworks for future work. |
object detection, self-supervised learning, detr, unsupervised pretraining, representation learning |
2307.15640
Report |
CLIP Brings Better Features to Visual Aesthetics Learners |
Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li |
The success of pre-training approaches on a variety of downstream tasks has
revitalized the field of computer vision. Image aesthetics assessment (IAA) is
one of the ideal application scenarios for such methods due to subjective and
expensive labeling procedure. In this work, an unified and flexible two-phase
\textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge
\textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}.
Specifically, we first integrate and leverage a multi-source unlabeled dataset
to align rich features between a given visual encoder and an off-the-shelf CLIP
image encoder via feature alignment loss. Notably, the given visual encoder is
not limited by size or structure and, once well-trained, it can seamlessly
serve as a better visual aesthetic learner for both student and teacher. In the
second phase, the unlabeled data is also utilized in semi-supervised IAA
learning to further boost student model performance when applied in
latency-sensitive production scenarios. By analyzing the attention distance and
entropy before and after feature alignment, we notice an alleviation of feature
collapse issue, which in turn showcase the necessity of feature alignment
instead of training directly based on CLIP image encoder. Extensive experiments
indicate the superiority of CSKD, which achieves state-of-the-art performance
on multiple widely used IAA benchmarks. |
This paper proposes CSKD, a novel CLIP-based two-phase Semi-supervised Knowledge Distillation method for Image Aesthetics Assessment (IAA), which improves both the generalization ability and the knowledge distillation efficiency of IAA algorithms. |
IAA suffers from poor model generalization ability due to the subjective and expensive labeling procedure. Existing DL-based methods have high complexity hindering their deployment on mobile devices, while lightweight models usually suffer from unacceptable performance drop. This paper utilizes the representation ability of CLIP to improve the performance of IAA models. |
The method consists of two phases: 1) CLIP-based Feature Alignment (CFA): aligns the features of a given visual encoder with an off-the-shelf CLIP image encoder using a large unlabeled dataset; 2) Semi-supervised Knowledge Distillation (SKD): fine-tunes a teacher model using labeled IAA data and then trains a student model with both labeled and unlabeled data by minimizing the difference between their predictions and human/pseudo labels. |
CSKD achieves state-of-the-art performance on multiple IAA benchmarks, including AVA, AADB, and PARA.
Analysis of attention maps before and after CFA indicates an alleviation of the feature collapse issue.
Semi-supervised knowledge distillation with unlabeled data significantly boosts student model performance. |
Limitation1: The performance improvement brought by using a larger CLIP model is not thoroughly investigated.
Limitation2: The method is only evaluated on three IAA datasets, and its generalization ability to other datasets needs further validation.
Future work will focus on exploring the impact of different CLIP models and applying the method to other image-related tasks. |
image aesthetics assessment, clip, knowledge distillation, semi-supervised learning, feature alignment |
2307.15353
Report |
Supervised Homography Learning with Realistic Dataset Generation |
Hai Jiang, Haipeng Li, Songchen Han, Haoqiang Fan, Bing Zeng, Shuaicheng Liu |
In this paper, we propose an iterative framework, which consists of two
phases: a generation phase and a training phase, to generate realistic training
data and yield a supervised homography network. In the generation phase, given
an unlabeled image pair, we utilize the pre-estimated dominant plane masks and
homography of the pair, along with another sampled homography that serves as
ground truth to generate a new labeled training pair with realistic motion. In
the training phase, the generated data is used to train the supervised
homography network, in which the training data is refined via a content
consistency module and a quality assessment module. Once an iteration is
finished, the trained network is used in the next data generation phase to
update the pre-estimated homography. Through such an iterative strategy, the
quality of the dataset and the performance of the network can be gradually and
simultaneously improved. Experimental results show that our method achieves
state-of-the-art performance and existing supervised methods can be also
improved based on the generated dataset. Code and dataset are available at
https://github.com/JianghaiSCU/RealSH. |
This paper proposes an iterative deep framework to generate realistic datasets for supervised homography learning and trains a high-precision homography estimation network. |
Supervised homography learning methods lag behind unsupervised methods due to the lack of qualified training data that simultaneously satisfies both label criteria and realism criteria. |
The framework iteratively generates training data and trains the homography network. It uses pre-estimated dominant plane masks, initial homographies, and a sampled ground truth homography to synthesize realistic image pairs. A content consistency module and a quality assessment module are introduced to refine the generated data during training. |
The method achieves state-of-the-art performance on CA-unsup and GHOF benchmarks, outperforming existing supervised and unsupervised methods.
The generated dataset, CA-sup, significantly improves the performance of existing supervised methods, demonstrating its effectiveness.
Ablation studies validate the contribution of each component in the framework, including the dataset generation strategy, content consistency module, and quality assessment module. |
The method relies on the accuracy of pre-estimated dominant plane masks and initial homographies.
The iterative process requires more computation compared to traditional training methods. |
homography estimation, dataset generation, supervised learning, deep learning, computer vision |
2307.15333
Report |
Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF |
Haotian Bai, Yiqi Lin, Yize Chen, Lin Wang |
The explicit neural radiance field (NeRF) has gained considerable interest
for its efficient training and fast inference capabilities, making it a
promising direction such as virtual reality and gaming. In particular,
PlenOctree (POT)[1], an explicit hierarchical multi-scale octree
representation, has emerged as a structural and influential framework. However,
POT's fixed structure for direct optimization is sub-optimal as the scene
complexity evolves continuously with updates to cached color and density,
necessitating refining the sampling distribution to capture signal complexity
accordingly. To address this issue, we propose the dynamic PlenOctree DOT,
which adaptively refines the sample distribution to adjust to changing scene
complexity. Specifically, DOT proposes a concise yet novel hierarchical feature
fusion strategy during the iterative rendering process. Firstly, it identifies
the regions of interest through training signals to ensure adaptive and
efficient refinement. Next, rather than directly filtering out valueless nodes,
DOT introduces the sampling and pruning operations for octrees to aggregate
features, enabling rapid parameter learning. Compared with POT, our DOT
outperforms it by enhancing visual quality, reducing over $55.15$/$68.84\%$
parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $\&$
Temples, respectively. Project homepage:https://vlislab22.github.io/DOT.
[1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance
fields." Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2021. |
This paper proposes DOT, a dynamic PlenOctree structure that adaptively refines the sample distribution in explicit NeRF based on training signals, improving rendering quality and efficiency. |
Fixed octree structures like POT are sub-optimal as scene complexity changes during training. DOT addresses this by dynamically calibrating the octree structure for better adaptation. |
DOT uses a hierarchical feature fusion strategy. It identifies regions of interest based on training signals like ray weight and then prunes valueless regions while sampling more in complex areas. This process iteratively refines the octree, aggregating features for efficient representation. |
DOT significantly reduces the number of parameters compared to POT (over 55% on synthetic and 68% on Tanks & Temples).
It enhances rendering quality, achieving better PSNR, SSIM, and LPIPS scores than POT on both datasets.
DOT achieves a considerable speedup, nearly doubling the FPS of POT on synthetic and Tanks & Temples datasets. |
DOT relies on pretrained NeRF-SH models, inheriting the limitation of potentially long initial training times.
Future work includes exploring methods to train the model from scratch with signal-guided sample allocation. |
neural radiance fields, plenoctree, adaptive sampling, hierarchical feature fusion, real-time rendering |
2307.15157
Report |
R-LPIPS: An Adversarially Robust Perceptual Similarity Metric |
Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Alexandre Araujo |
Similarity metrics have played a significant role in computer vision to
capture the underlying semantics of images. In recent years, advanced
similarity metrics, such as the Learned Perceptual Image Patch Similarity
(LPIPS), have emerged. These metrics leverage deep features extracted from
trained neural networks and have demonstrated a remarkable ability to closely
align with human perception when evaluating relative image similarity. However,
it is now well-known that neural networks are susceptible to adversarial
examples, i.e., small perturbations invisible to humans crafted to deliberately
mislead the model. Consequently, the LPIPS metric is also sensitive to such
adversarial examples. This susceptibility introduces significant security
concerns, especially considering the widespread adoption of LPIPS in
large-scale applications. In this paper, we propose the Robust Learned
Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages
adversarially trained deep features. Through a comprehensive set of
experiments, we demonstrate the superiority of R-LPIPS compared to the
classical LPIPS metric. The code is available at
https://github.com/SaraGhazanfari/R-LPIPS. |
This paper introduces R-LPIPS, an adversarially robust perceptual similarity metric designed to address the vulnerability of the LPIPS metric to adversarial examples. |
The sensitivity of LPIPS to adversarial perturbations poses significant security risks, especially in applications like copyright infringement detection and digital forensics where image similarity assessment is crucial. |
R-LPIPS leverages adversarially trained deep features, incorporating adversarial training into the LPIPS training process to enhance its robustness. |
R-LPIPS demonstrates superior robustness compared to LPIPS when evaluated against adversarial attacks (l-infinity-PGD and l2-PGD) across various data distortions.
The natural 2AFC score of R-LPIPS remains comparable to LPIPS, indicating that robustness is achieved without sacrificing accuracy.
New perceptual attacks (R-PPGA and R-LPA) based on R-LPIPS prove to be more effective than attacks based on LPIPS, successfully breaking the defenses of the perceptually robust model PAT. |
The adversarial training of R-LPIPS currently focuses on x0, with potential for further exploration by applying AT to x1 or both x0 and x1.
While adversarial training provides empirical robustness, R-LPIPS lacks theoretical guarantees. Investigating theoretical foundations for perceptual metrics like R-LPIPS is an important area for future work. |
perceptual similarity metric, adversarial robustness, lpips, adversarial training, perceptual attacks |
2307.15139
Report |
Online Clustered Codebook |
Chuanxia Zheng, Andrea Vedaldi |
Vector Quantisation (VQ) is experiencing a comeback in machine learning,
where it is increasingly used in representation learning. However, optimizing
the codevectors in existing VQ-VAE is not entirely trivial. A problem is
codebook collapse, where only a small subset of codevectors receive gradients
useful for their optimisation, whereas a majority of them simply ``dies off''
and is never updated or used. This limits the effectiveness of VQ for learning
larger codebooks in complex computer vision tasks that require high-capacity
representations. In this paper, we present a simple alternative method for
online codebook learning, Clustering VQ-VAE (CVQ-VAE). Our approach selects
encoded features as anchors to update the ``dead'' codevectors, while
optimising the codebooks which are alive via the original loss. This strategy
brings unused codevectors closer in distribution to the encoded features,
increasing the likelihood of being chosen and optimized. We extensively
validate the generalization capability of our quantiser on various datasets,
tasks (e.g. reconstruction and generation), and architectures (e.g. VQ-VAE,
VQGAN, LDM). Our CVQ-VAE can be easily integrated into the existing models with
just a few lines of code. |
This paper introduces CVQ-VAE, a novel Vector Quantisation (VQ) method addressing codebook collapse in representation learning by dynamically initializing codebooks using online feature clustering. |
Codebook collapse limits the effectiveness of VQ, particularly for large codebooks in complex computer vision tasks requiring high-capacity representations. CVQ-VAE aims to overcome this limitation and improve the utilization of large codebooks. |
CVQ-VAE dynamically initializes unoptimized codevectors by resampling from learned features. Unlike traditional clustering, it employs running averages of encoded features across mini-batches to handle changing feature representations during deep network training. |
CVQ-VAE significantly outperforms previous VQ methods like VQ-VAE and SQ-VAE on various datasets.
It achieves superior reconstruction quality compared to state-of-the-art methods like VQGAN, even under high compression ratios.
The method demonstrates strong generalization capabilities across different tasks, datasets, and architectures, including VQ-VAE, VQGAN, and LDM. |
While CVQ-VAE demonstrates promising results, the exploration of optimal codebook dimensionality remains an open question.
Future work could investigate the application of CVQ-VAE to broader downstream tasks beyond generation and completion. |
vector quantisation, representation learning, codebook collapse, image generation, deep learning |
2307.15131
Report |
Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields |
Xiangyu Wang, Jingsen Zhu, Qi Ye, Yuchi Huo, Yunlong Ran, Zhihua Zhong, Jiming Chen |
With the popularity of implicit neural representations, or neural radiance
fields (NeRF), there is a pressing need for editing methods to interact with
the implicit 3D models for tasks like post-processing reconstructed scenes and
3D content creation. While previous works have explored NeRF editing from
various perspectives, they are restricted in editing flexibility, quality, and
speed, failing to offer direct editing response and instant preview. The key
challenge is to conceive a locally editable neural representation that can
directly reflect the editing instructions and update instantly. To bridge the
gap, we propose a new interactive editing method and system for implicit
representations, called Seal-3D, which allows users to edit NeRF models in a
pixel-level and free manner with a wide range of NeRF-like backbone and preview
the editing effects instantly. To achieve the effects, the challenges are
addressed by our proposed proxy function mapping the editing instructions to
the original space of NeRF models in the teacher model and a two-stage training
strategy for the student model with local pretraining and global finetuning. A
NeRF editing system is built to showcase various editing types. Our system can
achieve compelling editing effects with an interactive speed of about 1 second. |
Seal-3D, an interactive pixel-level editing method for neural radiance fields that supports instant preview. |
Existing NeRF editing methods are limited in flexibility, quality, and speed, lacking direct editing response and instant preview. |
The method uses a proxy function to map editing instructions to the original NeRF space and a two-stage training strategy (local pretraining for instant preview and global finetuning for refinement) for a student NeRF model. |
Interactive editing with instant preview (≈1s) is achieved.
The method supports various editing types including geometry and color edits.
The student model can generate higher quality results than the teacher model due to multi-view consistency. |
The method does not support complex view-dependent lighting effects.
It cannot handle reconstruction failures in the original NeRF model. |
neural radiance fields, nerf editing, interactive editing, 3d scene editing, instant preview |
2307.15055
Report |
PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking |
Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, Leonidas J. Guibas |
We introduce PointOdyssey, a large-scale synthetic dataset, and data
generation framework, for the training and evaluation of long-term fine-grained
tracking algorithms. Our goal is to advance the state-of-the-art by placing
emphasis on long videos with naturalistic motion. Toward the goal of
naturalism, we animate deformable characters using real-world motion capture
data, we build 3D scenes to match the motion capture environments, and we
render camera viewpoints using trajectories mined via structure-from-motion on
real videos. We create combinatorial diversity by randomizing character
appearance, motion profiles, materials, lighting, 3D assets, and atmospheric
effects. Our dataset currently includes 104 videos, averaging 2,000 frames
long, with orders of magnitude more correspondence annotations than prior work.
We show that existing methods can be trained from scratch in our dataset and
outperform the published variants. Finally, we introduce modifications to the
PIPs point tracking method, greatly widening its temporal receptive field,
which improves its performance on PointOdyssey as well as on two real-world
benchmarks. Our data and code are publicly available at:
https://pointodyssey.com |
Introduces PointOdyssey, a large-scale synthetic dataset for training and evaluating long-term fine-grained tracking algorithms, featuring long videos with naturalistic motion and diverse scenes. |
Addresses the lack of datasets for fine-grained long-range tracking that reflect the complexities and opportunities of real-world video. |
Generates synthetic data using motion capture data to animate characters, recreates real-world environments, randomizes scene attributes, and provides pixel-perfect annotations for long-range trajectories. |
Existing methods trained on PointOdyssey outperform their publicly available variants.
A modified PIPs method with an extended temporal receptive field and template updates (PIPs++) achieves state-of-the-art performance on PointOdyssey and real-world benchmarks.
PointOdyssey presents a more challenging benchmark than existing real-world datasets like TAP-Vid-DAVIS and CroHD. |
Dataset currently lacks large outdoor scenes with significant camera travel.
Exploration of trackers utilizing scene-level and semantic cues, beyond low-level appearance matching, remains an open challenge. |
point tracking, synthetic dataset, long-term tracking, motion capture, scene understanding |
2307.15049
Report |
Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models |
Kecheng Zheng, Wei Wu, Ruili Feng, Kai Zhu, Jiawei Liu, Deli Zhao, Zheng-Jun Zha, Wei Chen, Yujun Shen |
Prompt tuning and adapter tuning have shown great potential in transferring
pre-trained vision-language models (VLMs) to various downstream tasks. In this
work, we design a new type of tuning method, termed as regularized mask tuning,
which masks the network parameters through a learnable selection. Inspired by
neural pathways, we argue that the knowledge required by a downstream task
already exists in the pre-trained weights but just gets concealed in the
upstream pre-training stage. To bring the useful knowledge back into light, we
first identify a set of parameters that are important to a given downstream
task, then attach a binary mask to each parameter, and finally optimize these
masks on the downstream data with the parameters frozen. When updating the
mask, we introduce a novel gradient dropout strategy to regularize the
parameter selection, in order to prevent the model from forgetting old
knowledge and overfitting the downstream data. Experimental results on 11
datasets demonstrate the consistent superiority of our method over previous
alternatives. It is noteworthy that we manage to deliver 18.73% performance
improvement compared to the zero-shot CLIP via masking an average of only 2.56%
parameters. Furthermore, our method is synergistic with most existing
parameter-efficient tuning methods and can boost the performance on top of
them. Project page can be found here (https://wuw2019.github.io/R-AMT/). |
The paper introduces Regularized Mask Tuning (R-MT), a new technique for adapting pre-trained vision-language models (VLMs) to downstream tasks by selectively masking parameters using learnable binary masks. |
Existing efficient tuning methods like prompt tuning and adapter tuning do not fully exploit the potential of pre-trained VLM parameters. R-MT aims to uncover hidden task-specific knowledge within these parameters, inspired by the concept of neural pathways in the brain. |
R-MT identifies key parameters based on gradient changes during downstream task training. Binary masks are attached to these parameters and optimized with gradient dropout regularization. This regularization incorporates general knowledge from the pre-trained VLM to prevent forgetting and overfitting. |
R-MT consistently outperforms existing methods, including prompt tuning and adapter tuning, on 11 image classification datasets.
R-MT achieves 18.73% performance improvement over zero-shot CLIP while masking only 2.56% of parameters on average.
R-MT is synergistic with existing methods and can boost their performance by around 3%. |
R-MT has not been evaluated on open-world detection and segmentation tasks due to computational resource limitations.
Future work will explore applying R-MT to other visual tasks such as segmentation. |
vision-language models, parameter-efficient tuning, mask tuning, few-shot learning, gradient dropout regularization |
2307.15033
Report |
Diverse Inpainting and Editing with GAN Inversion |
Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen, Aysegul Dundar |
Recent inversion methods have shown that real images can be inverted into
StyleGAN's latent space and numerous edits can be achieved on those images
thanks to the semantically rich feature representations of well-trained GAN
models. However, extensive research has also shown that image inversion is
challenging due to the trade-off between high-fidelity reconstruction and
editability. In this paper, we tackle an even more difficult task, inverting
erased images into GAN's latent space for realistic inpaintings and editings.
Furthermore, by augmenting inverted latent codes with different latent samples,
we achieve diverse inpaintings. Specifically, we propose to learn an encoder
and mixing network to combine encoded features from erased images with
StyleGAN's mapped features from random samples. To encourage the mixing network
to utilize both inputs, we train the networks with generated data via a novel
set-up. We also utilize higher-rate features to prevent color inconsistencies
between the inpainted and unerased parts. We run extensive experiments and
compare our method with state-of-the-art inversion and inpainting methods.
Qualitative metrics and visual comparisons show significant improvements. |
This paper introduces a novel framework for diverse image inpainting and editing using GAN inversion. It leverages an encoder and mixing network to combine encoded features from erased images with randomly sampled latent codes from StyleGAN. |
This approach addresses the limitations of existing GAN inversion methods that struggle with the trade-off between high-fidelity reconstruction and editability, particularly in the challenging scenario of inpainting erased images. |
The framework utilizes a two-stage training pipeline. First, it trains an encoder and mixing network with generated data to ensure diversity. Second, it incorporates skip connections to achieve high-fidelity reconstructions and seamless transitions between unerased and erased pixels. |
The proposed method significantly outperforms state-of-the-art models in terms of FID, LPIPS, U-IDS, and P-IDS metrics for image inpainting.
The framework demonstrates robustness across different mask difficulty levels and generalizes well to diverse datasets like FFHQ, AFHQ Cat, and AFHQ Dog.
The model successfully performs diverse inpainting and enables image editing on erased regions using InterfaceGAN directions. |
The diversity of inpainting results, while improved, is still limited by the model's ability to generate semantically consistent pixels.
Future work could explore alternative mixing network architectures or training strategies to further enhance the diversity and realism of inpainted outputs. |
gan inversion, image inpainting, image editing, generative adversarial networks, stylegan |
2307.14971
Report |
Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models |
Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou, Jiwen Lu |
With the overwhelming trend of mask image modeling led by MAE, generative
pre-training has shown a remarkable potential to boost the performance of
fundamental models in 2D vision. However, in 3D vision, the over-reliance on
Transformer-based backbones and the unordered nature of point clouds have
restricted the further development of generative pre-training. In this paper,
we propose a novel 3D-to-2D generative pre-training method that is adaptable to
any point cloud model. We propose to generate view images from different
instructed poses via the cross-attention mechanism as the pre-training scheme.
Generating view images has more precise supervision than its point cloud
counterpart, thus assisting 3D backbones to have a finer comprehension of the
geometrical structure and stereoscopic relations of the point cloud.
Experimental results have proved the superiority of our proposed 3D-to-2D
generative pre-training over previous pre-training methods. Our method is also
effective in boosting the performance of architecture-oriented approaches,
achieving state-of-the-art performance when fine-tuning on ScanObjectNN
classification and ShapeNetPart segmentation tasks. Code is available at
https://github.com/wangzy22/TAP. |
This paper proposes TAP, a novel 3D-to-2D generative pre-training method for point cloud models that enhances geometric structure and stereoscopic relation understanding. |
Existing 3D generative pre-training methods suffer from imprecise supervision and limited backbone adaptability. This work aims to address these limitations. |
TAP generates view images from different poses using a pose-dependent Photograph Module and a 2D generator. The module encodes pose information into queries for cross-attention with 3D features, enabling the model to learn projection relations. The generated images are supervised by rendered ground truth images with MSE loss. |
TAP consistently improves performance across various point cloud backbone architectures.
It outperforms previous generative pre-training methods on ScanObjectNN classification and achieves state-of-the-art results on ShapeNetPart segmentation.
The method demonstrates superior performance in few-shot learning scenarios and shows promising results in scene-level dense prediction tasks. |
The current implementation relies on a relatively simple 2D generator, which could be further improved for generating higher-fidelity images.
Exploring the effectiveness of perceptual loss with more realistic rendered images is an intriguing avenue for future work. |
3d vision, point cloud analysis, generative pre-training, cross-modal learning, self-supervised learning |
2307.14918
Report |
GET3D--: Learning GET3D from Unconstrained Image Collections |
Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong |
The demand for efficient 3D model generation techniques has grown
exponentially, as manual creation of 3D models is time-consuming and requires
specialized expertise. While generative models have shown potential in creating
3D textured shapes from 2D images, their applicability in 3D industries is
limited due to the lack of a well-defined camera distribution in real-world
scenarios, resulting in low-quality shapes. To overcome this limitation, we
propose GET3D--, the first method that directly generates textured 3D shapes
from 2D images with unknown pose and scale. GET3D-- comprises a 3D shape
generator and a learnable camera sampler that captures the 6D external changes
on the camera. In addition, We propose a novel training schedule to stably
optimize both the shape generator and camera sampler in a unified framework. By
controlling external variations using the learnable camera sampler, our method
can generate aligned shapes with clear textures. Extensive experiments
demonstrate the efficacy of GET3D--, which precisely fits the 6D camera pose
distribution and generates high-quality shapes on both synthetic and realistic
unconstrained datasets. |
GET3D-- generates textured 3D shapes from 2D images with unknown and unconstrained camera poses. |
Existing 3D generation methods often assume fixed or known camera distributions, limiting their applicability to real-world images with unconstrained camera poses. |
GET3D-- employs a 3D shape generator and a learnable 6D camera sampler. It uses a novel training schedule: (1) initializes the shape generator with a fixed camera distribution, (2) initializes the camera sampler with the learned coarse shapes, (3) jointly trains both, and (4) fine-tunes the shape generator. It also uses camera compensation and a shape align loss to decouple object and camera transformations. |
GET3D-- generates higher-quality shapes and textures compared to baseline GET3D on unconstrained datasets.
The learnable camera sampler effectively captures the underlying 6D camera distribution.
Camera compensation and shape align loss are crucial for accurate shape and texture generation. |
The method assumes camera component independence and single Gaussian ground-truth distribution.
Shape align loss might introduce noise when object shapes vary greatly. |
3d shape generation, camera pose estimation, unconstrained images, generative adversarial networks, differentiable rendering |
2307.14770
Report |
3DPortraitGAN: Learning One-Quarter Headshot 3D GANs from a Single-View Portrait Dataset with Diverse Body Poses |
Yiqian Wu, Hao Xu, Xiangjun Tang, Hongbo Fu, Xiaogang Jin |
3D-aware face generators are typically trained on 2D real-life face image
datasets that primarily consist of near-frontal face data, and as such, they
are unable to construct one-quarter headshot 3D portraits with complete head,
neck, and shoulder geometry. Two reasons account for this issue: First,
existing facial recognition methods struggle with extracting facial data
captured from large camera angles or back views. Second, it is challenging to
learn a distribution of 3D portraits covering the one-quarter headshot region
from single-view data due to significant geometric deformation caused by
diverse body poses. To this end, we first create the dataset
360{\deg}-Portrait-HQ (360{\deg}PHQ for short) which consists of high-quality
single-view real portraits annotated with a variety of camera parameters (the
yaw angles span the entire 360{\deg} range) and body poses. We then propose
3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that
learns a canonical 3D avatar distribution from the 360{\deg}PHQ dataset with
body pose self-learning. Our model can generate view-consistent portrait images
from all camera angles with a canonical one-quarter headshot 3D representation.
Our experiments show that the proposed framework can accurately predict
portrait body poses and generate view-consistent, realistic portrait images
with complete geometry from all camera angles. |
This paper introduces 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that can learn a canonical 3D avatar distribution from a single-view portrait dataset with diverse body poses. |
Existing 3D-aware face generators are limited to frontal views and lack complete neck and shoulder geometry due to limitations in existing datasets. |
The authors create a new dataset, 360°-Portrait-HQ (360°PHQ), containing single-view portraits with diverse camera angles and body poses. They then propose a 3DPortraitGAN framework with a body pose-aware discriminator and a deformation module to generate view-consistent one-quarter headshot portraits with complete geometry. |
3DPortraitGAN generates high-quality, view-consistent portrait images from 360° camera angles.
The model accurately predicts body poses, surpassing the accuracy of coarse poses obtained from off-the-shelf methods.
Quantitative evaluation shows 3DPortraitGAN outperforms state-of-the-art methods in FID and facial identity consistency metrics. |
The deformation module, solely based on the SMPL model, does not consider the generated geometry, leading to artifacts and high computational cost.
The pose predictor in the generator is prone to collapsing during training, limiting the model's ability to achieve perfectly canonical representations. |
portrait generation, 3d-aware gans, deformable neural radiance fields, single-view reconstruction, body pose estimation |
2307.14735
Report |
Test Time Adaptation for Blind Image Quality Assessment |
Subhadeep Roy, Shankhanil Mitra, Soma Biswas, Rajiv Soundararajan |
While the design of blind image quality assessment (IQA) algorithms has
improved significantly, the distribution shift between the training and testing
scenarios often leads to a poor performance of these methods at inference time.
This motivates the study of test time adaptation (TTA) techniques to improve
their performance at inference time. Existing auxiliary tasks and loss
functions used for TTA may not be relevant for quality-aware adaptation of the
pre-trained model. In this work, we introduce two novel quality-relevant
auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In
particular, we introduce a group contrastive loss at the batch level and a
relative rank loss at the sample level to make the model quality aware and
adapt to the target data. Our experiments reveal that even using a small batch
of images from the test distribution helps achieve significant improvement in
performance by updating the batch normalization statistics of the source model. |
This paper introduces novel self-supervised test-time adaptation (TTA) techniques for blind image quality assessment (IQA) to address the challenge of distribution shifts between training and testing data. |
Existing IQA algorithms often suffer from poor generalization ability due to distribution shifts between training and testing scenarios. TTA offers a promising solution to adapt pre-trained IQA models to target data distributions at inference time, thereby improving their performance. |
The proposed TTA-IQA method introduces two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for IQA: 1) Group Contrastive (GC) loss: Contrasting groups of low and high-quality images in a batch to capture quality discriminative information. 2) Rank loss: Enforcing the model to rank the image quality of distorted augmentations of each test sample to maintain quality order. |
TTA-IQA significantly improves the performance of four different quality-aware source models (TReS, MUSIQ, HyperIQA, MetaIQA) on four IQA databases (KonIQ-10k, PIPAL, CID2013, LIVE-IQA).
The combination of rank loss and GC loss consistently outperforms using either loss individually, demonstrating their complementary nature.
TTA-IQA effectively adapts to target data even with small batch sizes, highlighting its efficiency in real-world scenarios. |
The choice of distortion types for the rank loss relies on the source model's knowledge, which may be inaccurate for significantly different target distributions.
Future work includes exploring more sophisticated auxiliary tasks and extending TTA-IQA to video quality assessment. |
test-time adaptation, blind image quality assessment, group contrastive learning, rank loss, distribution shift |
2307.14659
Report |
LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement |
Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tae-Kyun Kim, Wei Liu, Hongdong Li |
Current deep learning methods for low-light image enhancement (LLIE)
typically rely on pixel-wise mapping learned from paired data. However, these
methods often overlook the importance of considering degradation
representations, which can lead to sub-optimal outcomes. In this paper, we
address this limitation by proposing a degradation-aware learning scheme for
LLIE using diffusion models, which effectively integrates degradation and image
priors into the diffusion process, resulting in improved image enhancement. Our
proposed degradation-aware learning scheme is based on the understanding that
degradation representations play a crucial role in accurately modeling and
capturing the specific degradation patterns present in low-light images. To
this end, First, a joint learning framework for both image generation and image
enhancement is presented to learn the degradation representations. Second, to
leverage the learned degradation representations, we develop a Low-Light
Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module.
This module takes into account both the color map and the latent degradation
representations to guide the diffusion process. By incorporating these
conditioning factors, the proposed LLDiffusion can effectively enhance
low-light images, considering both the inherent degradation patterns and the
desired color fidelity. Finally, we evaluate our proposed method on several
well-known benchmark datasets, including synthetic and real-world unpaired
datasets. Extensive experiments on public benchmarks demonstrate that our
LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and
qualitatively. The source code and pre-trained models are available at
https://github.com/TaoWangzj/LLDiffusion. |
This paper introduces LLDiffusion, a novel degradation-aware diffusion model for low-light image enhancement, which integrates degradation representations into the enhancement process. |
Current LLIE methods often overlook degradation representations, leading to sub-optimal results with artifacts or unnatural enhancements. LLDiffusion addresses this by explicitly modeling and utilizing degradation patterns. |
The approach involves a two-stage process: (1) Joint learning of degradation representations through a degradation generation network and an enhancement diffusion module. (2) Enhancement using a dynamic diffusion module conditioned on learned degradation representations and image priors (color maps). |
LLDiffusion outperforms state-of-the-art LLIE methods on benchmark datasets (LOL, LOL-v2, VE-LOL) both quantitatively and qualitatively.
The method exhibits strong generalization ability, effectively enhancing images from unseen datasets (DICM, MEF, NPE).
Ablation studies confirm the contribution of each component, highlighting the importance of degradation representation learning and the dynamic diffusion module. |
The latent map encoder currently has a simple structure and could be improved with increased width and depth for potentially better performance.
Future work will explore extending LLDiffusion for low-light video enhancement. |
low-light image enhancement, diffusion models, degradation representations, deep learning, computer vision |
2307.14638
Report |
EqGAN: Feature Equalization Fusion for Few-shot Image Generation |
Yingbo Zhou, Zhihao Yue, Yutong Ye, Pengyu Zhang, Xian Wei, Mingsong Chen |
Due to the absence of fine structure and texture information, existing
fusion-based few-shot image generation methods suffer from unsatisfactory
generation quality and diversity. To address this problem, we propose a novel
feature Equalization fusion Generative Adversarial Network (EqGAN) for few-shot
image generation. Unlike existing fusion strategies that rely on either deep
features or local representations, we design two separate branches to fuse
structures and textures by disentangling encoded features into shallow and deep
contents. To refine image contents at all feature levels, we equalize the fused
structure and texture semantics at different scales and supplement the decoder
with richer information by skip connections. Since the fused structures and
textures may be inconsistent with each other, we devise a consistent
equalization loss between the equalized features and the intermediate output of
the decoder to further align the semantics. Comprehensive experiments on three
public datasets demonstrate that, EqGAN not only significantly improves
generation performance with FID score (by up to 32.7%) and LPIPS score (by up
to 4.19%), but also outperforms the state-of-the-arts in terms of accuracy (by
up to 1.97%) for downstream classification tasks. |
The paper proposes EqGAN, a feature equalization fusion-based generative adversarial network for few-shot image generation. |
Existing fusion-based methods suffer from unsatisfactory generation quality and diversity due to semantic entanglement when fusing image features. |
EqGAN disentangles encoded features into structure and texture branches, performs multi-scale feature equalization fusion, and introduces a consistent equalization loss to align fused semantics. |
EqGAN significantly improves FID and LPIPS scores compared to state-of-the-art methods, demonstrating superior image quality and diversity.
Ablation studies confirm the effectiveness of each component in the feature equalization fusion strategy.
EqGAN boosts the accuracy of downstream classification tasks by providing higher-quality augmented images. |
The model's performance might be further enhanced by exploring more sophisticated fusion strategies.
The computational cost of EqGAN is relatively high due to the multi-scale feature processing. |
few-shot image generation, generative adversarial networks, feature fusion, semantic alignment, image quality |
2307.14620
Report |
NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection |
Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka |
We present NeRF-Det, a novel method for indoor 3D detection with posed RGB
images as input. Unlike existing indoor 3D detection methods that struggle to
model scene geometry, our method makes novel use of NeRF in an end-to-end
manner to explicitly estimate 3D geometry, thereby improving 3D detection
performance. Specifically, to avoid the significant extra latency associated
with per-scene optimization of NeRF, we introduce sufficient geometry priors to
enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the
detection and NeRF branches through a shared MLP, enabling an efficient
adaptation of NeRF to detection and yielding geometry-aware volumetric
representations for 3D detection. Our method outperforms state-of-the-arts by
3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We
provide extensive analysis to shed light on how NeRF-Det works. As a result of
our joint-training design, NeRF-Det is able to generalize well to unseen scenes
for object detection, view synthesis, and depth estimation tasks without
requiring per-scene optimization. Code is available at
\url{https://github.com/facebookresearch/NeRF-Det}. |
Presents NeRF-Det, a novel method for indoor 3D object detection from posed RGB images, leveraging NeRF to learn geometry-aware volumetric representations. |
Addresses the challenge of ambiguous scene geometry in indoor 3D detection from RGB-only images by explicitly modeling it using NeRF. |
Jointly trains a NeRF branch with the 3D detection pipeline, sharing a geometry MLP and using augmented image features (including variance and color) as priors for NeRF. It estimates an opacity field from density to refine volume features. |
Outperforms state-of-the-art RGB-only methods by 3.9 mAP and 3.1 mAP on ScanNet and ARKITScenes, respectively.
Demonstrates the effectiveness of NeRF over depth maps and cost volume for scene geometry modeling in 3D detection.
Shows generalization ability to novel view synthesis and depth estimation on unseen scenes without per-scene optimization. |
The detection branch might hinder the NeRF branch's performance by potentially erasing low-level details.
Future work includes adapting NeRF-Det for outdoor 3D detection, addressing challenges like dynamic objects and unbounded scenes. |
3d object detection, neural radiance fields (nerf), multi-view geometry, indoor scene understanding, geometry-aware representations |
2307.14611
Report |
TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation |
Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh |
We propose TextManiA, a text-driven manifold augmentation method that
semantically enriches visual feature spaces, regardless of class distribution.
TextManiA augments visual data with intra-class semantic perturbation by
exploiting easy-to-understand visually mimetic words, i.e., attributes. This
work is built on an interesting hypothesis that general language models, e.g.,
BERT and GPT, encompass visual information to some extent, even without
training on visual training data. Given the hypothesis, TextManiA transfers
pre-trained text representation obtained from a well-established large language
encoder to a target visual feature space being learned. Our extensive analysis
hints that the language encoder indeed encompasses visual information at least
useful to augment visual representation. Our experiments demonstrate that
TextManiA is particularly powerful in scarce samples with class imbalance as
well as even distribution. We also show compatibility with the label mix-based
approaches in evenly distributed scarce data. |
Proposes TextManiA, a method that enriches visual features by transferring attribute information from text embeddings to visual feature spaces, particularly beneficial for long-tailed and scarce data. |
Addresses the challenge of performance degradation in learning models when faced with data distribution shifts, especially in long-tailed distributions and scarce data scenarios. |
Leverages visually mimetic words (attributes) encoded by language models (BERT, GPT-2, CLIP) to augment visual features. Computes difference vectors between text embeddings with and without attributes, projects them onto the target visual feature space, and adds them to the original features. |
TextManiA consistently improves performance on long-tailed classification benchmarks (CIFAR-100-LT, ImageNet-LT), demonstrating its effectiveness in handling skewed class distributions.
Outperforms or complements mix-based augmentation methods in scarce data classification tasks (CIFAR-100-10%, Tiny-ImageNet-10%), highlighting the benefit of intra-class semantic perturbation.
Improves few-shot object detection accuracy (PASCAL VOC, MS-COCO) by enhancing the classification head's performance, especially in low-shot settings. |
Current attribute set limited to color and size, exploring additional attributes could further enhance performance.
More effective attribute selection methods for specific tasks and datasets could be investigated. |
data augmentation, long-tail classification, scarce data, few-shot learning, vision and language |
2307.14489
Report |
SuperInpaint: Learning Detail-Enhanced Attentional Implicit Representation for Super-resolutional Image Inpainting |
Canyu Zhang, Qing Guo, Xiaoguang Li, Renjie Wan, Hongkai Yu, Ivor Tsang, Song Wang |
In this work, we introduce a challenging image restoration task, referred to
as SuperInpaint, which aims to reconstruct missing regions in low-resolution
images and generate completed images with arbitrarily higher resolutions. We
have found that this task cannot be effectively addressed by stacking
state-of-the-art super-resolution and image inpainting methods as they amplify
each other's flaws, leading to noticeable artifacts. To overcome these
limitations, we propose the detail-enhanced attentional implicit representation
(DEAR) that can achieve SuperInpaint with a single model, resulting in
high-quality completed images with arbitrary resolutions. Specifically, we use
a deep convolutional network to extract the latent embedding of an input image
and then enhance the high-frequency components of the latent embedding via an
adaptive high-pass filter. This leads to detail-enhanced semantic embedding. We
further feed the semantic embedding into an unmask-attentional module that
suppresses embeddings from ineffective masked pixels. Additionally, we extract
a pixel-wise importance map that indicates which pixels should be used for
image reconstruction. Given the coordinates of a pixel we want to reconstruct,
we first collect its neighboring pixels in the input image and extract their
detail-enhanced semantic embeddings, unmask-attentional semantic embeddings,
importance values, and spatial distances to the desired pixel. Then, we feed
all the above terms into an implicit representation and generate the color of
the specified pixel. To evaluate our method, we extend three existing datasets
for this new task and build 18 meaningful baselines using SOTA inpainting and
super-resolution methods. Extensive experimental results demonstrate that our
method outperforms all existing methods by a significant margin on four widely
used metrics. |
This paper identifies a novel and challenging image restoration task, termed "SuperInpaint", which focuses on reconstructing missing regions in low-resolution images and generating high-fidelity completed images at any desired higher resolution. |
Existing image inpainting methods can't handle resolution changes, and super-resolution methods struggle with large missing regions. Combining them directly leads to amplified artifacts and unsatisfactory results. |
The authors propose DEAR (Detail-Enhanced Attentional Implicit Representation) for this task. DEAR leverages implicit image representation and incorporates three key modules: 1) Detail-Enhanced Semantic Embedding (DSE) to enhance high-frequency details. 2) Unmask-Attentional Semantic Embedding (USE) to suppress information from ineffective masked pixels. 3) Pixel-wise Importance Map to identify pixels suitable for reconstruction. |
DEAR significantly outperforms all 18 constructed baselines (combinations of SOTA inpainting and super-resolution methods) on three newly created datasets for SuperInpaint.
DEAR achieves superior performance in terms of PSNR, SSIM, L1, and LPIPS across a wide range of upscaling ratios.
Ablation studies confirm the effectiveness of each proposed module (DSE, USE, PIM) in contributing to the overall performance gain. |
The current work primarily focuses on reconstructing images with a single upscale ratio during training.
Exploring the feasibility of training a single DEAR model for arbitrary upscale ratios is an intriguing direction for future work. |
image inpainting, super-resolution, implicit neural representation, detail enhancement, attention mechanism |
2307.14352
Report |
General Image-to-Image Translation with One-Shot Image Guidance |
Bin Cheng, Zuhao Liu, Yunbo Peng, Yue Lin |
Large-scale text-to-image models pre-trained on massive text-image pairs show
excellent performance in image synthesis recently. However, image can provide
more intuitive visual concepts than plain text. People may ask: how can we
integrate the desired visual concept into an existing image, such as our
portrait? Current methods are inadequate in meeting this demand as they lack
the ability to preserve content or translate visual concepts effectively.
Inspired by this, we propose a novel framework named visual concept translator
(VCT) with the ability to preserve content in the source image and translate
the visual concepts guided by a single reference image. The proposed VCT
contains a content-concept inversion (CCI) process to extract contents and
concepts, and a content-concept fusion (CCF) process to gather the extracted
information to obtain the target image. Given only one reference image, the
proposed VCT can complete a wide range of general image-to-image translation
tasks with excellent results. Extensive experiments are conducted to prove the
superiority and effectiveness of the proposed methods. Codes are available at
https://github.com/CrystalNeuro/visual-concept-translator. |
This paper proposes Visual Concept Translator (VCT), a novel framework for general image-to-image translation guided by a single reference image. |
Image-guided I2I, integrating visual concepts from a reference image into a source image while preserving content, has broad applications in areas like game production and art creation. Existing methods struggle to effectively translate visual concepts while preserving source content. |
VCT employs a two-step process: (1) Content-Concept Inversion (CCI) extracts content and concept embeddings from the source and reference images respectively using techniques like Pivot Turning Inversion and Multi-concept Inversion. (2) Content-Concept Fusion (CCF) utilizes a dual-stream denoising architecture with an attention control mechanism to combine extracted information and generate the target image. |
VCT demonstrates superior performance in general I2I tasks compared to GAN-based and existing diffusion-based methods, effectively translating concepts from reference images while preserving source image content.
The method excels in style transfer tasks, outperforming state-of-the-art approaches by effectively transferring artistic styles from reference images to content images.
Ablation studies confirm the efficacy of individual VCT components, including Multi-concept Inversion, Pivotal Turning Inversion, and Attention Control, highlighting their contributions to the framework's performance. |
The paper acknowledges a trade-off between preserving source image structure and incorporating semantic changes from the reference image, suggesting further exploration of this balance.
Future work could investigate extending VCT to incorporate multiple reference images for more complex concept fusion and manipulation. |
image-to-image translation, visual concept, diffusion models, one-shot learning, attention mechanism |
2307.14331
Report |
Visual Instruction Inversion: Image Editing via Visual Prompting |
Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee |
Text-conditioned image editing has emerged as a powerful tool for editing
images. However, in many situations, language can be ambiguous and ineffective
in describing specific image edits. When faced with such challenges, visual
prompts can be a more informative and intuitive way to convey ideas. We present
a method for image editing via visual prompting. Given pairs of example that
represent the "before" and "after" images of an edit, our goal is to learn a
text-based editing direction that can be used to perform the same edit on new
images. We leverage the rich, pretrained editing capabilities of text-to-image
diffusion models by inverting visual prompts into editing instructions. Our
results show that with just one example pair, we can achieve competitive
results compared to state-of-the-art text-conditioned image editing frameworks. |
This paper presents a novel framework for image editing that learns specific editing instructions from before-and-after image pairs, enabling intuitive editing with diffusion models. |
Describing desired image edits with text can be challenging due to the ambiguity of language. Visual prompting offers a more intuitive and precise way to convey specific image transformations. |
The proposed method leverages a pretrained text-conditioned image editing diffusion model (InstructPix2Pix). By optimizing a textual instruction to reconstruct the "after" image from the "before" image while aligning with their semantic difference in CLIP embedding space, the method learns an edit direction applicable to new images. |
The method achieves competitive performance against state-of-the-art text-conditioned image editing models, demonstrating its effectiveness in learning and applying edits from visual prompts.
Using identical noise during training and testing helps balance the extent of editing and faithfulness to the input image.
The learned instructions can be combined with user-provided text prompts, allowing for flexible and specific image manipulations. |
The method's reliance on a pretrained model limits its editing scope and might inherit unwanted biases.
Further research is needed to investigate the sensitivity to prompt selection and explore the potential of diffusion models as task solvers for computer vision tasks. |
image editing, visual prompting, diffusion models, text-to-image synthesis, computer vision |
2307.14073
Report |
VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet |
Zhihao Hu, Dong Xu |
Recently, diffusion models like StableDiffusion have achieved impressive
image generation results. However, the generation process of such diffusion
models is uncontrollable, which makes it hard to generate videos with
continuous and consistent content. In this work, by using the diffusion model
with ControlNet, we proposed a new motion-guided video-to-video translation
framework called VideoControlNet to generate various videos based on the given
prompts and the condition from the input video. Inspired by the video codecs
that use motion information for reducing temporal redundancy, our framework
uses motion information to prevent the regeneration of the redundant areas for
content consistency. Specifically, we generate the first frame (i.e., the
I-frame) by using the diffusion model with ControlNet. Then we generate other
key frames (i.e., the P-frame) based on the previous I/P-frame by using our
newly proposed motion-guided P-frame generation (MgPG) method, in which the
P-frames are generated based on the motion information and the occlusion areas
are inpainted by using the diffusion model. Finally, the rest frames (i.e., the
B-frame) are generated by using our motion-guided B-frame interpolation (MgBI)
module. Our experiments demonstrate that our proposed VideoControlNet inherits
the generation capability of the pre-trained large diffusion model and extends
the image diffusion model to the video diffusion model by using motion
information. More results are provided at our project page. |
Proposed VideoControlNet, a motion-guided video-to-video translation framework using a diffusion model with ControlNet, for generating diverse and content-consistent videos from prompts and input video conditions. |
Existing video diffusion models struggle to generate videos with continuous and consistent content due to the uncontrollable nature of the diffusion process. |
Leverages motion information to prevent redundant area regeneration and uses diffusion-model-based inpainting for new content. Employs a motion-guided P-frame generation (MgPG) module for keyframes and a motion-guided B-frame interpolation (MgBI) module for intermediate frames. |
Outperforms state-of-the-art methods in user preference and objective metrics like FVD, IS, FID.
Generates high-quality videos with better content consistency compared to methods like Text2LIVE.
Offers flexibility in controlling video style and enables video editing through masks and prompts. |
Relies on accurate optical flow estimation for optimal performance, which can be challenging for complex motion.
Strong motion guidance necessitates detailed conditions (depth maps, canny maps) limiting flexibility in condition types (e.g., segmentation maps). |
video generation, diffusion models, video-to-video translation, controlnet, motion guidance |
2307.14063
Report |
ECO: Ensembling Context Optimization for Vision-Language Models |
Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico Becattini, Marco Bertini, Alberto Del Bimbo |
Image recognition has recently witnessed a paradigm shift, where
vision-language models are now used to perform few-shot classification based on
textual prompts. Among these, the CLIP model has shown remarkable capabilities
for zero-shot transfer by matching an image and a custom textual prompt in its
latent space. This has paved the way for several works that focus on
engineering or learning textual contexts for maximizing CLIP's classification
capabilities. In this paper, we follow this trend by learning an ensemble of
prompts for image classification. We show that learning diverse and possibly
shorter contexts improves considerably and consistently the results rather than
relying on a single trainable prompt. In particular, we report better few-shot
capabilities with no additional cost at inference time. We demonstrate the
capabilities of our approach on 11 different benchmarks. |
This paper introduces ECO, a method that enhances prompt learning for few-shot image classification in vision-language models by learning an ensemble of diverse and shorter textual prompts instead of a single, longer prompt. |
ECO improves upon existing prompt learning methods, which often focus on optimizing a single textual prompt, by leveraging the power of prompt ensembling to achieve more robust and accurate results, especially in few-shot scenarios. |
ECO learns multiple sets of context tokens (prompts) with a reduced number of tokens per prompt while keeping the total number of trainable parameters the same as single-prompt methods like CoOp. The learned prompts are then combined using prompt ensembling, effectively averaging their textual features for classification. |
ECO consistently outperforms existing methods, including zero-shot CLIP and CoOp, on 11 different image classification benchmarks.
The method proves to be more data-efficient, showing significant improvements even with a limited number of training shots (1 or 2).
ECO maintains computational efficiency at inference time as the learned prompt features can be pre-computed and used as a single prompt. |
The current study focuses on evaluating ECO with CoOp; further research could explore its integration with other prompt learning techniques like CoCoOp and MaPLe.
While ECO effectively balances context length and the number of prompts, determining the optimal configuration for specific datasets or tasks might require further investigation. |
prompt learning, prompt ensembling, few-shot learning, image classification, vision-language models |
2307.14030
Report |
Consensus-Adaptive RANSAC |
Luca Cavalli, Daniel Barath, Marc Pollefeys, Viktor Larsson |
RANSAC and its variants are widely used for robust estimation, however, they
commonly follow a greedy approach to finding the highest scoring model while
ignoring other model hypotheses. In contrast, Iteratively Reweighted Least
Squares (IRLS) techniques gradually approach the model by iteratively updating
the weight of each correspondence based on the residuals from previous
iterations. Inspired by these methods, we propose a new RANSAC framework that
learns to explore the parameter space by considering the residuals seen so far
via a novel attention layer. The attention mechanism operates on a batch of
point-to-model residuals, and updates a per-point estimation state to take into
account the consensus found through a lightweight one-step transformer. This
rich state then guides the minimal sampling between iterations as well as the
model refinement. We evaluate the proposed approach on essential and
fundamental matrix estimation on a number of indoor and outdoor datasets. It
outperforms state-of-the-art estimators by a significant margin adding only a
small runtime overhead. Moreover, we demonstrate good generalization properties
of our trained model, indicating its effectiveness across different datasets
and tasks. The proposed attention mechanism and one-step transformer provide an
adaptive behavior that enhances the performance of RANSAC, making it a more
effective tool for robust estimation. Code is available at
https://github.com/cavalli1234/CA-RANSAC. |
Proposes CA-RANSAC, a novel RANSAC framework that leverages consensus from previous iterations to enhance sampling and model refinement during robust estimation. |
Addresses limitations of traditional RANSAC methods that ignore sub-optimal model hypotheses, leading to improved exploration of the parameter space and better model selection. |
Introduces a consensus-based attention mechanism operating on point-to-model residuals, updating per-point estimation states using a one-step transformer to guide minimal sampling and non-linear model refinement. |
Outperforms state-of-the-art estimators in essential and fundamental matrix estimation tasks on indoor and outdoor datasets.
Demonstrates superior accuracy, particularly in low-error regimes, indicating effective model refinement.
Exhibits good generalization across different datasets, matching strategies, and estimation tasks. |
Current implementation relies on a fixed number of iterations without an early termination criterion.
Exploration of more efficient biased sampling schemes within inlier pools could further enhance performance. |
ransac, robust estimation, attention mechanism, consensus-based learning, minimal sample selection |
2307.13974
Report |
Tracking Anything in High Quality |
Jiawen Zhu, Zhenyu Chen, Zeqi Hao, Shijie Chang, Lu Zhang, Dong Wang, Huchuan Lu, Bin Luo, Jun-Yan He, Jin-Peng Lan, Hanyuan Chen, Chenyang Li |
Visual object tracking is a fundamental video task in computer vision.
Recently, the notably increasing power of perception algorithms allows the
unification of single/multiobject and box/mask-based tracking. Among them, the
Segment Anything Model (SAM) attracts much attention. In this report, we
propose HQTrack, a framework for High Quality Tracking anything in videos.
HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask
refiner (MR). Given the object to be tracked in the initial frame of a video,
VMOS propagates the object masks to the current frame. The mask results at this
stage are not accurate enough since VMOS is trained on several closeset video
object segmentation (VOS) datasets, which has limited ability to generalize to
complex and corner scenes. To further improve the quality of tracking masks, a
pretrained MR model is employed to refine the tracking results. As a compelling
testament to the effectiveness of our paradigm, without employing any tricks
such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd
place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code
and models are available at https://github.com/jiawen-zhu/HQTrack. |
HQTrack, a framework for High Quality Tracking anything in videos, comprising a video multi-object segmenter (VMOS) and a mask refiner (MR) |
Addresses challenges in the VOTS2023 challenge, such as long-term sequences, disappearing/reappearing targets, and complex scenes |
VMOS (based on DeAOT) propagates object masks across frames, and MR (using HQ-SAM) refines the masks by leveraging a pre-trained segmentation model. |
Joint tracking outperforms separate tracking for multiple objects.
Multi-scale propagation mechanism and InternImage-T backbone significantly improve VMOS performance.
Selectively refining masks with HQ-SAM based on IoU threshold enhances overall accuracy. |
Limited exploration of the relationship between long-term memory gap and object disappearance/reappearance.
Further investigation on the influence of different mask refiners. |
visual object tracking, video object segmentation, multi-object tracking, mask refinement, hq-sam |
2307.13908
Report |
Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation |
Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, Fan Wang |
Text-to-3D generation has recently garnered significant attention, fueled by
2D diffusion models trained on billions of image-text pairs. Existing methods
primarily rely on score distillation to leverage the 2D diffusion priors to
supervise the generation of 3D models, e.g., NeRF. However, score distillation
is prone to suffer the view inconsistency problem, and implicit NeRF modeling
can also lead to an arbitrary shape, thus leading to less realistic and
uncontrollable 3D generation. In this work, we propose a flexible framework of
Points-to-3D to bridge the gap between sparse yet freely available 3D points
and realistic shape-controllable 3D generation by distilling the knowledge from
both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce
controllable sparse 3D points to guide the text-to-3D generation. Specifically,
we use the sparse point cloud generated from the 3D diffusion model, Point-E,
as the geometric prior, conditioned on a single reference image. To better
utilize the sparse 3D points, we propose an efficient point cloud guidance loss
to adaptively drive the NeRF's geometry to align with the shape of the sparse
3D points. In addition to controlling the geometry, we propose to optimize the
NeRF for a more view-consistent appearance. To be specific, we perform score
distillation to the publicly available 2D image diffusion model ControlNet,
conditioned on text as well as depth map of the learned compact geometry.
Qualitative and quantitative comparisons demonstrate that Points-to-3D improves
view consistency and achieves good shape controllability for text-to-3D
generation. Points-to-3D provides users with a new way to improve and control
text-to-3D generation. |
Presents Points-to-3D, a novel text-to-3D generation framework that bridges the gap between sparse 3D points and realistic, shape-controllable 3D generation by leveraging pre-trained 2D and 3D diffusion models. |
Addresses limitations in existing text-to-3D methods, such as view inconsistency (Janus problem) and lack of shape controllability, aiming for more realistic and controllable 3D content generation. |
Utilizes a pre-trained point cloud diffusion model (Point-E) to generate sparse 3D points from a reference image, guides NeRF geometry using an efficient point cloud guidance loss, and optimizes appearance via score distillation from a controllable 2D diffusion model (ControlNet) conditioned on text and learned depth map. |
Significantly alleviates view inconsistency in generated 3D content compared to baselines.
Achieves good controllability over 3D shapes by leveraging reference images and sparse 3D point guidance.
Demonstrates superior performance in terms of CLIP R-precision and user preference for view consistency and prompt relevance. |
Performance can be affected by limitations of the underlying pre-trained 2D and 3D diffusion models.
Currently requires a reference image for shape guidance, limiting spontaneity in content creation. |
text-to-3d, diffusion models, nerf, point cloud, shape controllability |
2307.13856
Report |
On the unreasonable vulnerability of transformers for image restoration -- and an easy fix |
Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Julia Grabinski, Paramanand Chandramouli, Margret Keuper |
Following their success in visual recognition tasks, Vision
Transformers(ViTs) are being increasingly employed for image restoration. As a
few recent works claim that ViTs for image classification also have better
robustness properties, we investigate whether the improved adversarial
robustness of ViTs extends to image restoration. We consider the recently
proposed Restormer model, as well as NAFNet and the "Baseline network" which
are both simplified versions of a Restormer. We use Projected Gradient Descent
(PGD) and CosPGD, a recently proposed adversarial attack tailored to pixel-wise
prediction tasks for our robustness evaluation. Our experiments are performed
on real-world images from the GoPro dataset for image deblurring. Our analysis
indicates that contrary to as advocated by ViTs in image classification works,
these models are highly susceptible to adversarial attacks. We attempt to
improve their robustness through adversarial training. While this yields a
significant increase in robustness for Restormer, results on other networks are
less promising. Interestingly, the design choices in NAFNet and Baselines,
which were based on iid performance, and not on robust generalization, seem to
be at odds with the model robustness. Thus, we investigate this further and
find a fix. |
This paper investigates the adversarial robustness of Transformer-based image restoration networks, namely Restormer, Baseline Network, and NAFNet. |
This study is important because while these networks achieve state-of-the-art performance on clean images, their robustness to adversarial attacks is crucial for real-world applications, especially in safety-critical domains. |
The authors evaluate the robustness of these networks using PGD and CosPGD attacks on the GoPro image deblurring dataset. They analyze the effects of adversarial training as a defense mechanism and study the impact of different architectural choices on robustness. |
Transformer-based restoration networks are highly vulnerable to adversarial attacks, exhibiting significant performance drops and distinct spectral artifacts.
Adversarial training effectively improves robustness and reduces spectral artifacts, with Restormer showing the most significant gains.
Design choices in NAFNet and Baseline Network, aimed at simplifying Restormer, negatively impact robustness. Replacing GELU activation with ReLU in the Intermediate network significantly improves robustness. |
While adversarial training and design changes improve robustness, there is still a considerable gap in achieving ideal restoration quality.
Future work could explore alternative methods beyond adversarial training to enhance robustness and image quality. |
adversarial robustness, image restoration, vision transformers, deblurring, adversarial training |
2307.13746
Report |
ChildGAN: Large Scale Synthetic Child Facial Data Using Domain Adaptation in StyleGAN |
Muhammad Ali Farooq, Wang Yao, Gabriel Costache, Peter Corcoran |
In this research work, we proposed a novel ChildGAN, a pair of GAN networks
for generating synthetic boys and girls facial data derived from StyleGAN2.
ChildGAN is built by performing smooth domain transfer using transfer learning.
It provides photo-realistic, high-quality data samples. A large-scale dataset
is rendered with a variety of smart facial transformations: facial expressions,
age progression, eye blink effects, head pose, skin and hair color variations,
and variable lighting conditions. The dataset comprises more than 300k distinct
data samples. Further, the uniqueness and characteristics of the rendered
facial features are validated by running different computer vision application
tests which include CNN-based child gender classifier, face localization and
facial landmarks detection test, identity similarity evaluation using ArcFace,
and lastly running eye detection and eye aspect ratio tests. The results
demonstrate that synthetic child facial data of high quality offers an
alternative to the cost and complexity of collecting a large-scale dataset from
real children. |
This paper presents ChildGAN, a pair of GAN networks based on StyleGAN2 for generating large-scale, high-quality synthetic child facial images. |
Large-scale child facial datasets are crucial for various AI applications but are challenging to acquire due to ethical and privacy concerns. Synthetic data offers a viable alternative. |
ChildGAN leverages transfer learning to adapt StyleGAN2, trained on adult faces, to generate child faces. It incorporates smart transformations like facial expressions, aging, and lighting for data diversity. |
ChildGAN generates over 300k unique child face images with diverse attributes.
Validation tests using gender classification, facial landmark detection, and identity similarity confirm the high quality and diversity of the synthetic data.
Eye aspect ratio tests on the synthetic data demonstrate realistic eye blinking effects. |
Quantitative validation of the synthetic data distribution against a real-world ground truth remains challenging.
Expanding ChildGAN to encompass greater ethnic diversity is a potential area for future research. |
synthetic data generation, generative adversarial networks (gans), facial image analysis, child facial recognition, transfer learning |
2307.13720
Report |
Composite Diffusion | whole >= Σparts |
Vikram Jamwal, Ramaneswaran S |
For an artist or a graphic designer, the spatial layout of a scene is a
critical design choice. However, existing text-to-image diffusion models
provide limited support for incorporating spatial information. This paper
introduces Composite Diffusion as a means for artists to generate high-quality
images by composing from the sub-scenes. The artists can specify the
arrangement of these sub-scenes through a flexible free-form segment layout.
They can describe the content of each sub-scene primarily using natural text
and additionally by utilizing reference images or control inputs such as line
art, scribbles, human pose, canny edges, and more.
We provide a comprehensive and modular method for Composite Diffusion that
enables alternative ways of generating, composing, and harmonizing sub-scenes.
Further, we wish to evaluate the composite image for effectiveness in both
image quality and achieving the artist's intent. We argue that existing image
quality metrics lack a holistic evaluation of image composites. To address
this, we propose novel quality criteria especially relevant to composite
generation.
We believe that our approach provides an intuitive method of art creation.
Through extensive user surveys, quantitative and qualitative analysis, we show
how it achieves greater spatial, semantic, and creative control over image
generation. In addition, our methods do not need to retrain or modify the
architecture of the base diffusion models and can work in a plug-and-play
manner with the fine-tuned models. |
This paper introduces Composite Diffusion, a novel approach for generating high-quality images by composing sub-scenes arranged by artists in a free-form layout. |
Existing text-to-image models offer limited spatial control, making it difficult for artists to dictate object layout and properties within a scene. This method seeks to grant artists greater creative control. |
The method utilizes pre-trained diffusion models and divides the generation into two stages: (1) Scaffolding: sub-scenes are generated independently using text descriptions, reference images, or control conditions. (2) Harmonization: Sub-scenes are blended and refined in the context of each other, ensuring coherence. |
Composite Diffusion demonstrates superior performance in spatial fidelity and content fidelity compared to text-to-image and serial inpainting baselines.
The method allows for controlled variation in image generation through modifications in segment layout, text descriptions, and the use of fine-tuned models.
Qualitative evaluation through user surveys and artist collaboration confirms the effectiveness of Composite Diffusion in creating high-quality, customizable artwork. |
The current implementation's performance is limited by the granularity of sub-scenes supported by the diffusion model's image space.
Achieving precise object shape conformance in text-only conditioning remains a challenge, often necessitating the use of control condition inputs. |
image generation, diffusion models, spatial control, composite images, generative ai |
2307.13639
Report |
Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Reconstruction |
Will Rowan, Patrik Huber, Nick Pears, Andrew Keeling |
Accurate 3D face reconstruction from 2D images is an enabling technology with
applications in healthcare, security, and creative industries. However, current
state-of-the-art methods either rely on supervised training with very limited
3D data or self-supervised training with 2D image data. To bridge this gap, we
present a method to generate a large-scale synthesised dataset of 250K
photorealistic images and their corresponding shape parameters and depth maps,
which we call SynthFace. Our synthesis method conditions Stable Diffusion on
depth maps sampled from the FLAME 3D Morphable Model (3DMM) of the human face,
allowing us to generate a diverse set of shape-consistent facial images that is
designed to be balanced in race and gender. We further propose ControlFace, a
deep neural network, trained on SynthFace, which achieves competitive
performance on the NoW benchmark, without requiring 3D supervision or manual 3D
asset creation. The complete SynthFace dataset will be made publicly available
upon publication. |
This paper introduces SynthFace, a large-scale synthetic dataset of 250K photorealistic face images with corresponding 3D shape parameters and depth maps, and ControlFace, a deep neural network trained on SynthFace for 3D face reconstruction. |
Accurate 3D face reconstruction from 2D images is crucial for applications in various fields, but existing methods are limited by the scarcity of paired 2D-to-3D data. SynthFace addresses this by providing a large-scale dataset for supervised training. |
SynthFace is generated by conditioning Stable Diffusion, a text-to-image diffusion model, on depth maps of 3D faces from the FLAME model. This generates photorealistic images with known 3D shape. ControlFace is then trained on SynthFace to regress 3DMM parameters from facial images. |
SynthFace is the largest dataset of its kind, containing 250K photorealistic face images with corresponding 3D shape information, balanced by race and gender.
ControlFace, trained on SynthFace, achieves competitive performance on the NoW benchmark for 3D face reconstruction.
This approach demonstrates the potential of combining 2D and 3D generative models for improving 3D face reconstruction. |
The current iteration of SynthFace does not model facial expressions, limiting the scope of ControlFace to shape prediction.
The use of ArcFace, an identity descriptor network, to extract shape information might introduce errors. Future work could explore networks specifically designed for shape extraction. |
3d face reconstruction, synthetic data, stable diffusion, 3d morphable model, controlnet |
2307.13240
Report |
Fashion Matrix: Editing Photos by Just Talking |
Zheng Chong, Xujie Zhang, Fuwei Zhao, Zhenyu Xie, Xiaodan Liang |
The utilization of Large Language Models (LLMs) for the construction of AI
systems has garnered significant attention across diverse fields. The extension
of LLMs to the domain of fashion holds substantial commercial potential but
also inherent challenges due to the intricate semantic interactions in
fashion-related generation. To address this issue, we developed a hierarchical
AI system called Fashion Matrix dedicated to editing photos by just talking.
This system facilitates diverse prompt-driven tasks, encompassing garment or
accessory replacement, recoloring, addition, and removal. Specifically, Fashion
Matrix employs LLM as its foundational support and engages in iterative
interactions with users. It employs a range of Semantic Segmentation Models
(e.g., Grounded-SAM, MattingAnything, etc.) to delineate the specific editing
masks based on user instructions. Subsequently, Visual Foundation Models (e.g.,
Stable Diffusion, ControlNet, etc.) are leveraged to generate edited images
from text prompts and masks, thereby facilitating the automation of fashion
editing processes. Experiments demonstrate the outstanding ability of Fashion
Matrix to explores the collaborative potential of functionally diverse
pre-trained models in the domain of fashion editing. |
Presents Fashion Matrix, a novel hierarchical AI system that leverages Large Language Models (LLMs) to enable conversational photo editing in the fashion domain. |
Addresses the limitations of existing image editing tools that lack fine-grained control and struggle with the nuanced semantic understanding required in fashion-related applications. |
Integrates LLMs with Semantic Segmentation Models (e.g., Grounded-SAM, MattingAnything) and Visual Foundation Models (e.g., Stable Diffusion, ControlNet) to enable multi-round dialogue-based editing with tasks like garment replacement, recoloring, addition, and removal. |
Introduces an 'AutoMasker' module that combines human parsing, pose estimation, and semantic segmentation for precise editing mask generation.
Outperforms text-based try-on methods (Text2Human, FICE) in terms of image quality (CLIP Score, IS), naturalness, and text-image matching.
Demonstrates the potential of combining functionally diverse pre-trained models for complex fashion editing tasks through extensive zero-shot experiments. |
LLM optimization specifically for fashion domain is needed for improved performance.
More detailed Semantic Segmentation Models for both humans and fashion items would enhance system capabilities. |
fashion editing, large language models, conversational ai, semantic segmentation, image generation |
2307.13226
Report |
Strivec: Sparse Tri-Vector Radiance Fields |
Quankai Gao, Qiangeng Xu, Hao Su, Ulrich Neumann, Zexiang Xu |
We propose Strivec, a novel neural representation that models a 3D scene as a
radiance field with sparsely distributed and compactly factorized local tensor
feature grids. Our approach leverages tensor decomposition, following the
recent work TensoRF, to model the tensor grids. In contrast to TensoRF which
uses a global tensor and focuses on their vector-matrix decomposition, we
propose to utilize a cloud of local tensors and apply the classic
CANDECOMP/PARAFAC (CP) decomposition to factorize each tensor into triple
vectors that express local feature distributions along spatial axes and
compactly encode a local neural field. We also apply multi-scale tensor grids
to discover the geometry and appearance commonalities and exploit spatial
coherence with the tri-vector factorization at multiple local scales. The final
radiance field properties are regressed by aggregating neural features from
multiple local tensors across all scales. Our tri-vector tensors are sparsely
distributed around the actual scene surface, discovered by a fast coarse
reconstruction, leveraging the sparsity of a 3D scene. We demonstrate that our
model can achieve better rendering quality while using significantly fewer
parameters than previous methods, including TensoRF and Instant-NGP. |
This paper introduces Strivec, a novel neural scene representation that leverages sparse, multi-scale, tri-vector tensors to represent local radiance fields for high-quality novel view synthesis. |
Existing methods, while achieving progress in compactness and quality, struggle to balance representing intricate local details with efficient use of model capacity. Strivec aims to address this by combining the sparsity of local representations with the efficiency of shared feature encoding. |
Strivec distributes local tensors based on coarse scene geometry. Each tensor uses CP decomposition to factorize its feature grid into tri-vector components. Features are aggregated from neighboring tensors at multiple scales to regress volume density and view-dependent color, enabling efficient and accurate radiance field rendering. |
Strivec achieves state-of-the-art rendering quality on both synthetic (NeRF Synthetic) and real (ScanNet, Tanks and Temples) datasets, outperforming previous methods like TensoRF and Instant-NGP.
Strivec achieves this superior quality with significantly fewer parameters, demonstrating its efficient representation power.
The paper conducts ablation studies showcasing the benefits of multi-scale representation, tri-vector factorization, and robustness to initial geometry choice. |
While achieving high quality and compactness, Strivec's optimization is slower than TensoRF due to the multi-tensor aggregation. Exploring acceleration strategies while maintaining quality could be beneficial.
The paper observes that adding more tensor components yields diminishing returns after a certain point. Investigating techniques to better capture high-frequency details with increased capacity is a potential avenue for future work. |
neural radiance fields, novel view synthesis, tensor decomposition, 3d scene representation, sparse representation |
2307.12981
Report |
3D-LLM: Injecting the 3D World into Large Language Models |
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan |
Large language models (LLMs) and Vision-Language Models (VLMs) have been
proven to excel at multiple tasks, such as commonsense reasoning. Powerful as
these models can be, they are not grounded in the 3D physical world, which
involves richer concepts such as spatial relationships, affordances, physics,
layout, and so on. In this work, we propose to inject the 3D world into large
language models and introduce a whole new family of 3D-LLMs. Specifically,
3D-LLMs can take 3D point clouds and their features as input and perform a
diverse set of 3D-related tasks, including captioning, dense captioning, 3D
question answering, task decomposition, 3D grounding, 3D-assisted dialog,
navigation, and so on. Using three types of prompting mechanisms that we
design, we are able to collect over 300k 3D-language data covering these tasks.
To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that
obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as
our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism,
3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show
that our model outperforms state-of-the-art baselines by a large margin (e.g.,
the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore,
experiments on our held-in datasets for 3D captioning, task composition, and
3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative
examples also show that our model could perform more tasks beyond the scope of
existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/. |
This paper introduces 3D-LLMs, a new family of large language models that can understand and interact with 3D scenes represented as point clouds with features. |
Existing LLMs and VLMs lack grounding in the 3D physical world, limiting their ability to reason about spatial relationships, affordances, and other 3D concepts. 3D-LLMs bridge this gap. |
The authors generate a 300k 3D-language dataset covering various tasks and train 3D-LLMs using pretrained 2D VLMs (Flamingo, BLIP-2) as backbones. They extract 3D features from multi-view images and incorporate a 3D localization mechanism. |
3D-LLMs outperform state-of-the-art baselines on the ScanQA 3D question answering benchmark.
Held-in experiments demonstrate 3D-LLMs' effectiveness in 3D captioning, task decomposition, and 3D-assisted dialog.
Qualitative examples showcase 3D-LLMs' ability to perform tasks beyond the scope of existing LLMs and VLMs, such as navigation and grounding. |
The current 3D feature extractor relies on rendering 3D scenes into multi-view images, introducing an additional rendering process.
Future work includes exploring end-to-end training with 3D data and expanding 3D-LLMs to more complex 3D reasoning and planning tasks. |
large language models, 3d vision, vision-language models, 3d scene understanding, 3d reasoning |
2307.12972
Report |
DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting |
Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, Lei Zhang |
In this paper, we propose a new operator, called 3D DeFormable Attention
(DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image
features into a unified 3D space for 3D object detection. Existing feature
lifting approaches, such as Lift-Splat-based and 2D attention-based, either use
estimated depth to get pseudo LiDAR features and then splat them to a 3D space,
which is a one-pass operation without feature refinement, or ignore depth and
lift features by 2D attention mechanisms, which achieve finer semantics while
suffering from a depth ambiguity problem. In contrast, our DFA3D-based method
first leverages the estimated depth to expand each view's 2D feature map to 3D
and then utilizes DFA3D to aggregate features from the expanded 3D feature
maps. With the help of DFA3D, the depth ambiguity problem can be effectively
alleviated from the root, and the lifted features can be progressively refined
layer by layer, thanks to the Transformer-like architecture. In addition, we
propose a mathematically equivalent implementation of DFA3D which can
significantly improve its memory efficiency and computational speed. We
integrate DFA3D into several methods that use 2D attention-based feature
lifting with only a few modifications in code and evaluate on the nuScenes
dataset. The experiment results show a consistent improvement of +1.41\% mAP on
average, and up to +15.1\% mAP improvement when high-quality depth information
is available, demonstrating the superiority, applicability, and huge potential
of DFA3D. The code is available at
https://github.com/IDEA-Research/3D-deformable-attention.git. |
This paper introduces 3D Deformable Attention (DFA3D), a novel operator for 2D-to-3D feature lifting in multi-view 3D object detection. It addresses the depth ambiguity issue present in existing 2D attention-based methods. |
Existing feature lifting methods suffer from limitations: Lift-Splat methods lack feature refinement and struggle with depth errors, while 2D attention-based approaches exhibit depth ambiguity due to ignoring depth information. |
DFA3D leverages estimated depth to expand 2D feature maps into 3D. It then uses a depth-weighted 2D deformable attention mechanism for efficient feature aggregation, addressing the memory consumption issue of explicit 3D feature expansion. |
DFA3D effectively alleviates the depth ambiguity problem by sampling features in 3D space.
The Transformer-like architecture with DFA3D allows for progressive feature refinement over multiple layers.
Experiments on the nuScenes dataset show consistent improvements, with an average increase of +1.41% mAP and up to +15.1% mAP with high-quality depth. |
The performance of DFA3D relies on the quality of estimated depth.
Future work includes exploring the integration of temporal information for improved depth estimation. |
3d object detection, multi-view vision, feature lifting, deformable attention, depth ambiguity |
2307.12967
Report |
Learning Dense Correspondences between Photos and Sketches |
Xuanchen Lu, Xiaolong Wang, Judith E Fan |
Humans effortlessly grasp the connection between sketches and real-world
objects, even when these sketches are far from realistic. Moreover, human
sketch understanding goes beyond categorization -- critically, it also entails
understanding how individual elements within a sketch correspond to parts of
the physical world it represents. What are the computational ingredients needed
to support this ability? Towards answering this question, we make two
contributions: first, we introduce a new sketch-photo correspondence benchmark,
$\textit{PSC6k}$, containing 150K annotations of 6250 sketch-photo pairs across
125 object categories, augmenting the existing Sketchy dataset with
fine-grained correspondence metadata. Second, we propose a self-supervised
method for learning dense correspondences between sketch-photo pairs, building
upon recent advances in correspondence learning for pairs of photos. Our model
uses a spatial transformer network to estimate the warp flow between latent
representations of a sketch and photo extracted by a contrastive learning-based
ConvNet backbone. We found that this approach outperformed several strong
baselines and produced predictions that were quantitatively consistent with
other warp-based methods. However, our benchmark also revealed systematic
differences between predictions of the suite of models we tested and those of
humans. Taken together, our work suggests a promising path towards developing
artificial systems that achieve more human-like understanding of visual images
at different levels of abstraction. Project page:
https://photo-sketch-correspondence.github.io |
This paper introduces PSC6k, a new benchmark for photo-sketch dense correspondence learning, and proposes a self-supervised method for learning dense correspondences between sketch-photo pairs. |
Understanding the link between sketches and real-world objects is crucial for bridging the gap between human and artificial vision systems. This task requires robust image understanding across domains and levels of abstraction, particularly in aligning semantic correspondences between stylized and photorealistic images. |
The PSC6k benchmark augments the Sketchy dataset with 150K keypoint annotations on 6250 sketch-photo pairs. The proposed self-supervised method utilizes a contrastive learning-based ConvNet backbone to extract latent representations and a spatial transformer network to estimate the warp flow between a sketch and a photo, aiming to maximize their feature map similarity. |
The proposed method outperforms existing self-supervised and weakly supervised methods on PSC6k, setting a new state-of-the-art.
Analysis reveals systematic differences between model predictions and human annotations, highlighting areas for future improvement.
The photo-sketch contrastive learning procedure reduces the texture bias in learned representations, leading to a stronger shape bias more aligned with human perception. |
The model exhibits limitations in handling non-continuous transformations and aligning fine structures.
Future work could explore stroke-based keypoints for improved coverage of semantically meaningful sketch regions. |
sketch understanding, dense correspondence learning, self-supervised learning, contrastive learning, spatial transformer network |
2307.12909
Report |
Dyn-E: Local Appearance Editing of Dynamic Neural Radiance Fields |
Shangzhan Zhang, Sida Peng, Yinji ShenTu, Qing Shuai, Tianrun Chen, Kaicheng Yu, Hujun Bao, Xiaowei Zhou |
Recently, the editing of neural radiance fields (NeRFs) has gained
considerable attention, but most prior works focus on static scenes while
research on the appearance editing of dynamic scenes is relatively lacking. In
this paper, we propose a novel framework to edit the local appearance of
dynamic NeRFs by manipulating pixels in a single frame of training video.
Specifically, to locally edit the appearance of dynamic NeRFs while preserving
unedited regions, we introduce a local surface representation of the edited
region, which can be inserted into and rendered along with the original NeRF
and warped to arbitrary other frames through a learned invertible motion
representation network. By employing our method, users without professional
expertise can easily add desired content to the appearance of a dynamic scene.
We extensively evaluate our approach on various scenes and show that our
approach achieves spatially and temporally consistent editing results. Notably,
our approach is versatile and applicable to different variants of dynamic NeRF
representations. |
This paper introduces Dyn-E, a novel framework for local appearance editing of dynamic Neural Radiance Fields (NeRFs) by manipulating pixels in a single training video frame. |
Current NeRF editing methods mainly focus on static scenes, leaving dynamic scene editing, crucial for volumetric video editing, underexplored. |
Dyn-E lifts the edited region to 3D space, forming a textured mesh. It utilizes an invertible network to represent the local surface motion, propagating edits across video frames while preserving unedited areas. |
Dyn-E achieves spatially and temporally consistent editing results, outperforming baselines relying on scene flow or optical flow warping.
The local surface representation effectively handles occlusions between edited content and the original dynamic NeRF.
Dyn-E demonstrates versatility by being applicable to various dynamic NeRF representations like HyperNeRF, DynamicNeRF, and Neural Body. |
The current method assumes the edited region is mostly occlusion-free, which might not hold in complex scenarios.
Future work could explore incorporating semantic information or user interaction for more controllable editing. |
dynamic neural radiance fields, appearance editing, 3d scene editing, volumetric video editing, invertible networks |
2307.12868
Report |
Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry |
Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, Youngjung Uh |
Despite the success of diffusion models (DMs), we still lack a thorough
understanding of their latent space. To understand the latent space
$\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective.
Our approach involves deriving the local latent basis within $\mathcal{X}$ by
leveraging the pullback metric associated with their encoding feature maps.
Remarkably, our discovered local latent basis enables image editing
capabilities by moving $\mathbf{x}_t$, the latent space of DMs, along the basis
vector at specific timesteps. We further analyze how the geometric structure of
DMs evolves over diffusion timesteps and differs across different text
conditions. This confirms the known phenomenon of coarse-to-fine generation, as
well as reveals novel insights such as the discrepancy between $\mathbf{x}_t$
across timesteps, the effect of dataset complexity, and the time-varying
influence of text prompts. To the best of our knowledge, this paper is the
first to present image editing through $\mathbf{x}$-space traversal, editing
only once at specific timestep $t$ without any additional training, and
providing thorough analyses of the latent structure of DMs. The code to
reproduce our experiments can be found at
https://github.com/enkeejunior1/Diffusion-Pullback. |
This paper introduces a novel approach for analyzing and manipulating the latent space of diffusion models (DMs) using a geometrical perspective, leveraging the pullback metric to discover local latent bases. |
Understanding the latent space of DMs is crucial for leveraging their full potential, especially in image editing and manipulation, which existing methods struggle to fully utilize. |
The authors employ the pullback metric to define distances in the latent space based on the local Euclidean metric of the corresponding feature space. They use SVD on the Jacobian of the mapping between these spaces to discover local latent bases. |
Traversing along the discovered latent basis enables semantic image editing at various diffusion timesteps.
The latent space structure evolves from low-frequency to high-frequency components as the generative process progresses, reflecting the coarse-to-fine generation.
Textual prompts in text-to-image DMs influence the latent space structure, with similar prompts yielding similar structures, but this influence diminishes in later generative stages. |
The discovered latent directions can sometimes exhibit entanglement between attributes, likely due to dataset biases.
While effective in many cases, the method occasionally leads to abrupt changes during editing, highlighting the need for further exploration of the complex geometry of the DM latent space. |
diffusion models, latent space, image editing, pullback metric, riemannian geometry |
2307.12751
Report |
ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised Real-world Single Image Super-Resolution |
Reyhaneh Neshatavar, Mohsen Yavartanoo, Sanghyun Son, Kyoung Mu Lee |
Single image super-resolution (SISR) is a challenging ill-posed problem that
aims to up-sample a given low-resolution (LR) image to a high-resolution (HR)
counterpart. Due to the difficulty in obtaining real LR-HR training pairs,
recent approaches are trained on simulated LR images degraded by simplified
down-sampling operators, e.g., bicubic. Such an approach can be problematic in
practice because of the large gap between the synthesized and real-world LR
images. To alleviate the issue, we propose a novel Invertible scale-Conditional
Function (ICF), which can scale an input image and then restore the original
input with different scale conditions. By leveraging the proposed ICF, we
construct a novel self-supervised SISR framework (ICF-SRSR) to handle the
real-world SR task without using any paired/unpaired training data.
Furthermore, our ICF-SRSR can generate realistic and feasible LR-HR pairs,
which can make existing supervised SISR networks more robust. Extensive
experiments demonstrate the effectiveness of the proposed method in handling
SISR in a fully self-supervised manner. Our ICF-SRSR demonstrates superior
performance compared to the existing methods trained on synthetic paired images
in real-world scenarios and exhibits comparable performance compared to
state-of-the-art supervised/unsupervised methods on public benchmark datasets. |
This paper proposes ICF-SRSR, a novel self-supervised framework for single image super-resolution (SISR) using an invertible scale-conditional function (ICF). |
ICF-SRSR addresses the issue of poor generalization in real-world SISR tasks, which stems from models being trained on synthetic datasets with simplified down-sampling operators. |
ICF-SRSR leverages a learnable ICF that can up-sample and down-sample an input image based on different scale conditions. The framework is trained in a self-supervised manner by minimizing the distance between the original input and the generated images after consecutive up-down and down-up stages. |
ICF-SRSR outperforms existing self-supervised and some supervised methods on synthetic datasets.
It surpasses methods trained on synthetic datasets when evaluated on real-world datasets.
ICF-SRSR can generate realistic low-resolution and high-resolution image pairs, beneficial for training other SISR models. |
The paper's evaluation on real-world datasets is limited due to the scarcity of aligned low-resolution and high-resolution image pairs.
Future work will focus on creating a large-scale real-world dataset and exploring applications of ICF in other image restoration tasks. |
super-resolution, self-supervised learning, real-world image super-resolution, invertible scale-conditional function, image restoration |
2307.12732
Report |
CLIP-KD: An Empirical Study of CLIP Model Distillation |
Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu |
Contrastive Language-Image Pre-training (CLIP) has become a promising
language-supervised visual pre-training framework. This paper aims to distill
small CLIP models supervised by a large teacher CLIP model. We propose several
distillation strategies, including relation, feature, gradient and contrastive
paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We
show that a simple feature mimicry with Mean Squared Error loss works
surprisingly well. Moreover, interactive contrastive learning across teacher
and student encoders is also effective in performance improvement. We explain
that the success of CLIP-KD can be attributed to maximizing the feature
similarity between teacher and student. The unified method is applied to
distill several student models trained on CC3M+12M. CLIP-KD improves student
CLIP models consistently over zero-shot ImageNet classification and cross-modal
retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the
teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy
over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\%
and 20.1\% margins, respectively. Our code is released on
https://github.com/winycg/CLIP-KD. |
This paper investigates various knowledge distillation (KD) strategies for compressing CLIP models, improving the performance of smaller CLIP models under the supervision of a larger, pretrained teacher model. |
Smaller CLIP models are desirable for resource-constrained applications, but they often suffer from performance degradation compared to larger models. This work aims to bridge this gap using KD. |
The paper proposes and evaluates several KD strategies, including: (1) Contrastive Relational Distillation, (2) Feature Distillation, (3) Masked Feature Distillation, (4) Gradient Distillation, (5) Interactive Contrastive Learning, and (6) Augmented Feature Distillation. These methods are analyzed individually and in combination. |
Feature Distillation with Mean Squared Error loss performs surprisingly well, significantly improving student performance.
Interactive Contrastive Learning, which promotes interaction between teacher and student encoders, also leads to significant gains.
The effectiveness of different KD methods is correlated with their ability to maximize feature similarity between teacher and student models. |
Distilling knowledge from significantly larger teachers to smaller students might not be optimal due to potential capacity gaps.
Exploring more advanced distillation strategies, such as incorporating intermediate layer distillation with architecture-aware mechanisms, could further improve performance. |
knowledge distillation, clip, contrastive learning, multimodal learning, model compression |
2307.12730
Report |
COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts |
Xiaofeng Mao, Yuefeng Chen, Yao Zhu, Da Chen, Hang Su, Rong Zhang, Hui Xue |
Practical object detection application can lose its effectiveness on image
inputs with natural distribution shifts. This problem leads the research
community to pay more attention on the robustness of detectors under
Out-Of-Distribution (OOD) inputs. Existing works construct datasets to
benchmark the detector's OOD robustness for a specific application scenario,
e.g., Autonomous Driving. However, these datasets lack universality and are
hard to benchmark general detectors built on common tasks such as COCO. To give
a more comprehensive robustness assessment, we introduce
COCO-O(ut-of-distribution), a test dataset based on COCO with 6 types of
natural distribution shifts. COCO-O has a large distribution gap with training
data and results in a significant 55.7% relative performance drop on a Faster
R-CNN detector. We leverage COCO-O to conduct experiments on more than 100
modern object detectors to investigate if their improvements are credible or
just over-fitting to the COCO test set. Unfortunately, most classic detectors
in early years do not exhibit strong OOD generalization. We further study the
robustness effect on recent breakthroughs of detector's architecture design,
augmentation and pre-training techniques. Some empirical findings are revealed:
1) Compared with detection head or neck, backbone is the most important part
for robustness; 2) An end-to-end detection transformer design brings no
enhancement, and may even reduce robustness; 3) Large-scale foundation models
have made a great leap on robust object detection. We hope our COCO-O could
provide a rich testbed for robustness study of object detection. The dataset
will be available at
https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o. |
This paper introduces COCO-O, a new benchmark dataset designed to evaluate the robustness of object detectors when faced with natural distribution shifts. |
Existing robustness benchmarks for object detection either rely on synthetic data or focus on specific scenarios. COCO-O addresses this gap by providing a diverse set of real-world images with natural distribution shifts, enabling a more comprehensive robustness assessment of modern object detectors. |
The authors construct COCO-O by collecting images from six distinct domains: sketch, weather, cartoon, painting, tattoo, and handmake. These domains represent varying degrees of object abstraction and introduce realistic challenges for object detection models. They evaluate a wide range of detectors, including classic architectures and state-of-the-art models, on COCO-O and analyze the impact of factors such as architecture design, data augmentation, and pre-training on robustness. |
Contrary to expectations, most classic detectors and recent architectural advancements in object detection show limited progress in robustness to natural distribution shifts.
The backbone network plays a more crucial role in OOD robustness than other detector components like the neck or head.
Large-scale foundation models, particularly those pre-trained on massive image-language datasets, exhibit significantly improved robustness on COCO-O, highlighting the potential of data scale and external knowledge for robust object detection. |
The reasons behind the poor performance of DETR-based models on COCO-O require further investigation.
Future work will focus on developing novel techniques to enhance the OOD robustness of object detection algorithms, leveraging the challenges and insights provided by COCO-O. |
object detection, robustness, benchmark dataset, distribution shift, out-of-distribution generalization |
2307.12616
Report |
CTVIS: Consistent Training for Online Video Instance Segmentation |
Kaining Ying, Qing Zhong, Weian Mao, Zhenhua Wang, Hao Chen, Lin Yuanbo Wu, Yifan Liu, Chengxiang Fan, Yunzhi Zhuge, Chunhua Shen |
The discrimination of instance embeddings plays a vital role in associating
instances across time for online video instance segmentation (VIS). Instance
embedding learning is directly supervised by the contrastive loss computed upon
the contrastive items (CIs), which are sets of anchor/positive/negative
embeddings. Recent online VIS methods leverage CIs sourced from one reference
frame only, which we argue is insufficient for learning highly discriminative
embeddings. Intuitively, a possible strategy to enhance CIs is replicating the
inference phase during training. To this end, we propose a simple yet effective
training strategy, called Consistent Training for Online VIS (CTVIS), which
devotes to aligning the training and inference pipelines in terms of building
CIs. Specifically, CTVIS constructs CIs by referring inference the
momentum-averaged embedding and the memory bank storage mechanisms, and adding
noise to the relevant embeddings. Such an extension allows a reliable
comparison between embeddings of current instances and the stable
representations of historical instances, thereby conferring an advantage in
modeling VIS challenges such as occlusion, re-identification, and deformation.
Empirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three
VIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS
(35.5% AP). Furthermore, we find that pseudo-videos transformed from images can
train robust models surpassing fully-supervised ones. |
This paper presents CTVIS, a novel training strategy for online video instance segmentation (VIS) that aligns training and inference pipelines to learn highly discriminative instance embeddings, thereby enhancing instance association across video frames. |
Accurate instance association in videos, especially under challenges like occlusion and re-identification, is crucial for VIS and its downstream applications. |
CTVIS leverages a memory bank to store momentum-averaged embeddings and constructs contrastive items by comparing against these stable representations. It further introduces noise during memory bank updates to simulate real-world tracking challenges. |
CTVIS significantly outperforms state-of-the-art VIS methods on YTVIS19, YTVIS21, and OVIS benchmarks.
The method effectively leverages long video sequences during training to improve embedding discrimination.
Training CTVIS solely on pseudo-videos generated from augmented still images achieves competitive performance, surpassing fully-supervised counterparts. |
The reliance on pseudo-videos for training might introduce biases if the augmentation strategies do not fully encapsulate real-world video characteristics.
Future work could explore the integration of CTVIS with other query-based instance segmentation models and evaluate its generalization to other video-related tasks like video panoptic segmentation. |
video instance segmentation, instance embedding learning, contrastive learning, memory bank, data augmentation |
2307.12612
Report |
Less is More: Focus Attention for Efficient DETR |
Dehua Zheng, Wenhui Dong, Hailin Hu, Xinghao Chen, Yunhe Wang |
DETR-like models have significantly boosted the performance of detectors and
even outperformed classical convolutional models. However, all tokens are
treated equally without discrimination brings a redundant computational burden
in the traditional encoder structure. The recent sparsification strategies
exploit a subset of informative tokens to reduce attention complexity
maintaining performance through the sparse encoder. But these methods tend to
rely on unreliable model statistics. Moreover, simply reducing the token
population hinders the detection performance to a large extent, limiting the
application of these sparse models. We propose Focus-DETR, which focuses
attention on more informative tokens for a better trade-off between computation
efficiency and model accuracy. Specifically, we reconstruct the encoder with
dual attention, which includes a token scoring mechanism that considers both
localization and category semantic information of the objects from multi-scale
feature maps. We efficiently abandon the background queries and enhance the
semantic interaction of the fine-grained object queries based on the scores.
Compared with the state-of-the-art sparse DETR-like detectors under the same
setting, our Focus-DETR gets comparable complexity while achieving 50.4AP
(+2.2) on COCO. The code is available at
https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and
https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR. |
Focus-DETR, a novel DETR-like model that focuses attention on informative tokens using a scoring mechanism incorporating localization and category semantic information, achieving a better computation-accuracy trade-off. |
DETR-like models, while effective, suffer from redundant computation in the encoder due to treating all tokens equally. |
A scoring mechanism with top-down score modulations across multi-scale features identifies foreground and fine-grained object tokens. These tokens are processed through an encoder with dual attention, enhancing semantic information and reducing computation. |
Achieves 50.4 AP (+2.2 AP over Sparse DETR) on COCO with comparable complexity.
Outperforms state-of-the-art sparse DETR-like models with ResNet-50, ResNet-101, and Swin Transformer backbones.
Demonstrates the effectiveness of focusing on informative tokens and enhancing their semantic representation. |
Exploring more hierarchical semantic grading strategies beyond position and category information.
Developing a unified scoring mechanism and feature enhancement algorithm for the entire Transformer. |
object detection, detr, transformer, attention mechanism, efficient computation |
2307.12574
Report |
A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation |
Jinjing Zhu, Yunhao Luo, Xu Zheng, Hao Wang, Lin Wang |
In this paper, we strive to answer the question "how to collaboratively learn
convolutional neural network (CNN)-based and vision transformer (ViT)-based
models by selecting and exchanging the reliable knowledge between them for
semantic segmentation?" Accordingly, we propose an online knowledge
distillation (KD) framework that can simultaneously learn compact yet effective
CNN-based and ViT-based models with two key technical breakthroughs to take
full advantage of CNNs and ViT while compensating their limitations. Firstly,
we propose heterogeneous feature distillation (HFD) to improve students'
consistency in low-layer feature space by mimicking heterogeneous features
between CNNs and ViT. Secondly, to facilitate the two students to learn
reliable knowledge from each other, we propose bidirectional selective
distillation (BSD) that can dynamically transfer selective knowledge. This is
achieved by 1) region-wise BSD determining the directions of knowledge
transferred between the corresponding regions in the feature space and 2)
pixel-wise BSD discerning which of the prediction knowledge to be transferred
in the logit space. Extensive experiments on three benchmark datasets
demonstrate that our proposed framework outperforms the state-of-the-art online
distillation methods by a large margin, and shows its efficacy in learning
collaboratively between ViT-based and CNN-based models. |
This supplementary material provides detailed insights into the CNN-Transformer collaborative learning framework for semantic segmentation, focusing on heterogeneous feature distillation (HFD) and region-wise bidirectional selective distillation (BSD). |
The proposed method addresses the challenge of effectively transferring knowledge between CNN and Transformer models for semantic segmentation, leveraging their complementary strengths. |
The method uses HFD to align heterogeneous features from early layers and BSD to selectively transfer knowledge in a region-wise manner based on prediction reliability. |
The method enables CNNs and Transformers to learn collaboratively and improve each other's performance.
BSD facilitates the selection and exchange of reliable knowledge between the models, leading to enhanced segmentation accuracy.
Experimental results demonstrate the effectiveness of the proposed approach compared to vanilla training and other distillation methods. |
The current study focuses on specific CNN and Transformer architectures; exploring other architectures could further enhance the method's applicability.
Investigating the impact of different knowledge distillation strategies within the framework could lead to further performance improvements. |
semantic segmentation, collaborative learning, knowledge distillation, cnn-transformer, heterogeneous feature distillation |
2307.12560
Report |
Interpolating between Images with Diffusion Models |
Clinton J. Wang, Polina Golland |
One little-explored frontier of image generation and editing is the task of
interpolating between two input images, a feature missing from all currently
deployed image generation pipelines. We argue that such a feature can expand
the creative applications of such models, and propose a method for zero-shot
interpolation using latent diffusion models. We apply interpolation in the
latent space at a sequence of decreasing noise levels, then perform denoising
conditioned on interpolated text embeddings derived from textual inversion and
(optionally) subject poses. For greater consistency, or to specify additional
criteria, we can generate several candidates and use CLIP to select the highest
quality image. We obtain convincing interpolations across diverse subject
poses, image styles, and image content, and show that standard quantitative
metrics such as FID are insufficient to measure the quality of an
interpolation. Code and data are available at
https://clintonjwang.github.io/interpolation. |
This paper presents a novel method for generating high-quality interpolations between two real images using pre-trained latent diffusion models, a task not addressed by existing image generation techniques. |
Real image interpolation can broaden the creative applications of image generation models in fields like art, media, and design. |
The method involves interpolating noisy latent representations of the input images at progressively decreasing noise levels, guided by interpolated text embeddings and optionally, subject poses. CLIP is used to select high-quality outputs from multiple generated candidates. |
The proposed method generates convincing interpolations across diverse image styles, content, and subject poses.
Adding noise to parent latent vectors before interpolation leads to more semantically meaningful transformations compared to alternative schemes.
Standard image generation metrics like FID and PPL are insufficient to evaluate the quality of interpolations, as they favor simple alpha composites over creative transformations. |
The method may struggle to interpolate images with significant differences in style, layout, or semantic mapping of objects.
Future work can explore non-uniform interpolation schedules and address limitations in handling large stylistic and semantic gaps between images. |
latent diffusion models, image interpolation, image editing, textual inversion, clip |
2307.12493
Report |
TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition |
Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong |
Text-driven diffusion models have exhibited impressive generative
capabilities, enabling various image editing tasks. In this paper, we propose
TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the
power of text-driven diffusion models for cross-domain image-guided
composition. This task aims to seamlessly integrate user-provided objects into
a specific visual context. Current diffusion-based methods often involve costly
instance-based optimization or finetuning of pretrained models on customized
datasets, which can potentially undermine their rich prior. In contrast,
TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain
image-guided composition without requiring additional training, finetuning, or
optimization. Moreover, we introduce the exceptional prompt, which contains no
information, to facilitate text-driven diffusion models in accurately inverting
real images into latent representations, forming the basis for compositing. Our
experiments show that equipping Stable Diffusion with the exceptional prompt
outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ,
COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile
visual domains. Code is available at https://github.com/Shilin-LU/TF-ICON |
This paper introduces TF-ICON, a training-free image composition framework that leverages pre-trained text-to-image diffusion models for cross-domain image composition. |
Existing diffusion-based image composition methods require expensive training or finetuning, potentially harming model priors. This work offers a training-free alternative for diverse visual domains. |
The method uses an 'exceptional prompt' to accurately invert real images into latent codes. It then performs composition by injecting composite self-attention maps during the denoising process, ensuring seamless object integration across domains. |
High-order diffusion ODE solvers are shown to outperform DDIM for real image inversion.
Introducing an exceptional prompt allows for accurate image inversion in text-driven diffusion models, exceeding SOTA methods on CelebA-HQ, COCO, and ImageNet datasets.
TF-ICON surpasses prior baselines in qualitative and quantitative evaluations, demonstrating superior performance in cross-domain image composition. |
TF-ICON's reliance on self-attention maps limits its ability to generate object views significantly different from the reference image.
The approach inherits the limitations and biases of the underlying Stable Diffusion model, potentially leading to artifacts in certain situations. |
image composition, diffusion models, training-free, cross-domain, image inversion |
2307.12392
Report |
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision |
Menghao Li, Chunlei Wang, Wenquan Feng, Shuchang Lyu, Guangliang Cheng, Xiangtai Li, Binghao Liu, Qi Zhao |
Visual Grounding (VG) aims at localizing target objects from an image based
on given expressions and has made significant progress with the development of
detection and vision transformer. However, existing VG methods tend to generate
false-alarm objects when presented with inaccurate or irrelevant descriptions,
which commonly occur in practical applications. Moreover, existing methods fail
to capture fine-grained features, accurate localization, and sufficient context
comprehension from the whole image and textual descriptions. To address both
issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with
Masked Reference based Centerpoint Supervision (MRCS). The framework introduces
iterative multi-level vision-language fusion (IMVF) for better alignment. We
use MRCS to ahieve more accurate localization with point-wised feature
supervision. Then, to improve the robustness of VG, we also present a
multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of
false-alarm objects when presented with inaccurate expressions. The proposed
framework is evaluated on five regular VG datasets and two newly constructed
robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new
state-of-the-art (SOTA) results, with improvements of 25\% and 10\% compared to
existing SOTA approaches on the two newly proposed robust VG datasets.
Moreover, the proposed framework is also verified effective on five regular VG
datasets. Codes and models will be publicly at
https://github.com/cv516Buaa/IR-VG. |
This paper introduces IR-VG, a novel iterative robust visual grounding framework that tackles the issue of false alarms in visual grounding tasks, where models incorrectly detect objects when presented with inaccurate or irrelevant textual descriptions. |
Current visual grounding methods often fail to accurately detect target objects when provided with irrelevant or inaccurate textual descriptions, a common occurrence in real-world applications. |
IR-VG leverages three key modules: Masked Reference based Centerpoint Supervision (MRCS) for enhanced fine-grained feature representation and localization accuracy, Iterative Multi-Level Vision-Language Fusion (IMVF) for better multi-modal understanding, and Multi-Stage False-Alarm Sensitive Decoder (MFSD) to identify and prevent false alarm predictions. |
IR-VG achieves state-of-the-art performance on five regular visual grounding datasets, demonstrating its effectiveness in general visual grounding tasks.
The framework significantly outperforms existing methods on two newly proposed robust visual grounding datasets (RefCOCOg_F and ReferItGame_F), demonstrating its robustness to irrelevant or inaccurate descriptions.
Ablation studies and qualitative analysis validate the contribution of each module (MRCS, IMVF, MFSD) to the overall performance improvement. |
The paper acknowledges the need for more sophisticated frameworks to handle false alarms in future work.
Future research will explore the issue of irrelevant expressions in foundation models like Grounding DINO. |
visual grounding, robustness, false alarm detection, multi-modal learning, vision-language understanding |
2307.12348
Report |
ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting |
Zongsheng Yue, Jianyi Wang, Chen Change Loy |
Diffusion-based image super-resolution (SR) methods are mainly limited by the
low inference speed due to the requirements of hundreds or even thousands of
sampling steps. Existing acceleration sampling techniques inevitably sacrifice
performance to some extent, leading to over-blurry SR results. To address this
issue, we propose a novel and efficient diffusion model for SR that
significantly reduces the number of diffusion steps, thereby eliminating the
need for post-acceleration during inference and its associated performance
deterioration. Our method constructs a Markov chain that transfers between the
high-resolution image and the low-resolution image by shifting the residual
between them, substantially improving the transition efficiency. Additionally,
an elaborate noise schedule is developed to flexibly control the shifting speed
and the noise strength during the diffusion process. Extensive experiments
demonstrate that the proposed method obtains superior or at least comparable
performance to current state-of-the-art methods on both synthetic and
real-world datasets, even only with 15 sampling steps. Our code and model are
available at https://github.com/zsyOAOA/ResShift. |
This paper proposes ResShift, an efficient diffusion model for image super-resolution that significantly reduces the number of diffusion steps required, achieving superior performance with just 15 steps. |
Existing diffusion-based SR methods suffer from slow inference speed due to hundreds or thousands of sampling steps. Acceleration techniques compromise performance, leading to over-blurry results. |
ResShift constructs a Markov chain that shifts the residual between the high-resolution and low-resolution images, enabling efficient transition. A flexible noise schedule controls shifting speed and noise strength. |
ResShift achieves superior or comparable performance to state-of-the-art methods on synthetic and real-world datasets with only 15 sampling steps.
It offers a better fidelity-realism trade-off compared to existing diffusion-based SR methods.
The proposed noise schedule provides flexibility in controlling the shifting speed and noise level, allowing for a trade-off between fidelity and realism. |
ResShift's inference speed, while faster than existing diffusion-based methods, is still slower than GAN-based approaches due to its iterative nature.
The model, like other SR methods, may struggle with severely degraded real-world images not well-represented by synthetic degradation models used in training. |
image super-resolution, diffusion model, efficient inference, noise schedule, markov chain |
2307.12280
Report |
Downstream-agnostic Adversarial Examples |
Ziqi Zhou, Shengshan Hu, Ruizhi Zhao, Qian Wang, Leo Yu Zhang, Junhui Hou, Hai Jin |
Self-supervised learning usually uses a large amount of unlabeled data to
pre-train an encoder which can be used as a general-purpose feature extractor,
such that downstream users only need to perform fine-tuning operations to enjoy
the benefit of "large model". Despite this promising prospect, the security of
pre-trained encoder has not been thoroughly investigated yet, especially when
the pre-trained encoder is publicly available for commercial use.
In this paper, we propose AdvEncoder, the first framework for generating
downstream-agnostic universal adversarial examples based on the pre-trained
encoder. AdvEncoder aims to construct a universal adversarial perturbation or
patch for a set of natural images that can fool all the downstream tasks
inheriting the victim pre-trained encoder. Unlike traditional adversarial
example works, the pre-trained encoder only outputs feature vectors rather than
classification labels. Therefore, we first exploit the high frequency component
information of the image to guide the generation of adversarial examples. Then
we design a generative attack framework to construct adversarial
perturbations/patches by learning the distribution of the attack surrogate
dataset to improve their attack success rates and transferability. Our results
show that an attacker can successfully attack downstream tasks without knowing
either the pre-training dataset or the downstream dataset. We also tailor four
defenses for pre-trained encoders, the results of which further prove the
attack ability of AdvEncoder. |
AdvEncoder, the first framework for generating downstream-agnostic universal adversarial examples based on pre-trained encoders in self-supervised learning. |
Pre-trained encoders are increasingly used in various downstream tasks, raising concerns about their security as their vulnerabilities could impact numerous applications. |
A frequency-based generative attack framework is employed to construct adversarial perturbations or patches by learning the distribution of an attacker's surrogate dataset. |
AdvEncoder achieves high attack success rates and transferability against different downstream tasks, such as image classification and image retrieval.
The attack remains effective even when the attacker has no knowledge of the pre-training dataset or downstream tasks.
Existing defenses, like data corruption, fine-tuning, pruning, and adversarial training, show limited effectiveness against AdvEncoder. |
The attack performance of AdvEncoder may vary depending on the similarity between the attacker's surrogate dataset and the pre-training/downstream datasets.
Further exploration is needed to develop more robust defenses specifically tailored to protect pre-trained encoders from adversarial attacks.
Future work includes investigating the effectiveness of AdvEncoder on other downstream tasks beyond image classification and retrieval. |
adversarial examples, self-supervised learning, pre-trained encoders, universal adversarial perturbations, universal adversarial patches |
2307.12217
Report |
LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference |
Cong Wang, Yu-Ping Wang, Dinesh Manocha |
We propose a novel method, LoLep, which regresses Locally-Learned planes from
a single RGB image to represent scenes accurately, thus generating better novel
views. Without the depth information, regressing appropriate plane locations is
a challenging problem. To solve this issue, we pre-partition the disparity
space into bins and design a disparity sampler to regress local offsets for
multiple planes in each bin. However, only using such a sampler makes the
network not convergent; we further propose two optimizing strategies that
combine with different disparity distributions of datasets and propose an
occlusion-aware reprojection loss as a simple yet effective geometric
supervision technique. We also introduce a self-attention mechanism to improve
occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module
to address the problem of applying self-attention to large feature maps. We
demonstrate the effectiveness of our approach and generate state-of-the-art
results on different datasets. Compared to MINE, our approach has an LPIPS
reduction of 4.8%-9.0% and an RV reduction of 73.9%-83.5%. We also evaluate the
performance on real-world images and demonstrate the benefits. |
Proposes LoLep, a novel single-view view synthesis method using locally-learned planes to represent scenes and generate better novel views from a single RGB image. |
Existing methods struggle to represent occluded regions well, and while layered representations are suitable, they either require excessive computing power or rely on depth maps for accurate plane locations. |
Utilizes a disparity sampler to regress locally-learned plane locations, introduces two parameter optimization strategies for different disparity distributions, and proposes an occlusion-aware reprojection loss for geometric supervision. A Block-Sampling Self-Attention (BS-SA) module enhances occlusion inference on large feature maps. |
Outperforms MINE on KITTI, RealEstate10K, and Flowers Light Field datasets with improved LPIPS, SSIM, PSNR, and Rendering Variance (RV).
Generates sharper and more realistic novel views with better handling of occlusions and scene geometry compared to previous methods.
Demonstrates significant improvements in depth estimation on NYU-Depth V2 and iBims-1 datasets, highlighting accurate scene representation. |
The locally-learned planes, while effective, represent a suboptimal solution due to their restriction to specific disparity bins.
Future work will focus on developing new techniques to optimize planes across the entire disparity space while preventing clustering, potentially achieving even better results. |
single-view view synthesis, locally-learned planes, occlusion inference, self-attention mechanism, multiplane image (mpi) |
2307.12101
Report |
Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes |
Di Wu, Pengfei Chen, Xuehui Yu, Guorong Li, Zhenjun Han, Jianbin Jiao |
Object detection via inaccurate bounding boxes supervision has boosted a
broad interest due to the expensive high-quality annotation data or the
occasional inevitability of low annotation quality (\eg tiny objects). The
previous works usually utilize multiple instance learning (MIL), which highly
depends on category information, to select and refine a low-quality box. Those
methods suffer from object drift, group prediction and part domination problems
without exploring spatial information. In this paper, we heuristically propose
a \textbf{Spatial Self-Distillation based Object Detector (SSD-Det)} to mine
spatial information to refine the inaccurate box in a self-distillation
fashion. SSD-Det utilizes a Spatial Position Self-Distillation \textbf{(SPSD)}
module to exploit spatial information and an interactive structure to combine
spatial information and category information, thus constructing a high-quality
proposal bag. To further improve the selection procedure, a Spatial Identity
Self-Distillation \textbf{(SISD)} module is introduced in SSD-Det to obtain
spatial confidence to help select the best proposals. Experiments on MS-COCO
and VOC datasets with noisy box annotation verify our method's effectiveness
and achieve state-of-the-art performance. The code is available at
https://github.com/ucas-vg/PointTinyBenchmark/tree/SSD-Det. |
This paper proposes SSD-Det, a Spatial Self-Distillation based object detector that addresses the challenge of training object detectors with inaccurate bounding box annotations. |
Training object detectors typically requires large amounts of accurately annotated data, which is expensive and time-consuming. Inaccurate annotations are common, especially with automated labeling techniques. This paper addresses this challenge by developing a method robust to such inaccuracies. |
SSD-Det leverages spatial information through two novel modules: Spatial Position Self-Distillation (SPSD) and Spatial Identity Self-Distillation (SISD). SPSD refines proposal bag construction by learning semantic-spatial correspondence, while SISD improves proposal selection by predicting object-aware IoU. |
SSD-Det significantly outperforms state-of-the-art methods on MS-COCO and VOC datasets with various noise levels.
SPSD effectively improves proposal bag quality, leading to a higher upper bound for proposal selection.
SISD successfully integrates object-relevant spatial confidence, improving proposal selection accuracy. |
The current implementation of SSD-Det is limited to two-stage object detectors.
Future work can explore the extension of SSD-Det to other detection frameworks and tasks, such as instance segmentation. |
object detection, noisy annotations, self-distillation, spatial information, proposal refinement |
2307.12027
Report |
On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement |
Xin Luo, Yunan Zhu, Shunxin Xu, Dong Liu |
Several recent studies advocate the use of spectral discriminators, which
evaluate the Fourier spectra of images for generative modeling. However, the
effectiveness of the spectral discriminators is not well interpreted yet. We
tackle this issue by examining the spectral discriminators in the context of
perceptual image super-resolution (i.e., GAN-based SR), as SR image quality is
susceptible to spectral changes. Our analyses reveal that the spectral
discriminator indeed performs better than the ordinary (a.k.a. spatial)
discriminator in identifying the differences in the high-frequency range;
however, the spatial discriminator holds an advantage in the low-frequency
range. Thus, we suggest that the spectral and spatial discriminators shall be
used simultaneously. Moreover, we improve the spectral discriminators by first
calculating the patch-wise Fourier spectrum and then aggregating the spectra by
Transformer. We verify the effectiveness of the proposed method twofold. On the
one hand, thanks to the additional spectral discriminator, our obtained SR
images have their spectra better aligned to those of the real images, which
leads to a better PD tradeoff. On the other hand, our ensembled discriminator
predicts the perceptual quality more accurately, as evidenced in the
no-reference image quality assessment task. |
This paper analyzes the effectiveness of spectral discriminators for improving perceptual quality in GAN-based image super-resolution and proposes using spatial and spectral discriminators in combination. |
Spectral discriminators, which analyze images in the frequency domain, have been proposed to address spectral discrepancies between generated and real images, but their effectiveness remains unclear. |
The authors analyze the robustness of spatial and spectral discriminators under frequency perturbations, revealing their complementary strengths. They propose a Dual Transformer discriminator combining a Spatial Transformer and a Spectral Transformer with a per-patch Fourier Transform. |
Spectral discriminators excel at identifying high-frequency noise, complementing spatial discriminators' strength in detecting low-frequency deficiencies.
Combining spatial and spectral discriminators in a Dual Transformer discriminator leads to better spectral alignment between generated and real images, improving perceptual quality in super-resolution.
The Dual Transformer discriminator demonstrates superior performance in no-reference image quality assessment compared to spatial discriminators alone. |
Better-aligned spectra don't always guarantee improved perceptual quality.
Training spatial and spectral discriminators separately increases computational overhead. |
image super-resolution, generative adversarial networks, spectral discriminators, perceptual quality, frequency analysis |
2307.11978
Report |
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? |
Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu, Linjie Yang |
Vision-language models such as CLIP learn a generic text-image embedding from
large-scale training data. A vision-language model can be adapted to a new
classification task through few-shot prompt tuning. We find that such a prompt
tuning process is highly robust to label noises. This intrigues us to study the
key reasons contributing to the robustness of the prompt tuning paradigm. We
conducted extensive experiments to explore this property and find the key
factors are: 1) the fixed classname tokens provide a strong regularization to
the optimization of the model, reducing gradients induced by the noisy samples;
2) the powerful pre-trained image-text embedding that is learned from diverse
and generic web data provides strong prior knowledge for image classification.
Further, we demonstrate that noisy zero-shot predictions from CLIP can be used
to tune its own prompt, significantly enhancing prediction accuracy in the
unsupervised setting. The code is available at https://github.com/CEWu/PTNL. |
This paper discovers and analyzes the surprising robustness of prompt tuning for vision-language models (e.g., CLIP) against noisy labels, outperforming traditional transfer learning methods like fine-tuning and linear probes. |
Learning with noisy labels is crucial for real-world applications where perfectly annotated data is scarce. This study reveals the robustness of prompt tuning, a data-efficient method, in handling such imperfect data. |
The authors conduct extensive experiments on various datasets, comparing prompt tuning with linear probes and fine-tuning under different noise levels and types. They analyze the impact of different components like class embeddings, learnable prompts, and robust loss functions (GCE) on the model's performance. |
Prompt tuning of CLIP demonstrates significantly higher robustness to noisy labels compared to fine-tuning or linear probing methods.
The fixed classname tokens within the prompt, along with CLIP's pre-trained text encoder, provide strong regularization, preventing overfitting to noisy data.
Leveraging this robustness, a novel unsupervised prompt tuning approach is proposed, utilizing randomly sampled pseudo labels to enhance CLIP zero-shot performance. |
The study primarily focuses on CLIP, leaving the exploration of other vision-language models for future work.
Investigating the impact of varying prompt lengths and exploring a wider range of robust loss functions beyond GCE could further enhance the understanding of noise robustness in prompt tuning. |
prompt tuning, vision-language models, noisy labels, clip, unsupervised learning |
2307.11932
Report |
RIC: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction |
Isaac Kasahara, Shubham Agrawal, Selim Engin, Nikhil Chavan-Dafle, Shuran Song, Volkan Isler |
General scene reconstruction refers to the task of estimating the full 3D
geometry and texture of a scene containing previously unseen objects. In many
practical applications such as AR/VR, autonomous navigation, and robotics, only
a single view of the scene may be available, making the scene reconstruction
task challenging. In this paper, we present a method for scene reconstruction
by structurally breaking the problem into two steps: rendering novel views via
inpainting and 2D to 3D scene lifting. Specifically, we leverage the
generalization capability of large visual language models (Dalle-2) to inpaint
the missing areas of scene color images rendered from different views. Next, we
lift these inpainted images to 3D by predicting normals of the inpainted image
and solving for the missing depth values. By predicting for normals instead of
depth directly, our method allows for robustness to changes in depth
distributions and scale. With rigorous quantitative evaluation, we show that
our method outperforms multiple baselines while providing generalization to
novel objects and scenes. |
This paper introduces Rotate-Inpaint-Complete (RIC), a novel method for 3D scene reconstruction from a single RGB-D image, leveraging the inpainting capabilities of large visual language models to handle novel objects and scenes. |
Reconstructing complete 3D scenes from limited viewpoints is crucial for various applications like AR/VR, robotics, and autonomous navigation. Existing methods struggle with novel objects and cluttered scenes. |
RIC generates novel views by rotating the input image, uses DALL-E for inpainting missing regions, predicts surface normals and occlusion boundaries from inpainted images, and optimizes depth using these predictions. Finally, it filters inconsistencies across viewpoints to refine the 3D reconstruction. |
RIC outperforms baselines in 3D scene reconstruction metrics on both in-distribution (YCB-V) and out-of-distribution (HOPE) datasets, demonstrating generalizability.
The method shows robustness to prompt specificity, indicating the effectiveness of view selection in preserving sufficient context for inpainting.
Qualitative results highlight RIC's ability to generate realistic novel views and complete scene geometry, even for heavily occluded objects. |
DALL-E's tendency to generate unrealistic elements can impact reconstruction quality, although mitigated through consistency filtering.
Reconstructing the backside of objects remains challenging due to limited context at large viewpoints. |
3d scene reconstruction, single-view reconstruction, dall-e, inpainting, novel view synthesis |
2307.11828
Report |
Enhancing Your Trained DETRs with Box Refinement |
Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng |
We present a conceptually simple, efficient, and general framework for
localization problems in DETR-like models. We add plugins to well-trained
models instead of inefficiently designing new models and training them from
scratch. The method, called RefineBox, refines the outputs of DETR-like
detectors by lightweight refinement networks. RefineBox is easy to implement
and train as it only leverages the features and predicted boxes from the
well-trained detection models. Our method is also efficient as we freeze the
trained detectors during training. In addition, we can easily generalize
RefineBox to various trained detection models without any modification. We
conduct experiments on COCO and LVIS $1.0$. Experimental results indicate the
effectiveness of our RefineBox for DETR and its representative variants (Figure
1). For example, the performance gains for DETR, Conditinal-DETR, DAB-DETR, and
DN-DETR are 2.4 AP, 2.5 AP, 1.9 AP, and 1.6 AP, respectively. We hope our work
will bring the attention of the detection community to the localization
bottleneck of current DETR-like models and highlight the potential of the
RefineBox framework. Code and models will be publicly available at:
\href{https://github.com/YiqunChen1999/RefineBox}{https://github.com/YiqunChen1999/RefineBox}. |
This paper introduces RefineBox, a novel framework to enhance the localization accuracy of pre-trained DETR-like object detectors by adding a lightweight refinement network. |
The authors identify localization accuracy as the bottleneck in DETR-like detectors, hindering further performance improvement even with perfect classification. |
RefineBox leverages the feature pyramid network from the pre-trained detector's backbone and refines predicted bounding boxes using a series of Refiner modules. Crucially, the original detector's parameters are frozen during training. |
RefineBox consistently improves Average Precision (AP) across various DETR-like models on COCO and LVIS datasets.
Significant gains are observed in AP75 and AP for small objects, highlighting improved localization accuracy.
The framework is lightweight, adding minimal parameters and FLOPs, making it efficient for training and inference. |
The simple design of the current refinement network may limit further performance gains. Exploring more sophisticated architectures is left for future work.
The paper mainly focuses on improving localization. Investigating the impact of refining classification jointly with localization is a promising direction. |
object detection, detr, localization accuracy, refinement network, two-stage detection |
2307.11661
Report |
Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts |
Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, Noel E. O'Connor |
Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have
revolutionized visual representation learning by providing good performance on
downstream datasets. VLMs are 0-shot adapted to a downstream dataset by
designing prompts that are relevant to the dataset. Such prompt engineering
makes use of domain expertise and a validation dataset. Meanwhile, recent
developments in generative pretrained models like GPT-4 mean they can be used
as advanced internet search tools. They can also be manipulated to provide
visual information in any structure. In this work, we show that GPT-4 can be
used to generate text that is visually descriptive and how this can be used to
adapt CLIP to downstream tasks. We show considerable improvements in 0-shot
transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD
(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.
We also design a simple few-shot adapter that learns to choose the best
possible sentences to construct generalizable classifiers that outperform the
recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized
fine-grained datasets. The code, prompts, and auxiliary text dataset is
available at https://github.com/mayug/VDT-Adapter. |
This paper proposes a novel method to enhance CLIP's zero-shot and few-shot domain adaptation capabilities by leveraging GPT-4 generated visually descriptive text (VDT) information. |
This approach addresses the limitations of prompt engineering, which relies on domain expertise and often yields inconsistent results due to prompt sensitivity. Using VDT provides richer semantic information for CLIP, leading to more accurate and generalizable classification. |
The methodology involves two main stages: 1) **VDT generation:** GPT-4 is prompted to generate detailed visual descriptions for each class in a dataset. 2) **CLIP adaptation:** For zero-shot transfer, VDT is incorporated into prompt ensembles. For few-shot transfer, a lightweight adapter network (CLIP-A-self) with self-attention is trained to selectively aggregate the most relevant VDT, improving classification accuracy on unseen classes. |
GPT-4 generated VDT significantly improves CLIP's 0-shot performance on 12 diverse datasets, with an average gain of 2% and even larger improvements on fine-grained datasets like EuroSAT (7%), DTD (7%), and CUB (3.3%).
CLIP-A-self, utilizing VDT, outperforms existing few-shot methods like CoCoOp by 3% on average in the Base-to-New setting, demonstrating better generalization ability.
Analysis of attention weights reveals that CLIP-A-self effectively learns to prioritize visually relevant VDT sentences, contributing to its superior performance. |
While GPT-4 provides high-quality VDT, its dependence on a paid API might pose scalability constraints.
Future work can explore incorporating other modalities like object detection or scene graphs to further enrich CLIP's understanding, potentially leading to even better performance. |
vision-language models, clip, gpt-4, prompt engineering, few-shot learning |
2307.11558
Report |
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method |
Zhihong Chen, Ruifei Zhang, Yibing Song, Xiang Wan, Guanbin Li |
Visual grounding (VG) aims to establish fine-grained alignment between vision
and language. Ideally, it can be a testbed for vision-and-language models to
evaluate their understanding of the images and texts and their reasoning
abilities over their joint space. However, most existing VG datasets are
constructed using simple description texts, which do not require sufficient
reasoning over the images and texts. This has been demonstrated in a recent
study~\cite{luo2022goes}, where a simple LSTM-based text encoder without
pretraining can achieve state-of-the-art performance on mainstream VG datasets.
Therefore, in this paper, we propose a novel benchmark of \underline{S}cene
\underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG),
where the image content and referring expressions are not sufficient to ground
the target objects, forcing the models to have a reasoning ability on the
long-form scene knowledge. To perform this task, we propose two approaches to
accept the triple-type input, where the former embeds knowledge into the image
features before the image-query interaction; the latter leverages linguistic
structure to assist in computing the image-text matching. We conduct extensive
experiments to analyze the above methods and show that the proposed approaches
achieve promising results but still leave room for improvement, including
performance and interpretability. The dataset and code are available at
\url{https://github.com/zhjohnchan/SK-VG}. |
This paper introduces SK-VG, a new benchmark dataset for visual grounding that requires models to reason over image, scene knowledge, and query triples for accurate object localization. |
Existing visual grounding datasets lack complex language understanding and reasoning challenges, failing to evaluate the full reasoning capabilities of vision-language models. SK-VG addresses this by incorporating scene knowledge, forcing models to reason beyond simple visual descriptions. |
The authors construct SK-VG with real images, manually annotated with scene stories and referring expressions. Two approaches, KeViLI (one-stage knowledge embedding) and LeViLM (two-stage linguistic-enhanced matching), are proposed to address this new task. |
LeViLM significantly outperforms traditional visual grounding models on SK-VG, demonstrating the effectiveness of incorporating scene knowledge.
Fine-tuning LeViLM on SK-VG substantially improves performance compared to zero-shot or linear probing, highlighting the importance of model adaptation for this task.
While LeViLM achieves decent results on easy/medium difficulty levels, it struggles with the hard split, indicating the need for further research in multi-hop reasoning and model interpretability. |
Scene knowledge annotation can be subjective and biased due to the creative nature of the task and annotator differences.
The scale of SK-VG is relatively smaller compared to existing datasets due to the complexity and time-consuming annotation process. |
visual grounding, scene knowledge, reasoning, benchmark dataset, vision-language |
2307.11458
Report |
Strip-MLP: Efficient Token Interaction for Vision MLP |
Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Jianguo Zhang |
Token interaction operation is one of the core modules in MLP-based models to
exchange and aggregate information between different spatial locations.
However, the power of token interaction on the spatial dimension is highly
dependent on the spatial resolution of the feature maps, which limits the
model's expressive ability, especially in deep layers where the feature are
down-sampled to a small spatial size. To address this issue, we present a novel
method called \textbf{Strip-MLP} to enrich the token interaction power in three
ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that
allows the token to interact with other tokens in a cross-strip manner,
enabling the tokens in a row (or column) to contribute to the information
aggregations in adjacent but different strips of rows (or columns). Secondly, a
\textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule
(CGSMM) is proposed to overcome the performance degradation caused by small
spatial feature size. The module allows tokens to interact more effectively in
the manners of within-patch and cross-patch, which is independent to the
feature spatial size. Finally, based on the Strip MLP layer, we propose a novel
\textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost
the token interaction power in the local region. Extensive experiments
demonstrate that Strip-MLP significantly improves the performance of MLP-based
models on small datasets and obtains comparable or even better results on
ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy
than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on
CIFAR-100. The source codes will be available
at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}. |
This paper proposes Strip-MLP, an efficient vision MLP model that enriches token interaction power through a novel Strip MLP layer, Cascade Group Strip Mixing Module (CGSMM), and Local Strip Mixing Module (LSMM). |
Existing MLP-based models suffer from degraded token interaction power, especially in deep layers with down-sampled feature maps, limiting their expressive ability. |
Strip MLP layer aggregates adjacent tokens in a cross-strip manner. CGSMM enables effective token interaction within and across channel-wise patches. LSMM enhances local token interactions. |
Strip-MLP significantly outperforms previous MLP-based models on small datasets like Caltech-101 and CIFAR-100.
It achieves comparable or superior results on ImageNet-1K with fewer parameters and FLOPs than other MLP, CNN, and Transformer models.
Ablation studies demonstrate the effectiveness of each proposed component. |
Optimal patch number in CGSMM depends on dataset scale and requires validation.
Exploring the application of Strip MLP layer in other vision tasks. |
vision mlp, token interaction, image classification, efficient model, strip mlp |
2307.11418
Report |
FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields |
Sungwon Hwang, Junha Hyung, Daejin Kim, Min-Jung Kim, Jaegul Choo |
As recent advances in Neural Radiance Fields (NeRF) have enabled
high-fidelity 3D face reconstruction and novel view synthesis, its manipulation
also became an essential task in 3D vision. However, existing manipulation
methods require extensive human labor, such as a user-provided semantic mask
and manual attribute search unsuitable for non-expert users. Instead, our
approach is designed to require a single text to manipulate a face
reconstructed with NeRF. To do so, we first train a scene manipulator, a latent
code-conditional deformable NeRF, over a dynamic scene to control a face
deformation using the latent code. However, representing a scene deformation
with a single latent code is unfavorable for compositing local deformations
observed in different instances. As so, our proposed Position-conditional
Anchor Compositor (PAC) learns to represent a manipulated scene with spatially
varying latent codes. Their renderings with the scene manipulator are then
optimized to yield high cosine similarity to a target text in CLIP embedding
space for text-driven manipulation. To the best of our knowledge, our approach
is the first to address the text-driven manipulation of a face reconstructed
with NeRF. Extensive results, comparisons, and ablation studies demonstrate the
effectiveness of our approach. |
Presents FaceCLIPNeRF, a method for text-driven 3D face manipulation using deformable neural radiance fields, enabling control of facial expressions using only text prompts. |
Existing 3D face manipulation methods with NeRF are labor-intensive, requiring manual input like semantic masks or attribute adjustments, making them unsuitable for non-expert users. This work addresses this by enabling manipulation with just a single text prompt. |
The method first trains a scene manipulator based on HyperNeRF to control facial deformations with latent codes. To overcome limitations in representing complex expressions, a Position-conditional Anchor Compositor (PAC) is introduced. This PAC learns to combine learned deformation anchors, enabling the representation of a manipulated scene with spatially varying latent codes. Finally, the rendered images are optimized to align with a target text's attributes in CLIP embedding space. |
FaceCLIPNeRF successfully manipulates facial expressions using both descriptive and emotional text prompts.
The method outperforms baselines in quantitative metrics such as R-precision and LPIPS, demonstrating superior text reflectivity and visual quality.
User studies confirm that FaceCLIPNeRF effectively reflects target text attributes while preserving visual realism and face identity. |
The method relies on a pre-trained human segmentation network for excluding dynamic scene elements during camera pose estimation, potentially limiting generalizability.
Future work could explore expanding the range of manipulable facial attributes and improving the fine-grained control over facial features. |
3d face manipulation, neural radiance fields, text-driven manipulation, deformable nerf, clip |
2307.11342
Report |
Tuning Pre-trained Model via Moment Probing |
Mingze Gao, Qilong Wang, Zhenyi Lin, Pengfei Zhu, Qinghua Hu, Jingbo Zhou |
Recently, efficient fine-tuning of large-scale pre-trained models has
attracted increasing research interests, where linear probing (LP) as a
fundamental module is involved in exploiting the final representations for
task-dependent classification. However, most of the existing methods focus on
how to effectively introduce a few of learnable parameters, and little work
pays attention to the commonly used LP module. In this paper, we propose a
novel Moment Probing (MP) method to further explore the potential of LP.
Distinguished from LP which builds a linear classification head based on the
mean of final features (e.g., word tokens for ViT) or classification tokens,
our MP performs a linear classifier on feature distribution, which provides the
stronger representation ability by exploiting richer statistical information
inherent in features. Specifically, we represent feature distribution by its
characteristic function, which is efficiently approximated by using first- and
second-order moments of features. Furthermore, we propose a multi-head
convolutional cross-covariance (MHC$^3$) to compute second-order moments in an
efficient and effective manner. By considering that MP could affect feature
learning, we introduce a partially shared module to learn two recalibrating
parameters (PSRP) for backbones based on MP, namely MP$_{+}$. Extensive
experiments on ten benchmarks using various models show that our MP
significantly outperforms LP and is competitive with counterparts at less
training cost, while our MP$_{+}$ achieves state-of-the-art performance. |
This paper proposes Moment Probing (MP), a novel method for fine-tuning large pre-trained models that outperforms linear probing (LP) by leveraging feature distribution for classification. |
Existing efficient fine-tuning methods primarily focus on introducing learnable parameters while overlooking the potential of the commonly used LP module. This paper addresses this gap by enhancing the representation power of LP for improved performance. |
MP models feature distribution using the characteristic function, approximating it with first and second-order moments. A multi-head convolutional cross-covariance (MHC³) method efficiently computes second-order moments. Furthermore, a partially shared module (PSRP) is introduced to learn recalibrating parameters for the backbone, resulting in MP₊. |
MP consistently outperforms LP and achieves competitive or better performance than existing parameter-efficient methods at a lower training cost.
MP generalizes well across pre-training strategies, few-shot settings, and out-of-distribution datasets.
MP₊, incorporating feature learning, surpasses full fine-tuning and other efficient methods, achieving state-of-the-art performance. |
The paper primarily focuses on classification tasks, future work could explore MP's applicability to other tasks like prompt learning.
Further investigation into the theoretical properties and limitations of MHC³ is warranted. |
fine-tuning, linear probing, parameter-efficient learning, transfer learning, moment probing |
2307.11335
Report |
Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields |
Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, Yuewen Ma |
Despite the tremendous progress in neural radiance fields (NeRF), we still
face a dilemma of the trade-off between quality and efficiency, e.g., MipNeRF
presents fine-detailed and anti-aliased renderings but takes days for training,
while Instant-ngp can accomplish the reconstruction in a few minutes but
suffers from blurring or aliasing when rendering at various distances or
resolutions due to ignoring the sampling area. To this end, we propose a novel
Tri-Mip encoding that enables both instant reconstruction and anti-aliased
high-fidelity rendering for neural radiance fields. The key is to factorize the
pre-filtered 3D feature spaces in three orthogonal mipmaps. In this way, we can
efficiently perform 3D area sampling by taking advantage of 2D pre-filtered
feature maps, which significantly elevates the rendering quality without
sacrificing efficiency. To cope with the novel Tri-Mip representation, we
propose a cone-casting rendering technique to efficiently sample anti-aliased
3D features with the Tri-Mip encoding considering both pixel imaging and
observing distance. Extensive experiments on both synthetic and real-world
datasets demonstrate our method achieves state-of-the-art rendering quality and
reconstruction speed while maintaining a compact representation that reduces
25% model size compared against Instant-ngp. |
This paper proposes Tri-MipRF, a novel neural radiance field representation that enables both instant reconstruction and anti-aliased high-fidelity rendering. |
Existing NeRF methods face a trade-off between quality and efficiency, struggling to achieve both high-quality anti-aliased renderings and fast reconstruction. |
The method introduces a novel Tri-Mip encoding that factorizes pre-filtered 3D feature spaces into three orthogonal mipmaps, allowing efficient 3D area sampling using 2D feature maps. It also employs a cone-casting rendering technique with adaptive sphere sampling based on pixel imaging and distance, coupled with a hybrid volume-surface rendering strategy for real-time performance. |
Tri-MipRF achieves state-of-the-art rendering quality with fine details and reduced aliasing on multi-scale Blender datasets.
The method achieves fast reconstruction within five minutes on a single GPU, comparable to Instant-ngp but with superior rendering quality.
Tri-MipRF maintains a compact representation, reducing the model size by 25% compared to Instant-ngp. |
The reliance on proxy mesh generation for real-time rendering introduces additional steps and may impact performance for complex scenes.
Exploration of alternative rendering strategies beyond hybrid volume-surface rendering could further enhance efficiency. |
neural radiance fields, anti-aliasing, mipmap, cone casting, real-time rendering |
2307.11308
Report |
DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport |
Zezeng Li, ShengHao Li, Zhanpeng Wang, Na Lei, Zhongxuan Luo, Xianfeng Gu |
Sampling from diffusion probabilistic models (DPMs) can be viewed as a
piecewise distribution transformation, which generally requires hundreds or
thousands of steps of the inverse diffusion trajectory to get a high-quality
image. Recent progress in designing fast samplers for DPMs achieves a trade-off
between sampling speed and sample quality by knowledge distillation or
adjusting the variance schedule or the denoising equation. However, it can't be
optimal in both aspects and often suffer from mode mixture in short steps. To
tackle this problem, we innovatively regard inverse diffusion as an optimal
transport (OT) problem between latents at different stages and propose the
DPM-OT, a unified learning framework for fast DPMs with a direct expressway
represented by OT map, which can generate high-quality samples within around 10
function evaluations. By calculating the semi-discrete optimal transport map
between the data latents and the white noise, we obtain an expressway from the
prior distribution to the data distribution, while significantly alleviating
the problem of mode mixture. In addition, we give the error bound of the
proposed method, which theoretically guarantees the stability of the algorithm.
Extensive experiments validate the effectiveness and advantages of DPM-OT in
terms of speed and quality (FID and mode mixture), thus representing an
efficient solution for generative modeling. Source codes are available at
https://github.com/cognaclee/DPM-OT |
This paper introduces DPM-OT, a new diffusion probabilistic model for fast sampling that leverages optimal transport (OT) to build an expressway between latents at different stages of the inverse diffusion process. |
Existing fast DPMs often compromise sample quality or introduce mode mixture due to approximating a continuous diffusion process. DPM-OT addresses these limitations by utilizing OT. |
DPM-OT computes a semi-discrete optimal transport (SDOT) map between white noise and data latents at an intermediate diffusion step. This map acts as an expressway to quickly bring the noise to a near-perfect initial point for subsequent inverse diffusion, significantly reducing sampling steps. |
DPM-OT generates high-quality images with fewer function evaluations compared to state-of-the-art models.
The proposed method effectively mitigates mode mixture, leading to more semantically meaningful samples.
Theoretical analysis proves that DPM-OT can fit the target data distribution no worse than traditional DPMs. |
A limitation is the storage requirement for noisy training samples at the intermediate diffusion step.
Future work includes extending DPM-OT to conditional image generation tasks. |
diffusion probabilistic models, optimal transport, fast sampling, mode mixture, generative modeling |
2307.11086
Report |
PAPR: Proximity Attention Point Rendering |
Yanshu Zhang, Shichong Peng, Alireza Moazeni, Ke Li |
Learning accurate and parsimonious point cloud representations of scene
surfaces from scratch remains a challenge in 3D representation learning.
Existing point-based methods often suffer from the vanishing gradient problem
or require a large number of points to accurately model scene geometry and
texture. To address these limitations, we propose Proximity Attention Point
Rendering (PAPR), a novel method that consists of a point-based scene
representation and a differentiable renderer. Our scene representation uses a
point cloud where each point is characterized by its spatial position,
influence score, and view-independent feature vector. The renderer selects the
relevant points for each ray and produces accurate colours using their
associated features. PAPR effectively learns point cloud positions to represent
the correct scene geometry, even when the initialization drastically differs
from the target geometry. Notably, our method captures fine texture details
while using only a parsimonious set of points. We also demonstrate four
practical applications of our method: zero-shot geometry editing, object
manipulation, texture transfer, and exposure control. More results and code are
available on our project website at https://zvict.github.io/papr/. |
This paper introduces Proximity Attention Point Rendering (PAPR), a novel method for learning and rendering parsimonious 3D point cloud scene representations directly from multi-view RGB images. |
Learning accurate and concise representations of 3D scenes is crucial for various applications in computer vision and graphics, including novel view synthesis, scene editing, and virtual reality. Existing methods often struggle to balance representation capacity and computational complexity. |
PAPR leverages a point cloud representation, where each point is defined by its position, influence score, and a view-independent feature vector. It utilizes a differentiable renderer with a ray-dependent point embedding and proximity attention to select relevant points and combine their features for generating high-quality images. |
PAPR effectively learns accurate geometry and texture details from scratch, even with random point cloud initialization.
It outperforms prior point-based and volumetric methods in terms of image quality while using a parsimonious point set.
PAPR enables several practical applications, including zero-shot geometry editing, object manipulation, texture transfer, and exposure control. |
The current pruning strategy assumes a near-constant background color, limiting its applicability in complex backgrounds.
Future work can explore learning a separate model to handle background variations and extend the method's capability with more points and deeper networks. |
3d scene representation, point cloud rendering, neural rendering, differentiable rendering, proximity attention |
2307.11073
Report |
OBJECT 3DIT: Language-guided 3D-aware Image Editing |
Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, Tanmay Gupta |
Existing image editing tools, while powerful, typically disregard the
underlying 3D geometry from which the image is projected. As a result, edits
made using these tools may become detached from the geometry and lighting
conditions that are at the foundation of the image formation process. In this
work, we formulate the newt ask of language-guided 3D-aware editing, where
objects in an image should be edited according to a language instruction in
context of the underlying 3D scene. To promote progress towards this goal, we
release OBJECT: a dataset consisting of 400K editing examples created from
procedurally generated 3D scenes. Each example consists of an input image,
editing instruction in language, and the edited image. We also introduce 3DIT :
single and multi-task models for four editing tasks. Our models show impressive
abilities to understand the 3D composition of entire scenes, factoring in
surrounding objects, surfaces, lighting conditions, shadows, and
physically-plausible object configurations. Surprisingly, training on only
synthetic scenes from OBJECT, editing capabilities of 3DIT generalize to
real-world images. |
This work introduces a novel model, 3DIT, for 3D-aware language-guided image editing that considers scene context, including geometry, lighting, and object interactions. |
Existing image editing methods often fall short in maintaining 3D consistency, leading to unrealistic edits. 3DIT aims to address this gap by leveraging language instructions for object manipulation while preserving scene realism. |
The authors created a dataset, Objaverse Editing in Context (OEC), with 400k editing examples generated from 3D scenes. They fine-tuned a diffusion model on OEC for four tasks: object translation, rotation, insertion, and removal. |
3DIT outperforms baselines on quantitative metrics for realism and faithfulness.
Human evaluation shows a strong preference for 3DIT outputs, demonstrating superior geometric and lighting consistency.
The model exhibits promising generalization to real-world images despite being trained solely on synthetic data. |
Limitations: Current model is limited to single object manipulations; Real-world performance can be further improved.
Future work: Extend 3DIT to handle multiple object edits and more complex scene manipulations; Explore fine-tuning on real-world data. |
image editing, 3d-aware, language-guided, diffusion models, synthetic data |
2307.11035
Report |
Cascade-DETR: Delving into High-Quality Universal Object Detection |
Mingqiao Ye, Lei Ke, Siyuan Li, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu |
Object localization in general environments is a fundamental part of vision
systems. While dominating on the COCO benchmark, recent Transformer-based
detection methods are not competitive in diverse domains. Moreover, these
methods still struggle to very accurately estimate the object bounding boxes in
complex environments.
We introduce Cascade-DETR for high-quality universal object detection. We
jointly tackle the generalization to diverse domains and localization accuracy
by proposing the Cascade Attention layer, which explicitly integrates
object-centric information into the detection decoder by limiting the attention
to the previous box prediction. To further enhance accuracy, we also revisit
the scoring of queries. Instead of relying on classification scores, we predict
the expected IoU of the query, leading to substantially more well-calibrated
confidences. Lastly, we introduce a universal object detection benchmark,
UDB10, that contains 10 datasets from diverse domains. While also advancing the
state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based
detectors on all datasets in UDB10, even by over 10 mAP in some cases. The
improvements under stringent quality requirements are even more pronounced. Our
code and models will be released at https://github.com/SysCV/cascade-detr. |
This paper proposes Cascade-DETR, a new DETR-based object detection model, for high-quality universal object detection. It tackles the generalization to diverse domains and localization accuracy of DETR-based detectors. |
Existing Transformer-based object detection methods, while achieving SOTA performance on COCO, show limitations in generalizing to diverse domains and achieving high accuracy in bounding box estimations. |
Cascade-DETR introduces two main components: (1) Cascade Attention: constrains cross-attention within iteratively refined predicted bounding boxes, injecting local object-centric prior. (2) IoU-aware Query Recalibration: predicts IoU of each query to recalibrate classification scores for better reflecting prediction quality. A new benchmark, UDB10, consisting of 10 datasets from diverse domains, is also introduced. |
Cascade-DETR outperforms SOTA DETR-based methods on UDB10, improving UniAP by 5.7.
On COCO, Cascade-DETR achieves significant improvements, especially under strict IoU thresholds, indicating better bounding box accuracy.
Cascade-DETR consistently outperforms baselines across various domains in UDB10, even when fine-tuned from COCO pre-trained models, showcasing its generalizability. |
The paper assumes the availability of bounding box annotations for training across all datasets, which might not always be feasible in real-world scenarios.
Further exploration on incorporating different types of weak supervision, such as image-level tags or point annotations, for training Cascade-DETR is left for future work. |
object detection, transformers, detr, bounding box accuracy, generalization |
2307.10984
Report |
Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image |
Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, Chunhua Shen |
Reconstructing accurate 3D scenes from images is a long-standing vision task.
Due to the ill-posedness of the single-image reconstruction problem, most
well-established methods are built upon multi-view geometry. State-of-the-art
(SOTA) monocular metric depth estimation methods can only handle a single
camera model and are unable to perform mixed-data training due to the metric
ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets
achieve zero-shot generalization by learning affine-invariant depths, which
cannot recover real-world metrics. In this work, we show that the key to a
zero-shot single-view metric depth model lies in the combination of large-scale
data training and resolving the metric ambiguity from various camera models. We
propose a canonical camera space transformation module, which explicitly
addresses the ambiguity problems and can be effortlessly plugged into existing
monocular models. Equipped with our module, monocular models can be stably
trained with over 8 million images with thousands of camera models, resulting
in zero-shot generalization to in-the-wild images with unseen camera settings.
Experiments demonstrate SOTA performance of our method on 7 zero-shot
benchmarks. Notably, our method won the championship in the 2nd Monocular Depth
Estimation Challenge. Our method enables the accurate recovery of metric 3D
structures on randomly collected internet images, paving the way for plausible
single-image metrology. The potential benefits extend to downstream tasks,
which can be significantly improved by simply plugging in our model. For
example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1),
leading to high-quality metric scale dense mapping. The code is available at
https://github.com/YvanYin/Metric3D. |
This paper introduces Metric3D, a novel method for zero-shot metric 3D prediction from a single image, enabling accurate metric 3D reconstruction from in-the-wild images. |
Existing methods for single-image 3D reconstruction either rely on object-specific priors or can only predict affine-invariant depths, lacking real-world metric information crucial for applications like metrology and robotics. |
Metric3D addresses the metric ambiguity issue by introducing a canonical camera space transformation module (CSTM). This module transforms training data to a canonical camera space, allowing the model to learn metric depth information across diverse camera settings. Additionally, it leverages a random proposal normalization loss (RPNL) to enhance depth accuracy by emphasizing local geometric details. |
Metric3D achieves state-of-the-art performance on 7 zero-shot benchmarks, outperforming existing methods in terms of metric depth accuracy and generalization ability.
The method enables plausible single-image metrology, demonstrated by its ability to accurately measure object sizes in real-world images.
It significantly improves the performance of downstream tasks like monocular SLAM, enabling metric-scale dense mapping by providing accurate depth priors. |
The accuracy of the metric reconstruction relies on the availability and accuracy of camera intrinsic parameters.
Future work includes exploring ways to estimate camera intrinsics directly from images to further enhance the applicability of the method. |
3d reconstruction, metric depth estimation, zero-shot learning, single-image metrology, monocular slam |
2307.10854
Report |
BlendFace: Re-designing Identity Encoders for Face-Swapping |
Kaede Shiohara, Xingchao Yang, Takafumi Taketomi |
The great advancements of generative adversarial networks and face
recognition models in computer vision have made it possible to swap identities
on images from single sources. Although a lot of studies seems to have proposed
almost satisfactory solutions, we notice previous methods still suffer from an
identity-attribute entanglement that causes undesired attributes swapping
because widely used identity encoders, eg, ArcFace, have some crucial attribute
biases owing to their pretraining on face recognition tasks. To address this
issue, we design BlendFace, a novel identity encoder for face-swapping. The key
idea behind BlendFace is training face recognition models on blended images
whose attributes are replaced with those of another mitigates inter-personal
biases such as hairsyles. BlendFace feeds disentangled identity features into
generators and guides generators properly as an identity loss function.
Extensive experiments demonstrate that BlendFace improves the
identity-attribute disentanglement in face-swapping models, maintaining a
comparable quantitative performance to previous methods. |
This paper introduces BlendFace, a novel identity encoder for face-swapping that mitigates identity-attribute entanglement, a common problem in existing models. |
Existing face-swapping models often exhibit identity-attribute entanglement due to biases in pre-trained identity encoders like ArcFace, leading to undesired attribute swapping (e.g., hairstyles). |
BlendFace is trained on blended images with swapped attributes, reducing bias towards specific features. It's then integrated into a face-swapping model as both the source feature extractor and identity loss function. |
BlendFace improves identity-attribute disentanglement in face-swapping, reducing unwanted attribute transfer.
It shows comparable or superior performance to state-of-the-art methods on FaceForensics++ in identity similarity and attribute preservation (expression, pose, gaze).
BlendFace enhances the visual consistency of swapped faces compared to existing models. |
BlendFace may not effectively handle large differences in face shapes between source and target images.
Preserving hard occlusions (e.g., hands) remains challenging due to limited training data. |
face swapping, identity encoder, generative adversarial networks, attribute disentanglement, face recognition |
2307.10829
Report |
Exact Diffusion Inversion via Bi-directional Integration Approximation |
Guoqiang Zhang, J. P. Lewis, W. Bastiaan Kleijn |
Recently, various methods have been proposed to address the inconsistency
issue of DDIM inversion to enable image editing, such as EDICT [36] and
Null-text inversion [22]. However, the above methods introduce considerable
computational overhead. In this paper, we propose a new technique, named
\emph{bi-directional integration approximation} (BDIA), to perform exact
diffusion inversion with neglible computational overhead. Suppose we would like
to estimate the next diffusion state $\boldsymbol{z}_{i-1}$ at timestep $t_i$
with the historical information $(i,\boldsymbol{z}_i)$ and
$(i+1,\boldsymbol{z}_{i+1})$. We first obtain the estimated Gaussian noise
$\hat{\boldsymbol{\epsilon}}(\boldsymbol{z}_i,i)$, and then apply the DDIM
update procedure twice for approximating the ODE integration over the next
time-slot $[t_i, t_{i-1}]$ in the forward manner and the previous time-slot
$[t_i, t_{t+1}]$ in the backward manner. The DDIM step for the previous
time-slot is used to refine the integration approximation made earlier when
computing $\boldsymbol{z}_i$. A nice property of BDIA-DDIM is that the update
expression for $\boldsymbol{z}_{i-1}$ is a linear combination of
$(\boldsymbol{z}_{i+1}, \boldsymbol{z}_i,
\hat{\boldsymbol{\epsilon}}(\boldsymbol{z}_i,i))$. This allows for exact
backward computation of $\boldsymbol{z}_{i+1}$ given $(\boldsymbol{z}_i,
\boldsymbol{z}_{i-1})$, thus leading to exact diffusion inversion. It is
demonstrated with experiments that (round-trip) BDIA-DDIM is particularly
effective for image editing. Our experiments further show that BDIA-DDIM
produces markedly better image sampling qualities than DDIM for text-to-image
generation.
BDIA can also be applied to improve the performance of other ODE solvers in
addition to DDIM. In our work, it is found that applying BDIA to the EDM
sampling procedure produces consistently better performance over four
pre-trained models. |
This paper proposes BDIA (bi-directional integration approximation), a novel technique for achieving exact diffusion inversion with DDIM, reducing computational overhead compared to methods like EDICT. |
Exact diffusion inversion is crucial for image editing applications in diffusion models, but existing methods introduce significant computational overhead. This work aims to address this limitation. |
BDIA-DDIM approximates ODE integration using both forward and backward DDIM updates at each timestep. This allows for expressing each diffusion state as a linear combination of previous states and estimated noise, enabling exact inversion without doubling NFEs. |
BDIA-DDIM achieves superior image sampling quality (FID score) compared to DDIM and DPM-Solver in text-to-image generation.
Incorporating BDIA into both DDIM and EDM significantly improves FID scores for unconditional image generation.
BDIA-DDIM demonstrates promising results in both text-based and ControlNet-based image editing, achieving comparable quality to EDICT while reducing NFEs by approximately half. |
The paper primarily focuses on DDIM and EDM. Exploring BDIA's application with other ODE solvers (e.g., PLMS, DEIS) is left for future work.
The paper evaluates image editing qualitatively and with FID scores. Further quantitative evaluation using other metrics (e.g., LPIPS) could provide a more comprehensive understanding of BDIA-DDIM's performance. |
diffusion models, image editing, ode solvers, ddim inversion, exact inversion |
2307.10816
Report |
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion |
Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou |
Recent text-to-image diffusion models have demonstrated an astonishing
capacity to generate high-quality images. However, researchers mainly studied
the way of synthesizing images with only text prompts. While some works have
explored using other modalities as conditions, considerable paired data, e.g.,
box/mask-image pairs, and fine-tuning time are required for nurturing models.
As such paired data is time-consuming and labor-intensive to acquire and
restricted to a closed set, this potentially becomes the bottleneck for
applications in an open world. This paper focuses on the simplest form of
user-provided conditions, e.g., box or scribble. To mitigate the aforementioned
problem, we propose a training-free method to control objects and contexts in
the synthesized images adhering to the given spatial conditions. Specifically,
three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints,
are designed and seamlessly integrated into the denoising step of diffusion
models, requiring no additional training and massive annotated layout data.
Extensive experimental results demonstrate that the proposed constraints can
control what and where to present in the images while retaining the ability of
Diffusion models to synthesize with high fidelity and diverse concept coverage.
The code is publicly available at https://github.com/showlab/BoxDiff. |
This paper proposes BoxDiff, a training-free method for controlling the location and scale of objects in images synthesized by pre-trained text-to-image diffusion models, using simple spatial constraints like boxes or scribbles. |
Current text-to-image models lack fine-grained spatial control, and existing layout-to-image methods require significant paired training data and are limited to closed-set categories. BoxDiff addresses these limitations by providing a training-free approach for spatial control in open-world settings. |
BoxDiff works by applying spatial constraints (Inner-Box, Outer-Box, and Corner) to the cross-attention maps between text tokens and intermediate features during the denoising step of diffusion models. These constraints guide the synthesis process to adhere to the user-provided spatial conditions. |
BoxDiff successfully controls the location and scale of synthesized objects according to user-provided boxes or scribbles.
The method outperforms existing fully-supervised layout-to-image methods in terms of both semantic accuracy and alignment with spatial conditions.
BoxDiff retains the high fidelity and diverse concept coverage of the underlying diffusion models, enabling the synthesis of novel objects and scenes. |
The precision of spatial control is limited by the resolution of the cross-attention maps used.
BoxDiff may struggle with unusual prompts or combinations of objects that infrequently co-occur. |
text-to-image synthesis, diffusion models, spatial constraints, training-free, open-world |
2307.10797
Report |
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces |
Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, Georgios Tzimiropoulos |
In this paper, we present our method for neural face reenactment, called
HyperReenact, that aims to generate realistic talking head images of a source
identity, driven by a target facial pose. Existing state-of-the-art face
reenactment methods train controllable generative models that learn to
synthesize realistic facial images, yet producing reenacted faces that are
prone to significant visual artifacts, especially under the challenging
condition of extreme head pose changes, or requiring expensive few-shot
fine-tuning to better preserve the source identity characteristics. We propose
to address these limitations by leveraging the photorealistic generation
ability and the disentangled properties of a pretrained StyleGAN2 generator, by
first inverting the real images into its latent space and then using a
hypernetwork to perform: (i) refinement of the source identity characteristics
and (ii) facial pose re-targeting, eliminating this way the dependence on
external editing methods that typically produce artifacts. Our method operates
under the one-shot setting (i.e., using a single source frame) and allows for
cross-subject reenactment, without requiring any subject-specific fine-tuning.
We compare our method both quantitatively and qualitatively against several
state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and
VoxCeleb2, demonstrating the superiority of our approach in producing
artifact-free images, exhibiting remarkable robustness even under extreme head
pose changes. We make the code and the pretrained models publicly available at:
https://github.com/StelaBou/HyperReenact . |
HyperReenact, a neural face reenactment method that refines and retargets facial images using a pretrained StyleGAN2 model and a hypernetwork. |
Existing methods struggle to produce realistic results in one-shot settings or with extreme head pose changes. This method aims to address these limitations by leveraging the strengths of pretrained StyleGAN2 and hypernetworks. |
The method uses a hypernetwork to modify the weights of the StyleGAN2 generator based on appearance features from a source image and pose features from a target image. It operates in one-shot (using a single source frame) and is trained in three phases: real image inversion, self reenactment, and cross-subject reenactment. |
Outperforms state-of-the-art methods on identity preservation and facial pose transfer, especially on challenging cases with large head pose differences.
Produces artifact-free images, as demonstrated through quantitative metrics (CSIM, LPIPS, FID, FVD, APD, AED) and qualitative comparisons.
Exhibits robustness to extreme head pose variations, outperforming other methods on a specifically designed benchmark. |
Struggles to reconstruct detailed accessories, like hats or eyeglasses, potentially due to underrepresentation in the training dataset.
Does not refine background details. |
face reenactment, hypernetworks, stylegan2, one-shot learning, facial image editing |
2307.10776
Report |
Urban Radiance Field Representation with Deformable Neural Mesh Primitives |
Fan Lu, Yan Xu, Guang Chen, Hongsheng Li, Kwan-Yee Lin, Changjun Jiang |
Neural Radiance Fields (NeRFs) have achieved great success in the past few
years. However, most current methods still require intensive resources due to
ray marching-based rendering. To construct urban-level radiance fields
efficiently, we design Deformable Neural Mesh Primitive~(DNMP), and propose to
parameterize the entire scene with such primitives. The DNMP is a flexible and
compact neural variant of classic mesh representation, which enjoys both the
efficiency of rasterization-based rendering and the powerful neural
representation capability for photo-realistic image synthesis. Specifically, a
DNMP consists of a set of connected deformable mesh vertices with paired vertex
features to parameterize the geometry and radiance information of a local area.
To constrain the degree of freedom for optimization and lower the storage
budgets, we enforce the shape of each primitive to be decoded from a relatively
low-dimensional latent space. The rendering colors are decoded from the vertex
features (interpolated with rasterization) by a view-dependent MLP. The DNMP
provides a new paradigm for urban-level scene representation with appealing
properties: $(1)$ High-quality rendering. Our method achieves leading
performance for novel view synthesis in urban scenarios. $(2)$ Low
computational costs. Our representation enables fast rendering (2.07ms/1k
pixels) and low peak memory usage (110MB/1k pixels). We also present a
lightweight version that can run 33$\times$ faster than vanilla NeRFs, and
comparable to the highly-optimized Instant-NGP (0.61 vs 0.71ms/1k pixels).
Project page: \href{https://dnmp.github.io/}{https://dnmp.github.io/}. |
This paper proposes Deformable Neural Mesh Primitive (DNMP), a novel neural scene representation for efficient and high-quality urban view synthesis, leveraging the efficiency of classic meshes and the representation power of neural features. |
Existing neural rendering methods, especially those for large-scale urban environments, suffer from high computational costs and memory footprints due to ray marching-based rendering. They also lack explicit surface constraints, leading to less robust novel view synthesis. |
The proposed method voxelizes the urban scene and assigns each voxel a DNMP, which parameterizes local geometry and radiance. DNMP shapes are decoded from a compact latent space for robust optimization, while radiance features are associated with mesh vertices. The method utilizes efficient rasterization for feature interpolation and rendering. |
The method achieves state-of-the-art novel view synthesis quality on KITTI-360 and Waymo datasets, outperforming baselines in terms of PSNR, SSIM, and LPIPS.
It exhibits strong robustness against viewpoint changes, generating high-quality rendering results even with significant view differences from the training set.
DNMP enables a 5x faster rendering speed and uses only 1/5 of the peak memory compared to Mip-NeRF 360, achieving a speed comparable to the highly optimized Instant-NGP. |
The current framework is based on the static-scene assumption and cannot handle moving objects.
Future work includes extending the method to incorporate dynamic elements for more general application scenarios. |
neural rendering, urban scene representation, deformable mesh, novel view synthesis, efficient rendering |
2307.10711
Report |
AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models |
Jiachun Pan, Jun Hao Liew, Vincent Y. F. Tan, Jiashi Feng, Hanshu Yan |
Existing customization methods require access to multiple reference examples
to align pre-trained diffusion probabilistic models (DPMs) with user-provided
concepts. This paper aims to address the challenge of DPM customization when
the only available supervision is a differentiable metric defined on the
generated contents. Since the sampling procedure of DPMs involves recursive
calls to the denoising UNet, na\"ive gradient backpropagation requires storing
the intermediate states of all iterations, resulting in extremely high memory
consumption. To overcome this issue, we propose a novel method AdjointDPM,
which first generates new samples from diffusion models by solving the
corresponding probability-flow ODEs. It then uses the adjoint sensitivity
method to backpropagate the gradients of the loss to the models' parameters
(including conditioning signals, network weights, and initial noises) by
solving another augmented ODE. To reduce numerical errors in both the forward
generation and gradient backpropagation processes, we further reparameterize
the probability-flow ODE and augmented ODE as simple non-stiff ODEs using
exponential integration. Finally, we demonstrate the effectiveness of
AdjointDPM on three interesting tasks: converting visual effects into
identification text embeddings, finetuning DPMs for specific types of
stylization, and optimizing initial noise to generate adversarial samples for
security auditing. |
Proposes AdjointDPM, a novel gradient backpropagation technique for diffusion probabilistic models (DPMs) based on the adjoint sensitivity method, enabling the optimization of DPM parameters like network weights, conditioning signals, and noisy states under differentiable loss functions. |
Addresses the significant memory consumption problem of naive backpropagation in DPMs, especially for tasks like guided generation and model customization, which require optimizing model parameters to achieve specific properties in generated content. |
Leverages the adjoint sensitivity method to compute gradients by solving a backward ODE, eliminating the need to store intermediate states of all iterations. Further reduces numerical errors by reparameterizing the probability-flow ODE and augmented ODE as simple non-stiff ODEs using exponential integration. |
AdjointDPM enables guided sampling, allowing the guidance of Stable Diffusion to synthesize images of specific animal breeds under the supervision of fine-grained vision classifiers.
Reveals potential security issues in DPM-based generation systems by successfully finding initial noise states that lead to the generation of NSFW content capable of bypassing safety filters.
Facilitates stylization via a single reference image, enabling AdjointDPM to fine-tune a Stable Diffusion model for style defined by the Gram matrix of the reference, generalizing stylization capabilities to different objects. |
The guidance of FGVC models does not fully resolve issues with inaccurate details in generated images.
The effectiveness of style transfer depends on the selection of appropriate weights for style and content loss terms. |
diffusion probabilistic models, adjoint sensitivity method, guided generation, model customization, security auditing |
2307.10584
Report |
Reference-based Painterly Inpainting via Diffusion: Crossing the Wild Reference Domain Gap |
Dejia Xu, Xingqian Xu, Wenyan Cong, Humphrey Shi, Zhangyang Wang |
Have you ever imagined how it would look if we placed new objects into
paintings? For example, what would it look like if we placed a basketball into
Claude Monet's ``Water Lilies, Evening Effect''? We propose Reference-based
Painterly Inpainting, a novel task that crosses the wild reference domain gap
and implants novel objects into artworks. Although previous works have examined
reference-based inpainting, they are not designed for large domain
discrepancies between the target and the reference, such as inpainting an
artistic image using a photorealistic reference. This paper proposes a novel
diffusion framework, dubbed RefPaint, to ``inpaint more wildly'' by taking such
references with large domain gaps. Built with an image-conditioned diffusion
model, we introduce a ladder-side branch and a masked fusion mechanism to work
with the inpainting mask. By decomposing the CLIP image embeddings at inference
time, one can manipulate the strength of semantic and style information with
ease. Experiments demonstrate that our proposed RefPaint framework produces
significantly better results than existing methods. Our method enables creative
painterly image inpainting with reference objects that would otherwise be
difficult to achieve. Project page: https://vita-group.github.io/RefPaint/ |
This paper introduces "Reference-based Painterly Inpainting", a novel task that implants new objects into artworks, even with significant domain gaps between the reference object and artistic background, and proposes a novel diffusion-based framework, called RefPaint, to address it. |
This task enables creative painterly image inpainting with reference objects, going beyond the limitations of existing reference-based inpainting methods that struggle with large domain discrepancies and text-based inpainting's ambiguity in specifying desired content. |
The RefPaint framework, built upon an image-conditioned diffusion model, introduces a ladder-side branch for masked image encoding and a masked fusion mechanism to incorporate inpainting masks. It uses PCA-decomposed CLIP image embeddings for disentangled semantic and style fusion via classifier-free guidance, allowing control over the trade-off between reference semantics and background style. |
RefPaint successfully injects new objects into artistic images while preserving the background style, even with challenging domain gaps.
The disentangled semantic and style fusion allows fine-grained control over the inpainted content, balancing fidelity to the reference object and stylistic consistency with the artwork.
Quantitative comparisons using CLIP image distance demonstrate that RefPaint outperforms existing methods in terms of integrating reference objects while maintaining background style. |
The model suffers from slow inference speed, which is a common limitation of diffusion models.
Handling complex cases, such as multiple objects or objects with detailed textures, requires further exploration. |
image inpainting, diffusion models, reference-based inpainting, painterly style transfer, image harmonization |
2307.10504
Report |
Identifying Interpretable Subspaces in Image Representations |
Neha Kalibhat, Shweta Bhardwaj, Bayan Bruss, Hamed Firooz, Maziar Sanjabi, Soheil Feizi |
We propose Automatic Feature Explanation using Contrasting Concepts (FALCON),
an interpretability framework to explain features of image representations. For
a target feature, FALCON captions its highly activating cropped images using a
large captioning dataset (like LAION-400m) and a pre-trained vision-language
model like CLIP. Each word among the captions is scored and ranked leading to a
small number of shared, human-understandable concepts that closely describe the
target feature. FALCON also applies contrastive interpretation using lowly
activating (counterfactual) images, to eliminate spurious concepts. Although
many existing approaches interpret features independently, we observe in
state-of-the-art self-supervised and supervised models, that less than 20% of
the representation space can be explained by individual features. We show that
features in larger spaces become more interpretable when studied in groups and
can be explained with high-order scoring concepts through FALCON. We discuss
how extracted concepts can be used to explain and debug failures in downstream
tasks. Finally, we present a technique to transfer concepts from one
(explainable) representation space to another unseen representation space by
learning a simple linear transformation. Code available at
https://github.com/NehaKalibhat/falcon-explain. |
This paper proposes FALCON, an interpretability framework that automatically identifies human-understandable concepts encoded by features in image representations. |
Understanding what information is encoded in image representations, especially in self-supervised models, is crucial for their deployment and generalization. |
FALCON leverages a probe dataset, a large captioning dataset (LAION-400m), and a pre-trained vision-language model (CLIP). It captions highly activating image crops for a target feature and extracts shared concepts. It also uses contrastive interpretation with lowly activating images to filter out spurious concepts. |
FALCON successfully identifies meaningful concepts for individual features and, more surprisingly, for groups of features, which are shown to be more interpretable.
Human evaluation via Amazon Mechanical Turk demonstrates the high relevance and explainability of FALCON's extracted concepts.
The paper showcases the transferability of concepts across different representation spaces by learning a simple linear transformation. |
Extending FALCON to explain vision-language models and non-image domains remains for future work.
The framework currently relies on a pre-trained vision-language model, which might be limiting for models trained on specialized tasks or data. |
interpretability, image representation, concept extraction, self-supervised learning, contrastive interpretation |
2307.10373
Report |
TokenFlow: Consistent Diffusion Features for Consistent Video Editing |
Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel |
The generative AI revolution has recently expanded to videos. Nevertheless,
current state-of-the-art video models are still lagging behind image models in
terms of visual quality and user control over the generated content. In this
work, we present a framework that harnesses the power of a text-to-image
diffusion model for the task of text-driven video editing. Specifically, given
a source video and a target text-prompt, our method generates a high-quality
video that adheres to the target text, while preserving the spatial layout and
motion of the input video. Our method is based on a key observation that
consistency in the edited video can be obtained by enforcing consistency in the
diffusion feature space. We achieve this by explicitly propagating diffusion
features based on inter-frame correspondences, readily available in the model.
Thus, our framework does not require any training or fine-tuning, and can work
in conjunction with any off-the-shelf text-to-image editing method. We
demonstrate state-of-the-art editing results on a variety of real-world videos.
Webpage: https://diffusion-tokenflow.github.io/ |
Introduces TokenFlow, a technique that leverages the internal representations of videos in text-to-image diffusion models to enable consistent and high-quality video editing. |
Existing video generation models lag behind image models and struggle to achieve both high visual quality and temporal consistency in edited videos. TokenFlow addresses this gap by harnessing the power of readily available, state-of-the-art image diffusion models for video editing. |
TokenFlow extracts and analyzes diffusion features from a pre-trained image diffusion model (Stable Diffusion). It then enforces consistency by propagating edits based on inter-frame feature correspondences found in the original video. This approach ensures that edits adhere to the target text prompt while maintaining temporal coherence. |
TokenFlow generates high-quality edited videos that exhibit strong adherence to the target text prompts.
Quantitative evaluations, including warping error and user studies, demonstrate that TokenFlow significantly outperforms existing and concurrent video editing methods in terms of temporal consistency.
The method is efficient, reducing per-frame editing time by 20% compared to applying image editing techniques frame-by-frame. |
TokenFlow is currently limited to edits that preserve the original structure of the video, as it relies on the original video's motion and feature correspondences.
The method's success is partially dependent on the accuracy of the underlying image editing technique used in conjunction with TokenFlow. Future work may explore combining TokenFlow with improved decoders to further enhance video quality and minimize flickering. |
video editing, diffusion models, temporal consistency, text-driven editing, stable diffusion |
2307.10159
Report |
FABRIC: Personalizing Diffusion Models with Iterative Feedback |
Dimitri von Rütte, Elisabetta Fedele, Jonathan Thomm, Lukas Wolf |
In an era where visual content generation is increasingly driven by machine
learning, the integration of human feedback into generative models presents
significant opportunities for enhancing user experience and output quality.
This study explores strategies for incorporating iterative human feedback into
the generative process of diffusion-based text-to-image models. We propose
FABRIC, a training-free approach applicable to a wide range of popular
diffusion models, which exploits the self-attention layer present in the most
widely used architectures to condition the diffusion process on a set of
feedback images. To ensure a rigorous assessment of our approach, we introduce
a comprehensive evaluation methodology, offering a robust mechanism to quantify
the performance of generative visual models that integrate human feedback. We
show that generation results improve over multiple rounds of iterative feedback
through exhaustive analysis, implicitly optimizing arbitrary user preferences.
The potential applications of these findings extend to fields such as
personalized content creation and customization. |
Presents FABRIC, a training-free method incorporating iterative user feedback (liked and disliked images) into text-to-image diffusion models for improved image generation aligned with user preferences. |
Addresses the limitations of current text-to-image models, which often require iterative prompt engineering and struggle to capture nuanced user preferences. |
Leverages attention-based reference image conditioning by injecting information from feedback images into the self-attention layer of a diffusion model's U-Net during the denoising process. |
FABRIC effectively guides image generation toward user preferences, evidenced by improved scores from a human preference prediction model.
It successfully steers generation towards a target image when feedback is provided based on similarity to that target.
The method demonstrates orthogonality with other Stable Diffusion enhancements, enabling improvements on top of existing techniques like LoRA and fine-tuned checkpoints. |
FABRIC may struggle to expand the generative distribution beyond the initial text-conditioned output, potentially limiting exploration.
The current feedback mechanism relies on binary preferences (like/dislike), which could be expanded for more nuanced guidance. |
text-to-image generation, diffusion models, human feedback, iterative refinement, attention mechanisms |
2307.09947
Report |
U-CE: Uncertainty-aware Cross-Entropy for Semantic Segmentation |
Steven Landgraf, Markus Hillemann, Kira Wursthorn, Markus Ulrich |
Deep neural networks have shown exceptional performance in various tasks, but
their lack of robustness, reliability, and tendency to be overconfident pose
challenges for their deployment in safety-critical applications like autonomous
driving. In this regard, quantifying the uncertainty inherent to a model's
prediction is a promising endeavour to address these shortcomings. In this
work, we present a novel Uncertainty-aware Cross-Entropy loss (U-CE) that
incorporates dynamic predictive uncertainties into the training process by
pixel-wise weighting of the well-known cross-entropy loss (CE). Through
extensive experimentation, we demonstrate the superiority of U-CE over regular
CE training on two benchmark datasets, Cityscapes and ACDC, using two common
backbone architectures, ResNet-18 and ResNet-101. With U-CE, we manage to train
models that not only improve their segmentation performance but also provide
meaningful uncertainties after training. Consequently, we contribute to the
development of more robust and reliable segmentation models, ultimately
advancing the state-of-the-art in safety-critical applications and beyond. |
This paper proposes U-CE, a novel uncertainty-aware cross-entropy loss function for semantic segmentation that incorporates predictive uncertainties into the training process. |
Quantifying predictive uncertainty is crucial for deploying deep learning models in safety-critical applications, as it provides insights into model reliability. |
U-CE integrates Monte Carlo Dropout during training to compute pixel-wise uncertainties, which are then used to weight the standard cross-entropy loss. |
U-CE consistently outperforms regular cross-entropy training in terms of mIoU across different dropout ratios, backbones, and datasets.
Models trained with U-CE demonstrate the ability to predict meaningful uncertainties, aligning with segmentation performance.
U-CE shows robustness to the choice of hyperparameters such as \alpha and the base learning rate. |
U-CE's effectiveness might be limited when densely annotated ground truth labels are unavailable.
Further investigation is needed to understand the impact of U-CE on generalization performance. |
semantic segmentation, uncertainty quantification, monte carlo dropout, deep learning, computer vision |
2307.09906
Report |
Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation |
Fa-Ting Hong, Dan Xu |
Talking head video generation aims to animate a human face in a still image
with dynamic poses and expressions using motion information derived from a
target-driving video, while maintaining the person's identity in the source
image. However, dramatic and complex motions in the driving video cause
ambiguous generation, because the still source image cannot provide sufficient
appearance information for occluded regions or delicate expression variations,
which produces severe artifacts and significantly degrades the generation
quality. To tackle this problem, we propose to learn a global facial
representation space, and design a novel implicit identity representation
conditioned memory compensation network, coined as MCNet, for high-fidelity
talking head generation.~Specifically, we devise a network module to learn a
unified spatial facial meta-memory bank from all training samples, which can
provide rich facial structure and appearance priors to compensate warped source
facial features for the generation. Furthermore, we propose an effective query
mechanism based on implicit identity representations learned from the discrete
keypoints of the source image. It can greatly facilitate the retrieval of more
correlated information from the memory bank for the compensation. Extensive
experiments demonstrate that MCNet can learn representative and complementary
facial memory, and can clearly outperform previous state-of-the-art talking
head generation methods on VoxCeleb1 and CelebV datasets. Please check our
\href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}. |
This paper proposes MCNet, an implicit identity representation conditioned memory compensation network for high-fidelity talking head video generation, addressing the ambiguity issue in existing methods when handling dramatic head motions. |
Existing talking head generation methods, while achieving progress in motion estimation, struggle to generate high-quality videos with large head motions due to limited appearance information in a single source image, leading to artifacts and quality degradation. |
The proposed MCNet learns a global facial meta-memory bank from all training samples to provide rich facial priors. It leverages an implicit identity representation learned from source image keypoints and warped features to query the meta-memory bank, obtaining identity-dependent memory for compensating ambiguous facial details in the warped source feature map. |
MCNet outperforms state-of-the-art methods on VoxCeleb1 and CelebV datasets for both same-identity and cross-identity reenactment.
The learned global facial meta-memory effectively compensates for ambiguous regions in generated faces, especially under large head motions or occlusions.
The proposed method demonstrates strong generalizability, improving the performance when incorporated into other talking head generation frameworks. |
The model's performance in handling unseen identities could be further improved.
The computational cost associated with querying the large meta-memory bank is a limitation. |
talking head generation, memory compensation network, implicit identity representation, global facial meta-memory, deep learning |
2307.09882
Report |
Adversarial Likelihood Estimation With One-Way Flows |
Omri Ben-Dov, Pravir Singh Gupta, Victoria Abrevaya, Michael J. Black, Partha Ghosh |
Generative Adversarial Networks (GANs) can produce high-quality samples, but
do not provide an estimate of the probability density around the samples.
However, it has been noted that maximizing the log-likelihood within an
energy-based setting can lead to an adversarial framework where the
discriminator provides unnormalized density (often called energy). We further
develop this perspective, incorporate importance sampling, and show that 1)
Wasserstein GAN performs a biased estimate of the partition function, and we
propose instead to use an unbiased estimator; and 2) when optimizing for
likelihood, one must maximize generator entropy. This is hypothesized to
provide a better mode coverage. Different from previous works, we explicitly
compute the density of the generated samples. This is the key enabler to
designing an unbiased estimator of the partition function and computation of
the generator entropy term. The generator density is obtained via a new type of
flow network, called one-way flow network, that is less constrained in terms of
architecture, as it does not require a tractable inverse function. Our
experimental results show that our method converges faster, produces comparable
sample quality to GANs with similar architecture, successfully avoids
over-fitting to commonly used datasets and produces smooth low-dimensional
latent representations of the training data. |
This paper proposes a new framework for adversarial generative modeling that combines the advantages of GANs (high-quality samples) with density estimation capabilities. |
Explicit density estimation in GANs allows for quantitative model comparison, likelihood-based training, and potentially mitigates issues like mode collapse. |
The authors leverage the connection between EBMs and GANs, introducing an unbiased estimator of the partition function by explicitly computing the generator density. They achieve this using a novel 'one-way flow' network for the generator. |
The model captures more modes and generates higher-quality samples than previous GANs on 2D datasets.
On real datasets, it demonstrates faster convergence and comparable sample quality to GANs while exhibiting good generalization.
The proposed method allows for practical computation of the partition function with a reasonable number of samples. |
The current implementation relies on an approximate Jacobian determinant computation, which introduces noise. Exploring architectures with closed-form Jacobian determinants is left for future work.
Further investigation is needed to fully leverage the potential of using multiple samples for approximating the normalizing factor. |
generative adversarial networks, density estimation, energy-based models, normalizing flows, one-way flows |
2307.09829
Report |
What do neural networks learn in image classification? A frequency shortcut perspective |
Shunxin Wang, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio |
Frequency analysis is useful for understanding the mechanisms of
representation learning in neural networks (NNs). Most research in this area
focuses on the learning dynamics of NNs for regression tasks, while little for
classification. This study empirically investigates the latter and expands the
understanding of frequency shortcuts. First, we perform experiments on
synthetic datasets, designed to have a bias in different frequency bands. Our
results demonstrate that NNs tend to find simple solutions for classification,
and what they learn first during training depends on the most distinctive
frequency characteristics, which can be either low- or high-frequencies.
Second, we confirm this phenomenon on natural images. We propose a metric to
measure class-wise frequency characteristics and a method to identify frequency
shortcuts. The results show that frequency shortcuts can be texture-based or
shape-based, depending on what best simplifies the objective. Third, we
validate the transferability of frequency shortcuts on out-of-distribution
(OOD) test sets. Our results suggest that frequency shortcuts can be
transferred across datasets and cannot be fully avoided by larger model
capacity and data augmentation. We recommend that future research should focus
on effective training schemes mitigating frequency shortcut learning. |
This paper investigates what neural networks learn during image classification, focusing on their tendency to exploit frequency shortcuts – specific frequency sets leading to accurate but potentially oversimplified classification. |
Understanding how data frequency characteristics and simplicity bias in neural networks can lead to frequency shortcut learning is crucial for addressing the limitations of current models and improving their generalization abilities, especially in out-of-distribution scenarios. |
The authors conduct experiments on synthetic datasets with controlled frequency biases and natural images (ImageNet-10, ImageNet-SCT). They propose a metric (ADCS) to compare class-wise frequency distributions and a frequency culling method to identify frequency shortcuts. They analyze the effects of model capacity and data augmentation on shortcut learning. |
Neural networks for classification tasks can prioritize learning distinctive frequency characteristics over semantic features, leading to frequency shortcut learning, where specific frequency subsets are used for classification.
Frequency shortcuts can be texture-based or shape-based, depending on the dataset characteristics and can hinder the learning of more meaningful semantic information.
Frequency shortcuts can be transferred across datasets and models and cannot be entirely avoided by increasing model capacity or applying common data augmentation techniques. |
The ADCS metric, while insightful, cannot solely predict shortcut learning; further investigation of the relationship between frequency characteristics and learning dynamics is needed.
Future work should focus on developing data augmentation strategies that explicitly target and mitigate frequency shortcut learning to improve the generalization capabilities of neural networks. |
frequency analysis, shortcut learning, image classification, generalization, data augmentation |
2307.09781
Report |
Text2Layer: Layered Image Generation using Latent Diffusion Model |
Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien |
Layer compositing is one of the most popular image editing workflows among
both amateurs and professionals. Motivated by the success of diffusion models,
we explore layer compositing from a layered image generation perspective.
Instead of generating an image, we propose to generate background, foreground,
layer mask, and the composed image simultaneously. To achieve layered image
generation, we train an autoencoder that is able to reconstruct layered images
and train diffusion models on the latent representation. One benefit of the
proposed problem is to enable better compositing workflows in addition to the
high-quality image output. Another benefit is producing higher-quality layer
masks compared to masks produced by a separate step of image segmentation.
Experimental results show that the proposed method is able to generate
high-quality layered images and initiates a benchmark for future work. |
This paper proposes Text2Layer, a novel method for generating layered images from text prompts, composed of foreground, background, a layer mask, and a composited image. |
Layered image generation facilitates more controllable and intuitive image editing workflows compared to traditional text-to-image generation or text-guided editing methods. |
The authors create a 57.02M layered-image dataset ("LL2I") and train a Composition-Aware Two-Layer Autoencoder (CaT2I-AE). They then train a diffusion model on the latent representations learned by CaT2I-AE, enabling text-driven layered image generation. |
Text2Layer generates higher quality layered images compared to baselines using Stable Diffusion components.
The generated layer masks demonstrate superior accuracy in capturing foreground objects.
The generated images exhibit strong text-image relevance, indicating effective adherence to text prompts. |
The current LL2I dataset, while large, is still smaller than datasets used to train state-of-the-art text-to-image models, potentially limiting generation quality and diversity.
Future work could explore conditional layer generation, enabling the generation of an arbitrary number of layers and more complex image compositions. |
layered image generation, text-to-image synthesis, diffusion models, image editing, computer vision |
2307.09582
Report |
Guided Linear Upsampling |
Shuangbing Song, Fan Zhong, Tianju Wang, Xueying Qin, Changhe Tu |
Guided upsampling is an effective approach for accelerating high-resolution
image processing. In this paper, we propose a simple yet effective guided
upsampling method. Each pixel in the high-resolution image is represented as a
linear interpolation of two low-resolution pixels, whose indices and weights
are optimized to minimize the upsampling error. The downsampling can be jointly
optimized in order to prevent missing small isolated regions. Our method can be
derived from the color line model and local color transformations. Compared to
previous methods, our method can better preserve detail effects while
suppressing artifacts such as bleeding and blurring. It is efficient, easy to
implement, and free of sensitive parameters. We evaluate the proposed method
with a wide range of image operators, and show its advantages through
quantitative and qualitative analysis. We demonstrate the advantages of our
method for both interactive image editing and real-time high-resolution video
processing. In particular, for interactive editing, the joint optimization can
be precomputed, thus allowing for instant feedback without hardware
acceleration. |
This paper introduces Guided Linear Upsampling (GLU), a novel guided upsampling technique for accelerating high-resolution image processing. |
Efficiently processing high-resolution images is crucial due to the increasing demand for high-quality visuals and the computational constraints of devices. GLU offers a simple yet powerful solution to address this challenge. |
GLU represents each high-resolution pixel as a linear interpolation of two optimized low-resolution pixels. It jointly optimizes downsampling and upsampling to minimize error and preserve details. This approach is inspired by the color line model and local color transformations but without explicit smoothness constraints. |
GLU outperforms previous methods (JBU, BGU) in quantitative and qualitative evaluations across various image processing tasks, especially for large upsampling ratios.
The target-free optimization in GLU allows for pre-computation, enabling interactive editing with instant feedback and real-time video processing.
Downsample optimization in GLU effectively preserves thin structures and small regions often lost in regular downsampling. |
GLU might exhibit limitations when handling new edges or drastic changes in local image structures, which are not present in the source image.
Adapting existing image processing operators for optimal performance at low resolutions is crucial for maximizing GLU's effectiveness. |
guided upsampling, optimized downsampling, image processing, interactive image editing, real-time video processing |
2307.09481
Report |
AnyDoor: Zero-shot Object-level Image Customization |
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao |
This work presents AnyDoor, a diffusion-based image generator with the power
to teleport target objects to new scenes at user-specified locations in a
harmonious way. Instead of tuning parameters for each object, our model is
trained only once and effortlessly generalizes to diverse object-scene
combinations at the inference stage. Such a challenging zero-shot setting
requires an adequate characterization of a certain object. To this end, we
complement the commonly used identity feature with detail features, which are
carefully designed to maintain texture details yet allow versatile local
variations (e.g., lighting, orientation, posture, etc.), supporting the object
in favorably blending with different surroundings. We further propose to borrow
knowledge from video datasets, where we can observe various forms (i.e., along
the time axis) of a single object, leading to stronger model generalizability
and robustness. Extensive experiments demonstrate the superiority of our
approach over existing alternatives as well as its great potential in
real-world applications, such as virtual try-on and object moving. Project page
is https://damo-vilab.github.io/AnyDoor-Page/. |
This paper proposes AnyDoor, a diffusion-based model that teleports objects from a source image to a target scene at user-specified locations with desired shapes in a zero-shot manner. |
Object teleportation is crucial for various applications like image composition, virtual try-on, and shape editing, but previous methods struggle to generate identity-consistent content, especially for untrained categories. |
AnyDoor uses an ID extractor (DINOv2) to capture object identity and a detail extractor (ControlNet-style UNet) to learn appearance details from a collage of high-frequency object maps and the scene. These features guide a pre-trained text-to-image diffusion model for generation. The model is trained on a dataset incorporating video and image pairs to capture object variations and diverse scenarios. |
AnyDoor outperforms existing reference-based methods in preserving object identity while generating high-quality compositions.
It achieves superior multi-subject composition compared to finetuning-based methods without requiring parameter tuning.
AnyDoor demonstrates strong potential for various applications like virtual try-on, object moving and swapping, and shape editing. |
AnyDoor might struggle with generating fine details like small characters or logos.
Future work could focus on incorporating additional controls and exploring higher-resolution generation. |
image generation, diffusion models, object teleportation, zero-shot learning, image editing |
2307.09361
Report |
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments |
Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos Komodakis, Matthieu Cord, Patrick Pérez |
Self-supervised learning can be used for mitigating the greedy needs of
Vision Transformer networks for very large fully-annotated datasets. Different
classes of self-supervised learning offer representations with either good
contextual reasoning properties, e.g., using masked image modeling strategies,
or invariance to image perturbations, e.g., with contrastive methods. In this
work, we propose a single-stage and standalone method, MOCA, which unifies both
desired properties using novel mask-and-predict objectives defined with
high-level features (instead of pixel-level details). Moreover, we show how to
effectively employ both learning paradigms in a synergistic and
computation-efficient way. Doing so, we achieve new state-of-the-art results on
low-shot settings and strong experimental results in various evaluation
protocols with a training that is at least 3 times faster than prior methods. |
This paper presents MOCA, a self-supervised representation learning method for Vision Transformers that leverages a novel masking strategy for predicting high-level online codebook assignments, thereby unifying the strengths of discriminative and hide-and-predict approaches. |
Vision Transformers typically require extensive annotated training data. MOCA addresses this challenge by effectively learning from unlabeled data, enabling robust representations with enhanced contextual reasoning and perturbation invariance. |
MOCA employs a teacher-student scheme where the teacher network, a momentum-updated version of the student, generates target codebook assignments from unmasked image views. The student network is trained to predict these assignments from masked views using two key objectives: masked same-view token assignment prediction (promoting contextual reasoning) and masked cross-view average assignment prediction (enhancing perturbation invariance). |
MOCA achieves state-of-the-art results in low-shot ImageNet classification, outperforming existing methods by a significant margin.
It demonstrates strong performance in linear probing and fine-tuning evaluations for image classification and semantic segmentation tasks.
MOCA exhibits superior computational efficiency, requiring significantly less training time compared to competing methods, while maintaining competitive performance. |
The paper explores the impact of decoder depth on performance but primarily focuses on ViT-B/16 architecture; investigating other architectures could be beneficial.
While MOCA excels in low-shot learning, exploring its performance on a wider range of downstream tasks and datasets would provide a more comprehensive evaluation of its capabilities. |
self-supervised learning, vision transformers, representation learning, masked image modeling, low-shot learning |
2307.09283
Report |
RepViT: Revisiting Mobile CNN From ViT Perspective |
Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding |
Recently, lightweight Vision Transformers (ViTs) demonstrate superior
performance and lower latency, compared with lightweight Convolutional Neural
Networks (CNNs), on resource-constrained mobile devices. Researchers have
discovered many structural connections between lightweight ViTs and lightweight
CNNs. However, the notable architectural disparities in the block structure,
macro, and micro designs between them have not been adequately examined. In
this study, we revisit the efficient design of lightweight CNNs from ViT
perspective and emphasize their promising prospect for mobile devices.
Specifically, we incrementally enhance the mobile-friendliness of a standard
lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural
designs of lightweight ViTs. This ends up with a new family of pure lightweight
CNNs, namely RepViT. Extensive experiments show that RepViT outperforms
existing state-of-the-art lightweight ViTs and exhibits favorable latency in
various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1
accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a
lightweight model, to the best of our knowledge. Besides, when RepViT meets
SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the
advanced MobileSAM. Codes and models are available at
\url{https://github.com/THU-MIG/RepViT}. |
This paper introduces RepViT, a family of pure lightweight Convolutional Neural Networks (CNNs) designed for mobile devices, achieving state-of-the-art performance by incorporating efficient architectural designs from lightweight Vision Transformers (ViTs). |
Lightweight ViTs, while demonstrating superior performance, face practical challenges due to inadequate hardware and computational library support. Lightweight CNNs, leveraging highly optimized convolution operations, prove advantageous for deployment on edge devices. |
The authors progressively enhance MobileNetV3-L by integrating efficient designs from lightweight ViTs, focusing on block structure, macro architecture (stem, downsampling layers, classifier, stage ratio), and micro design (kernel size, SE layer placement). |
RepViT consistently surpasses existing state-of-the-art lightweight ViTs and CNNs across diverse model sizes on ImageNet-1K, object detection, instance segmentation, and semantic segmentation benchmarks.
RepViT-M1.0 achieves over 80% top-1 accuracy on ImageNet with 1.0 ms latency on an iPhone 12, marking a first for lightweight models.
RepViT-SAM, integrating RepViT as the image encoder in the Segment Anything Model, exhibits exceptional efficiency on mobile devices while maintaining remarkable zero-shot transfer performance for downstream tasks. |
The study primarily focuses on iPhone 12 for latency measurement, potentially limiting generalizability to other mobile platforms.
Future exploration could involve investigating the effectiveness of RepViT's design principles on alternative lightweight CNN architectures beyond MobileNetV3. |
lightweight cnn, vision transformer, mobile devices, efficient architecture design, computer vision |
2307.09165
Report |
Towards Trustworthy Dataset Distillation |
Shijie Ma, Fei Zhu, Zhen Cheng, Xu-Yao Zhang |
Efficiency and trustworthiness are two eternal pursuits when applying deep
learning in real-world applications. With regard to efficiency, dataset
distillation (DD) endeavors to reduce training costs by distilling the large
dataset into a tiny synthetic dataset. However, existing methods merely
concentrate on in-distribution (InD) classification in a closed-world setting,
disregarding out-of-distribution (OOD) samples. On the other hand, OOD
detection aims to enhance models' trustworthiness, which is always
inefficiently achieved in full-data settings. For the first time, we
simultaneously consider both issues and propose a novel paradigm called
Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and
outliers, the condensed datasets are capable to train models competent in both
InD classification and OOD detection. To alleviate the requirement of real
outlier data and make OOD detection more practical, we further propose to
corrupt InD samples to generate pseudo-outliers and introduce Pseudo-Outlier
Exposure (POE). Comprehensive experiments on various settings demonstrate the
effectiveness of TrustDD, and the proposed POE surpasses state-of-the-art
method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more
trustworthy and applicable to real open-world scenarios. Our code will be
publicly available. |
The paper proposes Trustworthy Dataset Distillation (TrustDD), a novel paradigm that enhances dataset distillation by incorporating outlier exposure for improved out-of-distribution (OOD) detection. |
Existing dataset distillation methods focus only on in-distribution classification, neglecting the critical aspect of OOD detection crucial for real-world deployment where unknown data is expected. |
TrustDD extends the traditional dataset distillation framework by distilling both in-distribution samples and outliers, encouraging models to learn robust representations for both tasks. The authors further introduce Pseudo-Outlier Exposure (POE), a method for generating synthetic outliers from in-distribution data using corruption transformations. |
TrustDD significantly improves OOD detection performance without sacrificing in-distribution classification accuracy.
POE achieves comparable or even superior performance to Outlier Exposure (OE), which relies on curated outlier datasets.
TrustDD generalizes well across various network architectures and OOD detection scores. |
The current corruption transformations in POE are designed for natural images and might require adaptation for other data types.
Further investigation on the optimal ratio of distilled in-distribution samples and outliers for balancing efficiency and trustworthiness is needed. |
dataset distillation, out-of-distribution detection, trustworthy deep learning, pseudo-outlier exposure, open-world learning |
2307.08996
Report |
Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond |
Yang Zhao, Tingbo Hou, Yu-Chuan Su, Xuhui Jia. Yandong Li, Matthias Grundmann |
An authentic face restoration system is becoming increasingly demanding in
many computer vision applications, e.g., image enhancement, video
communication, and taking portrait. Most of the advanced face restoration
models can recover high-quality faces from low-quality ones but usually fail to
faithfully generate realistic and high-frequency details that are favored by
users. To achieve authentic restoration, we propose $\textbf{IDM}$, an
$\textbf{I}$teratively learned face restoration system based on denoising
$\textbf{D}$iffusion $\textbf{M}$odels (DDMs). We define the criterion of an
authentic face restoration system, and argue that denoising diffusion models
are naturally endowed with this property from two aspects: intrinsic iterative
refinement and extrinsic iterative enhancement. Intrinsic learning can preserve
the content well and gradually refine the high-quality details, while extrinsic
enhancement helps clean the data and improve the restoration task one step
further. We demonstrate superior performance on blind face restoration tasks.
Beyond restoration, we find the authentically cleaned data by the proposed
restoration system is also helpful to image generation tasks in terms of
training stabilization and sample quality. Without modifying the models, we
achieve better quality than state-of-the-art on FFHQ and ImageNet generation
using either GANs or diffusion models. |
This paper proposes IDM, an Iteratively learned face restoration system using Denoising Diffusion Models (DDMs) for authentic face restoration. |
Existing face restoration models often fail to generate realistic high-frequency details and struggle to preserve delicate identity features. This paper addresses these challenges by introducing a novel approach using DDMs. |
The proposed IDM leverages intrinsic iterative refinement within DDMs and extrinsic iterative enhancement of training data. Intrinsic learning gradually refines details while preserving content through the DDM's iterative denoising process. Extrinsic learning utilizes the trained DDM to enhance the training data itself, leading to improved restoration quality in the next iteration. |
IDM achieves superior quantitative results on blind face restoration benchmarks, outperforming state-of-the-art methods like GFPGAN and CodeFormer in terms of PSNR, SSIM, LPIPS, and Arcface identity score.
Qualitative results demonstrate IDM's ability to generate more realistic and faithful face restorations, preserving high-frequency details and delicate identity features better than baselines.
Beyond restoration, the enhanced training data from IDM benefits image generation tasks, improving FID, precision, and recall scores for both GANs (StyleGAN2, BigGAN) and DDMs on FFHQ and ImageNet datasets. |
The efficiency of IDM could be a limitation, as it requires multiple diffusion steps during inference, making it slower than single-forward pass methods.
Further exploration of loss functions and optimizer settings for training DDMs could potentially address the observed color faithfulness issues with L2 loss. |
face restoration, denoising diffusion models, authentic restoration, image generation, iterative learning |
2307.08727
Report |
Learning to Count without Annotations |
Lukas Knobel, Tengda Han, Yuki M. Asano |
While recent supervised methods for reference-based object counting continue
to improve the performance on benchmark datasets, they have to rely on small
datasets due to the cost associated with manually annotating dozens of objects
in images. We propose UnCounTR, a model that can learn this task without
requiring any manual annotations. To this end, we construct "Self-Collages",
images with various pasted objects as training samples, that provide a rich
learning signal covering arbitrary object types and counts. Our method builds
on existing unsupervised representations and segmentation techniques to
successfully demonstrate for the first time the ability of reference-based
counting without manual supervision. Our experiments show that our method not
only outperforms simple baselines and generic models such as FasterRCNN and
DETR, but also matches the performance of supervised counting models in some
domains. |
This paper proposes UnCounTR, the first method for reference-based object counting that does not require any manual annotations. |
Manually annotating object counts in images is expensive and limits the size of datasets. This paper explores whether it's possible to learn counting without relying on these annotations. |
The paper introduces Self-Collages, a self-supervised method that generates training data by pasting objects onto background images. It uses an off-the-shelf pretrained DINO ViT backbone to extract features and train the counting model. |
UnCounTR outperforms strong baselines like DETR and achieves comparable performance to supervised methods on CARPK and MSO.
On FSC-147, UnCounTR outperforms the supervised method CounTR on low-count ranges and shows competitive results for medium counts.
The paper demonstrates that UnCounTR can be extended to perform self-supervised semantic counting, where the model identifies exemplars and counts them without any prior. |
UnCounTR's performance degrades for images with counts significantly higher than the ones seen during training, suggesting limits to its generalization abilities.
The paper primarily focuses on counting, leaving the exploration of using Self-Collages for related tasks like semantic instance segmentation as future work. |
object counting, self-supervised learning, few-shot learning, computer vision, unsupervised learning |
2307.08695
Report |
Neural Video Depth Stabilizer |
Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, Guosheng Lin |
Video depth estimation aims to infer temporally consistent depth. Some
methods achieve temporal consistency by finetuning a single-image depth model
during test time using geometry and re-projection constraints, which is
inefficient and not robust. An alternative approach is to learn how to enforce
temporal consistency from data, but this requires well-designed models and
sufficient video depth data. To address these challenges, we propose a
plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that
stabilizes inconsistent depth estimations and can be applied to different
single-image depth models without extra effort. We also introduce a large-scale
dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with
over two million frames, making it the largest natural-scene video depth
dataset to our knowledge. We evaluate our method on the VDW dataset as well as
two public benchmarks and demonstrate significant improvements in consistency,
accuracy, and efficiency compared to previous approaches. Our work serves as a
solid baseline and provides a data foundation for learning-based video depth
models. We will release our dataset and code for future research. |
This paper introduces NVDS, a plug-and-play framework for improving temporal consistency in video depth estimation, and VDW, a large-scale natural-scene video depth dataset. |
Existing video depth estimation methods suffer from limitations: test-time training methods are computationally expensive and not robust, while learning-based methods lack sufficient training data. |
NVDS uses a Stabilization Network with cross-attention to refine flickering disparity maps from any single-image depth model. VDW provides diverse video data for training robust models. |
NVDS significantly outperforms previous methods in terms of consistency, accuracy, and efficiency.
VDW, with over 2 million frames, serves as the largest natural-scene video depth dataset to date.
Experiments demonstrate the effectiveness of NVDS with different depth predictors and the benefits of VDW for training. |
The current implementation of NVDS focuses on a specific framework.
Future work includes exploring alternative mechanisms and lightweight models for broader application. |
video depth estimation, temporal consistency, plug-and-play framework, dataset, deep learning |
2307.08629
Report |
Deficiency-Aware Masked Transformer for Video Inpainting |
Yongsheng Yu, Heng Fan, Libo Zhang |
Recent video inpainting methods have made remarkable progress by utilizing
explicit guidance, such as optical flow, to propagate cross-frame pixels.
However, there are cases where cross-frame recurrence of the masked video is
not available, resulting in a deficiency. In such situation, instead of
borrowing pixels from other frames, the focus of the model shifts towards
addressing the inverse problem. In this paper, we introduce a
dual-modality-compatible inpainting framework called Deficiency-aware Masked
Transformer (DMT), which offers three key advantages. Firstly, we pretrain a
image inpainting model DMT_img serve as a prior for distilling the video model
DMT_vid, thereby benefiting the hallucination of deficiency cases. Secondly,
the self-attention module selectively incorporates spatiotemporal tokens to
accelerate inference and remove noise signals. Thirdly, a simple yet effective
Receptive Field Contextualizer is integrated into DMT, further improving
performance. Extensive experiments conducted on YouTube-VOS and DAVIS datasets
demonstrate that DMT_vid significantly outperforms previous solutions. The code
and video demonstrations can be found at github.com/yeates/DMT. |
This paper introduces Deficiency-aware Masked Transformer (DMT), a novel video inpainting framework that leverages pre-trained image inpainting models to enhance performance in deficiency cases (where the masked content is absent throughout the video). |
Addresses the challenge of deficiency cases in video inpainting, where existing methods struggle to generate plausible content. The paper bridges the gap between image and video inpainting by transferring knowledge from a pre-trained image inpainting model. |
The authors propose a dual-modality-compatible framework with: (1) a pre-trained image inpainting model (DMT_img) serving as a prior for the video inpainting model (DMT_vid), (2) a Token Selection mechanism to focus on valid spatiotemporal tokens, (3) a Mask Activation strategy to iteratively hallucinate missing regions, and (4) a Receptive Field Contextualizer (RFC) to enhance spatial feature reconstruction. |
DMT_vid significantly outperforms state-of-the-art video inpainting methods on benchmark datasets, achieving higher PSNR and SSIM scores while reducing VFID.
The proposed framework effectively leverages the pre-trained image inpainting model to handle deficiency cases, demonstrating the benefits of knowledge transfer between image and video domains.
The Token Selection and Mask Activation mechanisms contribute to improved efficiency and performance by reducing computational complexity and enabling the reconstruction of missing tokens. |
The method's reliance on Transformers leads to high memory requirements when processing high-resolution videos.
Training a unified model for both image and video inpainting tasks poses challenges due to the inherent differences in their objectives and requirements. |
video inpainting, image inpainting, transformer, deficiency-aware, receptive field |
2307.08585
Report |
Identity-Preserving Aging of Face Images via Latent Diffusion Models |
Sudipta Banerjee, Govind Mittal, Ameya Joshi, Chinmay Hegde, Nasir Memon |
The performance of automated face recognition systems is inevitably impacted
by the facial aging process. However, high quality datasets of individuals
collected over several years are typically small in scale. In this work, we
propose, train, and validate the use of latent text-to-image diffusion models
for synthetically aging and de-aging face images. Our models succeed with
few-shot training, and have the added benefit of being controllable via
intuitive textual prompting. We observe high degrees of visual realism in the
generated images while maintaining biometric fidelity measured by commonly used
metrics. We evaluate our method on two benchmark datasets (CelebA and AgeDB)
and observe significant reduction (~44%) in the False Non-Match Rate compared
to existing state-of the-art baselines. |
The paper proposes a novel method for age progression and regression of face images using latent text-to-image diffusion models, focusing on preserving biometric identity. |
Facial aging significantly impacts face recognition systems, and existing methods struggle to balance visual realism with biometric fidelity. This work addresses this gap by leveraging the power of diffusion models and identity-preserving techniques. |
The method adapts DreamBooth, a latent diffusion model, by incorporating biometric and contrastive losses during fine-tuning. This allows the model to learn identity-specific features while leveraging a regularization set of image-caption pairs to understand age progression concepts. |
The method generates visually compelling age-progressed and regressed images while maintaining high biometric fidelity, as demonstrated by user studies and quantitative metrics.
The proposed approach outperforms state-of-the-art methods like IPCGAN, AttGAN, and Talk-to-Edit, showing significant reduction in FNMR.
Fine-tuning face recognition models on the generated images leads to significant performance improvement, suggesting their potential for improving face recognition robustness. |
The method currently relies on fine-tuning for each individual, and exploring zero-shot learning is a future direction.
Further research can investigate the use of composable diffusion models for more fine-grained control over age editing. |
age progression, age regression, face recognition, latent diffusion models, biometric identity preservation |
2307.08581
Report |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs |
Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang |
LLMs have demonstrated remarkable abilities at interacting with humans
through language, especially with the usage of instruction-following data.
Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further
enlarge their abilities by incorporating multi-modal inputs, including image,
video, and speech. Despite their effectiveness at generating precise and
detailed language understanding of the given modality signal, these LLMs give
up the ability to ground specific parts of inputs, thus only constructing a
coarse-grained mapping. However, explicit and informative correspondence
between text and other modalities will not only improve the user experience but
also help to expand the application scenario of multi-modal LLMs. Therefore, we
propose BuboGPT, a multi-modal LLM with visual grounding that can perform
cross-modal interaction between vision, audio and language, providing
fine-grained understanding of visual objects and other given modalities. As a
result, BuboGPT is able to point out the specific location of an object in the
image, when it is generating response or description for that object. Our
contributions are two-fold: 1) An off-the-shelf visual grounding module based
on SAM that extracts entities in a sentence and find corresponding masks in the
image. 2) A two-stage training scheme and instruction dataset to endow joint
text-image-audio understanding. Our experiments show that BuboGPT achieves
impressive multi-modality understanding and visual grounding abilities during
the interaction with human. It performs consistently well when provided by
arbitrary modality combinations (either aligned or unaligned). Our code, model
and dataset are available at https://bubo-gpt.github.io . |
BuboGPT, a multi-modal large language model (LLM) that incorporates visual grounding for fine-grained understanding of visual objects and their relationships with other modalities like text and audio. |
Existing multi-modal LLMs lack the ability to ground understanding in specific parts of inputs, limiting their interpretability and application scenarios. BuboGPT addresses this by linking visual objects with other modalities for fine-grained understanding. |
A two-stage training scheme is used: 1) Single-modal pre-training aligns vision and audio encoders with the LLM. 2) Multi-modal instruct tuning on a curated dataset with image-text, audio-text, and image-audio-text pairs, including negative pairs for better semantic reasoning. |
BuboGPT achieves impressive visual grounding, accurately associating textual descriptions with image regions.
The model demonstrates strong audio understanding, providing detailed descriptions even for subtle audio cues.
BuboGPT excels in aligned and arbitrary audio-image understanding, identifying sound sources in images and reasoning about the relationship between audio and visual inputs. |
Inherits language hallucination limitations from the underlying LLM, potentially generating non-factual information.
Grounding question answering (QA) capabilities are limited by the text-based connection between grounding results and modalities, requiring further improvement with fine-grained visual grounding datasets and spatial information integration. |
multi-modal learning, large language models, visual grounding, audio understanding, instruction tuning |
2307.08526
Report |
Image Captions are Natural Prompts for Text-to-Image Models |
Shiye Lei, Hao Chen, Sen Zhang, Bo Zhao, Dacheng Tao |
With the rapid development of Artificial Intelligence Generated Content
(AIGC), it has become common practice in many learning tasks to train or
fine-tune large models on synthetic data due to the data-scarcity and privacy
leakage problems. Albeit promising with unlimited data generation, owing to
massive and diverse information conveyed in real images, it is challenging for
text-to-image generative models to synthesize informative training data with
hand-crafted prompts, which usually leads to inferior generalization
performance when training downstream models. In this paper, we theoretically
analyze the relationship between the training effect of synthetic data and the
synthetic data distribution induced by prompts. Then we correspondingly propose
a simple yet effective method that prompts text-to-image generative models to
synthesize more informative and diverse training data. Specifically, we caption
each real image with the advanced captioning model to obtain informative and
faithful prompts that extract class-relevant information and clarify the
polysemy of class names. The image captions and class names are concatenated to
prompt generative models for training image synthesis. Extensive experiments on
ImageNette, ImageNet-100, and ImageNet-1K verify that our method significantly
improves the performance of models trained on synthetic training data, i.e.,
10% classification accuracy improvements on average. |
This paper proposes Caption in Prompt (CiP), a training-free method to synthesize informative training data using large text-to-image (T2I) models. It involves captioning real images and combining these captions with class names to prompt T2I models for generating synthetic training samples. |
Existing methods for generating synthetic training data using T2I models rely on simple prompts, resulting in limited information and diversity in the synthetic data. This leads to inferior generalization performance when training downstream models. CiP addresses this issue by creating more informative prompts based on real data. |
CiP first uses an off-the-shelf image captioning model to generate captions for real images. Then, it concatenates these captions with class names to form prompts for the T2I model. Finally, the T2I model generates synthetic images based on these constructed prompts. |
Guidance scale, a parameter in Stable Diffusion, significantly impacts the training effect of synthetic data, with a suitable range between 1.5 and 2.0.
CiP significantly improves the training effect of synthetic datasets, leading to a substantial increase (around 10%) in classification accuracy compared to basic prompts.
The quality of the captioning model used in CiP affects the performance, with BLIP-2 generating captions that lead to better results than ViT-GPT2. |
Generating training data via large diffusion models requires substantial computational resources, limiting scalability and edge-computing applications.
While CiP is more efficient than methods based on diffusion inversion, image editing, and fine-tuning, reducing synthesis cost remains important for broader adoption. |
synthetic data generation, text-to-image models, image captioning, stable diffusion, deep learning |
2307.08504
Report |
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization |
Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang |
Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have
demonstrated impressive performance in various tasks. However, the lengthy
visual token sequences fed into ViT can lead to training inefficiency and
ineffectiveness. Existing efforts address the challenge by either bottom-level
patch extraction in the ViT backbone or top-level patch abstraction outside,
not balancing training efficiency and effectiveness well. Inspired by text
summarization in natural language processing, we propose a Bottom-Up Patch
Summarization approach named BUS, coordinating bottom-level extraction and
top-level abstraction to learn a concise summary of lengthy visual token
sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware
Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual
token extraction and then attach a flexible Transformer-based Patch Abstraction
Decoder (PAD) upon the backbone for top-level visual abstraction. This
bottom-up collaboration enables our BUS to yield high training efficiency while
maintaining or even improving effectiveness. We evaluate our approach on
various visual-language understanding and generation tasks and show competitive
downstream task performance while boosting the training efficiency by 50\%.
Additionally, our model achieves state-of-the-art performance on many
downstream tasks by increasing input image resolution without increasing
computational costs over baselines. |
This paper introduces \modelname, a novel Vision-Language Pre-training (VLP) model that utilizes a bottom-up patch summarization approach for efficient and effective learning. |
Existing ViT-based VLP models suffer from training inefficiency and ineffectiveness due to lengthy visual token sequences. \modelname addresses this by summarizing these sequences, balancing efficiency and effectiveness. |
The model employs a two-step process: 1) **Key Patch Extraction (KPE)** within the ViT backbone selects text-relevant patches using a Text Semantic-aware Patch Selector (TSPS). 2) **Text-Guided Patch Abstraction (TPA)** utilizes a lightweight Transformer-based Patch Abstraction Decoder (PAD) to further condense the selected patches into a concise visual summary. |
\modelname achieves competitive or better performance on downstream tasks like VQA, image captioning, and retrieval while being significantly faster than previous VLP models.
The model can process higher resolution images without increased computational cost, leading to state-of-the-art results on tasks like VQA.
Ablation studies confirm the effectiveness of both KPE and TPA, highlighting their contribution to \modelname's efficiency and accuracy. |
The paper primarily focuses on efficiency and effectiveness, leaving further exploration of the learned representations for future work.
The impact of varying the number of selected patches on model performance could be further investigated. |
vision-language pre-training, vision transformer, patch summarization, cross-modal learning, efficiency |
2307.08500
Report |
Cumulative Spatial Knowledge Distillation for Vision Transformers |
Borui Zhao, Renjie Song, Jiajun Liang |
Distilling knowledge from convolutional neural networks (CNNs) is a
double-edged sword for vision transformers (ViTs). It boosts the performance
since the image-friendly local-inductive bias of CNN helps ViT learn faster and
better, but leading to two problems: (1) Network designs of CNN and ViT are
completely different, which leads to different semantic levels of intermediate
features, making spatial-wise knowledge transfer methods (e.g., feature
mimicking) inefficient. (2) Distilling knowledge from CNN limits the network
convergence in the later training period since ViT's capability of integrating
global information is suppressed by CNN's local-inductive-bias supervision. To
this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD
distills spatial-wise knowledge to all patch tokens of ViT from the
corresponding spatial responses of CNN, without introducing intermediate
features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF)
module, which introduces the global response of CNN and increasingly emphasizes
its importance during the training. Applying CKF leverages CNN's local
inductive bias in the early training period and gives full play to ViT's global
capability in the later one. Extensive experiments and analysis on ImageNet-1k
and downstream datasets demonstrate the superiority of our CSKD. Code will be
publicly available. |
This paper proposes Cumulative Spatial Knowledge Distillation (CSKD), a novel knowledge distillation technique for Vision Transformers (ViTs) that addresses limitations of distilling from Convolutional Neural Networks (CNNs). |
Distilling knowledge from CNNs to ViTs, while beneficial, presents challenges: 1) misaligned intermediate feature semantics due to architectural differences and 2) hindering ViT's global information integration capabilities in later training stages. |
CSKD transfers spatial knowledge by using dense predictions from CNN's last features to supervise corresponding ViT patch tokens, avoiding intermediate feature alignment issues. It incorporates a Cumulative Knowledge Fusion (CKF) module that progressively emphasizes CNN's global response, balancing local and global knowledge transfer throughout training. |
CSKD consistently outperforms DeiT and DearKD baselines on ImageNet-1k, achieving up to +1.8% top-1 accuracy improvement.
The method demonstrates superior transfer learning performance on downstream datasets like CIFAR, Cars, and iNat19, indicating improved generalization.
Visualizations of attention distances and heatmaps confirm that CSKD effectively leverages ViT's global modeling capacity. |
The study primarily focuses on image classification tasks, leaving its application to other vision tasks for future exploration.
The current implementation relies on a pre-trained CNN teacher; exploring student-teacher co-training could further enhance performance. |
knowledge distillation, vision transformer, convolutional neural network, spatial knowledge transfer, cumulative knowledge fusion |
2307.08448
Report |
Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation |
Luozhou Wang, Shuai Yang, Shu Liu, Ying-cong Chen |
Conditional diffusion models have demonstrated impressive performance in
image manipulation tasks. The general pipeline involves adding noise to the
image and then denoising it. However, this method faces a trade-off problem:
adding too much noise affects the fidelity of the image while adding too little
affects its editability. This largely limits their practical applicability. In
this paper, we propose a novel framework, Selective Diffusion Distillation
(SDD), that ensures both the fidelity and editability of images. Instead of
directly editing images with a diffusion model, we train a feedforward image
manipulation network under the guidance of the diffusion model. Besides, we
propose an effective indicator to select the semantic-related timestep to
obtain the correct semantic guidance from the diffusion model. This approach
successfully avoids the dilemma caused by the diffusion process. Our extensive
experiments demonstrate the advantages of our framework. Code is released at
https://github.com/AndysonYs/Selective-Diffusion-Distillation. |
This paper proposes Selective Diffusion Distillation (SDD), a novel image manipulation framework that leverages a pre-trained text-guided diffusion model to supervise an efficient feedforward image manipulator, avoiding the editability-fidelity trade-off common in direct diffusion-based editing. |
Existing diffusion-based image manipulation methods suffer from a trade-off between editability and fidelity, limiting their practicality. This paper aims to overcome this limitation by introducing a new framework that separates the manipulation process from the diffusion process. |
The proposed SDD framework utilizes a pre-trained diffusion model as a supervisor to train a feedforward image manipulator (e.g., StyleGAN with a latent mapper). It introduces the Hybrid Quality Score (HQS) to select semantically relevant diffusion timesteps, ensuring the manipulator receives optimal guidance from the diffusion model. |
SDD successfully performs various image manipulations across different domains (faces, cats, cars) while preserving high fidelity to the input image.
Compared to other diffusion-based methods, SDD achieves higher CLIP similarity (better semantic alignment with the text prompt) and lower FID (better image quality).
SDD demonstrates superior efficiency compared to diffusion-based methods when manipulating a large number of images. |
The HQS selection strategy relies on empirical observations and might require further investigation for optimal performance.
Future work can explore different architectures for the image manipulator and extend the method to a wider range of image manipulation tasks. |
image manipulation, diffusion models, knowledge distillation, text-guided image editing, hybrid quality score |
2307.08436
Report |
DOT: A Distillation-Oriented Trainer |
Borui Zhao, Quan Cui, Renjie Song, Jiajun Liang |
Knowledge distillation transfers knowledge from a large model to a small one
via task and distillation losses. In this paper, we observe a trade-off between
task and distillation losses, i.e., introducing distillation loss limits the
convergence of task loss. We believe that the trade-off results from the
insufficient optimization of distillation loss. The reason is: The teacher has
a lower task loss than the student, and a lower distillation loss drives the
student more similar to the teacher, then a better-converged task loss could be
obtained. To break the trade-off, we propose the Distillation-Oriented Trainer
(DOT). DOT separately considers gradients of task and distillation losses, then
applies a larger momentum to distillation loss to accelerate its optimization.
We empirically prove that DOT breaks the trade-off, i.e., both losses are
sufficiently optimized. Extensive experiments validate the superiority of DOT.
Notably, DOT achieves a +2.59% accuracy improvement on ImageNet-1k for the
ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's
optimization properties in terms of loss convergence and model generalization.
Code will be made publicly available. |
This paper proposes Distillation-Oriented Trainer (DOT) to address the trade-off between task and distillation losses in knowledge distillation, aiming to improve student model convergence and generalization. |
Knowledge distillation, while effective in transferring knowledge, often encounters a trade-off where improving distillation loss hinders task loss convergence, limiting the student model's performance. |
DOT tackles this trade-off by employing separate momentum buffers for task and distillation losses during optimization. It assigns a larger momentum to the distillation loss, ensuring its gradients dominate the training process and lead to better knowledge transfer. |
DOT successfully breaks the task-distillation loss trade-off, achieving lower values for both losses simultaneously.
The method guides the student model towards flatter minima in the loss landscape, empirically demonstrating improved generalization ability.
DOT consistently enhances the performance of various distillation methods across benchmarks like CIFAR-100, Tiny-ImageNet, and ImageNet-1k, achieving new state-of-the-art results. |
The paper primarily focuses on image classification tasks, and further investigation is needed to assess DOT's effectiveness in other domains like natural language processing.
Future work could explore the optimal balance between task and distillation losses during different training stages for potential further improvements. |
knowledge distillation, optimization, deep learning, model compression, loss landscape |
2307.08397
Report |
CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing |
Ahmet Canberk Baykal, Abdul Basit Anees, Duygu Ceylan, Erkut Erdem, Aykut Erdem, Deniz Yuret |
Researchers have recently begun exploring the use of StyleGAN-based models
for real image editing. One particularly interesting application is using
natural language descriptions to guide the editing process. Existing approaches
for editing images using language either resort to instance-level latent code
optimization or map predefined text prompts to some editing directions in the
latent space. However, these approaches have inherent limitations. The former
is not very efficient, while the latter often struggles to effectively handle
multi-attribute changes. To address these weaknesses, we present CLIPInverter,
a new text-driven image editing approach that is able to efficiently and
reliably perform multi-attribute changes. The core of our method is the use of
novel, lightweight text-conditioned adapter layers integrated into pretrained
GAN-inversion networks. We demonstrate that by conditioning the initial
inversion step on the CLIP embedding of the target description, we are able to
obtain more successful edit directions. Additionally, we use a CLIP-guided
refinement step to make corrections in the resulting residual latent codes,
which further improves the alignment with the text prompt. Our method
outperforms competing approaches in terms of manipulation accuracy and
photo-realism on various domains including human faces, cats, and birds, as
shown by our qualitative and quantitative results. |
Presents CLIPInverter, a novel text-driven image editing approach that uses CLIP-guided adapter layers within pretrained GAN-inversion networks for efficient and reliable multi-attribute image manipulation. |
Existing methods for language-guided image editing are either inefficient (instance-level optimization) or struggle with multi-attribute changes (predefined text prompts). |
Integrates lightweight text-conditioned adapter layers (CLIPAdapter) into pretrained GAN-inversion networks. The initial inversion is conditioned on the CLIP embedding of the target description, and a CLIP-guided refinement step (CLIPRemapper) further improves alignment with the text prompt. |
Outperforms competing approaches in manipulation accuracy and photo-realism across various domains (faces, cats, birds).
Enables smooth image manipulations through latent code interpolation, offering user control over the editing process.
Demonstrates zero-shot capabilities by handling novel descriptions and using reference images as conditioning input. |
Inherits limitations of the underlying GAN inversion network, such as potential struggles with unusual poses or challenging lighting conditions.
May be affected by biases present in the training data, which can lead to undesired manipulations. This can be mitigated by using more comprehensive textual descriptions. |
image manipulation, text-guided editing, stylegan, clip, gan inversion |
2307.08199
Report |
Unbiased Image Synthesis via Manifold Guidance in Diffusion Models |
Xingzhe Su, Daixi Jia, Fengge Wu, Junsuo Zhao, Changwen Zheng, Wenwen Qiang |
Diffusion Models are a potent class of generative models capable of producing
high-quality images. However, they often inadvertently favor certain data
attributes, undermining the diversity of generated images. This issue is
starkly apparent in skewed datasets like CelebA, where the initial dataset
disproportionately favors females over males by 57.9%, this bias amplified in
generated data where female representation outstrips males by 148%. In
response, we propose a plug-and-play method named Manifold Guidance Sampling,
which is also the first unsupervised method to mitigate bias issue in DDPMs.
Leveraging the inherent structure of the data manifold, this method steers the
sampling process towards a more uniform distribution, effectively dispersing
the clustering of biased data. Without the need for modifying the existing
model or additional training, it significantly mitigates data bias and enhances
the quality and unbiasedness of the generated images. |
This paper proposes Manifold Guidance Sampling (MGS), a plug-and-play, unsupervised method to mitigate bias in Denoising Diffusion Probabilistic Models (DDPMs) by guiding the sampling process towards a more uniform distribution on the data manifold. |
DDPMs, despite their success in image synthesis, inherit and often amplify biases present in training data, leading to skewed and unrepresentative generated images. This underscores the need for methods like MGS to ensure fairness and diversity in generated data. |
MGS operates in two stages: (1) It evaluates the data manifold by learning an efficient mapping from high-dimensional image space to a low-dimensional feature space, capturing the intrinsic data structure. (2) It incorporates manifold constraints into the DDPM sampling process, promoting a uniform distribution of generated samples across the learned manifold. |
MGS effectively reduces bias in generated images, demonstrated through analysis of attribute distributions on the CelebA dataset.
MGS enhances both the quality and diversity of generated images compared to standard DDPM sampling, evidenced by improved FID and sFID scores across multiple datasets.
MGS is a versatile and adaptable method, compatible with various DDPM architectures and sampling schedules, and does not require model retraining or label information. |
While significantly mitigating bias, MGS doesn't completely eliminate it, suggesting room for further improvement.
The effectiveness of MGS may vary across different datasets and bias types, necessitating further investigation into its generalizability and potential limitations. |
diffusion models, image synthesis, data bias, manifold learning, unsupervised learning |
2307.08093
Report |
Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections |
Yifan Yang, Shuhai Zhang, Zixiong Huang, Yubing Zhang, Mingkui Tan |
Neural Radiance Fields (NeRF) is a revolutionary approach for rendering
scenes by sampling a single ray per pixel and it has demonstrated impressive
capabilities in novel-view synthesis from static scene images. However, in
practice, we usually need to recover NeRF from unconstrained image collections,
which poses two challenges: 1) the images often have dynamic changes in
appearance because of different capturing time and camera settings; 2) the
images may contain transient objects such as humans and cars, leading to
occlusion and ghosting artifacts. Conventional approaches seek to address these
challenges by locally utilizing a single ray to synthesize a color of a pixel.
In contrast, humans typically perceive appearance and objects by globally
utilizing information across multiple pixels. To mimic the perception process
of humans, in this paper, we propose Cross-Ray NeRF (CR-NeRF) that leverages
interactive information across multiple rays to synthesize occlusion-free novel
views with the same appearances as the images. Specifically, to model varying
appearances, we first propose to represent multiple rays with a novel cross-ray
feature and then recover the appearance by fusing global statistics, i.e.,
feature covariance of the rays and the image appearance. Moreover, to avoid
occlusion introduced by transient objects, we propose a transient objects
handler and introduce a grid sampling strategy for masking out the transient
objects. We theoretically find that leveraging correlation across multiple rays
promotes capturing more global information. Moreover, extensive experimental
results on large real-world datasets verify the effectiveness of CR-NeRF. |
This paper proposes Cross-Ray NeRF (CR-NeRF), a novel method for synthesizing novel views from unconstrained image collections by leveraging interactions among multiple rays to address varying appearances and transient occlusions. |
Existing NeRF methods struggle with unconstrained image collections due to their static scene assumption, leading to inaccurate reconstructions with over-smoothing and ghosting artifacts. CR-NeRF aims to overcome these limitations and enable realistic novel view synthesis from diverse and dynamic scenes. |
CR-NeRF introduces a cross-ray paradigm with two key components: (1) Cross-ray appearance modeling: representing multiple rays with a cross-ray feature, fusing it with an appearance embedding using global statistics (feature covariance), and decoding it to obtain pixel colors simultaneously. (2) Cross-ray transient object handling: employing a segmentation network to generate a visibility map for transient objects and using grid sampling to pair the map with the input rays. |
CR-NeRF outperforms state-of-the-art methods like NeRF-W and Ha-NeRF on benchmark datasets, achieving higher PSNR, SSIM, and lower LPIPS values.
CR-NeRF demonstrates superior ability in modeling varying appearances, especially for images with high-frequency information, and effectively removes transient objects like tourists and cars.
CR-NeRF exhibits significant inference efficiency when handling multiple images with varying appearances but fixed camera positions, outperforming Ha-NeRF significantly. |
The definition of transient objects needs further exploration for more robust handling.
The current method focuses on synthesizing static scenes and could be extended to dynamic scenes with moving objects in the future. |
novel view synthesis, neural radiance fields, unconstrained image collections, appearance modeling, transient object handling |
2307.08076
Report |
Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector |
Shuo-Yen Lin, Ernie Chu, Che-Hsien Lin, Jun-Cheng Chen, Jia-Ching Wang |
Many physical adversarial patch generation methods are widely proposed to
protect personal privacy from malicious monitoring using object detectors.
However, they usually fail to generate satisfactory patch images in terms of
both stealthiness and attack performance without making huge efforts on careful
hyperparameter tuning. To address this issue, we propose a novel naturalistic
adversarial patch generation method based on the diffusion models (DM). Through
sampling the optimal image from the DM model pretrained upon natural images, it
allows us to stably craft high-quality and naturalistic physical adversarial
patches to humans without suffering from serious mode collapse problems as
other deep generative models. To the best of our knowledge, we are the first to
propose DM-based naturalistic adversarial patch generation for object
detectors. With extensive quantitative, qualitative, and subjective
experiments, the results demonstrate the effectiveness of the proposed approach
to generate better-quality and more naturalistic adversarial patches while
achieving acceptable attack performance than other state-of-the-art patch
generation methods. We also show various generation trade-offs under different
conditions. |
This paper proposes a novel method for generating naturalistic adversarial patches for object detectors, leveraging diffusion models (DM) pre-trained on natural images. |
Existing adversarial patch generation methods often create visually conspicuous patterns or require extensive hyperparameter tuning to balance attack performance and natural appearance. This work addresses these limitations by utilizing the power of diffusion models in generating high-quality and diverse images. |
The method introduces a novel Adversarial Patch Sampling (APS) technique based on DDIM. It optimizes an initial patch generated from a text-conditioned LDM by backpropagating the object detector's loss into the LDM's sampling process. Text conditioning and strategic noise injection during optimization contribute to maintaining the naturalism of the generated patches. |
The proposed method demonstrates superior attack performance compared to previous state-of-the-art methods on various object detectors.
Subjective evaluations through user studies confirm that the generated patches are perceived as more natural than those from previous methods, often even surpassing real-world images.
The method exhibits robustness to existing defenses like SAC, showcasing its effectiveness in real-world scenarios. |
The generalization of adversarial patches across different datasets requires further investigation and potential improvement.
The computational cost associated with the diffusion model sampling process, although mitigated by DDIM and LDM, remains a consideration for future work. |
adversarial patch, object detection, diffusion models, naturalistic patch, physical adversarial examples |
2307.08041
Report |
Planting a SEED of Vision in Large Language Model |
Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan |
We present SEED, an elaborate image tokenizer that empowers Large Language
Models (LLMs) with the emergent ability to SEE and Draw at the same time.
Research on image tokenizers has previously reached an impasse, as frameworks
employing quantized visual tokens have lost prominence due to subpar
performance and convergence in multimodal comprehension (compared to BLIP-2,
etc.) or generation (compared to Stable Diffusion, etc.). Despite the
limitations, we remain confident in its natural capacity to unify visual and
textual representations, facilitating scalable multimodal training with LLM's
original recipe. In this study, we identify two crucial principles for the
architecture and training of SEED that effectively ease subsequent alignment
with LLMs. (1) Image tokens should be independent of 2D physical patch
positions and instead be produced with a 1D causal dependency, exhibiting
intrinsic interdependence that aligns with the left-to-right autoregressive
prediction mechanism in LLMs. (2) Image tokens should capture high-level
semantics consistent with the degree of semantic abstraction in words, and be
optimized for both discriminativeness and reconstruction during the tokenizer
training phase. As a result, the off-the-shelf LLM is able to perform both
image-to-text and text-to-image generation by incorporating our SEED through
efficient LoRA tuning. Comprehensive multimodal pretraining and instruction
tuning, which may yield improved results, are reserved for future
investigation. This version of SEED was trained in 5.7 days using only 64 V100
GPUs and 5M publicly available image-text pairs. Our preliminary study
emphasizes the great potential of discrete visual tokens in versatile
multimodal LLMs and the importance of proper image tokenizers in broader
research. |
This paper introduces SEED, a novel image tokenizer designed to equip Large Language Models (LLMs) with the capacity for both visual understanding (seeing) and generation (drawing). |
Existing image tokenizers struggle to effectively bridge the gap between visual and textual representations, hindering the development of truly versatile multimodal LLMs. |
SEED leverages a VQ-based approach, employing a Causal Q-Former to produce discrete visual tokens with 1D causal dependency and high-level semantic information. It further incorporates a Reverse Q-Former to align visual tokens with the latent space of Stable Diffusion for image generation. |
SEED tokens demonstrate competitive performance in text-image retrieval tasks compared to BLIP-2.
SEED facilitates efficient alignment with LLMs through LoRA tuning, enabling text-to-image and image-to-text generation.
Preliminary experiments with SEED-OPT₂.₇₋ show promising results in zero-shot image captioning, visual QA, and image generation. |
The current SEED implementation is limited by the scale of training data (5M image-text pairs) and the size of the LLM used (OPT₂.₇₋).
Future work will explore more comprehensive multimodal pretraining and instruction tuning to further enhance SEED’s capabilities. |
image tokenization, multimodal llms, visual comprehension, image generation, causal dependency |
2307.08012
Report |
Householder Projector for Unsupervised Latent Semantics Discovery |
Yue Song, Jichao Zhang, Nicu Sebe, Wei Wang |
Generative Adversarial Networks (GANs), especially the recent style-based
generators (StyleGANs), have versatile semantics in the structured latent
space. Latent semantics discovery methods emerge to move around the latent code
such that only one factor varies during the traversal. Recently, an
unsupervised method proposed a promising direction to directly use the
eigenvectors of the projection matrix that maps latent codes to features as the
interpretable directions. However, one overlooked fact is that the projection
matrix is non-orthogonal and the number of eigenvectors is too large. The
non-orthogonality would entangle semantic attributes in the top few
eigenvectors, and the large dimensionality might result in meaningless
variations among the directions even if the matrix is orthogonal. To avoid
these issues, we propose Householder Projector, a flexible and general low-rank
orthogonal matrix representation based on Householder transformations, to
parameterize the projection matrix. The orthogonality guarantees that the
eigenvectors correspond to disentangled interpretable semantics, while the
low-rank property encourages that each identified direction has meaningful
variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and
evaluate the models on several benchmarks. Within only $1\%$ of the original
training steps for fine-tuning, our projector helps StyleGANs to discover more
disentangled and precise semantic attributes without sacrificing image
fidelity. |
This paper introduces Householder Projector, a low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix in StyleGANs for enhanced unsupervised latent semantics discovery. |
Existing unsupervised methods for discovering interpretable directions in StyleGANs suffer from entangled semantics due to imbalanced eigenvalues in the projection matrix. Additionally, enforcing vanilla orthogonality can lead to meaningless variations due to the high dimensionality of the projector. |
The proposed method decomposes the projection matrix into its SVD form and represents the orthogonal singular vectors using Householder reflectors. A low-rank identity matrix is employed for singular values, enabling control over the number of semantic concepts. The method leverages pre-trained weights for initialization and employs acceleration techniques for efficient computation. |
Householder Projector significantly improves latent semantics discovery in StyleGANs, leading to more precise attribute control without compromising image fidelity.
The method outperforms other unsupervised baselines in terms of latent space smoothness (PPL and PIPL) and maintains competitive image quality (FID).
Householder Projector enables the discovery of diverse and semantically consistent interpretable directions across different layers and datasets. |
Current experiments focus on fine-tuning pre-trained StyleGANs, and training from scratch could potentially further enhance performance.
The number of semantics per layer is currently pre-defined, and exploring adaptive schemes for automatic semantic mining is an area for future work. |
generative adversarial networks, latent semantics discovery, stylegan, householder transformations, disentanglement |
2307.07790
Report |
Adaptive Nonlinear Latent Transformation for Conditional Face Editing |
Zhizhong Huang, Siteng Ma, Junping Zhang, Hongming Shan |
Recent works for face editing usually manipulate the latent space of StyleGAN
via the linear semantic directions. However, they usually suffer from the
entanglement of facial attributes, need to tune the optimal editing strength,
and are limited to binary attributes with strong supervision signals. This
paper proposes a novel adaptive nonlinear latent transformation for
disentangled and conditional face editing, termed AdaTrans. Specifically, our
AdaTrans divides the manipulation process into several finer steps; i.e., the
direction and size at each step are conditioned on both the facial attributes
and the latent codes. In this way, AdaTrans describes an adaptive nonlinear
transformation trajectory to manipulate the faces into target attributes while
keeping other attributes unchanged. Then, AdaTrans leverages a predefined
density model to constrain the learned trajectory in the distribution of latent
codes by maximizing the likelihood of transformed latent code. Moreover, we
also propose a disentangled learning strategy under a mutual information
framework to eliminate the entanglement among attributes, which can further
relax the need for labeled data. Consequently, AdaTrans enables a controllable
face editing with the advantages of disentanglement, flexibility with
non-binary attributes, and high fidelity. Extensive experimental results on
various facial attributes demonstrate the qualitative and quantitative
effectiveness of the proposed AdaTrans over existing state-of-the-art methods,
especially in the most challenging scenarios with a large age gap and few
labeled examples. The source code is available at
https://github.com/Hzzone/AdaTrans. |
Proposes AdaTrans, an adaptive nonlinear latent transformation method for disentangled and conditional face editing in StyleGAN, addressing limitations of linear interpolation methods. |
Existing linear methods suffer from attribute entanglement, require manual strength tuning, and are limited to binary attributes. AdaTrans aims to achieve disentanglement, flexibility with non-binary attributes, and high fidelity. |
Divides manipulation into finer steps with direction and size conditioned on attributes and latent codes, describing an adaptive nonlinear trajectory. Leverages a density model to constrain the trajectory within the latent space distribution. |
Achieves disentangled and controllable face editing, preserving unrelated attributes even with large age gaps.
Outperforms state-of-the-art methods in terms of editing accuracy, attribute preservation, and identity preservation.
Demonstrates flexibility by handling multi-attribute editing and maintaining performance with limited labeled data. |
Background preservation during editing is not addressed.
Further exploration of intermediate StyleGAN features for background preservation is planned as future work. |
face editing, stylegan, disentanglement, nonlinear transformation, latent space |
2307.07710
Report |
ExposureDiffusion: Learning to Expose for Low-light Image Enhancement |
Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen |
Previous raw image-based low-light image enhancement methods predominantly
relied on feed-forward neural networks to learn deterministic mappings from
low-light to normally-exposed images. However, they failed to capture critical
distribution information, leading to visually undesirable results. This work
addresses the issue by seamlessly integrating a diffusion model with a
physics-based exposure model. Different from a vanilla diffusion model that has
to perform Gaussian denoising, with the injected physics-based exposure model,
our restoration process can directly start from a noisy image instead of pure
noise. As such, our method obtains significantly improved performance and
reduced inference time compared with vanilla diffusion models. To make full use
of the advantages of different intermediate steps, we further propose an
adaptive residual layer that effectively screens out the side-effect in the
iterative refinement when the intermediate results have been already
well-exposed. The proposed framework can work with both real-paired datasets,
SOTA noise models, and different backbone networks. Note that, the proposed
framework is compatible with real-paired datasets, real/synthetic noise models,
and different backbone networks. We evaluate the proposed method on various
public benchmarks, achieving promising results with consistent improvements
using different exposure models and backbones. Besides, the proposed method
achieves better generalization capacity for unseen amplifying ratios and better
performance than a larger feedforward neural model when few parameters are
adopted. |
This paper proposes ExposureDiffusion, a novel diffusion-based model for low-light image enhancement in raw image space, seamlessly integrating a diffusion model with a physics-based exposure model. |
Existing raw image enhancement methods rely on deterministic mappings, failing to capture distribution information and effectively incorporate noise models, leading to suboptimal results. |
The method simulates the exposure process using a progressive shared network, minimizing the divergence between the simulated process and the real physics-based exposure process. An adaptive residual layer dynamically fuses denoising strategies for areas with different noise levels. |
ExposureDiffusion achieves significant performance improvements over baseline methods on SID and ELD datasets.
The method demonstrates compatibility with different noise models and backbone networks, consistently enhancing results.
ExposureDiffusion exhibits better generalization ability for unseen amplification ratios compared to feedforward networks. |
The determination of optimal inference steps for varying noise levels needs further investigation.
Future work could explore adaptive algorithms for automatically determining the number of inference steps. |
low-light image enhancement, diffusion models, raw image processing, physics-based modeling, adaptive residual learning |
2307.07678
Report |
Both Spatial and Frequency Cues Contribute to High-Fidelity Image Inpainting |
Ze Lu, Yalei Lv, Wenqi Wang, Pengfei Xiong |
Deep generative approaches have obtained great success in image inpainting
recently. However, most generative inpainting networks suffer from either
over-smooth results or aliasing artifacts. The former lacks high-frequency
details, while the latter lacks semantic structure. To address this issue, we
propose an effective Frequency-Spatial Complementary Network (FSCN) by
exploiting rich semantic information in both spatial and frequency domains.
Specifically, we introduce an extra Frequency Branch and Frequency Loss on the
spatial-based network to impose direct supervision on the frequency
information, and propose a Frequency-Spatial Cross-Attention Block (FSCAB) to
fuse multi-domain features and combine the corresponding characteristics. With
our FSCAB, the inpainting network is capable of capturing frequency information
and preserving visual consistency simultaneously. Extensive quantitative and
qualitative experiments demonstrate that our inpainting network can effectively
achieve superior results, outperforming previous state-of-the-art approaches
with significantly fewer parameters and less computation cost. The code will be
released soon. |
This paper proposes a Frequency-Spatial Complementary Network (FSCN) for high-fidelity image inpainting. |
Most existing image inpainting networks suffer from either over-smooth results (lacking high-frequency details) or aliasing artifacts (lacking semantic structure) due to focusing solely on spatial or frequency domain. |
FSCN utilizes a Frequency Branch and Frequency Loss to capture high-frequency details and a spatial branch for semantic structures. It employs a Frequency-Spatial Cross-Attention Block (FSCAB) to effectively fuse features from both domains. |
FSCN achieves state-of-the-art results on CelebA-HQ and Places datasets, outperforming previous methods in terms of FID, LPIPS, and SSIM.
It recovers fine-grained details and preserves semantic consistency effectively.
FSCN achieves superior results with significantly fewer parameters and less computational cost compared to previous SOTA methods. |
Performance on thick masks can be further improved, potentially by exploring more sophisticated mask-aware strategies.
The network's generalization ability across diverse datasets and inpainting scenarios could be further enhanced. |
image inpainting, frequency domain, spatial domain, cross-attention, deep learning |
2307.07663
Report |
INVE: Interactive Neural Video Editing |
Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee |
We present Interactive Neural Video Editing (INVE), a real-time video editing
solution, which can assist the video editing process by consistently
propagating sparse frame edits to the entire video clip. Our method is inspired
by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from
two major drawbacks: (1) the method is too slow for interactive editing, and
(2) it offers insufficient support for some editing use cases, including direct
frame editing and rigid texture tracking. To address these challenges we
leverage and adopt highly efficient network architectures, powered by
hash-grids encoding, to substantially improve processing speed. In addition, we
learn bi-directional functions between image-atlas and introduce vectorized
editing, which collectively enables a much greater variety of edits in both the
atlas and the frames directly. Compared to LNA, our INVE reduces the learning
and inference time by a factor of 5, and supports various video editing
operations that LNA cannot. We showcase the superiority of INVE over LNA in
interactive video editing through a comprehensive quantitative and qualitative
analysis, highlighting its numerous advantages and improved performance. For
video results, please see https://gabriel-huang.github.io/inve/ |
This paper presents INVE, an interactive video editing tool that allows users to propagate single-frame edits consistently throughout a video using a layered neural atlas representation. |
Interactive video editing remains a challenging task due to the need for temporally consistent edits, robust object tracking, and real-time performance. Existing methods often fall short in one or more of these areas. |
INVE builds upon Layered Neural Atlases (LNA) but introduces several key innovations:
- **Boosted Training & Inference Speed:** Employs multi-resolution hash grids and a GPU-optimized MLP architecture for faster computation.
- **Inverse Mapping:** Learns bi-directional mappings between frames and atlases to enable rigid texture tracking.
- **Layered Editing:** Supports independent editing of sketches, textures, and local adjustments through separate layers.
- **Vectorized Sketching:** Represents sketches as continuous vectorized strokes for artifact-free editing at the frame level. |
INVE achieves 5x faster training and inference speed compared to LNA.
It introduces inverse mapping for more accurate and intuitive texture tracking.
Layered editing and vectorized sketching enable a wider range of editing possibilities with improved consistency and reduced artifacts. |
The method's performance relies heavily on the quality of pre-computed optical flow and object masks.
Future work could explore extending INVE to handle longer video sequences and more complex editing operations, such as object removal or insertion. |
video editing, neural atlas, interactive editing, texture tracking, deep learning |
2307.07653
Report |
RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World |
Donghua Wang, Wen Yao, Tingsong Jiang, Chao Li, Xiaoqian Chen |
Physical adversarial attacks against deep neural networks (DNNs) have
recently gained increasing attention. The current mainstream physical attacks
use printed adversarial patches or camouflage to alter the appearance of the
target object. However, these approaches generate conspicuous adversarial
patterns that show poor stealthiness. Another physical deployable attack is the
optical attack, featuring stealthiness while exhibiting weakly in the daytime
with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA),
featuring effective and stealthy in both the digital and physical world, which
is implemented by placing the color transparent plastic sheet and a paper cut
of a specific shape in front of the mirror to create different colored
geometries on the target object. To achieve these goals, we devise a general
framework based on the circle to model the reflected light on the target
object. Specifically, we optimize a circle (composed of a coordinate and
radius) to carry various geometrical shapes determined by the optimized angle.
The fill color of the geometry shape and its corresponding transparency are
also optimized. We extensively evaluate the effectiveness of RFLA on different
datasets and models. Experiment results suggest that the proposed method
achieves over 99% success rate on different datasets and models in the digital
world. Additionally, we verify the effectiveness of the proposed method in
different physical environments by using sunlight or a flashlight. |
This paper presents RFLA, a novel physical adversarial attack that exploits reflected light to mislead Deep Neural Networks (DNNs) in both digital and physical environments. |
Existing physical attacks lack stealth or struggle in strong light conditions. RFLA addresses these limitations by utilizing natural sunlight or artificial light sources like flashlights. |
RFLA uses a mirror, colored transparent plastic sheets, and paper cut-outs to manipulate reflected light. A circle-based framework optimizes the position, geometry, and color of the reflected light for maximum attack effectiveness. The optimization process leverages the Particle Swarm Optimization (PSO) algorithm. |
RFLA achieves high attack success rates (over 99%) against various image classification models in the digital world, significantly outperforming existing patch-based and line-based attacks.
The attack demonstrates strong transferability across different DNN models, even in physical settings.
Physical experiments using reflected sunlight and flashlight confirm RFLA's efficacy in real-world scenarios, successfully attacking image classification and traffic sign recognition models. |
RFLA's effectiveness might be compromised in adverse weather conditions like fog or rain.
Future work will focus on enhancing RFLA's robustness against various environmental factors and exploring more sophisticated defenses against such attacks. |
adversarial attack, physical attack, reflected light, deep neural networks, particle swarm optimization |
2307.07635
Report |
CoTracker: It is Better to Track Together |
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht |
We introduce CoTracker, a transformer-based model that tracks dense points in
a frame jointly across a video sequence. This differs from most existing
state-of-the-art approaches that track points independently, ignoring their
correlation. We show that joint tracking results in a significantly higher
tracking accuracy and robustness. We also provide several technical
innovations, including the concept of virtual tracks, which allows CoTracker to
track 70k points jointly and simultaneously. Furthermore, CoTracker operates
causally on short windows (hence, it is suitable for online tasks), but is
trained by unrolling the windows across longer video sequences, which enables
and significantly improves long-term tracking. We demonstrate qualitatively
impressive tracking results, where points can be tracked for a long time even
when they are occluded or leave the field of view. Quantitatively, CoTracker
outperforms all recent trackers on standard benchmarks, often by a substantial
margin. |
Introduces CoTracker, a transformer-based model for jointly tracking dense points in videos, significantly improving accuracy and robustness by considering point correlations. |
Existing point trackers largely ignore correlations between points, leading to suboptimal performance, especially in challenging scenarios like occlusions. |
CoTracker utilizes a transformer architecture with novel virtual track tokens for efficiency, operates causally on short windows for online tasks, but leverages unrolled training on longer sequences to enhance long-term tracking. |
Achieves state-of-the-art results on multiple benchmarks (TAP-Vid-DAVIS, PointOdyssey, DynamicReplica), surpassing previous methods by significant margins.
Demonstrates the importance of joint tracking, with performance gains observed even when tracking a single target point supported by additional points.
Shows the effectiveness of virtual track tokens, allowing for near-dense point tracking on a single GPU. |
Despite improvements, tracking errors that humans easily avoid can still occur.
Limited window size poses challenges for points occluded over long durations, suggesting potential benefits from incorporating global context or offline processing. |
point tracking, joint tracking, transformer, virtual tracks, unrolled training |
2307.07397
Report |
Improving Zero-Shot Generalization for CLIP with Synthesized Prompts |
Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, Tieniu Tan |
With the growing interest in pretrained vision-language models like CLIP,
recent research has focused on adapting these models to downstream tasks.
Despite achieving promising results, most existing methods require labeled data
for all classes, which may not hold in real-world applications due to the long
tail and Zipf's law. For example, some classes may lack labeled data entirely,
such as emerging concepts. To address this problem, we propose a plug-and-play
generative approach called \textbf{S}ynt\textbf{H}es\textbf{I}zed
\textbf{P}rompts~(\textbf{SHIP}) to improve existing fine-tuning methods.
Specifically, we follow variational autoencoders to introduce a generator that
reconstructs the visual features by inputting the synthesized prompts and the
corresponding class names to the textual encoder of CLIP. In this manner, we
easily obtain the synthesized features for the remaining label-only classes.
Thereafter, we fine-tune CLIP with off-the-shelf methods by combining labeled
and synthesized features. Extensive experiments on base-to-new generalization,
cross-dataset transfer learning, and generalized zero-shot learning demonstrate
the superiority of our approach. The code is available at
\url{https://github.com/mrflogs/SHIP}. |
This paper introduces SHIP, a plug-and-play generative method, to enhance CLIP's performance in few-shot learning scenarios where some classes lack labeled data. |
Existing CLIP fine-tuning methods often falter when dealing with novel classes without labeled data, hindering their applicability in real-world settings. |
SHIP employs a VAE-based generator to synthesize prompts for novel classes by leveraging the pre-trained CLIP's language encoder. These synthesized prompts are then used to generate features for novel classes, enabling the use of off-the-shelf fine-tuning methods on both base and novel classes. |
SHIP consistently improves the performance of existing methods (CoOp, CLIP-Adapter, Tip-Adapter) in base-to-new generalization tasks across various datasets.
In cross-dataset transfer learning, SHIP enhances CoOp's accuracy, demonstrating its effectiveness in transferring knowledge to new datasets.
For generalized zero-shot learning, SHIP effectively handles the challenge of mixed base and novel classes during testing, surpassing previous methods in unseen class accuracy. |
SHIP requires additional training, leading to increased computational cost compared to zero-shot CLIP.
The effectiveness of SHIP in dense prediction tasks remains unexplored. |
few-shot learning, vision-language models, clip, generative models, prompt learning |
2307.06948
Report |
Self-regulating Prompts: Foundational Model Adaptation without Forgetting |
Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan |
Prompt learning has emerged as an efficient alternative for fine-tuning
foundational models, such as CLIP, for various downstream tasks. Conventionally
trained using the task-specific objective, i.e., cross-entropy loss, prompts
tend to overfit downstream data distributions and find it challenging to
capture task-agnostic general features from the frozen CLIP. This leads to the
loss of the model's original generalization capability. To address this issue,
our work introduces a self-regularization framework for prompting called
PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the
prompts to optimize for both task-specific and task-agnostic general
representations using a three-pronged approach by: (a) regulating prompted
representations via mutual agreement maximization with the frozen model, (b)
regulating with self-ensemble of prompts over the training trajectory to encode
their complementary strengths, and (c) regulating with textual diversity to
mitigate sample diversity imbalance with the visual branch. To the best of our
knowledge, this is the first regularization framework for prompt learning that
avoids overfitting by jointly attending to pre-trained model features, the
training trajectory during prompting, and the textual diversity. PromptSRC
explicitly steers the prompts to learn a representation space that maximizes
performance on downstream tasks without compromising CLIP generalization. We
perform extensive experiments on 4 benchmarks where PromptSRC overall performs
favorably well compared to the existing methods. Our code and pre-trained
models are publicly available at: https://github.com/muzairkhattak/PromptSRC. |
The paper introduces PromptSRC, a self-regularization framework for prompt learning that prevents overfitting and improves generalization in foundational vision-language models like CLIP. |
Existing prompt learning methods, while effective, tend to overfit to downstream task data, sacrificing the generalization ability of the pre-trained model. PromptSRC aims to address this issue by retaining task-agnostic knowledge while adapting to downstream tasks. |
PromptSRC utilizes a three-pronged approach: (a) maximizing mutual agreement between prompted and frozen model features, (b) employing a Gaussian-weighted self-ensemble of prompts learned across epochs, and (c) incorporating textual diversity by using multiple text augmentations for pre-trained features. |
PromptSRC significantly outperforms existing methods in base-to-novel generalization, particularly on novel classes.
It shows consistent improvements in few-shot learning, especially in extremely low-data regimes.
PromptSRC demonstrates superior performance in domain generalization tasks, indicating its robustness to domain shifts. |
The paper primarily focuses on image recognition tasks and evaluating its effectiveness on other downstream tasks like image captioning or visual question answering is left for future work.
Exploring alternate regularization techniques or more sophisticated prompt aggregation strategies could further enhance performance. |
prompt learning, clip, regularization, generalization, vision-language models |
2307.06940
Report |
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation |
Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen |
Generating videos for visual storytelling can be a tedious and complex
process that typically requires either live-action filming or graphics
animation rendering. To bypass these challenges, our key idea is to utilize the
abundance of existing video clips and synthesize a coherent storytelling video
by customizing their appearances. We achieve this by developing a framework
comprised of two functional modules: (i) Motion Structure Retrieval, which
provides video candidates with desired scene or motion context described by
query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates
plot-aligned videos under the guidance of motion structure and text prompts.
For the first module, we leverage an off-the-shelf video retrieval system and
extract video depths as motion structure. For the second module, we propose a
controllable video generation model that offers flexible controls over
structure and characters. The videos are synthesized by following the
structural guidance and appearance instruction. To ensure visual consistency
across clips, we propose an effective concept personalization approach, which
allows the specification of the desired character identities through text
prompts. Extensive experiments demonstrate that our approach exhibits
significant advantages over various existing baselines. |
Introduces a novel retrieval-based pipeline for storytelling video synthesis, enabling better quality, layout/motion control, and character personalization for character-consistent storytelling videos. |
Addresses the limitations of current text-to-video generation techniques in creating engaging and coherent storytelling videos. |
Leverages existing video content for structure guidance in a text-to-video generation model. Employs a new personalization method, TimeInv, for consistent character rendering across video clips, and tackles the conflict between structure and character generation through adjustable depth control. |
Retrieval-augmented video generation significantly improves quality and controllability compared to text-only generation.
TimeInv outperforms baseline personalization approaches in achieving consistent character appearance and compositionality.
Adjustable depth control effectively mitigates the conflict between motion guidance and character fidelity. |
Exploration of a general character control mechanism without fine-tuning is needed.
Further research on better cooperation strategies between character and structure control is crucial. |
story visualization, video diffusion models, retrieval-augmented generation, personalized generation, text-to-video synthesis |
2307.06925
Report |
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models |
Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano |
Text-to-image (T2I) personalization allows users to guide the creative image
generation process by combining their own visual concepts in natural language
prompts. Recently, encoder-based techniques have emerged as a new effective
approach for T2I personalization, reducing the need for multiple images and
long training times. However, most existing encoders are limited to a
single-class domain, which hinders their ability to handle diverse concepts. In
this work, we propose a domain-agnostic method that does not require any
specialized dataset or prior information about the personalized concepts. We
introduce a novel contrastive-based regularization technique to maintain high
fidelity to the target concept characteristics while keeping the predicted
embeddings close to editable regions of the latent space, by pushing the
predicted tokens toward their nearest existing CLIP tokens. Our experimental
results demonstrate the effectiveness of our approach and show how the learned
tokens are more semantic than tokens predicted by unregularized models. This
leads to a better representation that achieves state-of-the-art performance
while being more flexible than previous methods. |
This paper presents a domain-agnostic tuning-encoder for fast personalization of text-to-image models, enabling one-shot inference-time tuning for diverse concepts. |
Existing encoder-based text-to-image personalization methods are limited to single-class domains, hindering their applicability to diverse concepts. |
The method leverages contrastive-based regularization to predict embeddings near semantically related words and employs a hyper-network to capture concept-specific features. A dual-path adaptation approach using hard and soft prompts is used during a brief inference-time tuning phase. |
The contrastive regularization improves embedding quality and prevents overfitting.
The method achieves comparable quality to state-of-the-art methods using only a single image and fewer training steps.
Ablation studies highlight the importance of regularization, fine-tuning, and the hyper-network. |
The method's performance is limited by the training data, potentially struggling with domains poorly represented in the dataset.
While reduced, a tuning step is still required to enhance downstream similarity. |
text-to-image synthesis, personalization, domain-agnostic, tuning-encoder, contrastive learning |
2307.06526
Report |
AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion |
Shuo Huang, Zongxin Yang, Liangting Li, Yi Yang, Jia Jia |
Large-scale pre-trained vision-language models allow for the zero-shot
text-based generation of 3D avatars. The previous state-of-the-art method
utilized CLIP to supervise neural implicit models that reconstructed a human
body mesh. However, this approach has two limitations. Firstly, the lack of
avatar-specific models can cause facial distortion and unrealistic clothing in
the generated avatars. Secondly, CLIP only provides optimization direction for
the overall appearance, resulting in less impressive results. To address these
limitations, we propose AvatarFusion, the first framework to use a latent
diffusion model to provide pixel-level guidance for generating human-realistic
avatars while simultaneously segmenting clothing from the avatar's body.
AvatarFusion includes the first clothing-decoupled neural implicit avatar model
that employs a novel Dual Volume Rendering strategy to render the decoupled
skin and clothing sub-models in one space. We also introduce a novel
optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which
semantically separates the generation of body and clothes, and generates a
variety of clothing styles. Moreover, we establish the first benchmark for
zero-shot text-to-avatar generation. Our experimental results demonstrate that
our framework outperforms previous approaches, with significant improvements
observed in all metrics. Additionally, since our model is clothing-decoupled,
we can exchange the clothes of avatars. Code are available on our project page
https://hansenhuang0823.github.io/AvatarFusion. |
AvatarFusion is the first zero-shot text-to-3D-avatar generation framework that decouples clothing from the avatar model, allowing for more realistic avatars and clothing exchange between avatars. |
Existing methods for generating 3D avatars from text suffer from facial distortion, unrealistic clothing, and limited detail due to the lack of avatar-specific models and the limitations of using CLIP for optimization. |
AvatarFusion leverages a clothing-decoupled neural implicit avatar model with a dual volume rendering strategy and a novel optimization method called Pixel-Semantics Difference-Sampling (PS-DS), which utilizes a latent diffusion model for pixel-level guidance. |
AvatarFusion outperforms baselines in both quantitative and qualitative evaluations on the newly proposed Famous-Character-50 benchmark.
The generated avatars exhibit superior facial details, more realistic clothing, and better alignment with text prompts.
The clothing-decoupled model enables clothing exchange between different avatars. |
Currently, the method cannot generate a realistic backside due to the limited responsiveness of vision-language models.
Future work may focus on addressing the backside generation issue and improving the generation of loose clothing. |
3d avatar generation, zero-shot learning, diffusion models, clothing decoupling, neural implicit surfaces |
2307.06304
Report |
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution |
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby |
The ubiquitous and demonstrably suboptimal choice of resizing images to a
fixed resolution before processing them with computer vision models has not yet
been successfully challenged. However, models such as the Vision Transformer
(ViT) offer flexible sequence-based modeling, and hence varying input sequence
lengths. We take advantage of this with NaViT (Native Resolution ViT) which
uses sequence packing during training to process inputs of arbitrary
resolutions and aspect ratios. Alongside flexible model usage, we demonstrate
improved training efficiency for large-scale supervised and contrastive
image-text pretraining. NaViT can be efficiently transferred to standard tasks
such as image and video classification, object detection, and semantic
segmentation and leads to improved results on robustness and fairness
benchmarks. At inference time, the input resolution flexibility can be used to
smoothly navigate the test-time cost-performance trade-off. We believe that
NaViT marks a departure from the standard, CNN-designed, input and modelling
pipeline used by most computer vision models, and represents a promising
direction for ViTs. |
The paper introduces NaViT, a Vision Transformer that processes images at their native resolution using a technique called "Patch n' Pack" which allows packing patches from multiple images into a single sequence, improving training efficiency and enabling flexible input resolutions and aspect ratios. |
Current computer vision models resize images to a fixed resolution before processing, which can harm performance and is computationally inefficient. NaViT addresses these limitations by allowing for variable input sizes. |
NaViT leverages the sequence-based nature of Vision Transformers and introduces masked self-attention, masked pooling, and factorized positional embeddings to handle variable resolution and aspect ratios. This allows packing patches from multiple images into a single sequence, significantly accelerating training. |
NaViT achieves superior training efficiency, matching the performance of top-performing ViT models with 4 times less compute.
The model allows for variable-resolution finetuning, achieving comparable performance to fixed-resolution finetuning while providing greater flexibility.
NaViT demonstrates improved out-of-distribution generalization, particularly on datasets with extreme aspect ratios. |
The paper mainly focuses on image classification and acknowledges the need for further exploration of NaViT's capabilities in downstream tasks like object detection and semantic segmentation.
While NaViT demonstrates promising results, more research is needed to fully explore the potential of Patch n' Pack in other Vision Transformer architectures and applications. |
vision transformers, native resolution, sequence packing, variable input size, training efficiency |
2307.06281
Report |
MMBench: Is Your Multi-modal Model an All-around Player? |
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin |
Large vision-language models have recently achieved remarkable progress,
exhibiting great perception and reasoning abilities concerning visual
information. However, how to effectively evaluate these large vision-language
models remains a major obstacle, hindering future model development.
Traditional benchmarks like VQAv2 or COCO Caption provide quantitative
performance measurements but suffer from a lack of fine-grained ability
assessment and non-robust evaluation metrics. Recent subjective benchmarks,
such as OwlEval, offer comprehensive evaluations of a model's abilities by
incorporating human labor, but they are not scalable and display significant
bias. In response to these challenges, we propose MMBench, a novel
multi-modality benchmark. MMBench methodically develops a comprehensive
evaluation pipeline, primarily comprised of two elements. The first element is
a meticulously curated dataset that surpasses existing similar benchmarks in
terms of the number and variety of evaluation questions and abilities. The
second element introduces a novel CircularEval strategy and incorporates the
use of ChatGPT. This implementation is designed to convert free-form
predictions into pre-defined choices, thereby facilitating a more robust
evaluation of the model's predictions. MMBench is a systematically-designed
objective benchmark for robustly evaluating the various abilities of
vision-language models. We hope MMBench will assist the research community in
better evaluating their models and encourage future advancements in this
domain. Project page: https://opencompass.org.cn/mmbench. |
This paper introduces MMBench, a bilingual benchmark designed for robust and holistic evaluation of multi-modal capabilities of large vision-language models (VLMs). |
Evaluating VLMs effectively is crucial for further development, but existing benchmarks lack either fine-grained ability assessment or scalability and suffer from bias. |
MMBench utilizes a hierarchical ability taxonomy, rigorous quality control, and a novel circular evaluation strategy (CircularEval) with LLM-assisted choice extraction. |
MMBench surpasses existing benchmarks in the number and variety of evaluation questions and abilities, covering 20 fine-grained skills.
GPT-4 achieves a 91.5% alignment rate with human evaluation in choice extraction, demonstrating its robustness in handling free-form VLM outputs.
Comprehensive evaluation of various VLMs on MMBench reveals performance gaps and provides insights for future optimization, especially highlighting challenges in understanding low-level visual features, structuralized inputs, and spatial relationships. |
The paper acknowledges potential bias in the initial English-centric data collection of MMBench.
Future work may involve expanding the benchmark with more challenging scenarios, such as incorporating video understanding or interactive tasks. |
vision-language models, multi-modal benchmark, evaluation, circulareval, llm-assisted choice extraction |
2307.05977
Report |
Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models |
Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee |
Large-scale image generation models, with impressive quality made possible by
the vast amount of data available on the Internet, raise social concerns that
these models may generate harmful or copyrighted content. The biases and
harmfulness arise throughout the entire training process and are hard to
completely remove, which have become significant hurdles to the safe deployment
of these models. In this paper, we propose a method called SDD to prevent
problematic content generation in text-to-image diffusion models. We
self-distill the diffusion model to guide the noise estimate conditioned on the
target removal concept to match the unconditional one. Compared to the previous
methods, our method eliminates a much greater proportion of harmful content
from the generated images without degrading the overall image quality.
Furthermore, our method allows the removal of multiple concepts at once,
whereas previous works are limited to removing a single concept at a time. |
This paper introduces SDD, a self-distillation method for text-to-image diffusion models, to prevent the generation of harmful or copyrighted content. |
Large-scale image generation models, trained on vast internet data, risk generating harmful or copyrighted content, posing a significant challenge to their safe deployment. Existing detoxification methods are often insufficient and can degrade image quality. |
SDD fine-tunes the diffusion model using self-distillation, guiding the noise estimate conditioned on the target removal concept to match the unconditional one. An EMA teacher model is employed to mitigate catastrophic forgetting during fine-tuning. |
SDD effectively removes a greater proportion of harmful content from generated images compared to previous methods, as demonstrated by experiments on NSFW and artist concept removal.
SDD exhibits minimal interference with other concepts in the generated images, preserving the overall image quality and user intent.
The use of an EMA teacher model in SDD helps maintain image quality and details more effectively compared to directly fine-tuning the student model. |
The method may not completely remove all problematic content and could still have minor impact on image quality.
The research primarily focuses on NSFW and artist concept removal, with limited exploration of other harmful content types. |
text-to-image generation, diffusion models, safe ai, content moderation, self-distillation |
2307.05892
Report |
SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views |
Shi-Sheng Huang, Zi-Xin Zou, Yi-Chi Zhang, Hua Huang |
The recent neural surface reconstruction by volume rendering approaches have
made much progress by achieving impressive surface reconstruction quality, but
are still limited to dense and highly accurate posed views. To overcome such
drawbacks, this paper pays special attention on the consistent surface
reconstruction from sparse views with noisy camera poses. Unlike previous
approaches, the key difference of this paper is to exploit the multi-view
constraints directly from the explicit geometry of the neural surface, which
can be used as effective regularization to jointly learn the neural surface and
refine the camera poses. To build effective multi-view constraints, we
introduce a fast differentiable on-surface intersection to generate on-surface
points, and propose view-consistent losses based on such differentiable points
to regularize the neural surface learning. Based on this point, we propose a
jointly learning strategy for neural surface and camera poses, named SC-NeuS,
to perform geometry-consistent surface reconstruction in an end-to-end manner.
With extensive evaluation on public datasets, our SC-NeuS can achieve
consistently better surface reconstruction results with fine-grained details
than previous state-of-the-art neural surface reconstruction approaches,
especially from sparse and noisy camera views. |
This paper presents SC-NeuS, a novel learning framework for geometry-consistent neural surface reconstruction from sparse views with noisy camera poses, leveraging multi-view constraints derived directly from the explicit geometry of the neural surface. |
Existing neural surface reconstruction methods often struggle with sparse and noisy input, limiting their applicability in real-world scenarios where dense, high-quality data acquisition is challenging. |
The method introduces a fast differentiable on-surface intersection to sample points on the neural surface. These points are then used to define view-consistent losses, regularizing the joint learning of the neural surface representation and camera poses in an end-to-end manner. A coarse-to-fine learning strategy further enhances reconstruction accuracy. |
SC-NeuS achieves state-of-the-art surface reconstruction quality from sparse and noisy views, outperforming existing methods like BARF, IDR, and NeuS-BARF on public datasets like DTU and BlendedMVS.
The proposed method demonstrates superior accuracy in both camera pose estimation and surface reconstruction compared to baselines.
Ablation studies confirm the effectiveness of the view-consistent re-projection and patch-warping losses in improving both the geometric accuracy and fine-grained detail of the reconstructed surfaces. |
The method's performance depends on the quality of 2D feature matching, which can be challenging in low-texture or illumination-varying scenes.
Large camera pose variations between sparse views may hinder effective joint optimization. |
neural surface reconstruction, sparse view reconstruction, camera pose estimation, multi-view constraints, differentiable rendering |
2307.05707
Report |
MoP-CLIP: A Mixture of Prompt-Tuned CLIP Models for Domain Incremental Learning |
Julien Nicolas, Florent Chiaroni, Imtiaz Ziko, Ola Ahmad, Christian Desrosiers, Jose Dolz |
Despite the recent progress in incremental learning, addressing catastrophic
forgetting under distributional drift is still an open and important problem.
Indeed, while state-of-the-art domain incremental learning (DIL) methods
perform satisfactorily within known domains, their performance largely degrades
in the presence of novel domains. This limitation hampers their
generalizability, and restricts their scalability to more realistic settings
where train and test data are drawn from different distributions. To address
these limitations, we present a novel DIL approach based on a mixture of
prompt-tuned CLIP models (MoP-CLIP), which generalizes the paradigm of
S-Prompting to handle both in-distribution and out-of-distribution data at
inference. In particular, at the training stage we model the features
distribution of every class in each domain, learning individual text and visual
prompts to adapt to a given domain. At inference, the learned distributions
allow us to identify whether a given test sample belongs to a known domain,
selecting the correct prompt for the classification task, or from an unseen
domain, leveraging a mixture of the prompt-tuned CLIP models. Our empirical
evaluation reveals the poor performance of existing DIL methods under domain
shift, and suggests that the proposed MoP-CLIP performs competitively in the
standard DIL settings while outperforming state-of-the-art methods in OOD
scenarios. These results demonstrate the superiority of MoP-CLIP, offering a
robust and general solution to the problem of domain incremental learning. |
This paper introduces MoP-CLIP, an exemplar-free domain incremental learning (DIL) approach based on a mixture of prompt-tuned CLIP models, addressing the limitations of existing methods in handling distributional drift and generalizing to unseen domains. |
Current DIL methods struggle with performance degradation under distributional shifts between training and testing data, limiting their applicability in real-world scenarios where such shifts are common. |
MoP-CLIP learns class-wise feature distributions for each domain during training, enabling it to identify whether a test sample belongs to a known domain and select the appropriate prompt, or to an unseen domain, triggering a mixture of prompts for prediction. |
MoP-CLIP achieves competitive performance on known domains compared to state-of-the-art DIL methods.
It significantly outperforms existing exemplar-free methods in scenarios with domain distributional shifts.
The paper provides empirical evidence of the limitations of existing DIL methods under domain shift and demonstrates the effectiveness of the proposed approach through extensive experiments. |
The assumption of isotropic Gaussian distribution for features around prototypes, while simplifying the model, might not hold for all datasets and could be explored further.
Further investigation into alternative distributions beyond Gaussian for modeling distances to prototypes, such as Weibull or Generalized Pareto, could be beneficial. |
domain incremental learning, prompt learning, distributional drift, out-of-distribution generalization, clip |
2307.05473
Report |
Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives |
Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei A. Efros, Mathieu Aubry |
Given a set of calibrated images of a scene, we present an approach that
produces a simple, compact, and actionable 3D world representation by means of
3D primitives. While many approaches focus on recovering high-fidelity 3D
scenes, we focus on parsing a scene into mid-level 3D representations made of a
small set of textured primitives. Such representations are interpretable, easy
to manipulate and suited for physics-based simulations. Moreover, unlike
existing primitive decomposition methods that rely on 3D input data, our
approach operates directly on images through differentiable rendering.
Specifically, we model primitives as textured superquadric meshes and optimize
their parameters from scratch with an image rendering loss. We highlight the
importance of modeling transparency for each primitive, which is critical for
optimization and also enables handling varying numbers of primitives. We show
that the resulting textured primitives faithfully reconstruct the input images
and accurately model the visible 3D points, while providing amodal shape
completions of unseen object regions. We compare our approach to the state of
the art on diverse scenes from DTU, and demonstrate its robustness on real-life
captures from BlendedMVS and Nerfstudio. We also showcase how our results can
be used to effortlessly edit a scene or perform physical simulations. Code and
video results are available at https://www.tmonnier.com/DBW . |
This paper introduces Differentiable Blocks World (DBW), an end-to-end method for reconstructing 3D scenes from calibrated images using a set of textured superquadric primitives. |
Existing multi-view modeling approaches, while highly accurate, often produce dense, uninterpretable representations. DBW addresses this by providing a compact, interpretable, and manipulable scene representation suitable for tasks like physics-based simulations and scene editing. |
DBW optimizes the parameters of superquadric meshes and their UV textures directly from images by minimizing a rendering loss. A key innovation is the modeling of primitive transparency, facilitating handling of varying primitive numbers and occlusions. |
DBW accurately reconstructs visible 3D points and faithfully reconstructs input images on DTU benchmark.
Outperforms state-of-the-art 3D decomposition methods (EMS, MonteBoxFinder) applied on ground-truth point clouds in terms of interpretability and accuracy.
Demonstrates robustness on real-life captures (Nerfstudio, BlendedMVS), enabling applications like amodal scene completion, scene editing, and physics-based simulations. |
DBW can sometimes converge to suboptimal solutions, missing parts or yielding unnatural decompositions.
Automatic selection among multiple runs mitigates this but increases computational cost. |
3d reconstruction, primitive-based representation, differentiable rendering, multi-view stereo, scene understanding |
2307.05468
Report |
My3DGen: A Scalable Personalized 3D Generative Model |
Luchao Qi, Jiaye Wu, Annie N. Wang, Shengze Wang, Roni Sengupta |
In recent years, generative 3D face models (e.g., EG3D) have been developed
to tackle the problem of synthesizing photo-realistic faces. However, these
models are often unable to capture facial features unique to each individual,
highlighting the importance of personalization. Some prior works have shown
promise in personalizing generative face models, but these studies primarily
focus on 2D settings. Also, these methods require both fine-tuning and storing
a large number of parameters for each user, posing a hindrance to achieving
scalable personalization. Another challenge of personalization is the limited
number of training images available for each individual, which often leads to
overfitting when using full fine-tuning methods. Our proposed approach,
My3DGen, generates a personalized 3D prior of an individual using as few as 50
training images. My3DGen allows for novel view synthesis, semantic editing of a
given face (e.g. adding a smile), and synthesizing novel appearances, all while
preserving the original person's identity. We decouple the 3D facial features
into global features and personalized features by freezing the pre-trained EG3D
and training additional personalized weights through low-rank decomposition. As
a result, My3DGen introduces only $\textbf{240K}$ personalized parameters per
individual, leading to a $\textbf{127}\times$ reduction in trainable parameters
compared to the $\textbf{30.6M}$ required for fine-tuning the entire parameter
space. Despite this significant reduction in storage, our model preserves
identity features without compromising the quality of downstream applications. |
My3DGen is a novel approach for creating personalized 3D generative priors for individuals using as few as 50 training images. This allows for novel view synthesis, semantic editing, and novel appearance synthesis while preserving identity. |
Current 3D generative face models struggle to capture and manipulate individual facial features without distorting identity. Personalizing these models is crucial for enhancing realism in various applications but often faces scalability issues due to large parameter storage requirements. |
My3DGen uses a pre-trained EG3D model for global facial features and learns personalized features via low-rank adaptation (LoRA). This method decomposes convolutional and fully-connected layer weights, drastically reducing the number of trainable parameters compared to full fine-tuning. |
My3DGen outperforms pre-trained EG3D in 3D reconstruction, novel appearance synthesis, image enhancement, and semantic editing while preserving identity.
Despite using significantly fewer trainable parameters, My3DGen achieves comparable results to fully fine-tuning a pre-trained model.
Analysis shows that personalizing earlier layers of StyleGAN2, responsible for coarse facial features, has the most impact on quality and identity preservation. |
My3DGen faces difficulties reconstructing faces heavily obscured by objects.
The model struggles with heavily cropped faces where boundaries are filled with padded values. |
personalization, 3d-gan, 3d face, lora, generative models |
2307.05462
Report |
Efficient 3D Articulated Human Generation with Layered Surface Volumes |
Yinghao Xu, Wang Yifan, Alexander W. Bergman, Menglei Chai, Bolei Zhou, Gordon Wetzstein |
Access to high-quality and diverse 3D articulated digital human assets is
crucial in various applications, ranging from virtual reality to social
platforms. Generative approaches, such as 3D generative adversarial networks
(GANs), are rapidly replacing laborious manual content creation tools. However,
existing 3D GAN frameworks typically rely on scene representations that
leverage either template meshes, which are fast but offer limited quality, or
volumes, which offer high capacity but are slow to render, thereby limiting the
3D fidelity in GAN settings. In this work, we introduce layered surface volumes
(LSVs) as a new 3D object representation for articulated digital humans. LSVs
represent a human body using multiple textured mesh layers around a
conventional template. These layers are rendered using alpha compositing with
fast differentiable rasterization, and they can be interpreted as a volumetric
representation that allocates its capacity to a manifold of finite thickness
around the template. Unlike conventional single-layer templates that struggle
with representing fine off-surface details like hair or accessories, our
surface volumes naturally capture such details. LSVs can be articulated, and
they exhibit exceptional efficiency in GAN settings, where a 2D generator
learns to synthesize the RGBA textures for the individual layers. Trained on
unstructured, single-view 2D image datasets, our LSV-GAN generates high-quality
and view-consistent 3D articulated digital humans without the need for
view-inconsistent 2D upsampling networks. |
This paper introduces Layered Surface Volumes (LSVs), a novel 3D representation for articulated digital humans, and uses it in a GAN framework (LSV-GAN) to generate high-quality, animatable human bodies from single-view images. |
High-quality 3D human assets are important for various applications, but existing generation methods struggle to balance realism, efficiency, and the ability to capture fine details like hair. This work aims to address these limitations. |
LSVs represent a human body using multiple textured mesh layers around a template mesh (SMPL). These layers, textured with color and transparency, are efficiently rendered using alpha compositing and differentiable rasterization. A 2D GAN generator learns to synthesize these textures from single-view images. |
LSV-GAN achieves state-of-the-art quality and diversity in generated 3D humans, outperforming baselines in FID and PCK metrics on multiple datasets.
The method maintains excellent multi-view consistency, thanks to the use of rasterization and the absence of view-inconsistent upsampling networks.
LSV-GAN is computationally efficient, achieving fast training and rendering times due to the use of LSVs and differentiable rasterization. |
The level of detail in generated results is limited by the image resolution.
Realistic motion of hair and clothes is limited by the use of linear blend skinning. |
3d human generation, generative adversarial networks (gans), layered surface volumes (lsvs), differentiable rasterization, articulated human body |
2307.05445
Report |
AutoDecoding Latent 3D Diffusion Models |
Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc Van Gool, Sergey Tulyakov |
We present a novel approach to the generation of static and articulated 3D
assets that has a 3D autodecoder at its core. The 3D autodecoder framework
embeds properties learned from the target dataset in the latent space, which
can then be decoded into a volumetric representation for rendering
view-consistent appearance and geometry. We then identify the appropriate
intermediate volumetric latent space, and introduce robust normalization and
de-normalization operations to learn a 3D diffusion from 2D images or monocular
videos of rigid or articulated objects. Our approach is flexible enough to use
either existing camera supervision or no camera information at all -- instead
efficiently learning it during training. Our evaluations demonstrate that our
generation results outperform state-of-the-art alternatives on various
benchmark datasets and metrics, including multi-view image datasets of
synthetic objects, real in-the-wild videos of moving people, and a large-scale,
real video dataset of static objects. |
This paper introduces 3DVADER, a novel two-stage approach for generating static and articulated 3D assets using a 3D autodecoder and a latent 3D diffusion model. |
Existing 3D generative methods struggle with the limitations of 2D training data and the lack of standard representations for 3D geometry. 3DVADER overcomes these by using a volumetric autodecoder to learn 3D representations from 2D images or videos, enabling generation of diverse and realistic 3D objects. |
The first stage trains a volumetric autodecoder to learn latent representations of objects from multi-view images or monocular videos. The second stage trains a 3D diffusion model in the compact latent space of the autodecoder, enabling efficient generation of diverse 3D content. |
3DVADER outperforms state-of-the-art methods on benchmark datasets, including multi-view images of synthetic objects and real in-the-wild videos of humans.
The method is scalable to large, multi-category datasets, exceeding the capacity of previous 3D diffusion models.
Robust normalization and denormalization operations are introduced to identify and operate within the appropriate latent space of the autodecoder for optimal diffusion. |
The method currently focuses on single-object scenes, limiting its application to more complex multi-object scenarios.
It requires multi-view images or video sequences for training, restricting its use with single-image datasets. |
3d generation, diffusion models, autodecoders, volumetric rendering, neural rendering |
2307.05222
Report |
Emu: Generative Pretraining in Multimodality |
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang |
We present Emu, a Transformer-based multimodal foundation model, which can
seamlessly generate images and texts in multimodal context. This omnivore model
can take in any single-modality or multimodal data input indiscriminately
(e.g., interleaved image, text and video) through a one-model-for-all
autoregressive training process. First, visual signals are encoded into
embeddings, and together with text tokens form an interleaved input sequence.
Emu is then end-to-end trained with a unified objective of classifying the next
text token or regressing the next visual embedding in the multimodal sequence.
This versatile multimodality empowers the exploration of diverse pretraining
data sources at scale, such as videos with interleaved frames and text,
webpages with interleaved images and text, as well as web-scale image-text
pairs and video-text pairs. Emu can serve as a generalist multimodal interface
for both image-to-text and text-to-image tasks, and supports in-context image
and text generation. Across a broad range of zero-shot/few-shot tasks including
image captioning, visual question answering, video question answering and
text-to-image generation, Emu demonstrates superb performance compared to
state-of-the-art large multimodal models. Extended capabilities such as
multimodal assistants via instruction tuning are also demonstrated with
impressive performance. |
Introducing Emu, a large multimodal model trained to predict the next element in interleaved visual and textual sequences, enabling it to perform diverse multimodal tasks like image captioning, visual question answering, and text-to-image generation. |
Emu leverages the power of LLMs and diverse web-scale data, including a novel video-text interleaved dataset, to achieve strong performance in zero-shot and few-shot settings on various tasks, advancing the capabilities of multimodal models. |
Emu utilizes a unified autoregressive training objective with a visual encoder (EVA-CLIP), causal transformer for visual sequence modeling, multimodal modeling LLM (LLaMA), and a visual decoder (Stable Diffusion) for image generation. |
Emu demonstrates state-of-the-art performance on multiple zero-shot and few-shot benchmarks, outperforming existing large multimodal models.
The model exhibits strong in-context learning abilities, improving performance with more in-context examples.
Emu showcases impressive qualitative capabilities like image blending, in-context text and image generation, and real-world knowledge grounding. |
Emu is primarily trained on English-language data, limiting its proficiency in other languages.
Like other LLMs and LMMs, Emu is susceptible to hallucinations, slow inference speed, and potential biases from training data. |
multimodal learning, large language models, image captioning, visual question answering, text-to-image generation |
2307.05134
Report |
TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation |
Paul Grimal, Hervé Le Borgne, Olivier Ferret, Julien Tourille |
The progress in the generation of synthetic images has made it crucial to
assess their quality. While several metrics have been proposed to assess the
rendering of images, it is crucial for Text-to-Image (T2I) models, which
generate images based on a prompt, to consider additional aspects such as to
which extent the generated image matches the important content of the prompt.
Moreover, although the generated images usually result from a random starting
point, the influence of this one is generally not considered. In this article,
we propose a new metric based on prompt templates to study the alignment
between the content specified in the prompt and the corresponding generated
images. It allows us to better characterize the alignment in terms of the type
of the specified objects, their number, and their color. We conducted a study
on several recent T2I models about various aspects. An additional interesting
result we obtained with our approach is that image quality can vary drastically
depending on the noise used as a seed for the images. We also quantify the
influence of the number of concepts in the prompt, their order as well as their
(color) attributes. Finally, our method allows us to identify some seeds that
produce better images than others, opening novel directions of research on this
understudied topic. |
The paper introduces TIAM, a novel metric to quantify the alignment between generated images and text prompts in text-to-image synthesis. |
Existing image quality metrics fail to adequately assess the alignment between generated content and textual descriptions, particularly in complex scenarios involving multiple objects and attributes. |
TIAM utilizes prompt templates to systematically analyze the success rate of generating images containing specific objects and their attributes (e.g., color). It leverages object detection and segmentation to compare generated images with ground truth labels derived from the prompt. |
The alignment performance of text-to-image models significantly declines as the number of objects in the prompt increases.
The initial objects mentioned in the prompt are more likely to be present and correctly attributed in the generated image.
The study reveals the significant influence of the random seed used during image generation, indicating that certain seed values consistently produce higher-quality results. |
TIAM's computational cost increases with the number of objects and attributes, potentially limiting its scalability.
The current implementation focuses on a limited set of attributes (primarily color) and object labels derived from the COCO dataset, requiring further work to extend its applicability to a broader range of attributes and open-vocabulary settings. |
text-to-image synthesis, image quality assessment, prompt engineering, semantic alignment, attribute binding |
2307.05000
Report |
Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar |
Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang |
Rendering photorealistic and dynamically moving human heads is crucial for
ensuring a pleasant and immersive experience in AR/VR and video conferencing
applications. However, existing methods often struggle to model challenging
facial regions (e.g., mouth interior, eyes, hair/beard), resulting in
unrealistic and blurry results. In this paper, we propose {\fullname}
({\name}), a method that adopts the neural point representation as well as the
neural volume rendering process and discards the predefined connectivity and
hard correspondence imposed by mesh-based approaches. Specifically, the neural
points are strategically constrained around the surface of the target
expression via a high-resolution UV displacement map, achieving increased
modeling capacity and more accurate control. We introduce three technical
innovations to improve the rendering and training efficiency: a patch-wise
depth-guided (shading point) sampling strategy, a lightweight radiance decoding
process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By
design, our {\name} is better equipped to handle topologically changing regions
and thin structures while also ensuring accurate expression control when
animating avatars. Experiments conducted on three subjects from the Multiface
dataset demonstrate the effectiveness of our designs, outperforming previous
state-of-the-art methods, especially in handling challenging facial regions. |
Proposes NPVA, a neural point-based volumetric representation for animatable head avatar creation that uses neural points constrained around a target expression's surface for efficient and photorealistic rendering. |
Existing mesh-based methods struggle to model challenging facial regions like mouths and beards, leading to unrealistic results. NPVA addresses this by using flexible neural points and neural volume rendering. |
NPVA uses a UV displacement map to guide neural points around a coarse target expression geometry. It introduces a patch-wise depth-guided sampling, lightweight radiance decoding, and Grid-Error-Patch training for efficiency. |
Outperforms state-of-the-art methods in rendering quality on novel expressions and views, especially in challenging regions.
Achieves ~70x faster rendering speed than NeRF while producing comparable high-fidelity results.
Demonstrates through ablation studies the effectiveness of its technical innovations like lightweight decoding and GEP training. |
Reliance on coarse mesh tracking limits handling of complex hairstyles.
Relaxing displacement map constraints for unseen hairstyles can lead to blurry renderings. |
neural representation, volume rendering, head avatar, facial animation, point cloud |
2307.04859
Report |
Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models |
Alexander W. Bergman, Wang Yifan, Gordon Wetzstein |
The ability to generate diverse 3D articulated head avatars is vital to a
plethora of applications, including augmented reality, cinematography, and
education. Recent work on text-guided 3D object generation has shown great
promise in addressing these needs. These methods directly leverage pre-trained
2D text-to-image diffusion models to generate 3D-multi-view-consistent radiance
fields of generic objects. However, due to the lack of geometry and texture
priors, these methods have limited control over the generated 3D objects,
making it difficult to operate inside a specific domain, e.g., human heads. In
this work, we develop a new approach to text-guided 3D head avatar generation
to address this limitation. Our framework directly operates on the geometry and
texture of an articulable 3D morphable model (3DMM) of a head, and introduces
novel optimization procedures to update the geometry and texture while keeping
the 2D and 3D facial features aligned. The result is a 3D head avatar that is
consistent with the text description and can be readily articulated using the
deformation model of the 3DMM. We show that our diffusion-based articulated
head avatars outperform state-of-the-art approaches for this task. The latter
are typically based on CLIP, which is known to provide limited diversity of
generation and accuracy for 3D object generation. |
Presents a novel method for generating 3D-view-consistent and articulable human head avatars from text prompts using pre-trained 2D text-to-image diffusion models. |
Addresses limitations of existing methods that struggle with control, diversity, and animation in text-guided 3D head avatar generation. |
Leverages score distillation loss to optimize shape and appearance of a 3D morphable model (3DMM) with a novel dual optimization procedure for geometry and texture, ensuring alignment and realism. |
Generates high-quality head avatars with diverse features, including fictional humanoids.
Exhibits superior geometry-text consistency compared to baselines, capturing unique geometric attributes from prompts.
Demonstrates realistic animation due to geometry-aware texture optimization and alignment with 3D facial landmarks. |
Generated images may exhibit cartoon-ish stylization and high color saturation, impacting realism.
Inconsistency in upsampling across camera views can lead to flickering during animation. |
3d head avatar generation, text-guided synthesis, diffusion models, 3d morphable models, articulated animation |
2307.04787
Report |
Collaborative Score Distillation for Consistent Visual Synthesis |
Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin |
Generative priors of large-scale text-to-image diffusion models enable a wide
range of new generation and editing applications on diverse visual modalities.
However, when adapting these priors to complex visual modalities, often
represented as multiple images (e.g., video), achieving consistency across a
set of images is challenging. In this paper, we address this challenge with a
novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein
Variational Gradient Descent (SVGD). Specifically, we propose to consider
multiple samples as "particles" in the SVGD update and combine their score
functions to distill generative priors over a set of images synchronously.
Thus, CSD facilitates seamless integration of information across 2D images,
leading to a consistent visual synthesis across multiple samples. We show the
effectiveness of CSD in a variety of tasks, encompassing the visual editing of
panorama images, videos, and 3D scenes. Our results underline the competency of
CSD as a versatile method for enhancing inter-sample consistency, thereby
broadening the applicability of text-to-image diffusion models. |
This paper proposes Collaborative Score Distillation (CSD), a novel method that extends text-to-image diffusion models for consistent visual synthesis and editing of complex visual data represented as a set of images. |
Existing text-to-image diffusion models struggle to maintain consistency across multiple images, limiting their application to complex visual modalities like videos and 3D scenes. |
CSD leverages Stein Variational Gradient Descent (SVGD) to distill generative priors over a set of images synchronously, ensuring consistency by sharing information among multiple samples during optimization. |
CSD enables spatially consistent panorama image editing, achieving a better balance between source-target consistency and instruction fidelity compared to baselines.
CSD facilitates temporally consistent video editing, outperforming zero-shot methods and demonstrating comparable performance to a state-of-the-art video editing model trained on a large-scale dataset.
CSD enhances 3D scene editing by encouraging multi-view consistency, leading to higher-quality edits and better preservation of source scene semantics compared to existing methods. |
The method inherits limitations from pre-trained text-to-image diffusion models, such as potential biases and difficulty in handling certain editing tasks (e.g., viewpoint changes).
Patch-wise processing of high-resolution images can sometimes lead to artifacts at patch boundaries. |
text-to-image synthesis, score distillation sampling, stein variational gradient descent, video editing, 3d scene editing |
2307.04767
Report |
Semantic-SAM: Segment and Recognize Anything at Any Granularity |
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, Jianfeng Gao |
In this paper, we introduce Semantic-SAM, a universal image segmentation
model to enable segment and recognize anything at any desired granularity. Our
model offers two key advantages: semantic-awareness and granularity-abundance.
To achieve semantic-awareness, we consolidate multiple datasets across three
granularities and introduce decoupled classification for objects and parts.
This allows our model to capture rich semantic information. For the
multi-granularity capability, we propose a multi-choice learning scheme during
training, enabling each click to generate masks at multiple levels that
correspond to multiple ground-truth masks. Notably, this work represents the
first attempt to jointly train a model on SA-1B, generic, and part segmentation
datasets. Experimental results and visualizations demonstrate that our model
successfully achieves semantic-awareness and granularity-abundance.
Furthermore, combining SA-1B training with other segmentation tasks, such as
panoptic and part segmentation, leads to performance improvements. We will
provide code and a demo for further exploration and evaluation. |
This paper presents Semantic-SAM, a universal image segmentation model capable of segmenting and recognizing objects at any desired granularity with semantic awareness. |
A universal segmentation model is crucial for achieving human-level image understanding in various applications, going beyond the limitations of existing models with single-input-single-output pipelines and restricted training data. |
Semantic-SAM leverages a multi-choice learning design with multiple queries per click, enabling the prediction of multi-granularity masks. It uses a shared text encoder for decoupled object and part classification, trained on a unified data format from seven datasets with different semantic and granularity levels, including SA-1B, COCO, ADE20k, Pascal Part, PACO, PartImageNet, and Objects365. |
Semantic-SAM achieves state-of-the-art performance on various segmentation tasks, including generic, part, and interactive segmentation.
Joint training with SA-1B significantly improves performance on COCO panoptic segmentation, demonstrating the benefit of multi-granularity learning.
The model exhibits superior granularity completeness compared to SAM, generating more meaningful and higher-quality masks at multiple levels. |
The model currently relies on a fixed number of prompts (6), potentially limiting its ability to capture even finer granularities.
Future work could explore incorporating a dynamic prompt generation mechanism based on image content and user intent. |
image segmentation, multi-granularity, semantic awareness, interactive segmentation, open-vocabulary segmentation |
2307.04749
Report |
Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback |
Jaskirat Singh, Liang Zheng |
The field of text-conditioned image generation has made unparalleled progress
with the recent advent of latent diffusion models. While remarkable, as the
complexity of given text input increases, the state-of-the-art diffusion models
may still fail in generating images which accurately convey the semantics of
the given prompt. Furthermore, it has been observed that such misalignments are
often left undetected by pretrained multi-modal models such as CLIP. To address
these problems, in this paper we explore a simple yet effective decompositional
approach towards both evaluation and improvement of text-to-image alignment. In
particular, we first introduce a Decompositional-Alignment-Score which given a
complex prompt decomposes it into a set of disjoint assertions. The alignment
of each assertion with generated images is then measured using a VQA model.
Finally, alignment scores for different assertions are combined aposteriori to
give the final text-to-image alignment score. Experimental analysis reveals
that the proposed alignment metric shows significantly higher correlation with
human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also
find that the assertion level alignment scores provide a useful feedback which
can then be used in a simple iterative procedure to gradually increase the
expression of different assertions in the final image outputs. Human user
studies indicate that the proposed approach surpasses previous state-of-the-art
by 8.7% in overall text-to-image alignment accuracy. Project page for our paper
is available at https://1jsingh.github.io/divide-evaluate-and-refine |
This paper introduces a novel decompositional framework for evaluating and refining text-to-image alignment in text-conditioned image generation models. |
Existing text-to-image generation models often fail to accurately convey the semantics of complex text prompts, and existing evaluation metrics like CLIP and BLIP scores often fail to detect these misalignments. |
The proposed framework, called Decompositional-Alignment-Score (DA-Score), decomposes complex prompts into disjoint assertions, evaluates the alignment of each assertion with the generated image using a VQA model, and then combines these scores to generate an overall text-to-image alignment score. This feedback is then used in an iterative refinement process to improve the generated image by increasing the expressiveness of the least aligned assertion. |
DA-Score shows significantly higher correlation with human ratings for text-to-image alignment compared to traditional metrics like CLIP, BLIP, and BLIP2.
The iterative refinement process, guided by DA-Score, generates images with improved alignment to complex prompts, outperforming prior works in terms of alignment accuracy.
Despite the iterative process, the proposed method maintains comparable inference times to other state-of-the-art techniques. |
The reliance on a pretrained BLIP-VQA model for assertion alignment evaluation introduces potential weaknesses based on the VQA model's limitations.
The current approach treats all assertions as equally important, neglecting potential variations in user priorities and the visual verifiability of certain assertions. |
text-to-image generation, text-image alignment, vqa, iterative refinement, diffusion models |
2307.04725
Report |
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning |
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, Bo Dai |
With the advance of text-to-image (T2I) diffusion models (e.g., Stable
Diffusion) and corresponding personalization techniques such as DreamBooth and
LoRA, everyone can manifest their imagination into high-quality images at an
affordable cost. However, adding motion dynamics to existing high-quality
personalized T2Is and enabling them to generate animations remains an open
challenge. In this paper, we present AnimateDiff, a practical framework for
animating personalized T2I models without requiring model-specific tuning. At
the core of our framework is a plug-and-play motion module that can be trained
once and seamlessly integrated into any personalized T2Is originating from the
same base T2I. Through our proposed training strategy, the motion module
effectively learns transferable motion priors from real-world videos. Once
trained, the motion module can be inserted into a personalized T2I model to
form a personalized animation generator. We further propose MotionLoRA, a
lightweight fine-tuning technique for AnimateDiff that enables a pre-trained
motion module to adapt to new motion patterns, such as different shot types, at
a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA
on several public representative personalized T2I models collected from the
community. The results demonstrate that our approaches help these models
generate temporally smooth animation clips while preserving the visual quality
and motion diversity. Codes and pre-trained weights are available at
https://github.com/guoyww/AnimateDiff. |
AnimateDiff, a practical framework for animating personalized text-to-image models without requiring model-specific tuning. |
Enables users to generate animations from personalized text-to-image models, which is desirable in various industries and for creative applications. |
Trains a plug-and-play motion module on real-world videos and integrates it into personalized T2I models. Also introduces AnimateDiff-LoRA for adapting the module to new motion patterns with few reference videos. |
Generates temporally smooth animations while preserving the visual quality and motion diversity of personalized T2I models.
Demonstrates that a Transformer architecture effectively captures motion priors.
Shows that AnimateDiff-LoRA successfully adapts pre-trained motion modules to new motion patterns with limited data and computation. |
Limited evaluation on controllable generation.
Potential misuse for generating inappropriate content, although the paper proposes adding a content safety checker to mitigate this risk. |
text-to-image synthesis, animation generation, motion modeling, personalization, diffusion models |
2307.04684
Report |
FreeDrag: Feature Dragging for Reliable Point-based Image Editing |
Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin, Jinjin Zheng |
To serve the intricate and varied demands of image editing, precise and
flexible manipulation in image content is indispensable. Recently, Drag-based
editing methods have gained impressive performance. However, these methods
predominantly center on point dragging, resulting in two noteworthy drawbacks,
namely "miss tracking", where difficulties arise in accurately tracking the
predetermined handle points, and "ambiguous tracking", where tracked points are
potentially positioned in wrong regions that closely resemble the handle
points. To address the above issues, we propose FreeDrag, a feature dragging
methodology designed to free the burden on point tracking. The FreeDrag
incorporates two key designs, i.e., template feature via adaptive updating and
line search with backtracking, the former improves the stability against
drastic content change by elaborately controls feature updating scale after
each dragging, while the latter alleviates the misguidance from similar points
by actively restricting the search area in a line. These two technologies
together contribute to a more stable semantic dragging with higher efficiency.
Comprehensive experimental results substantiate that our approach significantly
outperforms pre-existing methodologies, offering reliable point-based editing
even in various complex scenarios. |
This paper introduces FreeDrag, a novel feature-dragging framework designed for robust and precise point-based image editing. |
Existing point-dragging methods suffer from limitations like 'miss tracking' and 'ambiguous tracking,' leading to inaccurate and unreliable editing outcomes. FreeDrag aims to address these issues and improve the quality of interactive image editing. |
FreeDrag utilizes two key mechanisms: 1) Adaptive template features, which dynamically adjust updating scales based on dragging quality, enhancing stability. 2) Line search with backtracking, constraining movements along a line to minimize ambiguity and employing backtracking for course correction. |
FreeDrag successfully mitigates point disappearance and content distortion, enabling precise detail editing.
It exhibits robustness against similar points, leading to more reliable and accurate dragging outcomes.
Quantitative evaluations demonstrate FreeDrag's superiority in achieving high editing accuracy while preserving image fidelity. |
The performance of FreeDrag is subject to the chosen parameters, requiring careful tuning for optimal results.
Future work will explore the integration of FreeDrag with other generative models beyond StyleGAN2 and diffusion models, potentially expanding its applicability. |
image editing, generative models, point-based editing, feature dragging, interactive editing |
2307.04455
Report |
SAM-IQA: Can Segment Anything Boost Image Quality Assessment? |
Xinpeng Li, Ting Jiang, Haoqiang Fan, Shuaicheng Liu |
Image Quality Assessment (IQA) is a challenging task that requires training
on massive datasets to achieve accurate predictions. However, due to the lack
of IQA data, deep learning-based IQA methods typically rely on pre-trained
networks trained on massive datasets as feature extractors to enhance their
generalization ability, such as the ResNet network trained on ImageNet. In this
paper, we utilize the encoder of Segment Anything, a recently proposed
segmentation model trained on a massive dataset, for high-level semantic
feature extraction. Most IQA methods are limited to extracting spatial-domain
features, while frequency-domain features have been shown to better represent
noise and blur. Therefore, we leverage both spatial-domain and frequency-domain
features by applying Fourier and standard convolutions on the extracted
features, respectively. Extensive experiments are conducted to demonstrate the
effectiveness of all the proposed components, and results show that our
approach outperforms the state-of-the-art (SOTA) in four representative
datasets, both qualitatively and quantitatively. Our experiments confirm the
powerful feature extraction capabilities of Segment Anything and highlight the
value of combining spatial-domain and frequency-domain features in IQA tasks.
Code: https://github.com/Hedlen/SAM-IQA |
This paper introduces a novel IQA method leveraging the Segment Anything (SAM) model for feature extraction, incorporating both spatial and frequency domain features through a spatial-frequency feature extraction module (SFEM). |
Accurate IQA is crucial for various image processing tasks, but existing methods suffer from limited training data. This paper addresses this by utilizing the robust feature extraction capabilities of SAM, trained on a massive dataset. |
The method extracts features using the SAM encoder and then employs SFEM to capture both spatial and frequency domain information using regular and Fourier convolutions. For FR-IQA, L1 distance is used to compare features, while for NR-IQA, features are directly fed into a regression block. |
The method outperforms state-of-the-art approaches in both FR-IQA and NR-IQA tasks on various benchmark datasets.
Ablation studies confirm the effectiveness of SAM encoder, Fourier convolution in SFEM, and L1 distance metric.
The method achieves strong generalization ability and superior performance in image quality assessment. |
The method's reliance on pre-trained SAM encoder limits its applicability in scenarios where SAM's performance is compromised.
Further exploration of advanced distance metric learning techniques could potentially enhance the model's accuracy. |
image quality assessment, segment anything, fourier convolution, spatial-frequency feature extraction, deep learning |
2307.04157
Report |
DIFF-NST: Diffusion Interleaving For deFormable Neural Style Transfer |
Dan Ruta, Gemma Canet Tarrés, Andrew Gilbert, Eli Shechtman, Nicholas Kolkin, John Collomosse |
Neural Style Transfer (NST) is the field of study applying neural techniques
to modify the artistic appearance of a content image to match the style of a
reference style image. Traditionally, NST methods have focused on texture-based
image edits, affecting mostly low level information and keeping most image
structures the same. However, style-based deformation of the content is
desirable for some styles, especially in cases where the style is abstract or
the primary concept of the style is in its deformed rendition of some content.
With the recent introduction of diffusion models, such as Stable Diffusion, we
can access far more powerful image generation techniques, enabling new
possibilities. In our work, we propose using this new class of models to
perform style transfer while enabling deformable style transfer, an elusive
capability in previous models. We show how leveraging the priors of these
models can expose new artistic controls at inference time, and we document our
findings in exploring this new direction for the field of style transfer. |
This paper proposes DIFF-NST, a novel Neural Style Transfer (NST) method leveraging diffusion models to enable deformable style transfer, going beyond texture-based edits to alter content shapes and structures according to the style image. |
Traditional NST methods primarily focus on texture transfer, neglecting style-based content deformation. This work explores the potential of diffusion models for achieving deformable style transfer, a capability previously elusive in NST. |
DIFF-NST freezes pre-trained diffusion model weights and trains MLPs within the UNet self-attention blocks. It interleaves content noise and style attention values during reverse diffusion to generate stylized images, enabling control over content deformation and stylization strength. |
DIFF-NST achieves deformable style transfer, successfully altering content shapes and structures based on the style image.
User studies demonstrate a strong preference for DIFF-NST in terms of style transfer quality compared to traditional NST methods and the closest related work, PARASOL.
The method offers inference-time control over the degree of content deformation and stylization strength. |
The method currently doesn't match textures to the style image with the same fidelity as traditional NST approaches.
There are occasional instances where structure from the style image unintentionally influences the stylized image. |
neural style transfer, diffusion models, deformable style transfer, image generation, artistic style |
2307.04028
Report |
Measuring the Success of Diffusion Models at Imitating Human Artists |
Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell |
Modern diffusion models have set the state-of-the-art in AI image generation.
Their success is due, in part, to training on Internet-scale data which often
includes copyrighted work. This prompts questions about the extent to which
these models learn from, imitate, or copy the work of human artists. This work
suggests that tying copyright liability to the capabilities of the model may be
useful given the evolving ecosystem of generative models. Specifically, much of
the legal analysis of copyright and generative systems focuses on the use of
protected data for training. As a result, the connections between data,
training, and the system are often obscured. In our approach, we consider
simple image classification techniques to measure a model's ability to imitate
specific artists. Specifically, we use Contrastive Language-Image Pretrained
(CLIP) encoders to classify images in a zero-shot fashion. Our process first
prompts a model to imitate a specific artist. Then, we test whether CLIP can be
used to reclassify the artist (or the artist's work) from the imitation. If
these tests match the imitation back to the original artist, this suggests the
model can imitate that artist's expression. Our approach is simple and
quantitative. Furthermore, it uses standard techniques and does not require
additional training. We demonstrate our approach with an audit of Stable
Diffusion's capacity to imitate 70 professional digital artists with
copyrighted work online. When Stable Diffusion is prompted to imitate an artist
from this set, we find that the artist can be identified from the imitation
with an average accuracy of 81.0%. Finally, we also show that a sample of the
artist's work can be matched to these imitation images with a high degree of
statistical reliability. Overall, these results suggest that Stable Diffusion
is broadly successful at imitating individual human artists. |
This paper presents a method to quantify a diffusion model's capacity to imitate human artists by employing CLIP-based image classification, which could inform discussions on copyright liability tied to model capabilities. |
As generative AI models, often trained on copyrighted works, become increasingly capable, it is crucial to develop objective measures of their ability to imitate specific artists, which could be relevant for copyright considerations. Current legal frameworks primarily focus on training data rather than model capabilities. |
The authors use CLIP encoders to classify images generated by Stable Diffusion. They prompt the model to imitate a specific artist's style and then assess if CLIP can correctly identify the artist from the generated image. They also compare the similarity of real artwork to generated imitations using CLIP embeddings. |
Stable Diffusion successfully imitates the style of 70 professional digital artists, as demonstrated by an average classification accuracy of 81.0% using their names as prompts.
These results remain consistent when tested on a larger set of 250 artists, indicating the generalizability of the findings.
Artwork generated by Stable Diffusion shows statistically significant similarity to real artwork by the artist it was prompted to imitate compared to other artists, further confirming its imitation capabilities. |
The study focuses on Stable Diffusion and a specific set of digital artists, potentially limiting the generalizability of the findings to other models or artistic domains.
Future work could investigate the effectiveness of different image classification techniques and explore potential defenses against AI imitation of copyrighted works. |
diffusion models, copyright law, image classification, clip, ai art |
2307.03869
Report |
Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation |
Aditya Sanghi, Pradeep Kumar Jayaraman, Arianna Rampini, Joseph Lambourne, Hooman Shayani, Evan Atherton, Saeid Asgari Taghanaki |
Significant progress has recently been made in creative applications of large
pre-trained models for downstream tasks in 3D vision, such as text-to-shape
generation. This motivates our investigation of how these pre-trained models
can be used effectively to generate 3D shapes from sketches, which has largely
remained an open challenge due to the limited sketch-shape paired datasets and
the varying level of abstraction in the sketches. We discover that conditioning
a 3D generative model on the features (obtained from a frozen large pre-trained
vision model) of synthetic renderings during training enables us to effectively
generate 3D shapes from sketches at inference time. This suggests that the
large pre-trained vision model features carry semantic signals that are
resilient to domain shifts, i.e., allowing us to use only RGB renderings, but
generalizing to sketches at inference time. We conduct a comprehensive set of
experiments investigating different design factors and demonstrate the
effectiveness of our straightforward approach for generation of multiple 3D
shapes per each input sketch regardless of their level of abstraction without
requiring any paired datasets during training. |
The paper proposes
ickname, a zero-shot approach for sketch-to-3D shape generation using pre-trained vision models. |
Sketch-to-3D shape generation is challenging due to the limited availability of paired datasets and varying levels of abstraction in sketches. This method addresses these challenges by leveraging the semantic knowledge captured in pre-trained vision models. |
The method uses a two-stage training process: (1) train a VQ-VAE to obtain shape embeddings, (2) train a masked transformer conditioned on local features from pre-trained vision models applied to synthetic renderings of the 3D shapes. During inference, the transformer is conditioned on features extracted from the input sketch to generate the 3D shape. |
The method generates multiple plausible 3D shapes per sketch, even for highly abstract sketches.
It generalizes well across different sketch datasets and 3D representations (voxels, implicit, CAD).
The method outperforms existing supervised sketch-to-3D approaches on shape classification accuracy and achieves promising results in human evaluation. |
The method is limited by the diversity and quality of the 3D shape dataset used for training.
It struggles to generate shapes with complex local details. |
sketch-to-3d, zero-shot learning, generative models, pre-trained vision models, clip |
2307.03798
Report |
Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints |
Matthias Freiberger, Peter Kun, Christian Igel, Anders Sundnes Løvlie, Sebastian Risi |
Models leveraging both visual and textual data such as Contrastive
Language-Image Pre-training (CLIP), are the backbone of many recent advances in
artificial intelligence. In this work, we show that despite their versatility,
such models are vulnerable to what we refer to as fooling master images.
Fooling master images are capable of maximizing the confidence score of a CLIP
model for a significant number of widely varying prompts, while being either
unrecognizable or unrelated to the attacked prompts for humans. The existence
of such images is problematic as it could be used by bad actors to maliciously
interfere with CLIP-trained image retrieval models in production with
comparably small effort as a single image can attack many different prompts. We
demonstrate how fooling master images for CLIP (CLIPMasterPrints) can be mined
using stochastic gradient descent, projected gradient descent, or blackbox
optimization. Contrary to many common adversarial attacks, the blackbox
optimization approach allows us to mine CLIPMasterPrints even when the weights
of the model are not accessible. We investigate the properties of the mined
images, and find that images trained on a small number of image captions
generalize to a much larger number of semantically related captions. We
evaluate possible mitigation strategies, where we increase the robustness of
the model and introduce an approach to automatically detect CLIPMasterPrints to
sanitize the input of vulnerable models. Finally, we find that vulnerability to
CLIPMasterPrints is related to a modality gap in contrastive pre-trained
multi-modal networks. Code available at
https://github.com/matfrei/CLIPMasterPrints. |
This paper introduces fooling master images (CLIPMasterPrints), which are images capable of maximizing the confidence score of a CLIP model for a wide range of prompts while appearing meaningless or unrelated to humans. |
The existence of CLIPMasterPrints poses a security risk as they can be used to manipulate CLIP-trained image retrieval systems in production, enabling attacks like censorship and adversarial marketing. |
The authors exploit the modality gap in CLIP models and mine CLIPMasterPrints using stochastic gradient descent (SGD), black-box optimization based on Latent Variable Evolution (LVE), and projected gradient descent (PGD). |
CLIPMasterPrints can successfully fool CLIP models for a variety of prompts, achieving higher cosine similarity scores than actual images matching the prompts.
The fooling effect generalizes to semantically related prompts that were not directly targeted during optimization.
Mitigations strategies include bridging the modality gap in CLIP models and sanitizing model inputs by training classifiers to detect CLIPMasterPrints. |
The mitigation strategy of adding fooling images to the training set is only partially successful, as the model remains vulnerable to newly mined CLIPMasterPrints.
Future work should focus on developing more effective mitigation strategies and investigating the vulnerability of other related models. |
clip, fooling images, adversarial attacks, multi-modal networks, modality gap |
2307.03441
Report |
NOFA: NeRF-based One-shot Facial Avatar Reconstruction |
Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, Baoyuan Wu |
3D facial avatar reconstruction has been a significant research topic in
computer graphics and computer vision, where photo-realistic rendering and
flexible controls over poses and expressions are necessary for many related
applications. Recently, its performance has been greatly improved with the
development of neural radiance fields (NeRF). However, most existing NeRF-based
facial avatars focus on subject-specific reconstruction and reenactment,
requiring multi-shot images containing different views of the specific subject
for training, and the learned model cannot generalize to new identities,
limiting its further applications. In this work, we propose a one-shot 3D
facial avatar reconstruction framework that only requires a single source image
to reconstruct a high-fidelity 3D facial avatar. For the challenges of lacking
generalization ability and missing multi-view information, we leverage the
generative prior of 3D GAN and develop an efficient encoder-decoder network to
reconstruct the canonical neural volume of the source image, and further
propose a compensation network to complement facial details. To enable
fine-grained control over facial dynamics, we propose a deformation field to
warp the canonical volume into driven expressions. Through extensive
experimental comparisons, we achieve superior synthesis results compared to
several state-of-the-art methods. |
The paper introduces NOFA, a novel one-shot 3D facial avatar reconstruction framework using NeRF, enabling high-fidelity reconstruction and reenactment from a single image. |
Existing NeRF-based facial avatars are subject-specific, requiring extensive multi-shot data and lacking generalization ability. This work addresses these limitations by proposing a generalizable one-shot approach. |
The method leverages a 3D GAN's generative prior to synthesize neural volumes, trains an encoder-decoder network for mapping images to canonical volumes, and employs a deformation field for facial dynamics control guided by 3DMM parameters. |
Outperforms state-of-the-art 2D and NeRF-based methods in novel view synthesis and reenactment tasks.
Achieves comparable performance to multi-shot methods while only requiring a single image.
Demonstrates superior identity preservation and detail rendering. |
Background rotation is coupled with head rotation due to camera pose modeling.
Potential misuse for creating deep-fakes. |
facial avatar reconstruction, neural radiance fields (nerf), one-shot learning, 3d generative adversarial networks (gans), facial reenactment |
2307.03190
Report |
Text-Guided Synthesis of Eulerian Cinemagraphs |
Aniruddha Mahapatra, Aliaksandr Siarohin, Hsin-Ying Lee, Sergey Tulyakov, Jun-Yan Zhu |
We introduce Text2Cinemagraph, a fully automated method for creating
cinemagraphs from text descriptions - an especially challenging task when
prompts feature imaginary elements and artistic styles, given the complexity of
interpreting the semantics and motions of these images. We focus on
cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds,
which exhibit continuous motion and repetitive textures. Existing single-image
animation methods fall short on artistic inputs, and recent text-based video
methods frequently introduce temporal inconsistencies, struggling to keep
certain regions static. To address these challenges, we propose an idea of
synthesizing image twins from a single text prompt - a pair of an artistic
image and its pixel-aligned corresponding natural-looking twin. While the
artistic image depicts the style and appearance detailed in our text prompt,
the realistic counterpart greatly simplifies layout and motion analysis.
Leveraging existing natural image and video datasets, we can accurately segment
the realistic image and predict plausible motion given the semantic
information. The predicted motion can then be transferred to the artistic image
to create the final cinemagraph. Our method outperforms existing approaches in
creating cinemagraphs for natural landscapes as well as artistic and
other-worldly scenes, as validated by automated metrics and user studies.
Finally, we demonstrate two extensions: animating existing paintings and
controlling motion directions using text. |
This paper introduces Text2Cinemagraph, the first fully automated method for generating cinemagraphs from text descriptions, capable of handling both artistic and natural scenes. |
This method allows content creators to easily generate cinemagraphs with a variety of styles and compositions, including those with imaginative elements, which are challenging to create with existing methods. |
The method leverages a twin image synthesis approach, generating an artistic image and a corresponding realistic image with similar semantic layout. A motion prediction model trained on real videos is applied to the realistic image, and the resulting motion is transferred to the artistic image to create the final cinemagraph. |
The method outperforms existing single-image animation techniques on both artistic and natural images, as measured by FVD scores and user studies.
Text and mask conditioning in the flow prediction network are shown to be crucial for generating plausible motions.
The method is extended to animate existing paintings and control motion directions using text. |
Limitations include potential inconsistencies between the artistic and realistic images and challenges in segmenting complex natural images.
Future work involves exploring more fine-grained text-guided direction control and addressing artifacts in the generated videos. |
cinemagraph generation, text-to-video synthesis, single image animation, twin image synthesis, artistic style transfer |
2307.03108
Report |
DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models |
Zhenting Wang, Chen Chen, Lingjuan Lyu, Dimitris N. Metaxas, Shiqing Ma |
Recent text-to-image diffusion models have shown surprising performance in
generating high-quality images. However, concerns have arisen regarding the
unauthorized data usage during the training or fine-tuning process. One example
is when a model trainer collects a set of images created by a particular artist
and attempts to train a model capable of generating similar images without
obtaining permission and giving credit to the artist. To address this issue, we
propose a method for detecting such unauthorized data usage by planting the
injected memorization into the text-to-image diffusion models trained on the
protected dataset. Specifically, we modify the protected images by adding
unique contents on these images using stealthy image warping functions that are
nearly imperceptible to humans but can be captured and memorized by diffusion
models. By analyzing whether the model has memorized the injected content
(i.e., whether the generated images are processed by the injected
post-processing function), we can detect models that had illegally utilized the
unauthorized data. Experiments on Stable Diffusion and VQ Diffusion with
different model training or fine-tuning methods (i.e, LoRA, DreamBooth, and
standard training) demonstrate the effectiveness of our proposed method in
detecting unauthorized data usages. Code:
https://github.com/ZhentingWang/DIAGNOSIS. |
This paper proposes DIAGNOSIS, a method for detecting unauthorized data usage in text-to-image diffusion models by injecting element-level memorizations into models trained on protected datasets. |
The increasing use of text-to-image diffusion models raises concerns about unauthorized data usage, necessitating techniques to detect and prevent such misuse. |
DIAGNOSIS modifies protected images using stealthy warping functions (signal functions) before release. A signal classifier is trained to detect the presence of this warping in generated images. By analyzing the memorization strength of a given model on the signal function, unauthorized usage can be determined. |
DIAGNOSIS achieves 100.0% detection accuracy on various text-to-image diffusion models and training methods.
The method has minimal impact on the generation quality of models trained on protected datasets.
DIAGNOSIS is robust against adaptive infringers employing strong image augmentations. |
The current signal function focuses on image warping, exploring other stealthy modifications could be beneficial.
Investigating the impact of infringers utilizing a portion of the dataset for training is an area for future work. |
unauthorized data usage, text-to-image diffusion models, memorization, data protection, copyright infringement |
2307.02953
Report |
SegNetr: Rethinking the local-global interactions and skip connections in U-shaped networks |
Junlong Cheng, Chengrui Gao, Fengjie Wang, Min Zhu |
Recently, U-shaped networks have dominated the field of medical image
segmentation due to their simple and easily tuned structure. However, existing
U-shaped segmentation networks: 1) mostly focus on designing complex
self-attention modules to compensate for the lack of long-term dependence based
on convolution operation, which increases the overall number of parameters and
computational complexity of the network; 2) simply fuse the features of encoder
and decoder, ignoring the connection between their spatial locations. In this
paper, we rethink the above problem and build a lightweight medical image
segmentation network, called SegNetr. Specifically, we introduce a novel
SegNetr block that can perform local-global interactions dynamically at any
stage and with only linear complexity. At the same time, we design a general
information retention skip connection (IRSC) to preserve the spatial location
information of encoder features and achieve accurate fusion with the decoder
features. We validate the effectiveness of SegNetr on four mainstream medical
image segmentation datasets, with 59\% and 76\% fewer parameters and GFLOPs
than vanilla U-Net, while achieving segmentation performance comparable to
state-of-the-art methods. Notably, the components proposed in this paper can be
applied to other U-shaped networks to improve their segmentation performance. |
This paper presents SegNetr, a lightweight U-shaped medical image segmentation network that improves local-global interaction and skip connections. |
Existing U-shaped networks often rely on computationally expensive self-attention modules or simplistic feature fusion, limiting their efficiency and performance. |
SegNetr introduces: (1) SegNetr blocks for dynamic local-global interaction with linear complexity using parallel processing and window displacement. (2) Information retention skip connections (IRSC) to preserve spatial information from the encoder and enhance feature fusion with the decoder. |
SegNetr achieves comparable or superior segmentation performance to state-of-the-art methods on four medical image datasets (ISIC2017, PH2, TNSCUI, ACDC).
It significantly reduces computational cost, with 59% fewer parameters and 76% fewer GFLOPs than vanilla U-Net.
Ablation studies demonstrate the effectiveness of the proposed SegNetr blocks and IRSC, indicating their potential applicability in other U-shaped networks. |
The paper primarily focuses on 2D medical image segmentation, leaving extensions to 3D data for future exploration.
Further research could investigate the optimal patch size configuration for different datasets and segmentation tasks. |
medical image segmentation, u-net, local-global interaction, skip connections, deep learning |
2307.02609
Report |
MRecGen: Multimodal Appropriate Reaction Generator |
Jiaqi Xu, Cheng Luo, Weicheng Xie, Linlin Shen, Xiaofeng Liu, Lu Liu, Hatice Gunes, Siyang Song |
Verbal and non-verbal human reaction generation is a challenging task, as
different reactions could be appropriate for responding to the same behaviour.
This paper proposes the first multiple and multimodal (verbal and nonverbal)
appropriate human reaction generation framework that can generate appropriate
and realistic human-style reactions (displayed in the form of synchronised
text, audio and video streams) in response to an input user behaviour. This
novel technique can be applied to various human-computer interaction scenarios
by generating appropriate virtual agent/robot behaviours. Our demo is available
at \url{https://github.com/SSYSteve/MRecGen}. |
This paper proposes MRecGen, the first multiple and multimodal appropriate human reaction generation framework that produces synchronized text, audio, and video streams of realistic reactions to user behavior. |
Generating realistic and appropriate reactions to human behavior is challenging due to the 'one-to-many mapping' problem where the same behavior can elicit various valid responses. Existing methods struggle with this and lack multimodal capabilities, limiting their applicability in human-computer interaction. |
MRecGen uses a four-module deep learning approach: (1) User behavior encoding (UBE) module encodes multimodal user input. (2) Appropriate reaction prediction (ARP) module predicts a distribution of suitable multimodal reactions. (3) Behaviour synchronisation (BS) module aligns and synchronizes user and reaction representations. (4) Reaction display (RD) module generates the final text, audio, and facial video output. |
MRecGen generates appropriate textual, audio, and facial reactions based on user studies.
The framework achieves high lip sync quality in the generated videos.
User study results demonstrate the effectiveness of MRecGen in generating appropriate and realistic reactions. |
The current demo primarily focuses on generating reactions from a single identity, limiting its generalizability.
Future work will investigate incorporating personality traits into the reaction generation process to enhance personalization. |
human-computer interaction, multimodal generation, deep learning, reaction generation, virtual agents |
2307.02321
Report |
MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers |
Jakob Drachmann Havtorn, Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi |
The input tokens to Vision Transformers carry little semantic meaning as they
are defined as regular equal-sized patches of the input image, regardless of
its content. However, processing uniform background areas of an image should
not necessitate as much compute as dense, cluttered areas. To address this
issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our
method introduces a conditional gating mechanism that selects the optimal token
scale for every image region, such that the number of tokens is dynamically
determined per input. In addition, to enhance the conditional behavior of the
gate during training, we introduce a novel generalization of the batch-shaping
loss. We show that our gating module is able to learn meaningful semantics
despite operating locally at the coarse patch-level. The proposed gating module
is lightweight, agnostic to the choice of transformer backbone, and trained
within a few epochs with little training overhead. Furthermore, in contrast to
token pruning, MSViT does not lose information about the input, thus can be
readily applied for dense tasks. We validate MSViT on the tasks of
classification and segmentation where it leads to improved accuracy-complexity
trade-off. |
This paper proposes MSViT, a Vision Transformer that dynamically selects the optimal token scale for different image regions, thereby reducing the number of input tokens and computational cost. |
Standard ViTs process images with a fixed token size, leading to computational redundancy, especially in uniform background areas. Dynamically adjusting token scale based on image content can improve efficiency. |
A lightweight gating MLP is introduced to select between coarse and fine token scales for each image region. To optimize this conditional gating mechanism, a novel Generalized Batch-Shaping (GBaS) loss is proposed, and an adaptive trimming strategy reduces training overhead. |
MSViT consistently improves the accuracy-complexity trade-off compared to standard ViTs across different backbones, pretraining methods, and input sizes.
The learned gating mechanism effectively captures meaningful semantics to distinguish background from foreground even with local information.
The pretrained MSViT gate transfers well to other tasks like semantic segmentation and can be combined with token pruning in hierarchical ViTs for further efficiency gains. |
The current design explores two token scales. Investigating more scales might further improve performance but add complexity.
The interaction between the gate's coarse scale, the base patch scale, and the attention window size in hierarchical ViTs requires further investigation to optimize token scale selection across layers. |
vision transformer, tokenization, mixed-scale, efficiency, conditional computing |
2307.01831
Report |
DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation |
Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li |
Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful
effectiveness in generating high-quality 2D images. However, it is still being
determined whether the Transformer architecture performs equally well in 3D
shape generation, as previous 3D diffusion methods mostly adopted the U-Net
architecture. To bridge this gap, we propose a novel Diffusion Transformer for
3D shape generation, namely DiT-3D, which can directly operate the denoising
process on voxelized point clouds using plain Transformers. Compared to
existing U-Net approaches, our DiT-3D is more scalable in model size and
produces much higher quality generations. Specifically, the DiT-3D adopts the
design philosophy of DiT but modifies it by incorporating 3D positional and
patch embeddings to adaptively aggregate input from voxelized point clouds. To
reduce the computational cost of self-attention in 3D shape generation, we
incorporate 3D window attention into Transformer blocks, as the increased 3D
token length resulting from the additional dimension of voxels can lead to high
computation. Finally, linear and devoxelization layers are used to predict the
denoised point clouds. In addition, our transformer architecture supports
efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on
ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on
the ShapeNet dataset demonstrate that the proposed DiT-3D achieves
state-of-the-art performance in high-fidelity and diverse 3D point cloud
generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy
of the state-of-the-art method by 4.59 and increases the Coverage metric by
3.51 when evaluated on Chamfer Distance. |
This paper introduces DiT-3D, a novel diffusion transformer architecture designed for 3D shape generation that leverages the denoising process of DDPM on 3D point clouds. |
Generating high-fidelity point clouds for 3D shape generation is a challenging and significant problem, and existing methods have limitations in terms of architecture and performance. |
The proposed DiT-3D model adapts the DiT framework by incorporating 3D positional and patch embeddings, 3D window attention, and devoxelized prediction to handle the unique characteristics of 3D point clouds. It also supports efficient fine-tuning from 2D to 3D using pre-trained DiT-2D weights. |
DiT-3D achieves state-of-the-art performance on the ShapeNet dataset, surpassing previous non-DDPM and DDPM-based 3D shape generation methods.
The proposed 3D adaptations, including voxel diffusion, 3D positional embeddings, and 3D window attention, significantly contribute to improving the quality and diversity of generated shapes.
DiT-3D exhibits strong scalability, allowing for flexible adjustments to patch sizes, voxel sizes, and model sizes. |
The model has yet to be explored on other 3D modalities like SDFs and meshes.
Scaling DiT-3D to large-scale training on more extensive 3D shape datasets is a potential area for future work. |
3d shape generation, diffusion models, transformers, point clouds, denoising |
2307.01425
Report |
Consistent Multimodal Generation via A Unified GAN Framework |
Zhen Zhu, Yijun Li, Weijie Lyu, Krishna Kumar Singh, Zhixin Shu, Soeren Pirk, Derek Hoiem |
We investigate how to generate multimodal image outputs, such as RGB, depth,
and surface normals, with a single generative model. The challenge is to
produce outputs that are realistic, and also consistent with each other. Our
solution builds on the StyleGAN3 architecture, with a shared backbone and
modality-specific branches in the last layers of the synthesis network, and we
propose per-modality fidelity discriminators and a cross-modality consistency
discriminator. In experiments on the Stanford2D3D dataset, we demonstrate
realistic and consistent generation of RGB, depth, and normal images. We also
show a training recipe to easily extend our pretrained model on a new domain,
even with a few pairwise data. We further evaluate the use of synthetically
generated RGB and depth pairs for training or fine-tuning depth estimators.
Code will be available at https://github.com/jessemelpolio/MultimodalGAN. |
This paper presents MultimodalGAN, a unified GAN framework for generating consistent multi-modal images (e.g., RGB, depth, surface normals) using a shared representation. |
Generating realistic and consistent multi-modal data is crucial for training vision models, especially when real data is scarce. |
The method builds on StyleGAN3, employing a shared backbone and modality-specific branches. It introduces fidelity discriminators (per modality) and a consistency discriminator to ensure realism and cross-modal consistency. A simple, unified data augmentation strategy is used across modalities. |
MultimodalGAN generates realistic and consistent RGB, depth, and normal images, outperforming previous methods on the Stanford2D3D dataset.
The model can be effectively fine-tuned for new domains with limited paired data, enabling generation of missing modalities.
Synthetic RGB and depth pairs generated by the model improve the performance of downstream depth estimation tasks. |
The current work focuses on three specific modalities (RGB, depth, normals).
Future work could explore the generation of more diverse modalities or leverage other generative models like diffusion models. |
generative adversarial networks, multimodal generation, data augmentation, depth estimation, surface normal estimation |
2307.01197
Report |
Segment Anything Meets Point Tracking |
Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu |
The Segment Anything Model (SAM) has established itself as a powerful
zero-shot image segmentation model, enabled by efficient point-centric
annotation and prompt-based models. While click and brush interactions are both
well explored in interactive image segmentation, the existing methods on videos
focus on mask annotation and propagation. This paper presents SAM-PT, a novel
method for point-centric interactive video segmentation, empowered by SAM and
long-term point tracking. SAM-PT leverages robust and sparse point selection
and propagation techniques for mask generation. Compared to traditional
object-centric mask propagation strategies, we uniquely use point propagation
to exploit local structure information agnostic to object semantics. We
highlight the merits of point-based tracking through direct evaluation on the
zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our
experiments on popular video object segmentation and multi-object segmentation
tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a
point-based segmentation tracker yields better zero-shot performance and
efficient interactions. We release our code that integrates different point
trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt. |
This paper introduces SAM-PT, a novel method for interactive video segmentation that combines the Segment Anything Model (SAM) with long-term point tracking, enabling zero-shot video object segmentation. |
Existing interactive video segmentation methods struggle with unseen objects and rely on laborious mask annotations. SAM-PT addresses these limitations by leveraging SAM's generalization ability and the efficiency of point-based tracking. |
SAM-PT selects query points on the first frame, tracks them throughout the video using point trackers like CoTracker, and prompts SAM with the tracked points to generate per-frame segmentation masks. An optional point reinitialization strategy improves tracking accuracy over time. |
SAM-PT achieves state-of-the-art zero-shot video object segmentation performance, outperforming previous methods on DAVIS, YouTube-VOS, and BDD100K datasets.
It also surpasses a fully-supervised method on the open-world UVO dataset, demonstrating its strong generalization ability.
In interactive settings, SAM-PT significantly reduces annotation effort, approaching the performance of fully supervised methods. |
SAM-PT's performance depends on the accuracy of the underlying point tracker, which can be challenged by occlusions and fast-moving objects.
The current implementation relies on high-resolution inputs for both SAM and point trackers, limiting real-time performance. |
video segmentation, interactive segmentation, zero-shot learning, point tracking, segment anything model |
2307.01187
Report |
SAMAug: Point Prompt Augmentation for Segment Anything Model |
Haixing Dai, Chong Ma, Zhiling Yan, Zhengliang Liu, Enze Shi, Yiwei Li, Peng Shu, Xiaozheng Wei, Lin Zhao, Zihao Wu, Fang Zeng, Dajiang Zhu, Wei Liu, Quanzheng Li, Lichao Sun, Shu Zhang Tianming Liu, Xiang Li |
This paper introduces SAMAug, a novel visual point augmentation method for
the Segment Anything Model (SAM) that enhances interactive image segmentation
performance. SAMAug generates augmented point prompts to provide more
information about the user's intention to SAM. Starting with an initial point
prompt, SAM produces an initial mask, which is then fed into our proposed
SAMAug to generate augmented point prompts. By incorporating these extra
points, SAM can generate augmented segmentation masks based on both the
augmented point prompts and the initial prompt, resulting in improved
segmentation performance. We conducted evaluations using four different point
augmentation strategies: random sampling, sampling based on maximum difference
entropy, maximum distance, and saliency. Experiment results on the COCO,
Fundus, COVID QUEx, and ISIC2018 datasets show that SAMAug can boost SAM's
segmentation results, especially using the maximum distance and saliency.
SAMAug demonstrates the potential of visual prompt augmentation for computer
vision. Codes of SAMAug are available at github.com/yhydhx/SAMAug |
This paper introduces SAMAug, a visual point augmentation method for the Segment Anything Model (SAM) to enhance interactive image segmentation. |
SAM, while powerful, can be ambiguous with limited prompt information like a single point. SAMAug addresses this by generating augmented point prompts to better guide the model. |
SAMAug leverages an initial point prompt and the resulting SAM mask to generate additional point prompts using four strategies: random sampling, maximum difference entropy, maximum distance, and saliency. |
SAMAug consistently improves SAM's segmentation performance across COCO, Fundus, COVID QU-Ex, and ISIC2018 datasets.
Maximum distance and saliency-based augmentation strategies demonstrate superior performance.
Bounding box prompts generally outperform point prompts, but their augmentation proves less effective. |
The optimal augmentation strategy appears dataset-dependent, requiring further investigation.
Adding multiple augmented points does not necessarily yield better results, indicating a need for refined multi-point strategies. |
prompt augmentation, segment anything model, visual prompting, interactive segmentation, image segmentation |
2307.00997
Report |
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation |
Yonglin Li, Jing Zhang, Xiao Teng, Long Lan |
The Segment Anything Model (SAM) has gained significant attention for its
impressive performance in image segmentation. However, it lacks proficiency in
referring video object segmentation (RVOS) due to the need for precise
user-interactive prompts and a limited understanding of different modalities,
such as language and vision. This paper presents the RefSAM model, which
explores the potential of SAM for RVOS by incorporating multi-view information
from diverse modalities and successive frames at different timestamps in an
online manner. Our proposed approach adapts the original SAM model to enhance
cross-modality learning by employing a lightweight Cross-Modal MLP that
projects the text embedding of the referring expression into sparse and dense
embeddings, serving as user-interactive prompts. Additionally, we have
introduced the hierarchical dense attention module to fuse hierarchical visual
semantic information with sparse embeddings in order to obtain fine-grained
dense embeddings, and an implicit tracking module to generate a track token and
provide historical information for the mask decoder. Furthermore, we employ a
parameter-efficient tuning strategy to effectively align and fuse the language
and vision features. Through comprehensive ablation studies, we demonstrate the
practical and effective design choices of our model. Extensive experiments
conducted on Ref-Youtu-VOS, Ref-DAVIS17, and three referring image segmentation
datasets validate the superiority and effectiveness of our RefSAM model over
existing methods. The code and models will be made publicly at
\href{https://github.com/LancasterLi/RefSAM}{github.com/LancasterLi/RefSAM}. |
This paper proposes RefSAM, an end-to-end framework adapting the Segment Anything Model (SAM) for referring video object segmentation (RVOS) by incorporating multi-view information from different modalities and video frames. |
SAM, while powerful in image segmentation, lacks proficiency in RVOS due to limitations in handling user prompts and understanding different modalities like language. |
RefSAM adapts SAM by employing a Cross-Modal MLP to project text embeddings into prompts, a hierarchical dense attention module to fuse visual and textual features, and an implicit tracking module for temporal consistency. |
RefSAM outperforms state-of-the-art methods on Ref-DAVIS17.
RefSAM achieves competitive performance on Ref-Youtube-VOS while being more parameter-efficient.
Ablation studies demonstrate the effectiveness of key modules and the parameter-efficient tuning strategy. |
RefSAM's performance on Ref-Youtube-VOS, while competitive, falls slightly short of the state-of-the-art.
Future work could explore more advanced designs for enhanced cross-modal fusion. |
video object segmentation, vision transformer, language and vision, segment anything, referring video object segmentation |
2307.00910
Report |
CoPL: Contextual Prompt Learning for Vision-Language Understanding |
Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, K J Joseph, Balaji Vasan Srinivasan |
Recent advances in multimodal learning has resulted in powerful
vision-language models, whose representations are generalizable across a
variety of downstream tasks. Recently, their generalization ability has been
further extended by incorporating trainable prompts, borrowed from the natural
language processing literature. While such prompt learning techniques have
shown impressive results, we identify that these prompts are trained based on
global image features which limits itself in two aspects: First, by using
global features, these prompts could be focusing less on the discriminative
foreground image, resulting in poor generalization to various
out-of-distribution test cases. Second, existing work weights all prompts
equally whereas intuitively, prompts should be reweighed according to the
semantics of the image. We address these as part of our proposed Contextual
Prompt Learning (CoPL) framework, capable of aligning the prompts to the
localized features of the image. Our key innovations over earlier works include
using local image features as part of the prompt learning process, and more
crucially, learning to weight these prompts based on local features that are
appropriate for the task at hand. This gives us dynamic prompts that are both
aligned to local image features as well as aware of local contextual
relationships. Our extensive set of experiments on a variety of standard and
few-shot datasets show that our method produces substantially improved
performance when compared to the current state of the art methods. We also
demonstrate both few-shot and out-of-distribution performance to establish the
utility of learning dynamic prompts that are aligned to local image features. |
This paper introduces CoPL (Contextual Prompt Learning), a novel method for image classification that enhances the generalization of pre-trained vision-language models by aligning prompts with local image features and dynamically weighting them based on semantic relevance. |
Existing prompt-based methods often rely on global image features, neglecting discriminative local information and treating all prompts equally, limiting their adaptability to diverse tasks and datasets. |
CoPL utilizes local image features to determine semantically meaningful prompts. It employs an attention mechanism to generate context representations by comparing learnable prompt tokens with patch representations. These context vectors dynamically weight and update the prompts, making them contextually aware. |
CoPL consistently outperforms baselines, including CLIP, CoOp, and CoCoOp, on 11 image classification datasets, demonstrating superior generalization to unseen classes and few-shot scenarios.
CoPL achieves state-of-the-art zero-shot performance, surpassing CLIP by 1.4% in accuracy on average across datasets.
The method exhibits strong inter-dataset transferability, effectively classifying images from unseen datasets after training on a different dataset. |
CoPL's performance may be limited on datasets lacking salient local features, such as EuroSAT, where global context is more critical.
Future work includes extending CoPL to incorporate user intents for local image editing tasks, leveraging its ability to understand and manipulate local image content. |
prompt learning, image classification, vision-language models, few-shot learning, zero-shot learning |
2307.00764
Report |
Hierarchical Open-vocabulary Universal Image Segmentation |
Xudong Wang, Shufan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, Trevor Darrell |
Open-vocabulary image segmentation aims to partition an image into semantic
regions according to arbitrary text descriptions. However, complex visual
scenes can be naturally decomposed into simpler parts and abstracted at
multiple levels of granularity, introducing inherent segmentation ambiguity.
Unlike existing methods that typically sidestep this ambiguity and treat it as
an external factor, our approach actively incorporates a hierarchical
representation encompassing different semantic-levels into the learning
process. We propose a decoupled text-image fusion mechanism and representation
learning modules for both "things" and "stuff". Additionally, we systematically
examine the differences that exist in the textual and visual features between
these types of categories. Our resulting model, named HIPIE, tackles
HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a
unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,
Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the
state-of-the-art results at various levels of image comprehension, including
semantic-level (e.g., semantic segmentation), instance-level (e.g.,
panoptic/referring segmentation and object detection), as well as part-level
(e.g., part/subpart segmentation) tasks. Our code is released at
https://github.com/berkeley-hipie/HIPIE. |
HIPIE, a novel hierarchical, open-vocabulary, and universal image segmentation and detection model, effectively addresses inherent segmentation ambiguity using a hierarchical representation encompassing different semantic levels. |
Existing open-vocabulary image segmentation methods often treat segmentation ambiguity as an external factor. This work directly tackles this issue by incorporating a hierarchical representation into the learning process, allowing for more comprehensive and nuanced image analysis. |
The model uses decoupled text-image fusion and representation learning modules for "things" (countable objects) and "stuff" (uncountable regions) based on observed discrepancies in their visual and textual features. It employs a pretrained BERT for text features and ResNet-50 or ViT for image features. It utilizes early fusion for things and late fusion for stuff during mask generation. For hierarchical segmentation, class names from different granularity levels are concatenated as prompts during training and inference. |
HIPIE achieves state-of-the-art performance on over 40 datasets across various segmentation tasks, including semantic, instance, panoptic, referring, and part segmentation.
The decoupled representation learning and text-image fusion for things and stuff significantly improve performance compared to unified approaches.
The model effectively generalizes to novel part classes, demonstrating its open-vocabulary hierarchical segmentation capability. |
Future work will focus on applying HIPIE to video-related tasks and further evaluating its performance on video object tracking and segmentation.
Exploring the impact of additional pretraining of the vision encoder on large-scale datasets like SA-1B and incorporating supplementary hierarchical datasets will be beneficial. |
open-vocabulary segmentation, hierarchical segmentation, universal segmentation, text-image fusion, representation learning |
2307.00716
Report |
JourneyDB: A Benchmark for Generative Image Understanding |
Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, Hongsheng Li |
While recent advancements in vision-language models have had a transformative
impact on multi-modal comprehension, the extent to which these models possess
the ability to comprehend generated images remains uncertain. Synthetic images,
in comparison to real data, encompass a higher level of diversity in terms of
both content and style, thereby presenting significant challenges for the
models to fully grasp. In light of this challenge, we introduce a comprehensive
dataset, referred to as JourneyDB, that caters to the domain of generative
images within the context of multi-modal visual understanding. Our meticulously
curated dataset comprises 4 million distinct and high-quality generated images,
each paired with the corresponding text prompts that were employed in their
creation. Furthermore, we additionally introduce an external subset with
results of another 22 text-to-image generative models, which makes JourneyDB a
comprehensive benchmark for evaluating the comprehension of generated images.
On our dataset, we have devised four benchmarks to assess the performance of
generated image comprehension in relation to both content and style
interpretation. These benchmarks encompass prompt inversion, style retrieval,
image captioning, and visual question answering. Lastly, we evaluate the
performance of state-of-the-art multi-modal models when applied to the
JourneyDB dataset, providing a comprehensive analysis of their strengths and
limitations in comprehending generated content. We anticipate that the proposed
dataset and benchmarks will facilitate further research in the field of
generative content understanding. The dataset is publicly available at
https://journeydb.github.io. |
This paper introduces JourneyDB, a large-scale dataset of 4 million generated image-prompt pairs designed for evaluating multi-modal visual understanding in the context of AI-generated images. |
Existing vision-language models are primarily trained on real images and struggle to comprehend the unique characteristics of generated images, which often depict fictional scenes and complex styles. |
The dataset was created by collecting image-prompt pairs from Midjourney, a text-to-image generation platform. GPT-3.5 was used to generate captions, separate prompts into content and style categories, and create visual question answering annotations. |
Existing multi-modal models perform poorly on JourneyDB benchmarks compared to real image datasets, highlighting their limitations in understanding generated content.
Fine-tuning models on JourneyDB significantly improves their performance, indicating the dataset's value for training models on generative content.
Prompt inversion, style retrieval, and visual question answering tasks on JourneyDB reveal specific challenges for models in understanding content and style nuances within generated images. |
Potential misalignment between some images and prompts may introduce noise into the annotations.
Future work could explore expanding JourneyDB with images from other text-to-image generation models and incorporating human feedback to refine annotations. |
generated images, multi-modal understanding, vision-language models, text-to-image generation, benchmark dataset |
2307.00619
Report |
Solving Linear Inverse Problems Provably via Posterior Sampling with Latent Diffusion Models |
Litu Rout, Negin Raoof, Giannis Daras, Constantine Caramanis, Alexandros G. Dimakis, Sanjay Shakkottai |
We present the first framework to solve linear inverse problems leveraging
pre-trained latent diffusion models. Previously proposed algorithms (such as
DPS and DDRM) only apply to pixel-space diffusion models. We theoretically
analyze our algorithm showing provable sample recovery in a linear model
setting. The algorithmic insight obtained from our analysis extends to more
general settings often considered in practice. Experimentally, we outperform
previously proposed posterior sampling algorithms in a wide variety of problems
including random inpainting, block inpainting, denoising, deblurring,
destriping, and super-resolution. |
This paper presents the first framework to solve linear inverse problems using pre-trained latent diffusion models, enabling the use of foundation models like Stable Diffusion without finetuning. |
Existing algorithms for inverse problems are limited to pixel-space diffusion models, preventing the utilization of powerful latent-based foundation models. |
The method extends Diffusion Posterior Sampling (DPS) with a "gluing" objective that guides the diffusion process towards latents consistent with both measurements and the decoder-encoder mapping. Theoretical analysis proves sample recovery in a linear model setting with a two-step diffusion process. |
The proposed Posterior Sampling with Latent Diffusion (PSLD) algorithm achieves state-of-the-art results on inpainting, block inpainting, denoising, deblurring, destriping, and super-resolution tasks.
PSLD outperforms DPS on both in-distribution (FFHQ 256) and out-of-distribution (ImageNet 256) datasets using Stable Diffusion.
Theoretical analysis demonstrates PSLD's advantage in avoiding the curse of ambient dimension associated with pixel-space diffusion models. |
The evaluation is based on Stable Diffusion, inheriting potential biases from the LAION dataset.
The paper focuses on linear inverse problems, leaving extension to non-linear problems for future work. |
latent diffusion models, inverse problems, posterior sampling, stable diffusion, image restoration |
2307.00522
Report |
LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance |
Linoy Tsaban, Apolinário Passos |
Recent large-scale text-guided diffusion models provide powerful
image-generation capabilities. Currently, a significant effort is given to
enable the modification of these images using text only as means to offer
intuitive and versatile editing. However, editing proves to be difficult for
these generative models due to the inherent nature of editing techniques, which
involves preserving certain content from the original image. Conversely, in
text-based models, even minor modifications to the text prompt frequently
result in an entirely distinct result, making attaining one-shot generation
that accurately corresponds to the users intent exceedingly challenging. In
addition, to edit a real image using these state-of-the-art tools, one must
first invert the image into the pre-trained models domain - adding another
factor affecting the edit quality, as well as latency. In this exploratory
report, we propose LEDITS - a combined lightweight approach for real-image
editing, incorporating the Edit Friendly DDPM inversion technique with Semantic
Guidance, thus extending Semantic Guidance to real image editing, while
harnessing the editing capabilities of DDPM inversion as well. This approach
achieves versatile edits, both subtle and extensive as well as alterations in
composition and style, while requiring no optimization nor extensions to the
architecture. |
The paper proposes LEDITS, a novel approach for editing real images by combining Edit Friendly DDPM inversion with Semantic Guidance (SEGA). |
LEDITS enables intuitive and versatile editing of real images within the latent space of text-guided diffusion models, addressing the limitations of existing methods that struggle with preserving content and achieving fine-grained control. |
LEDITS first inverts a real image into the latent space using Edit Friendly DDPM inversion. Then, it applies SEGA during the denoising process, utilizing pre-computed noise vectors from the inversion step to guide the generation towards the desired edits specified by text prompts and semantic concepts. |
LEDITS achieves a balance between fidelity to the original image and the creativity of the edit, allowing for both subtle and extensive modifications.
The method offers flexibility and versatility by enabling independent editing operations with DDPM inversion and SEGA, leading to more diverse outputs.
LEDITS retains the strengths of both DDPM inversion and SEGA, such as preserving image semantics, achieving high fidelity to editing prompts, and demonstrating robustness and monotonicity in semantic guidance. |
The paper primarily focuses on qualitative analysis and leaves quantitative evaluations for future work.
Further exploration of the interplay between DDPM inversion parameters and SEGA guidance scales is suggested. |
image editing, diffusion models, ddpm inversion, semantic guidance, text-guided image manipulation |
2307.00430
Report |
WaveMixSR: A Resource-efficient Neural Network for Image Super-resolution |
Pranav Jeevan, Akella Srinidhi, Pasunuri Prathiba, Amit Sethi |
Image super-resolution research recently been dominated by transformer models
which need higher computational resources than CNNs due to the quadratic
complexity of self-attention. We propose a new neural network -- WaveMixSR --
for image super-resolution based on WaveMix architecture which uses a
2D-discrete wavelet transform for spatial token-mixing. Unlike
transformer-based models, WaveMixSR does not unroll the image as a sequence of
pixels/patches. It uses the inductive bias of convolutions along with the
lossless token-mixing property of wavelet transform to achieve higher
performance while requiring fewer resources and training data. We compare the
performance of our network with other state-of-the-art methods for image
super-resolution. Our experiments show that WaveMixSR achieves competitive
performance in all datasets and reaches state-of-the-art performance in the
BSD100 dataset on multiple super-resolution tasks. Our model is able to achieve
this performance using less training data and computational resources while
maintaining high parameter efficiency compared to current state-of-the-art
models. |
Proposes WaveMixSR, a novel wavelet-based neural network architecture for image super-resolution, employing 2D discrete wavelet transform (DWT) for efficient spatial token mixing. |
Addresses limitations of transformer-based SR models, such as high computational cost and data requirements, by leveraging the efficiency and inductive bias of DWT and CNNs. |
Constructs a two-path network where the luminance (Y) channel undergoes upsampling, feature extraction using multiple WaveMix blocks (containing DWT, convolutions, and MLPs), and reconstruction, while the chrominance (CbCr) channels are upsampled separately. |
Achieves state-of-the-art performance on the BSD100 dataset for multiple SR tasks, outperforming transformer-based methods.
Demonstrates high parameter efficiency and reduced computational complexity compared to transformer-based models, requiring fewer resources and training data.
Successfully reconstructs high-frequency details and sharp images, as evidenced by visual results and quantitative metrics (PSNR, SSIM). |
Performance improvement potential by exploring larger training datasets (DF2K) and pre-training techniques.
Further investigation into the benefits of adversarial training for potential enhancements. |
image super-resolution, wavelet transform, token mixing, deep learning, computer vision |
2307.00407
Report |
WavePaint: Resource-efficient Token-mixer for Self-supervised Inpainting |
Pranav Jeevan, Dharshan Sampath Kumar, Amit Sethi |
Image inpainting, which refers to the synthesis of missing regions in an
image, can help restore occluded or degraded areas and also serve as a
precursor task for self-supervision. The current state-of-the-art models for
image inpainting are computationally heavy as they are based on transformer or
CNN backbones that are trained in adversarial or diffusion settings. This paper
diverges from vision transformers by using a computationally-efficient
WaveMix-based fully convolutional architecture -- WavePaint. It uses a
2D-discrete wavelet transform (DWT) for spatial and multi-resolution
token-mixing along with convolutional layers. The proposed model outperforms
the current state-of-the-art models for image inpainting on reconstruction
quality while also using less than half the parameter count and considerably
lower training and evaluation times. Our model even outperforms current
GAN-based architectures in CelebA-HQ dataset without using an adversarially
trainable discriminator. Our work suggests that neural architectures that are
modeled after natural image priors require fewer parameters and computations to
achieve generalization comparable to transformers. |
This paper presents WavePaint, a computationally-efficient, fully convolutional model based on WaveMix for high-quality image inpainting. |
Current state-of-the-art inpainting models heavily rely on computationally expensive transformers or CNNs, requiring substantial resources and training time. |
WavePaint leverages the power of 2D discrete wavelet transform (DWT) for spatial and multi-resolution token mixing, enabling efficient global context understanding. |
WavePaint achieves comparable, and in some cases superior, results to SOTA models on CelebA-HQ using fewer parameters and significantly faster training and inference.
It outperforms larger models like LaMa in terms of FID score, parameter count, GPU memory usage, and speed.
The model demonstrates the effectiveness of wavelet token mixing for realistic image generation from masked images without requiring adversarial or diffusion-based training. |
The study primarily focuses on large mask inpainting and doesn't address blind mask inpainting.
Future work includes exploring WavePaint's potential for resource-efficient image generation in adversarial or diffusion settings. |
image inpainting, wavelet transform, token mixing, wavemix, image generation |
2307.00398
Report |
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models |
Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, Zeynep Akata |
Large-scale vision-language models (VLMs) like CLIP successfully find
correspondences between images and text. Through the standard deterministic
mapping process, an image or a text sample is mapped to a single vector in the
embedding space. This is problematic: as multiple samples (images or text) can
abstract the same concept in the physical world, deterministic embeddings do
not reflect the inherent ambiguity in the embedding space. We propose ProbVLM,
a probabilistic adapter that estimates probability distributions for the
embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc
manner without needing large-scale datasets or computing. On four challenging
datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the
multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify
the calibration of embedding uncertainties in retrieval tasks and show that
ProbVLM outperforms other methods. Furthermore, we propose active learning and
model selection as two real-world downstream tasks for VLMs and show that the
estimated uncertainty aids both tasks. Lastly, we present a novel technique for
visualizing the embedding distributions using a large-scale pre-trained latent
diffusion model. Code is available at https://github.com/ExplainableML/ProbVLM. |
This paper introduces \texttt{ProbVLM}, a post-hoc probabilistic adapter that converts deterministic embeddings from pre-trained vision-language models (VLMs) into probabilistic embeddings. |
Existing large-scale VLMs provide deterministic embeddings that do not capture the inherent ambiguity in image-text mappings, limiting their ability to model uncertainty in downstream tasks. |
\texttt{ProbVLM} leverages pre-trained VLM encoders to predict parameters of a heteroscedastic generalized Gaussian distribution for each embedding. It is trained using a combination of intra-modal and cross-modal alignment objectives. |
\texttt{ProbVLM} provides well-calibrated uncertainties, with higher uncertainties correlating with lower performance on retrieval tasks.
Uncertainty estimates from \texttt{ProbVLM} enable effective model selection from a set of fine-tuned VLMs on unlabeled target datasets.
The uncertainties facilitate active learning by selecting the most informative samples for fine-tuning, leading to improved performance with limited labeled data. |
Exploration of more complex probability distributions beyond the generalized Gaussian distribution.
Investigating the integration of \texttt{ProbVLM} into the training process of VLMs, rather than as a post-hoc adaptation. |
vision-language models, probabilistic embeddings, uncertainty estimation, active learning, model selection |
2307.00300
Report |
DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation |
Zhuowei Chen, Shancheng Fang, Wei Liu, Qian He, Mengqi Huang, Yongdong Zhang, Zhendong Mao |
While large-scale pre-trained text-to-image models can synthesize diverse and
high-quality human-centric images, an intractable problem is how to preserve
the face identity for conditioned face images. Existing methods either require
time-consuming optimization for each face-identity or learning an efficient
encoder at the cost of harming the editability of models. In this work, we
present an optimization-free method for each face identity, meanwhile keeping
the editability for text-to-image models. Specifically, we propose a novel
face-identity encoder to learn an accurate representation of human faces, which
applies multi-scale face features followed by a multi-embedding projector to
directly generate the pseudo words in the text embedding space. Besides, we
propose self-augmented editability learning to enhance the editability of
models, which is achieved by constructing paired generated face and edited face
images using celebrity names, aiming at transferring mature ability of
off-the-shelf text-to-image models in celebrity faces to unseen faces.
Extensive experiments show that our methods can generate identity-preserved
images under different scenes at a much faster speed. |
This paper proposes DreamIdentity, an optimization-free method for preserving face identity in text-to-image models, enabling identity-preserved image generation under different scenes at a much faster speed. |
Existing methods for preserving face identity in text-to-image synthesis either require time-consuming optimization per identity or compromise model editability. This paper addresses these limitations by introducing an efficient and effective approach. |
DreamIdentity utilizes a novel Multi-word Multi-scale ID encoder (M2 ID encoder) to learn accurate representations of human faces by extracting multi-scale features and projecting them into multiple word embeddings. It also introduces Self-Augmented Editability Learning to enhance editability by training the encoder using a self-augmented dataset of celebrity faces and their edited versions. |
DreamIdentity outperforms existing optimization-based and efficient methods in terms of text-alignment, face similarity, and encoding speed.
The M2 ID encoder, with its multi-scale features and multi-word embedding projection, significantly improves identity preservation compared to using a standard CLIP encoder.
Self-Augmented Editability Learning effectively enhances the model's ability to generate images that adhere to editing prompts while preserving identity. |
The model's performance may be limited when presented with poor-quality or out-of-domain face images.
Editability might be hindered when generating scenes that significantly deviate from the input identity's gender characteristics. |
text-to-image synthesis, face identity preservation, personalized image generation, multi-word embedding, self-augmented editability learning |
2307.00154
Report |
Stitched ViTs are Flexible Vision Backbones |
Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, Bohan Zhuang |
Large pretrained plain vision Transformers (ViTs) have been the workhorse for
many downstream tasks. However, existing works utilizing off-the-shelf ViTs are
inefficient in terms of training and deployment, because adopting ViTs with
individual sizes requires separate trainings and is restricted by fixed
performance-efficiency trade-offs. In this paper, we are inspired by stitchable
neural networks (SN-Net), which is a new framework that cheaply produces a
single model that covers rich subnetworks by stitching pretrained model
families, supporting diverse performance-efficiency trade-offs at runtime.
Building upon this foundation, we introduce SN-Netv2, a systematically improved
model stitching framework to facilitate downstream task adaptation.
Specifically, we first propose a two-way stitching scheme to enlarge the
stitching space. We then design a resource-constrained sampling strategy that
takes into account the underlying FLOPs distributions in the space for better
sampling. Finally, we observe that learning stitching layers as a low-rank
update plays an essential role on downstream tasks to stabilize training and
ensure a good Pareto frontier. With extensive experiments on ImageNet-1K,
ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance
over SN-Netv1 on downstream dense predictions and shows strong ability as a
flexible vision backbone, achieving great advantages in both training
efficiency and deployment flexibility. Code is available at
https://github.com/ziplab/SN-Netv2. |
This paper introduces SN-Netv2, an improved framework for adapting large pretrained vision transformers (ViTs) to downstream tasks like semantic segmentation and depth estimation. SN-Netv2 creates a single model encompassing numerous subnetworks with varying performance-efficiency trade-offs by stitching together pretrained ViTs of different sizes. |
Existing methods for adapting pretrained ViTs to downstream tasks are inefficient for training and deployment as they require separate training for each ViT size and lack flexibility in performance-efficiency trade-offs. |
SN-Netv2 introduces three key improvements: 1) Two-way Stitching (TWS) for a larger, more optimal stitching space, 2) Resource-constrained Sampling (ROS) for balanced training across varying FLOPs constraints, 3) Low-Rank Adaptation of Stitching Layers (LoRA SL) for stabilizing training and achieving smoother performance curves. |
SN-Netv2 outperforms its predecessor SN-Netv1 and achieves competitive performance with individually trained ViTs across benchmarks like ADE20K, COCO-Stuff-10K, and NYUv2.
The framework offers significant training efficiency advantages, requiring less GPU hours than training individual ViT backbones separately.
SN-Netv2 enables flexible deployment as a single model can adapt to various resource constraints at runtime. |
Exploration of parameter-efficient approaches within SN-Netv2 is left for future work.
Future work can investigate better training strategies to further improve the performance of stitches at the Pareto frontier. |
vision transformers, model stitching, downstream task adaptation, semantic segmentation, depth estimation |
2307.00040
Report |
DisCo: Disentangled Control for Realistic Human Dance Generation |
Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang |
Generative AI has made significant strides in computer vision, particularly
in text-driven image/video synthesis (T2I/T2V). Despite the notable
advancements, it remains challenging in human-centric content synthesis such as
realistic dance generation. Current methodologies, primarily tailored for human
motion transfer, encounter difficulties when confronted with real-world dance
scenarios (e.g., social media dance), which require to generalize across a wide
spectrum of poses and intricate human details. In this paper, we depart from
the traditional paradigm of human motion transfer and emphasize two additional
critical attributes for the synthesis of human dance content in social media
contexts: (i) Generalizability: the model should be able to generalize beyond
generic human viewpoints as well as unseen human subjects, backgrounds, and
poses; (ii) Compositionality: it should allow for the seamless composition of
seen/unseen subjects, backgrounds, and poses from different sources. To address
these challenges, we introduce DISCO, which includes a novel model architecture
with disentangled control to improve the compositionality of dance synthesis,
and an effective human attribute pre-training for better generalizability to
unseen humans. Extensive qualitative and quantitative results demonstrate that
DisCc can generate high-quality human dance images and videos with diverse
appearances and flexible motions. Code is available at
https://disco-dance.github.io/. |
This paper introduces \ourmodel, a novel approach for generating realistic human dance videos from a single image, particularly focusing on social media scenarios like TikTok. |
Existing methods for human motion transfer struggle with generalizability to unseen subjects, backgrounds, and poses, and lack compositionality for creating novel combinations. |
\ourmodel employs a disentangled control architecture with separate ControlNet branches for background and pose, and incorporates CLIP image embeddings for human foreground. It also utilizes a human attribute pre-training strategy on large-scale image datasets to enhance generalizability. |
\ourmodel demonstrates superior quantitative results on FID, FID-VID, and FVD metrics compared to state-of-the-art methods like DreamPose.
It exhibits strong generalizability, successfully generating dance videos with unseen human subjects, backgrounds, and poses.
Qualitative results and a user study confirm the generation of high-quality, faithful, and composable human dance videos with diverse appearances and motions. |
The model currently struggles with hand posture accuracy without fine-grained hand pose control.
Extending the approach to multi-person scenarios and human-object interactions presents future challenges. Future work could explore motion pre-training via video data alongside attribute pre-training. |
human dance generation, disentangled control, diffusion models, controlnet, human attribute pre-training |
2307.00038
Report |
Training-free Object Counting with Prompts |
Zenglin Shi, Ying Sun, Mengmi Zhang |
This paper tackles the problem of object counting in images. Existing
approaches rely on extensive training data with point annotations for each
object, making data collection labor-intensive and time-consuming. To overcome
this, we propose a training-free object counter that treats the counting task
as a segmentation problem. Our approach leverages the Segment Anything Model
(SAM), known for its high-quality masks and zero-shot segmentation capability.
However, the vanilla mask generation method of SAM lacks class-specific
information in the masks, resulting in inferior counting accuracy. To overcome
this limitation, we introduce a prior-guided mask generation method that
incorporates three types of priors into the segmentation process, enhancing
efficiency and accuracy. Additionally, we tackle the issue of counting objects
specified through text by proposing a two-stage approach that combines
reference object selection and prior-guided mask generation. Extensive
experiments on standard datasets demonstrate the competitive performance of our
training-free counter compared to learning-based approaches. This paper
presents a promising solution for counting objects in various scenarios without
the need for extensive data collection and counting-specific training. Code is
available at \url{https://github.com/shizenglin/training-free-object-counter} |
This paper presents a training-free object counting model that leverages the Segment Anything Model (SAM) and incorporates prior information for accurate and efficient object counting using prompts like points, boxes, or text. |
Existing object counting methods rely heavily on extensive training data with point annotations, which is labor-intensive and time-consuming. This work addresses this limitation by proposing a training-free approach, making object counting more accessible and flexible. |
The method formulates counting as a segmentation problem. It leverages SAM for segmentation and introduces a prior-guided mask generation approach incorporating three priors: similarity prior, segment prior, and semantic prior. For text-based counting, a two-stage approach combining reference object selection and prior-guided mask generation is proposed. |
The training-free counter achieves competitive performance compared to learning-based approaches on standard datasets like FSC-147 and CARPK.
The prior-guided mask generation method significantly improves counting efficiency and accuracy by effectively differentiating target objects from non-target objects.
The reference object selection algorithm enhances text-specified counting by refining the similarity maps obtained from CLIP-Surgery. |
The model faces challenges in counting extremely small, occluded, or densely clustered objects.
Future work will focus on addressing these limitations by developing more advanced adaptive thresholding methods or fine-tuning SAM with limited annotated data. |
object counting, training-free, segment anything model (sam), prior-guided segmentation, text-specified counting |
2306.17843
Report |
Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors |
Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, Bernard Ghanem |
We present Magic123, a two-stage coarse-to-fine approach for high-quality,
textured 3D meshes generation from a single unposed image in the wild using
both2D and 3D priors. In the first stage, we optimize a neural radiance field
to produce a coarse geometry. In the second stage, we adopt a memory-efficient
differentiable mesh representation to yield a high-resolution mesh with a
visually appealing texture. In both stages, the 3D content is learned through
reference view supervision and novel views guided by a combination of 2D and 3D
diffusion priors. We introduce a single trade-off parameter between the 2D and
3D priors to control exploration (more imaginative) and exploitation (more
precise) of the generated geometry. Additionally, we employ textual inversion
and monocular depth regularization to encourage consistent appearances across
views and to prevent degenerate solutions, respectively. Magic123 demonstrates
a significant improvement over previous image-to-3D techniques, as validated
through extensive experiments on synthetic benchmarks and diverse real-world
images. Our code, models, and generated 3D assets are available at
https://github.com/guochengqian/Magic123. |
Magic123: a novel two-stage coarse-to-fine approach for high-quality textured 3D mesh generation from a single unposed image, using both 2D and 3D diffusion priors. |
Single-image 3D reconstruction is a challenging, ill-posed problem in computer vision, with existing methods often limited in quality, generalization, or computational cost. This work combines the advantages of both 2D and 3D priors to generate high-fidelity 3D content with detailed geometry and appealing textures. |
The method utilizes a two-stage optimization: 1) A coarse stage optimizes a neural radiance field (NeRF) for initial geometry. 2) A fine stage employs a memory-efficient differentiable mesh (DMTet) to refine geometry and texture at high resolution. Both stages leverage 2D and 3D diffusion priors for novel view guidance, controlled by a trade-off parameter for exploration/exploitation. |
Magic123 demonstrates significant improvement over existing image-to-3D techniques on both synthetic and real-world images, achieving state-of-the-art performance in quantitative metrics (PSNR, LPIPS, CLIP-similarity).
The method successfully balances geometry exploration and exploitation, generating faithful 3D reconstructions with high generalizability to diverse objects.
The two-stage coarse-to-fine approach enables high-resolution (1K) output with disentangled geometry and texture. |
The current method assumes the reference image is captured from a front view, limiting its applicability to unposed images with significant viewpoint variations.
The reliance on pre-processing steps like segmentation and depth estimation introduces potential error propagation to the 3D generation. |
3d reconstruction, single image to 3d, diffusion models, neural radiance fields (nerf), deep learning |
2306.17842
Report |
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs |
Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang |
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling
frozen LLMs to perform both understanding and generation tasks involving
non-linguistic modalities such as images or videos. SPAE converts between raw
pixels and interpretable lexical tokens (or words) extracted from the LLM's
vocabulary. The resulting tokens capture both the semantic meaning and the
fine-grained details needed for visual reconstruction, effectively translating
the visual content into a language comprehensible to the LLM, and empowering it
to perform a wide array of multimodal tasks. Our approach is validated through
in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set
of image understanding and generation tasks. Our method marks the first
successful attempt to enable a frozen LLM to generate image content while
surpassing state-of-the-art performance in image understanding tasks, under the
same setting, by over 25%. |
This paper introduces Semantic Pyramid AutoEncoder (SPAE), which enables frozen LLMs to perform visual understanding and generation tasks through conversion between visual content and interpretable lexical tokens. |
This approach leverages the knowledge and generative capabilities of LLMs for multimodal tasks without requiring training on image-text pairs. |
SPAE uses a frozen language codebook and a pyramid token structure to capture semantic concepts and fine-grained details. It employs a semantic loss to encourage conceptually relevant tokens and utilizes in-context learning with a progressive denoising method for image generation. |
SPAE with PaLM 2 outperforms the best-published few-shot image classification accuracy by over 25%.
The pyramid structure allows for representing semantic concepts with fewer tokens, improving efficiency.
Frozen LLMs, when paired with SPAE, are capable of performing tasks like image captioning, visual question answering, and conditional image denoising. |
Reconstructing images with a frozen language codebook requires more tokens compared to models with learned codebooks.
The in-context learning approach is limited by the acceptable sequence length, impacting image resolution and quality. |
multimodal learning, large language models, image generation, image understanding, in-context learning |
2306.17723
Report |
FlipNeRF: Flipped Reflection Rays for Few-shot Novel View Synthesis |
Seunghyeon Seo, Yeonjin Chang, Nojun Kwak |
Neural Radiance Field (NeRF) has been a mainstream in novel view synthesis
with its remarkable quality of rendered images and simple architecture.
Although NeRF has been developed in various directions improving continuously
its performance, the necessity of a dense set of multi-view images still exists
as a stumbling block to progress for practical application. In this work, we
propose FlipNeRF, a novel regularization method for few-shot novel view
synthesis by utilizing our proposed flipped reflection rays. The flipped
reflection rays are explicitly derived from the input ray directions and
estimated normal vectors, and play a role of effective additional training rays
while enabling to estimate more accurate surface normals and learn the 3D
geometry effectively. Since the surface normal and the scene depth are both
derived from the estimated densities along a ray, the accurate surface normal
leads to more exact depth estimation, which is a key factor for few-shot novel
view synthesis. Furthermore, with our proposed Uncertainty-aware Emptiness Loss
and Bottleneck Feature Consistency Loss, FlipNeRF is able to estimate more
reliable outputs with reducing floating artifacts effectively across the
different scene structures, and enhance the feature-level consistency between
the pair of the rays cast toward the photo-consistent pixels without any
additional feature extractor, respectively. Our FlipNeRF achieves the SOTA
performance on the multiple benchmarks across all the scenarios. |
FlipNeRF, a novel regularization method for few-shot novel view synthesis, utilizes flipped reflection rays as additional training cues. |
NeRF struggles with performance degradation when trained on sparse views. FlipNeRF addresses this by generating effective reflection rays, enabling more accurate surface normal and depth estimation. |
FlipNeRF generates flipped reflection rays based on input ray directions and estimated surface normals. It uses a masking strategy to filter ineffective rays and introduces Uncertainty-aware Emptiness Loss (UE Loss) and Bottleneck Feature Consistency Loss (BFC Loss) to improve the reliability and consistency of the model. |
Achieves state-of-the-art performance on Realistic Synthetic 360°, DTU, and LLFF benchmarks.
Significantly outperforms baselines under extremely sparse settings (e.g., 3/4-view).
Demonstrates the importance of accurate surface normal estimation in few-shot novel view synthesis. |
The improvement is less significant on LLFF due to its less dynamic camera poses.
Exploring the combination of FlipNeRF with Ref-NeRF representation or further research on view-dependent appearance for few-shot novel view synthesis. |
novel view synthesis, neural radiance fields (nerf), few-shot learning, surface normal estimation, regularization |
2306.17643
Report |
Neural 3D Scene Reconstruction from Multiple 2D Images without 3D Supervision |
Yi Guo, Che Sun, Yunde Jia, Yuwei Wu |
Neural 3D scene reconstruction methods have achieved impressive performance
when reconstructing complex geometry and low-textured regions in indoor scenes.
However, these methods heavily rely on 3D data which is costly and
time-consuming to obtain in real world. In this paper, we propose a novel
neural reconstruction method that reconstructs scenes using sparse depth under
the plane constraints without 3D supervision. We introduce a signed distance
function field, a color field, and a probability field to represent a scene. We
optimize these fields to reconstruct the scene by using differentiable ray
marching with accessible 2D images as supervision. We improve the
reconstruction quality of complex geometry scene regions with sparse depth
obtained by using the geometric constraints. The geometric constraints project
3D points on the surface to similar-looking regions with similar features in
different 2D images. We impose the plane constraints to make large planes
parallel or vertical to the indoor floor. Both two constraints help reconstruct
accurate and smooth geometry structures of the scene. Without 3D supervision,
our method achieves competitive performance compared with existing methods that
use 3D supervision on the ScanNet dataset. |
This paper proposes a novel neural 3D scene reconstruction method that reconstructs indoor scenes from 2D images without 3D supervision by using sparse depth under plane constraints. |
Existing neural 3D scene reconstruction methods heavily rely on 3D supervision, which is costly and time-consuming to obtain. This paper aims to address this challenge by reconstructing scenes using only 2D images. |
The method represents a scene as a signed distance function field, a color field, and a plane probability field. It uses differentiable volume rendering with 2D images to optimize these fields. The method utilizes geometry constraints to obtain sparse depth for reconstructing regions with complex geometry. It also imposes plane constraints to improve the reconstruction quality of large low-textured regions. |
The method achieves comparable results to Manhattan-SDF with dense depth while using only sparse depth.
The method outperforms Manhattan-SDF when both use dense depth.
The method achieves comparable results to existing methods that use 3D supervision on the ScanNet dataset. |
The method relies on the assumption that large planes in the scene are parallel or vertical to the floor, which may not hold for all indoor scenes.
The plane estimation method used may not be robust to complex scenes with many small planes. |
3d scene reconstruction, neural implicit representation, volume rendering, unsupervised learning, plane constraints |
2306.17567
Report |
Counting Guidance for High Fidelity Text-to-Image Synthesis |
Wonjun Kang, Kevin Galim, Hyung Il Koo |
Recently, the quality and performance of text-to-image generation
significantly advanced due to the impressive results of diffusion models.
However, text-to-image diffusion models still fail to generate high fidelity
content with respect to the input prompt. One problem where text-to-diffusion
models struggle is generating the exact number of objects specified in the text
prompt. E.g. given a prompt "five apples and ten lemons on a table",
diffusion-generated images usually contain the wrong number of objects. In this
paper, we propose a method to improve diffusion models to focus on producing
the correct object count given the input prompt. We adopt a counting network
that performs reference-less class-agnostic counting for any given image. We
calculate the gradients of the counting network and refine the predicted noise
for each step. To handle multiple types of objects in the prompt, we use novel
attention map guidance to obtain high-fidelity masks for each object. Finally,
we guide the denoising process by the calculated gradients for each object.
Through extensive experiments and evaluation, we demonstrate that our proposed
guidance method greatly improves the fidelity of diffusion models to object
count. |
This paper proposes counting guidance, a novel method leveraging a counting network to guide Stable Diffusion in generating images with the precise number of objects specified in the text prompt. |
Current text-to-image diffusion models struggle to accurately depict the correct object count as per user instructions, limiting their ability to fulfill specific image generation requests. |
The method employs a pre-trained counting network (RCC) and uses its gradients to refine the noise prediction during the Stable Diffusion denoising process. For multiple object types, attention map guidance is introduced to prevent semantic information mixing and generate accurate object masks, enabling masked counting guidance for each object. |
The proposed method successfully generates the specified number of objects for both single and multiple object type prompts.
Attention map guidance effectively mitigates the semantic information mixing problem in Stable Diffusion, leading to more accurate object representation.
The approach demonstrates efficacy in handling a large number of objects, improving upon the limitations of the base Stable Diffusion model. |
Tuning the scale parameters of the counting network guidance is often required for different text prompts.
Generating the exact number of complex objects remains challenging due to the early determination of image structure in the denoising process. |
text-to-image generation, diffusion models, stable diffusion, object counting, attention map guidance |
2306.17560
Report |
Class-Incremental Learning using Diffusion Model for Distillation and Replay |
Quentin Jodelet, Xin Liu, Yin Jun Phua, Tsuyoshi Murata |
Class-incremental learning aims to learn new classes in an incremental
fashion without forgetting the previously learned ones. Several research works
have shown how additional data can be used by incremental models to help
mitigate catastrophic forgetting. In this work, following the recent
breakthrough in text-to-image generative models and their wide distribution, we
propose the use of a pretrained Stable Diffusion model as a source of
additional data for class-incremental learning. Compared to competitive methods
that rely on external, often unlabeled, datasets of real images, our approach
can generate synthetic samples belonging to the same classes as the previously
encountered images. This allows us to use those additional data samples not
only in the distillation loss but also for replay in the classification loss.
Experiments on the competitive benchmarks CIFAR100, ImageNet-Subset, and
ImageNet demonstrate how this new approach can be used to further improve the
performance of state-of-the-art methods for class-incremental learning on large
scale datasets. |
This paper proposes SDDR, a novel class-incremental learning method leveraging a pre-trained Stable Diffusion model to generate synthetic images for both knowledge distillation and replay. |
Existing CIL methods using additional data rely on external datasets of real images, limiting their use to distillation. SDDR overcomes this by generating labeled synthetic images of previously learned classes, allowing their use for both distillation and replay, leading to improved performance. |
SDDR generates synthetic images using class names and descriptions as prompts for Stable Diffusion. During training, it combines these images with real data for both classification and distillation losses. The approach is designed to be complementary and can be integrated with other CIL methods. |
SDDR significantly improves the average incremental accuracy of baselines like iCaRL and LUCIR on CIFAR100, ImageNet-Subset, and ImageNet.
Combining SDDR with FOSTER achieves state-of-the-art performance on several benchmarks.
SDDR shows significant improvements, especially in challenging scenarios with limited memory and a large number of incremental steps. |
The quality and diversity of synthetic images are limited by the pre-trained Stable Diffusion model.
Future work includes exploring fine-tuning of the generative model during training and modifying losses to bridge the gap between synthetic and real data. |
class-incremental learning, stable diffusion, synthetic data, knowledge distillation, catastrophic forgetting |
2306.17391
Report |
EyeBAG: Accurate Control of Eye Blink and Gaze Based on Data Augmentation Leveraging Style Mixing |
Bryan S. Kim, Jeong Young Jeong, Wonjong Ryu |
Recent developments in generative models have enabled the generation of
photo-realistic human face images, and downstream tasks utilizing face
generation technology have advanced accordingly. However, models for downstream
tasks are yet substandard at eye control (e.g. eye blink, gaze redirection). To
overcome such eye control problems, we introduce a novel framework consisting
of two distinct modules: a blink control module and a gaze redirection module.
We also propose a novel data augmentation method to train each module,
leveraging style mixing to obtain images with desired features. We show that
our framework produces eye-controlled images of high quality, and demonstrate
how it can be used to improve the performance of downstream tasks. |
Introduces EyeBAG, a novel framework for accurate control of eye blinks and gaze in face images using generative models. |
Current generative models struggle with realistic eye control, leading to awkwardness and a sense of alienation in generated images, particularly impacting downstream tasks like face swapping. |
Presents a two-module approach: 1) Blink control module: regulates eye blink degree using a U-Net architecture trained on paired open/closed eye images generated through a novel style mixing data augmentation technique. 2) Gaze redirection module: controls gaze direction by manipulating iris position, trained on a dataset augmented with diverse gaze directions also generated via style mixing. |
EyeBAG generates high-quality, photorealistic images of blinking and gaze-redirected faces.
The framework's discriminator doubles as a highly accurate blink detection network for images and videos.
Data augmentation using EyeBAG significantly improves the performance of downstream tasks, such as face swapping, particularly in scenarios with closed eyes or varying gazes. |
The current implementation focuses solely on eye control and does not address other facial expressions or head movements.
Future work could explore the generalization of the style mixing data augmentation technique to other facial features and expressions, further enhancing the realism of generated faces. |
generative models, data augmentation, style mixing, eye blink control, gaze redirection |
2306.17321
Report |
Training-Free Neural Matte Extraction for Visual Effects |
Sharif Elcott, J. P. Lewis, Nori Kanazawa, Christoph Bregler |
Alpha matting is widely used in video conferencing as well as in movies,
television, and social media sites. Deep learning approaches to the matte
extraction problem are well suited to video conferencing due to the consistent
subject matter (front-facing humans), however training-based approaches are
somewhat pointless for entertainment videos where varied subjects (spaceships,
monsters, etc.) may appear only a few times in a single movie -- if a method of
creating ground truth for training exists, just use that method to produce the
desired mattes. We introduce a training-free high quality neural matte
extraction approach that specifically targets the assumptions of visual effects
production. Our approach is based on the deep image prior, which optimizes a
deep neural network to fit a single image, thereby providing a deep encoding of
the particular image. We make use of the representations in the penultimate
layer to interpolate coarse and incomplete "trimap" constraints. Videos
processed with this approach are temporally consistent. The algorithm is both
very simple and surprisingly effective. |
This paper introduces a training-free neural matte extraction approach for visual effects using the deep image prior (DIP). |
This approach is specifically designed for visual effects production, where subject matter is diverse, training data is often impractical, and clean plates are undesirable or infeasible. |
The method utilizes a DIP network to reconstruct the target image and simultaneously inpaint the alpha matte in the trimap's unknown region, constrained by known regions. Separate networks reconstruct foreground and background, further coupled with the alpha output via the alpha-compositing equation. |
The method produces high-quality alpha mattes comparable to ground truth data.
It effectively handles challenging cases like hair and objects with similar colors to the background.
Temporal consistency is achieved by warm-starting optimization from previous frames in videos. |
The computational cost is high, limiting its use to offline applications.
Objects with holes can pose challenges and require further investigation for robust handling. |
alpha matting, deep learning, visual effects, deep image prior, training-free |
2306.17319
Report |
ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation |
Shuyang Sun, Weijun Wang, Qihang Yu, Andrew Howard, Philip Torr, Liang-Chieh Chen |
This paper presents a new mechanism to facilitate the training of mask
transformers for efficient panoptic segmentation, democratizing its deployment.
We observe that due to its high complexity, the training objective of panoptic
segmentation will inevitably lead to much higher false positive penalization.
Such unbalanced loss makes the training process of the end-to-end
mask-transformer based architectures difficult, especially for efficient
models. In this paper, we present ReMaX that adds relaxation to mask
predictions and class predictions during training for panoptic segmentation. We
demonstrate that via these simple relaxation techniques during training, our
model can be consistently improved by a clear margin \textbf{without} any extra
computational cost on inference. By combining our method with efficient
backbones like MobileNetV3-Small, our method achieves new state-of-the-art
results for efficient panoptic segmentation on COCO, ADE20K and Cityscapes.
Code and pre-trained checkpoints will be available at
\url{https://github.com/google-research/deeplab2}. |
This paper introduces ReMaX, a novel mechanism to facilitate the training of mask transformers for efficient panoptic segmentation by adding relaxation to mask predictions and class predictions during training. |
The training objective of panoptic segmentation often leads to unbalanced loss with high false positive penalization, making it difficult to train efficient models. ReMaX aims to stabilize training and improve performance without incurring extra computational cost during inference. |
ReMaX consists of two relaxation techniques: (1) ReMask utilizes an auxiliary semantic segmentation branch during training to guide and calibrate panoptic predictions, suppressing false positive predictions. (2) ReClass softens one-hot class labels by considering the overlap between predicted masks and ground truth masks, accounting for potential multi-class regions in predictions. |
ReMaX significantly improves training convergence, achieving up to 3x faster training speed compared to baselines.
ReMaX achieves state-of-the-art results for efficient panoptic segmentation on COCO, ADE20K, and Cityscapes datasets, outperforming previous methods in terms of accuracy and speed.
Ablation studies validate the effectiveness of both ReMask and ReClass, showing their contribution to improved performance and stable training. |
The current implementation is limited to TensorFlow, which restricts the choice of baselines.
The class weighting scheme in ReClass, based on mask size, might not be optimal and requires further investigation. |
panoptic segmentation, mask transformers, efficient training, relaxation techniques, computer vision |
2306.17123
Report |
PVP: Personalized Video Prior for Editable Dynamic Portraits using StyleGAN |
Kai-En Lin, Alex Trevithick, Keli Cheng, Michel Sarkis, Mohsen Ghafoorian, Ning Bi, Gerhard Reitmayr, Ravi Ramamoorthi |
Portrait synthesis creates realistic digital avatars which enable users to
interact with others in a compelling way. Recent advances in StyleGAN and its
extensions have shown promising results in synthesizing photorealistic and
accurate reconstruction of human faces. However, previous methods often focus
on frontal face synthesis and most methods are not able to handle large head
rotations due to the training data distribution of StyleGAN. In this work, our
goal is to take as input a monocular video of a face, and create an editable
dynamic portrait able to handle extreme head poses. The user can create novel
viewpoints, edit the appearance, and animate the face. Our method utilizes
pivotal tuning inversion (PTI) to learn a personalized video prior from a
monocular video sequence. Then we can input pose and expression coefficients to
MLPs and manipulate the latent vectors to synthesize different viewpoints and
expressions of the subject. We also propose novel loss functions to further
disentangle pose and expression in the latent space. Our algorithm shows much
better performance over previous approaches on monocular video datasets, and it
is also capable of running in real-time at 54 FPS on an RTX 3080. |
This paper presents a novel algorithm for creating editable dynamic portraits from monocular portrait videos using StyleGAN, allowing for manipulation of pose, expression, and appearance. |
Current methods for portrait synthesis either struggle with extreme head poses, lack editability, or require extensive multi-view input. This work aims to overcome these limitations and provide a comprehensive solution for creating interactive and personalized digital avatars. |
The method involves two stages: 1) Learning a personalized video prior by fine-tuning a StyleGAN generator on selected frames from the input video. 2) Training pose and expression mapping networks to control the rendering within the personalized manifold using pose and expression parameters. |
The method achieves state-of-the-art visual quality on monocular video datasets, outperforming existing 2D and 3D methods in terms of reconstruction accuracy and detail.
It allows for direct control over head poses, enabling the synthesis of extreme viewpoints not achievable by previous 2D methods.
The approach supports real-time rendering at 54 FPS on an RTX 3080 GPU, making it suitable for interactive applications. |
The current method is limited to the facial region and does not handle the back of the head or upper body.
The personalization process requires a time-consuming optimization stage for each subject. Future work could explore meta-learning for faster adaptation. |
digital avatars, stylegan, personalized video prior, facial reenactment, portrait editing |
2306.17115
Report |
Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation |
Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, Shenghua Gao |
We present a novel alignment-before-generation approach to tackle the
challenging task of generating general 3D shapes based on 2D images or texts.
Directly learning a conditional generative model from images or texts to 3D
shapes is prone to producing inconsistent results with the conditions because
3D shapes have an additional dimension whose distribution significantly differs
from that of 2D images and texts. To bridge the domain gap among the three
modalities and facilitate multi-modal-conditioned 3D shape generation, we
explore representing 3D shapes in a shape-image-text-aligned space. Our
framework comprises two models: a Shape-Image-Text-Aligned Variational
Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model
(ASLDM). The former model encodes the 3D shapes into the shape latent space
aligned to the image and text and reconstructs the fine-grained 3D neural
fields corresponding to given shape embeddings via the transformer-based
decoder. The latter model learns a probabilistic mapping function from the
image or text space to the latent shape space. Our extensive experiments
demonstrate that our proposed approach can generate higher-quality and more
diverse 3D shapes that better semantically conform to the visual or textural
conditional inputs, validating the effectiveness of the
shape-image-text-aligned space for cross-modality 3D shape generation. |
This paper introduces an innovative "alignment-before-generation" approach for generating 3D shapes from 2D images or text descriptions, aiming to enhance the consistency between generated 3D shapes and their corresponding conditions. |
Generating 3D shapes from 2D images or text is challenging due to the inherent domain gap between these modalities. Existing methods often struggle to produce consistent and high-quality results due to this gap. |
The proposed framework utilizes two key components: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and an Aligned Shape Latent Diffusion Model (ASLDM). The SITA-VAE learns a shared representation space for 3D shapes, images, and texts using contrastive learning. The ASLDM, operating in this aligned space, learns a probabilistic mapping from images or texts to 3D shape embeddings. |
The proposed method outperforms baseline methods in terms of reconstruction accuracy and generation quality, as evidenced by metrics like IoU, shape-image score (SI-S), and shape-text score (ST-S).
The generated 3D shapes demonstrate a high degree of fidelity to the input conditions, exhibiting smoother surfaces, finer details, and better semantic consistency.
The framework exhibits robustness in handling out-of-domain images and complex text descriptions, showcasing its generalization capabilities. |
The method's reliance on ground-truth 3D shapes during training poses a limitation, as 3D data is often scarce. Exploring unsupervised or weakly-supervised learning approaches could mitigate this issue.
Representing 3D shapes as occupancy fields necessitates converting meshes into watertight ones, potentially leading to a loss of geometric detail in the original mesh. Investigating alternative shape representations could address this limitation. |
3d shape generation, cross-modal learning, contrastive learning, latent diffusion model, shape-image-text alignment |
2306.16928
Report |
One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization |
Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, Hao Su |
Single image 3D reconstruction is an important but challenging task that
requires extensive knowledge of our natural world. Many existing methods solve
this problem by optimizing a neural radiance field under the guidance of 2D
diffusion models but suffer from lengthy optimization time, 3D inconsistency
results, and poor geometry. In this work, we propose a novel method that takes
a single image of any object as input and generates a full 360-degree 3D
textured mesh in a single feed-forward pass. Given a single image, we first use
a view-conditioned 2D diffusion model, Zero123, to generate multi-view images
for the input view, and then aim to lift them up to 3D space. Since traditional
reconstruction methods struggle with inconsistent multi-view predictions, we
build our 3D reconstruction module upon an SDF-based generalizable neural
surface reconstruction method and propose several critical training strategies
to enable the reconstruction of 360-degree meshes. Without costly
optimizations, our method reconstructs 3D shapes in significantly less time
than existing methods. Moreover, our method favors better geometry, generates
more 3D consistent results, and adheres more closely to the input image. We
evaluate our approach on both synthetic data and in-the-wild images and
demonstrate its superiority in terms of both mesh quality and runtime. In
addition, our approach can seamlessly support the text-to-3D task by
integrating with off-the-shelf text-to-image diffusion models. |
This paper proposes One-2-3-45, a novel method that reconstructs a full 360-degree textured 3D mesh from a single image in a feed-forward manner. |
Existing optimization-based methods for single image 3D reconstruction are time-consuming, memory intensive, and often produce 3D inconsistent results with poor geometry. This work aims to address these limitations. |
The method leverages a view-conditioned 2D diffusion model (Zero123) to generate multi-view images. It then estimates the elevation of the input view and utilizes a cost-volume-based neural surface reconstruction module trained on inconsistent multi-view predictions to reconstruct the 3D mesh. |
Reconstructs 3D shapes significantly faster than existing optimization-based methods (45 seconds).
Produces higher quality geometry and more 3D consistent results due to the use of SDF representation and camera-conditioned multi-view predictions.
Exhibits better adherence to the input image compared to existing methods. |
The method's performance depends on the quality of multi-view images generated by Zero123, which can be inconsistent in cases of limited input information or ambiguous structures.
Minor artifacts on the backside of generated results suggest room for improvement in reconstruction techniques and regularization. |
3d reconstruction, single image 3d reconstruction, diffusion models, neural surface reconstruction, zero-shot learning |
2306.16894
Report |
PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing |
Wenjing Huang, Shikui Tu, Lei Xu |
Diffusion models have showcased their remarkable capability to synthesize
diverse and high-quality images, sparking interest in their application for
real image editing. However, existing diffusion-based approaches for local
image editing often suffer from undesired artifacts due to the pixel-level
blending of the noised target images and diffusion latent variables, which lack
the necessary semantics for maintaining image consistency. To address these
issues, we propose PFB-Diff, a Progressive Feature Blending method for
Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly
integrates text-guided generated content into the target image through
multi-level feature blending. The rich semantics encoded in deep features and
the progressive blending scheme from high to low levels ensure semantic
coherence and high quality in edited images. Additionally, we introduce an
attention masking mechanism in the cross-attention layers to confine the impact
of specific words to desired regions, further improving the performance of
background editing. PFB-Diff can effectively address various editing tasks,
including object/background replacement and object attribute editing. Our
method demonstrates its superior performance in terms of image fidelity,
editing accuracy, efficiency, and faithfulness to the original image, without
the need for fine-tuning or training. |
This paper introduces PFB-Diff, a novel method for text-driven image editing using diffusion models, which leverages progressive feature blending and attention masking to enable seamless and consistent edits. |
Existing diffusion-based image editing methods often suffer from artifacts and inconsistencies due to pixel-level blending, especially when handling complex edits or rough masks. This method aims to address these issues and achieve more natural and accurate results. |
PFB-Diff operates by progressively blending features of the input image with generated features at different layers of the diffusion model's U-Net. It also employs an attention masking mechanism to restrict the influence of specific words to the target regions, ensuring semantic consistency. |
PFB-Diff demonstrates superior performance compared to existing state-of-the-art methods, achieving higher CLIP scores and Local CLIP scores, indicating better image-text alignment and accurate local editing.
The method effectively tackles various editing tasks, including object/background replacement and object property changes, while maintaining high image quality and faithfulness to the original image.
User studies confirm that PFB-Diff produces more favorable results compared to other methods, indicating higher user satisfaction in terms of editing accuracy, realism, and faithfulness. |
PFB-Diff currently requires user-provided masks, which can be a limitation in certain scenarios compared to mask-free methods.
The method is not currently applicable to style transfer tasks. |
image editing, diffusion models, text-to-image synthesis, feature blending, attention masking |
2306.15876
Report |
Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners |
Bowen Shi, Xiaopeng Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian |
Representation learning has been evolving from traditional supervised
training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous
works have demonstrated their pros and cons in specific scenarios, i.e., CL and
supervised pre-training excel at capturing longer-range global patterns and
enabling better feature discrimination, while MIM can introduce more local and
diverse attention across all transformer layers. In this paper, we explore how
to obtain a model that combines their strengths. We start by examining previous
feature distillation and mask feature reconstruction methods and identify their
limitations. We find that their increasing diversity mainly derives from the
asymmetric designs, but these designs may in turn compromise the discrimination
ability. In order to better obtain both discrimination and diversity, we
propose a simple but effective Hybrid Distillation strategy, which utilizes
both the supervised/CL teacher and the MIM teacher to jointly guide the student
model. Hybrid Distill imitates the token relations of the MIM teacher to
alleviate attention collapse, as well as distills the feature maps of the
supervised/CL teacher to enable discrimination. Furthermore, a progressive
redundant token masking strategy is also utilized to reduce the distilling
costs and avoid falling into local optima. Experiment results prove that Hybrid
Distill can achieve superior performance on different benchmarks. |
This paper proposes Hybrid Distill, a novel framework for representation learning that combines the strengths of Contrastive Learning (CL) and Masked Image Modeling (MIM) by distilling knowledge from both CL/supervised and MIM pre-trained teachers into a student model. |
Discrimination and diversity are both crucial for downstream adaptation of representation learning models. However, existing methods for combining CL and MIM, such as feature distillation and mask feature reconstruction, have limitations in effectively incorporating both properties. Hybrid Distill addresses these limitations by leveraging the strengths of both CL and MIM teachers. |
Hybrid Distill utilizes a supervised/CL teacher (e.g., DeiT, CLIP) and an MIM teacher (e.g., MAE). It distills token relations from the MIM teacher in later layers to enhance diversity and feature maps from the supervised/CL teacher in the final layer to enhance discrimination. Additionally, a progressive redundant token masking strategy is employed to reduce computational cost and prevent local optima. |
Hybrid Distill effectively combines discrimination from supervised/CL models with the diversity of MIM models, as demonstrated through property analysis using average head distance, normalized mutual information, and attention visualization.
Hybrid Distill outperforms single-teacher distillation baselines and previous methods using asymmetric designs on various downstream tasks, including image classification, object detection, instance segmentation, and semantic segmentation.
The progressive redundant token masking strategy successfully reduces computational cost while maintaining performance and even provides regularization benefits, preventing the model from falling into local optima. |
The use of two teacher models introduces additional overhead, although the increase in training time is relatively small (around 1.2 times).
The performance improvement when using CLIP as a teacher is less significant than with DeiT, possibly due to the gap in pre-training capacity between CLIP and the MIM teacher (MAE). |
representation learning, knowledge distillation, contrastive learning, masked image modeling, vision transformer |
2306.15832
Report |
Easing Color Shifts in Score-Based Diffusion Models |
Katherine Deck, Tobias Bischoff |
Generated images of score-based models can suffer from errors in their
spatial means, an effect, referred to as a color shift, which grows for larger
images. This paper investigates a previously-introduced approach to mitigate
color shifts in score-based diffusion models. We quantify the performance of a
nonlinear bypass connection in the score network, designed to process the
spatial mean of the input and to predict the mean of the score function. We
show that this network architecture substantially improves the resulting
quality of the generated images, and that this improvement is approximately
independent of the size of the generated images. As a result, this modified
architecture offers a simple solution for the color shift problem across image
sizes. We additionally discuss the origin of color shifts in an idealized
setting in order to motivate the approach. |
This paper investigates and quantifies the performance of a nonlinear bypass connection in the score network, which processes the spatial mean of the input and predicts the mean of the score function, to mitigate color shifts (errors in spatial means) in score-based diffusion models. |
Color shifts are a common problem in score-based diffusion models, especially for large images, and this paper offers a simple and effective solution to address this issue. |
The authors employ a modified score network architecture with a mean-bypass layer that predicts the spatial mean of the score independently from the spatial variations. They compare this approach to a baseline U-net model with and without exponential moving average (EMA) smoothing on FashionMNIST and 2D turbulence datasets. |
The mean-bypass layer significantly reduces color shifts across different image sizes compared to the baseline model with or without EMA.
The modified network architecture achieves superior optimization of the spatial mean loss term, leading to more accurate spatial mean predictions.
EMA smoothing alone is insufficient to mitigate color shifts in large images, particularly when training data is limited. |
The mean-bypass layer architecture does not leverage potential correlations between image means and spatial variations.
Future work could explore incorporating information about spatial variations into the mean-bypass layer to capture potential correlations. |
score-based diffusion models, color shift, image generation, mean-bypass layer, spatial mean prediction |
2306.15769
Report |
What Makes ImageNet Look Unlike LAION |
Ali Shirali, Moritz Hardt |
ImageNet was famously created from Flickr image search results. What if we
recreated ImageNet instead by searching the massive LAION dataset based on
image captions alone? In this work, we carry out this counterfactual
investigation. We find that the resulting ImageNet recreation, which we call
LAIONet, looks distinctly unlike the original. Specifically, the intra-class
similarity of images in the original ImageNet is dramatically higher than it is
for LAIONet. Consequently, models trained on ImageNet perform significantly
worse on LAIONet. We propose a rigorous explanation for the discrepancy in
terms of a subtle, yet important, difference in two plausible causal
data-generating processes for the respective datasets, that we support with
systematic experimentation. In a nutshell, searching based on an image caption
alone creates an information bottleneck that mitigates the selection bias
otherwise present in image-based filtering. Our explanation formalizes a
long-held intuition in the community that ImageNet images are stereotypical,
unnatural, and overly simple representations of the class category. At the same
time, it provides a simple and actionable takeaway for future dataset creation
efforts. |
This paper introduces LAIONet, a recreation of ImageNet using the LAION dataset and text-based image selection, and investigates the differences between LAIONet and ImageNet. |
The research aims to understand the impact of different data collection methodologies on dataset bias and model performance. |
The authors created LAIONet by searching LAION for images matching ImageNet synsets based on text descriptions. They then compared LAIONet and ImageNet in terms of CLIP zero-shot accuracy, model performance, and intra-class similarity. |
LAIONet images are more diverse than ImageNet images, exhibiting lower intra-class similarity.
ImageNet-trained models experience a significant performance drop on LAIONet, particularly on more frequent classes.
The authors provide evidence that the image-to-selection link in ImageNet's creation process is responsible for its lower diversity and the observed performance drop. |
The study is limited by the availability of accurate captions for only a subset of ImageNet.
Scaling the analysis to LAION-5B could potentially provide a more comprehensive comparison. |
imagenet, laion, dataset bias, intra-class similarity, information bottleneck |
2306.15706
Report |
Approximated Prompt Tuning for Vision-Language Pre-trained Models |
Qiong Wu, Shubin Huang, Yiyi Zhou, Pingyang Dai, Annan Shu, Guannan Jiang, Rongrong Ji |
Prompt tuning is a parameter-efficient way to deploy large-scale pre-trained
models to downstream tasks by adding task-specific tokens. In terms of
vision-language pre-trained (VLP) models, prompt tuning often requires a large
number of learnable tokens to bridge the gap between the pre-training and
downstream tasks, which greatly exacerbates the already high computational
overhead. In this paper, we revisit the principle of prompt tuning for
Transformer-based VLP models, and reveal that the impact of soft prompt tokens
can be actually approximated via independent information diffusion steps,
thereby avoiding the expensive global attention modeling and reducing the
computational complexity to a large extent. Based on this finding, we propose a
novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer
learning. To validate APT, we apply it to two representative VLP models, namely
ViLT and METER, and conduct extensive experiments on a bunch of downstream
tasks. Meanwhile, the generalization of APT is also validated on CLIP for image
classification and StableDiffusion for text-to-image generation. The
experimental results not only show the superior performance gains and
computation efficiency of APT against the conventional prompt tuning methods,
e.g., +7.01% accuracy and -82.30% additional computation overhead on METER, but
also confirm its merits over other parameter-efficient transfer learning
approaches. |
This paper proposes Approximated Prompt Tuning (APT), a novel method for parameter- and computation-efficient adaptation of Vision-Language Pre-trained (VLP) models to downstream tasks. |
Existing prompt tuning methods applied to VLP models often suffer from high computational overhead and inefficient adaptation due to the large gap between pre-training and downstream tasks. |
APT approximates the influence of prompt tokens on the input sequence by separating them from the expensive global self-attention mechanism and aggregating them with low-rank transformations. |
APT achieves superior performance gains over conventional prompt tuning methods on VLP models, with up to +8.30% accuracy improvement on VQA2.0 for METER.
APT significantly reduces computational overhead compared to existing prompt tuning methods, saving up to 82.30% additional computations for ViLT.
APT demonstrates better performance than other Parameter Efficient Transfer Learning (PETL) approaches on various VLP models and VL tasks, and its generalization is validated on CLIP for image classification and StableDiffusion for text-to-image generation. |
The performance of APT is still slightly inferior to full fine-tuning, indicating room for further improvement.
Future work includes exploring more effective information diffusion strategies and extending APT to other multimodal pre-trained models. |
prompt tuning, vision-language pre-training, parameter efficient transfer learning, multimodal learning, approximation methods |
2306.15658
Report |
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy |
Xianhang Li, Zeyu Wang, Cihang Xie |
The recent work CLIPA presents an inverse scaling law for CLIP training --
whereby the larger the image/text encoders used, the shorter the sequence
length of image/text tokens that can be applied in training. This finding
enables us to train high-performance CLIP models with significantly reduced
computations. Building upon this work, we hereby present CLIPA-v2 with two key
contributions. Technically, we find this inverse scaling law is also applicable
in the finetuning stage, enabling further reduction in computational needs.
Empirically, we explore CLIPA at scale, extending the experiments up to the
H/14 model with ~13B image-text pairs seen during training.
Our results are exciting -- by only allocating a budget of \$10,000, our CLIP
model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing
the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing
the computational cost by ~39X. Moreover, with an additional investment of
$4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our
code and models are available at https://github.com/UCSC-VLAA/CLIPA. |
This paper introduces CLIPA-v2, building on CLIPA, which leverages an inverse scaling law to train high-performance CLIP models efficiently. CLIPA-v2 demonstrates that this law also applies to the fine-tuning stage, further reducing computation needs. |
Training CLIP models is computationally expensive. CLIPA-v2 provides a solution to reduce training costs while achieving state-of-the-art zero-shot performance. |
The authors scale CLIPA to larger models (up to H/14), datasets (LAION-2B, DataComp-1B), and training schedules (13B samples). They explore the inverse scaling law in fine-tuning and analyze different masking strategies. |
CLIPA-v2 achieves 81.1% zero-shot ImageNet accuracy within a $10,000 budget, outperforming the previous best model (OpenCLIP) by 1.0% while being 39x faster.
An additional $4,000 investment further boosts the accuracy to 81.8%, setting a new performance record.
The inverse scaling law, allowing for training with fewer tokens for larger models, also proves effective during fine-tuning. |
CLIPA-v2 lags behind in zero-shot retrieval tasks on COCO and Flickr30k compared to OpenCLIP's best model.
The impact of different pre-training datasets on downstream tasks needs further investigation. |
clip, clipa, zero-shot learning, vision-language model, efficient training |
2306.15419
Report |
Freestyle 3D-Aware Portrait Synthesis Based on Compositional Generative Priors |
Tianxiang Ma, Kang Zhao, Jianxin Sun, Yingya Zhang, Jing Dong |
Efficiently generating a freestyle 3D portrait with high quality and
3D-consistency is a promising yet challenging task. The portrait styles
generated by most existing methods are usually restricted by their 3D
generators, which are learned in specific facial datasets, such as FFHQ. To get
the diverse 3D portraits, one can build a large-scale multi-style database to
retrain a 3D-aware generator, or use a off-the-shelf tool to do the style
translation. However, the former is time-consuming due to data collection and
training process, the latter may destroy the multi-view consistency. To tackle
this problem, we propose a novel text-driven 3D-aware portrait synthesis
framework that can generate out-of-distribution portrait styles. Specifically,
for a given portrait style prompt, we first composite two generative priors, a
3D-aware GAN generator and a text-guided image editor, to quickly construct a
few-shot stylized portrait set. Then we map the special style domain of this
set to our proposed 3D latent feature generator and obtain a 3D representation
containing the given style information. Finally we use a pre-trained 3D
renderer to generate view-consistent stylized portraits from the 3D
representation. Extensive experimental results show that our method is capable
of synthesizing high-quality 3D portraits with specified styles in a few
minutes, outperforming the state-of-the-art. |
This paper proposes a novel freestyle 3D-aware portrait synthesis framework based on compositional generative priors to efficiently generate high-quality 3D portraits with specified styles. |
Existing 3D portrait synthesis methods are usually restricted by the training data, limiting their ability to generate diverse freestyle 3D portraits. |
This work composites a 3D-aware GAN generator (EG3D) and a text-guided image editor (Instruct-pix2pix) to construct a few-shot stylized portrait dataset. Then, a proposed 3D latent feature generator maps the style information from this dataset to a 3D representation, which is used by a pre-trained 3D renderer to synthesize the final stylized 3D portrait. |
The method can generate high-quality and 3D-consistent portraits with diverse styles specified by text prompts.
The approach outperforms baselines in both qualitative and quantitative comparisons, demonstrating its superiority in generating freestyle 3D portraits.
The framework is efficient, enabling the generation of a stylized 3D portrait model in approximately 3 minutes. |
The method relies on two pre-trained generative priors, which may limit its performance when synthesizing styles that significantly deviate from human portrait shapes.
Achieving perfect 3D-consistent portrait stylization across different viewpoints remains a challenge due to limitations of the text-guided image editor. |
3d portrait synthesis, generative adversarial networks, text-guided image editing, few-shot learning, neural rendering |
2306.15111
Report |
Self-Supervised Image Captioning with CLIP |
Chuanyang Jin |
Image captioning, a fundamental task in vision-language understanding, seeks
to generate accurate natural language descriptions for provided images. Current
image captioning approaches heavily rely on high-quality image-caption pairs,
which can be hard to obtain for many domains. To address this, we introduce a
self-supervised image captioning method. After learning an initial signal from
a small labeled dataset, our method transitions to self-supervised learning on
unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance
between images and generated captions. Remarkably, despite utilizing less than
2% of the labeled COCO dataset, our method delivers a performance comparable to
state-of-the-art models trained on the complete dataset. Human evaluations
further reveal that our method produces captions with greater distinctiveness
and informativeness, two attributes inherently challenging to achieve through
supervised learning. |
This paper introduces a self-supervised image captioning method that leverages CLIP relevance to generate captions, reducing the reliance on labeled image-caption pairs. |
Current image captioning approaches depend on large, high-quality labeled datasets, which are difficult and expensive to create. Additionally, relying on reference captions limits the quality of generated captions. |
The method employs a two-stage approach: 1) **Supervised Training:** Train on a small labeled dataset to establish an initial signal. 2) **Self-Supervised Training:** Utilize unlabeled data and train the model to maximize CLIP relevance between generated captions and corresponding images. |
The method achieves comparable performance to state-of-the-art models on standard metrics while using significantly less labeled data.
The generated captions are found to be more distinctive and informative than those from supervised methods based on human evaluation.
The proposed RefCompare Score, based on CLIP relevance, reveals that the generated captions are often better than the reference captions. |
The model's performance depends on the quality of the initial signal obtained during supervised training.
Further improvements might be achieved by exploring different language models or alternative self-supervised objectives. |
image captioning, self-supervised learning, clip, vision-language understanding, natural language generation |
2306.14644
Report |
PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas |
Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan |
Art forms such as movies and television (TV) dramas are reflections of the
real world, which have attracted much attention from the multimodal learning
community recently. However, existing corpora in this domain share three
limitations: (1) annotated in a scene-oriented fashion, they ignore the
coherence within plots; (2) their text lacks empathy and seldom mentions
situational context; (3) their video clips fail to cover long-form relationship
due to short duration. To address these fundamental issues, using 1,106 TV
drama episodes and 24,875 informative plot-focused sentences written by
professionals, with the help of 449 human annotators, we constructed PTVD, the
first plot-oriented multimodal dataset in the TV domain. It is also the first
non-English dataset of its kind. Additionally, PTVD contains more than 26
million bullet screen comments (BSCs), powering large-scale pre-training. Next,
aiming to open-source a strong baseline for follow-up works, we developed the
multimodal algorithm that attacks different cinema/TV modelling problems with a
unified architecture. Extensive experiments on three cognitive-inspired tasks
yielded a number of novel observations (some of them being quite
counter-intuition), further validating the value of PTVD in promoting
multimodal research. The dataset and codes are released at
\url{https://ptvd.github.io/}. |
This paper introduces \TVD, a novel plot-oriented multimodal dataset for TV dramas, addressing limitations of existing scene-oriented datasets. |
Existing movie/TV datasets are limited by scene-oriented annotations, lack of empathy in text, and short clip durations, hindering research on modeling complex narratives and long-form relationships. \TVD tackles these limitations, enabling research on higher cognitive tasks in multimodal learning. |
Researchers constructed \TVD using 1,106 TV drama episodes, 24,875 plot-focused sentences, and 26M+ Bullet Screen Comments (BSCs). 449 annotators aligned clips with plot descriptions, resulting in a dataset rich in contextual and emotional information, suitable for tasks beyond scene understanding. |
Multimodal data, especially plot text, significantly improves Genre Classification, with a bias towards frequent genres observed.
Plot Retrieval performance is enhanced by fine-tuning with plot text and pre-training with BSCs. Video input consistently outperforms image input for plot retrieval, demonstrating the dataset's ability to assess models' capacity to capture long-form relationships.
Pre-training with BSCs surprisingly hinders BSC generation while benefiting plot text generation, suggesting potential differences in text distribution and complexity. |
\TVD currently includes 83 TV dramas, potentially limiting diversity and generalizability. The dataset is in Chinese, potentially introducing cultural and linguistic biases.
The proposed framework, while scalable, utilizes established techniques and lacks manual evaluation for Plot Text Generation, potentially overlooking nuanced insights. |
multimodal learning, tv drama analysis, plot understanding, dataset creation, bullet screen comments |
2306.14544
Report |
A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis |
Aishwarya Agarwal, Srikrishna Karanam, K J Joseph, Apoorv Saxena, Koustava Goswami, Balaji Vasan Srinivasan |
While recent developments in text-to-image generative models have led to a
suite of high-performing methods capable of producing creative imagery from
free-form text, there are several limitations. By analyzing the cross-attention
representations of these models, we notice two key issues. First, for text
prompts that contain multiple concepts, there is a significant amount of
pixel-space overlap (i.e., same spatial regions) among pairs of different
concepts. This eventually leads to the model being unable to distinguish
between the two concepts and one of them being ignored in the final generation.
Next, while these models attempt to capture all such concepts during the
beginning of denoising (e.g., first few steps) as evidenced by cross-attention
maps, this knowledge is not retained by the end of denoising (e.g., last few
steps). Such loss of knowledge eventually leads to inaccurate generation
outputs. To address these issues, our key innovations include two test-time
attention-based loss functions that substantially improve the performance of
pretrained baseline text-to-image diffusion models. First, our attention
segregation loss reduces the cross-attention overlap between attention maps of
different concepts in the text prompt, thereby reducing the confusion/conflict
among various concepts and the eventual capture of all concepts in the
generated output. Next, our attention retention loss explicitly forces
text-to-image diffusion models to retain cross-attention information for all
concepts across all denoising time steps, thereby leading to reduced
information loss and the preservation of all concepts in the generated output. |
This paper proposes A-STAR, a training-free method using two new attention-based loss functions during inference to improve the semantic accuracy of pretrained text-to-image diffusion models. |
Existing text-to-image diffusion models often fail to accurately represent all concepts from the input text prompt in the generated images. |
The method introduces attention segregation loss to minimize overlap between cross-attention maps of different concepts and attention retention loss to enforce information retention across denoising steps. |
A-STAR successfully reduces attention overlap and decay, leading to generated images that better capture all concepts in the input prompt.
Quantitative evaluation using CLIP similarity and text-text similarity demonstrates significant improvement over baseline models and existing methods.
User study confirms that A-STAR generates images that are semantically more faithful to the input text. |
A-STAR currently does not explicitly model relationships between concepts, which can limit its ability to generate images with complex compositions.
Integrating A-STAR with techniques for controlling camera pose and viewpoint could further enhance the quality of generated images. |
text-to-image synthesis, diffusion models, attention mechanism, semantic accuracy, image generation |
2306.14408
Report |
Decompose and Realign: Tackling Condition Misalignment in Text-to-Image Diffusion Models |
Luozhou Wang, Guibao Shen, Wenhang Ge, Guangyong Chen, Yijun Li, Ying-cong Chen |
Text-to-image diffusion models have advanced towards more controllable
generation via supporting various additional conditions (e.g., depth map,
bounding box) beyond text. However, these models are learned based on the
premise of perfect alignment between the text and extra conditions. If this
alignment is not satisfied, the final output could be either dominated by one
condition, or ambiguity may arise, failing to meet user expectations.To address
this issue, we present a training-free approach called ``Decompose and
Realign'' to further improve the controllability of existing models when
provided with partially aligned conditions. The ``Decompose'' phase separates
conditions based on pair relationships, computing the result individually for
each pair. This ensures that each pair no longer has conflicting conditions.
The ``Realign'' phase aligns these independently calculated results via a
cross-attention mechanism to avoid new conflicts when combining them back. Both
qualitative and quantitative results demonstrate the effectiveness of our
approach in handling unaligned conditions, which performs favorably against
recent methods and more importantly adds flexibility to the controllable image
generation process. Our code will be available at:
https://github.com/EnVision-Research/Decompose-and-Realign. |
Presents "Decompose and Realign," a training-free approach to address misalignment between text and image conditions in multi-condition controllable image generation, aiming for more flexibility. |
Existing controllable generation models struggle with misaligned conditions, resulting in either one condition dominating the output or ambiguity in object correspondence. |
The "Decompose" phase separates conditions into aligned pairs to compute individual scores. The "Realign" phase aligns these scores with the unified text score via cross-attention to avoid conflicts during merging. |
Effectively handles unaligned conditions, generating all objects from the text while respecting image guidance.
Outperforms baselines in qualitative comparisons, achieving better object correspondence and reducing dominance/ambiguity.
Quantitative evaluation demonstrates improved image-text similarity and better adherence to image conditions compared to other methods. |
Effectiveness of "Realign" relies on the model's cross-attention control capability, which might be affected by model bias.
Data bias in training data can lead to unexpected disentanglement or entanglement in specific object combinations. |
image generation, controllable generation, diffusion models, cross-attention, condition misalignment |
2306.14153
Report |
DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data |
Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan |
Denoising diffusion probabilistic models (DDPMs) have been proven capable of
synthesizing high-quality images with remarkable diversity when trained on
large amounts of data. Typical diffusion models and modern large-scale
conditional generative models like text-to-image generative models are
vulnerable to overfitting when fine-tuned on extremely limited data. Existing
works have explored subject-driven generation using a reference set containing
a few images. However, few prior works explore DDPM-based domain-driven
generation, which aims to learn the common features of target domains while
maintaining diversity. This paper proposes a novel DomainStudio approach to
adapt DDPMs pre-trained on large-scale source datasets to target domains using
limited data. It is designed to keep the diversity of subjects provided by
source domains and get high-quality and diverse adapted samples in target
domains. We propose to keep the relative distances between adapted samples to
achieve considerable generation diversity. In addition, we further enhance the
learning of high-frequency details for better generation quality. Our approach
is compatible with both unconditional and conditional diffusion models. This
work makes the first attempt to realize unconditional few-shot image generation
with diffusion models, achieving better quality and greater diversity than
current state-of-the-art GAN-based approaches. Moreover, this work also
significantly relieves overfitting for conditional generation and realizes
high-quality domain-driven generation, further expanding the applicable
scenarios of modern large-scale text-to-image models. |
This paper introduces DomainStudio, a novel approach to achieve few-shot domain-driven image generation with diffusion models, by preserving relative distances between generated samples and enhancing high-frequency details. |
Existing diffusion models and large-scale conditional generative models often overfit when fine-tuned on limited data, resulting in poor quality and limited diversity. This work addresses this challenge for both unconditional and conditional image generation. |
DomainStudio adapts pre-trained diffusion models to target domains using: 1) a pairwise similarity loss to maintain relative distances between generated samples, and 2) techniques to enhance high-frequency details by preserving details from the source model and learning from limited target data. |
DomainStudio achieves better generation quality and diversity than state-of-the-art unconditional GAN-based approaches.
It successfully adapts pre-trained text-to-image diffusion models to generate diverse samples in target domains with different subjects and contexts, outperforming existing subject-driven methods.
Quantitative evaluations using Intra-LPIPS and FID demonstrate superior diversity and quality compared to baselines. |
The current implementation faces challenges in scaling to higher image resolutions due to memory constraints.
While the high-frequency details enhancement shows promising results, there is room for improvement, particularly when target domains contain significantly more high-frequency components than source domains. |
image generation, diffusion models, few-shot learning, domain adaptation, text-to-image generation |
2306.13776
Report |
Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window |
Jinkyu Koo, John Yang, Le An, Gwenaelle Cunha Sergio, Su Inn Park |
Transformer models have shown great potential in computer vision, following
their success in language tasks. Swin Transformer is one of them that
outperforms convolution-based architectures in terms of accuracy, while
improving efficiency when compared to Vision Transformer (ViT) and its
variants, which have quadratic complexity with respect to the input size. Swin
Transformer features shifting windows that allows cross-window connection while
limiting self-attention computation to non-overlapping local windows. However,
shifting windows introduces memory copy operations, which account for a
significant portion of its runtime. To mitigate this issue, we propose
Swin-Free in which we apply size-varying windows across stages, instead of
shifting windows, to achieve cross-connection among local windows. With this
simple design change, Swin-Free runs faster than the Swin Transformer at
inference with better accuracy. Furthermore, we also propose a few of Swin-Free
variants that are faster than their Swin Transformer counterparts. |
This paper proposes SwinNext, a Transformer-based vision model that improves both latency and accuracy over the Swin Transformer by replacing the memory-intensive shifted window scheme with size-varying windows across stages. |
Swin Transformer's shifted window scheme, while effective, introduces significant memory movement overhead, impacting its efficiency, especially on GPUs. |
SwinNext utilizes size-varying windows across stages to achieve cross-window connections without shifting windows. The authors also explored replacing LayerNorm and GELU layers with BatchNorm and ReLU and reducing the model depth to further enhance latency. |
SwinNext consistently achieves better accuracy than Swin Transformer on ImageNet-1K classification tasks.
SwinNext demonstrates reduced latency compared to Swin Transformer, thanks to less memory movement and better GPU utilization with larger matrix multiplications.
Further optimizations like BatchNorm/ReLU replacement and depth reduction lead to even faster variants of SwinNext with competitive accuracy. |
The paper mainly focuses on image classification, leaving its application to other vision tasks and larger input resolutions for future work.
Exploring dynamic window size adjustment across stages for further GPU utilization improvement is another potential direction. |
transformer, computer vision, model efficiency, swin transformer, image classification |
2306.13653
Report |
ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration |
Jiaqi Ma, Tianheng Cheng, Guoli Wang, Qian Zhang, Xinggang Wang, Lefei Zhang |
Image restoration aims to reconstruct degraded images, e.g., denoising or
deblurring. Existing works focus on designing task-specific methods and there
are inadequate attempts at universal methods. However, simply unifying multiple
tasks into one universal architecture suffers from uncontrollable and undesired
predictions. To address those issues, we explore prompt learning in universal
architectures for image restoration tasks. In this paper, we present
Degradation-aware Visual Prompts, which encode various types of image
degradation, e.g., noise and blur, into unified visual prompts. These
degradation-aware prompts provide control over image processing and allow
weighted combinations for customized image restoration. We then leverage
degradation-aware visual prompts to establish a controllable and universal
model for image restoration, called ProRes, which is applicable to an extensive
range of image restoration tasks. ProRes leverages the vanilla Vision
Transformer (ViT) without any task-specific designs. Furthermore, the
pre-trained ProRes can easily adapt to new tasks through efficient prompt
tuning with only a few images. Without bells and whistles, ProRes achieves
competitive performance compared to task-specific methods and experiments can
demonstrate its ability for controllable restoration and adaptation for new
tasks. The code and models will be released in
\url{https://github.com/leonmakise/ProRes}. |
This paper introduces ProRes, a universal image restoration framework based on degradation-aware visual prompts. These prompts, encoding specific degradation types, provide control over image processing within a unified architecture, eliminating the need for task-specific designs. |
Existing image restoration methods are often task-specific, limiting their applicability to multiple degradation types. ProRes addresses this by offering a universal approach that handles diverse image restoration tasks within a single model, simplifying training and improving efficiency. |
ProRes employs a vanilla Vision Transformer (ViT) as its backbone and incorporates degradation-aware visual prompts. These image-like prompts, added to degraded images, guide the restoration process. The model is trained on a joint dataset encompassing denoising, low-light enhancement, deraining, and deblurring tasks. |
ProRes achieves competitive performance compared to task-specific methods on various benchmarks.
The degradation-aware prompts enable controllable restoration by combining prompts for different degradation types.
ProRes exhibits strong transferability, adapting effectively to new tasks or datasets via prompt tuning. |
The performance of ProRes on certain tasks may benefit from further optimization compared to highly specialized methods.
Future work can explore the impact of larger and more diverse datasets on ProRes's capabilities, particularly for complex or unseen degradation types. |
image restoration, universal model, visual prompt learning, prompt tuning, vision transformer |
2306.13455
Report |
DreamEditor: Text-Driven 3D Scene Editing with Neural Fields |
Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, Guanbin Li |
Neural fields have achieved impressive advancements in view synthesis and
scene reconstruction. However, editing these neural fields remains challenging
due to the implicit encoding of geometry and texture information. In this
paper, we propose DreamEditor, a novel framework that enables users to perform
controlled editing of neural fields using text prompts. By representing scenes
as mesh-based neural fields, DreamEditor allows localized editing within
specific regions. DreamEditor utilizes the text encoder of a pretrained
text-to-Image diffusion model to automatically identify the regions to be
edited based on the semantics of the text prompts. Subsequently, DreamEditor
optimizes the editing region and aligns its geometry and texture with the text
prompts through score distillation sampling [29]. Extensive experiments have
demonstrated that DreamEditor can accurately edit neural fields of real-world
scenes according to the given text prompts while ensuring consistency in
irrelevant areas. DreamEditor generates highly realistic textures and geometry,
significantly surpassing previous works in both quantitative and qualitative
evaluations. |
DreamEditor is a novel framework for text-driven 3D scene editing using neural fields, enabling localized modifications based on text prompts while preserving consistency in irrelevant areas. |
Editing neural fields is challenging due to the implicit encoding of geometry and texture. DreamEditor offers an intuitive and precise way to modify 3D scenes using simple text descriptions. |
DreamEditor represents scenes as mesh-based neural fields for localized editing. It utilizes a pretrained text-to-image diffusion model to automatically identify editing regions based on text prompts, then optimizes geometry and texture through score distillation sampling. |
DreamEditor achieves accurate and high-quality editing of real-world neural fields based on text prompts.
The method preserves irrelevant regions unchanged, ensuring consistency between original and edited scenes.
Quantitative and qualitative evaluations demonstrate DreamEditor's superiority over existing methods in editing precision, visual fidelity, and user satisfaction. |
DreamEditor inherits the Janus problem from DreamFusion, where objects may appear as front views from different viewpoints.
The method currently focuses on object-centric editing in the foreground, limited by the challenges of reconstructing backgrounds in unbounded scenes. |
neural fields, 3d scene editing, text-guided editing, score distillation sampling, mesh-based neural fields |
2306.13078
Report |
Continuous Layout Editing of Single Images with Diffusion Models |
Zhiyuan Zhang, Zhitong Huang, Jing Liao |
Recent advancements in large-scale text-to-image diffusion models have
enabled many applications in image editing. However, none of these methods have
been able to edit the layout of single existing images. To address this gap, we
propose the first framework for layout editing of a single image while
preserving its visual properties, thus allowing for continuous editing on a
single image. Our approach is achieved through two key modules. First, to
preserve the characteristics of multiple objects within an image, we
disentangle the concepts of different objects and embed them into separate
textual tokens using a novel method called masked textual inversion. Next, we
propose a training-free optimization method to perform layout control for a
pre-trained diffusion model, which allows us to regenerate images with learned
concepts and align them with user-specified layouts. As the first framework to
edit the layout of existing images, we demonstrate that our method is effective
and outperforms other baselines that were modified to support this task. Our
code will be freely available for public use upon acceptance. |
This paper presents the first framework for continuous layout editing of single images using diffusion models, allowing users to rearrange object positions while preserving visual properties. |
Existing layout control methods for image generation cannot edit the layout of existing images, limiting users' ability to experiment with different object arrangements within a given scene. |
The framework employs two key modules: (1) Masked Textual Inversion, which disentangles and embeds concepts of multiple objects within a single image into separate tokens, and (2) Training-free Layout Editing, which optimizes cross-attention during the diffusion process to align objects with user-specified layouts. |
The proposed method effectively edits image layouts while preserving visual fidelity, outperforming baselines in qualitative and quantitative comparisons.
A user study confirms the superiority of the method in terms of visual similarity, layout alignment, image quality, and overall quality.
The framework enables continuous layout editing, allowing users to experiment with various object arrangements within a single image. |
The method may struggle to preserve visual details when object sizes differ significantly between input and edited images, and with recovering the full body of heavily occluded objects.
The layout editing process is not real-time due to the iterative nature of diffusion models. |
layout editing, diffusion models, textual inversion, image manipulation, single image editing |
2306.12929
Report |
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing |
Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort |
Transformer models have been widely adopted in various domains over the last
years, and especially large language models have advanced the field of AI
significantly. Due to their size, the capability of these networks has
increased tremendously, but this has come at the cost of a significant increase
in necessary compute. Quantization is one of the most effective ways to reduce
the computational time and memory consumption of neural networks. Many studies
have shown, however, that modern transformer models tend to learn strong
outliers in their activations, making them difficult to quantize. To retain
acceptable performance, the existence of these outliers requires activations to
be in higher bitwidth or the use of different numeric formats, extra
fine-tuning, or other workarounds. We show that strong outliers are related to
very specific behavior of attention heads that try to learn a "no-op" or just a
partial update of the residual. To achieve the exact zeros needed in the
attention matrix for a no-update, the input to the softmax is pushed to be
larger and larger during training, causing outliers in other parts of the
network. Based on these observations, we propose two simple (independent)
modifications to the attention mechanism - clipped softmax and gated attention.
We empirically show that models pre-trained using our methods learn
significantly smaller outliers while maintaining and sometimes even improving
the floating-point task performance. This enables us to quantize transformers
to full INT8 quantization of the activations without any additional effort. We
demonstrate the effectiveness of our methods on both language models (BERT,
OPT) and vision transformers. |
This paper proposes two simple modifications to the attention mechanism of transformer models: clipped softmax and gated attention, to address the problem of outliers in activations that hinder quantization. |
Quantization is crucial for reducing the computational time and memory consumption of transformer models, especially large language models. However, outliers in activations make it difficult to quantize these models effectively without significant performance degradation. |
The authors analyze the outlier problem and find that it stems from attention heads trying to learn a "no-op" or partial update of the residual. To achieve this, the input to the softmax is pushed to extreme values during training, causing outliers. The proposed methods, clipped softmax and gated attention, modify the attention mechanism to allow for small or zero attention outputs without requiring extreme softmax inputs. |
Clipped softmax and gated attention significantly reduce the magnitude of outliers in activations and the kurtosis of activation distributions.
The proposed methods enable effective quantization of transformers to full INT8 quantization without performance loss, achieving performance close to the original FP16/32 models.
In some cases, the proposed methods even improve the floating-point performance of the models, potentially by facilitating the learning of "no-op" updates. |
The scalability of the methods to very large transformers trained for extended periods needs further investigation.
The methods introduce additional hyperparameters, although they demonstrate robustness to these parameters. |
quantization, transformers, outliers, attention mechanism, clipped softmax, gated attention |
2306.12624
Report |
DreamEdit: Subject-driven Image Editing |
Tianle Li, Max Ku, Cong Wei, Wenhu Chen |
Subject-driven image generation aims at generating images containing
customized subjects, which has recently drawn enormous attention from the
research community. However, the previous works cannot precisely control the
background and position of the target subject. In this work, we aspire to fill
the void and propose two novel subject-driven sub-tasks, i.e., Subject
Replacement and Subject Addition. The new tasks are challenging in multiple
aspects: replacing a subject with a customized one can change its shape,
texture, and color, while adding a target subject to a designated position in a
provided scene necessitates a context-aware posture. To conquer these two novel
tasks, we first manually curate a new dataset DreamEditBench containing 22
different types of subjects, and 440 source images with different difficulty
levels. We plan to host DreamEditBench as a platform and hire trained
evaluators for standard human evaluation. We also devise an innovative method
DreamEditor to resolve these tasks by performing iterative generation, which
enables a smooth adaptation to the customized subject. In this project, we
conduct automatic and human evaluations to understand the performance of
DreamEditor and baselines on DreamEditBench. For Subject Replacement, we found
that the existing models are sensitive to the shape and color of the original
subject. The model failure rate will dramatically increase when the source and
target subjects are highly different. For Subject Addition, we found that the
existing models cannot easily blend the customized subjects into the background
smoothly, leading to noticeable artifacts in the generated image. We hope
DreamEditBench can become a standard platform to enable future investigations
toward building more controllable subject-driven image editing. Our project
homepage is https://dreameditbenchteam.github.io/. |
This paper introduces two novel subject-driven image editing tasks: Subject Replacement and Subject Addition, aiming to replace or add a customized subject to an image while maintaining background integrity and subject realism. |
Existing subject-driven image generation methods lack control over subject placement and background, while image editing methods struggle with subject fidelity. This work aims to bridge this gap. |
A new dataset called DreamEdit is curated with various subjects and backgrounds for the tasks. A novel iterative method, DreamEdit, is proposed. This method fine-tunes a text-to-image model with target subject images, iteratively in-paints the target subject onto the source image guided by segmentation masks and text prompts. |
DreamEdit achieves better overall scores compared to baselines in both automatic and human evaluations.
The proposed tasks pose significant challenges to existing models, especially when source and target subjects differ significantly or require complex contextual interaction.
Human evaluation reveals significant discrepancies with automatic metrics, highlighting the need for rigorous human assessment in this field. |
DreamEdit struggles with large discrepancies between source and target subjects.
Iterative generation can lead to blurry backgrounds, and the model's success relies heavily on the performance of the segmentation and in-painting models. |
image editing, subject-driven generation, iterative generation, dreambooth, human evaluation |
2306.12570
Report |
Local 3D Editing via 3D Distillation of CLIP Knowledge |
Junha Hyung, Sungwon Hwang, Daejin Kim, Hyunji Lee, Jaegul Choo |
3D content manipulation is an important computer vision task with many
real-world applications (e.g., product design, cartoon generation, and 3D
Avatar editing). Recently proposed 3D GANs can generate diverse photorealistic
3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of
NeRF still remains a challenging problem since the visual quality tends to
degrade after manipulation and suboptimal control handles such as 2D semantic
maps are used for manipulations. While text-guided manipulations have shown
potential in 3D editing, such approaches often lack locality. To overcome these
problems, we propose Local Editing NeRF (LENeRF), which only requires text
inputs for fine-grained and localized manipulation. Specifically, we present
three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field
Network, and the Deformation Network, which are jointly used for local
manipulations of 3D features by estimating a 3D attention field. The 3D
attention field is learned in an unsupervised way, by distilling the zero-shot
mask generation capability of CLIP to the 3D space with multi-view guidance. We
conduct diverse experiments and thorough evaluations both quantitatively and
qualitatively. |
Proposes LENeRF, a framework for localized editing of 3D scenes using text prompts for manipulation and region specification, enabling real-time, high-fidelity edits. |
Addresses limitations of existing 3D editing methods that lack locality, rely on suboptimal 2D guidance, or struggle with photorealism and multi-view consistency. |
Combines a pretrained NeRF generator with three trainable modules: Latent Residual Mapper for generating target features, Attention Field Network for estimating 3D masks, and Deformation Network for handling geometric changes. Trained with CLIP guidance and pseudo-labels from CLIP-generated relevance maps. |
Achieves localized editing with minimal unintended changes, as demonstrated by quantitative metrics and qualitative comparisons.
Exhibits robustness to out-of-distribution editing scenarios.
Enables sequential editing while preserving identity and content quality. |
Relies on pretrained models (EG3D, CLIP) and may be limited by their capabilities.
Generation of accurate 3D masks from 2D relevance maps remains challenging, potentially leading to artifacts. |
3d editing, nerf, clip, text-guided editing, 3d mask generation |
2306.12511
Report |
Semi-Implicit Denoising Diffusion Models (SIDDMs) |
Yanwu Xu, Mingming Gong, Shaoan Xie, Wei Wei, Matthias Grundmann, Kayhan Batmanghelich, Tingbo Hou |
Despite the proliferation of generative models, achieving fast sampling
during inference without compromising sample diversity and quality remains
challenging. Existing models such as Denoising Diffusion Probabilistic Models
(DDPM) deliver high-quality, diverse samples but are slowed by an inherently
high number of iterative steps. The Denoising Diffusion Generative Adversarial
Networks (DDGAN) attempted to circumvent this limitation by integrating a GAN
model for larger jumps in the diffusion process. However, DDGAN encountered
scalability limitations when applied to large datasets. To address these
limitations, we introduce a novel approach that tackles the problem by matching
implicit and explicit factors. More specifically, our approach involves
utilizing an implicit model to match the marginal distributions of noisy data
and the explicit conditional distribution of the forward diffusion. This
combination allows us to effectively match the joint denoising distributions.
Unlike DDPM but similar to DDGAN, we do not enforce a parametric distribution
for the reverse step, enabling us to take large steps during inference. Similar
to the DDPM but unlike DDGAN, we take advantage of the exact form of the
diffusion process. We demonstrate that our proposed method obtains comparable
generative performance to diffusion-based models and vastly superior results to
models with a small number of sampling steps. |
This paper introduces Semi-Implicit Denoising Diffusion Models (SIDDMs), a novel approach for fast sampling in generative models without compromising sample quality and diversity, addressing limitations in existing DDPM and DDGAN models. |
Achieving fast sampling, high-quality samples, and mode coverage simultaneously in generative models is challenging. Existing methods struggle to address all three aspects effectively, especially for large-scale datasets. |
SIDDMs decompose the denoising distribution into marginal and conditional distributions, leveraging both implicit GAN objectives for marginal matching and explicit L2 reconstruction loss for conditional matching (Auxiliary Forward Diffusion, AFD). This approach enables fast sampling similar to DDGANs while maintaining high generation quality comparable to DDPMs. Additionally, the paper introduces a novel discriminator regularization technique using an auxiliary denoising task. |
SIDDMs demonstrate superior quantitative results over DDGANs on CIFAR10, CelebA-HQ, and ImageNet datasets.
The proposed method achieves comparable generative performance to DDPMs while requiring significantly fewer sampling steps.
Ablation studies confirm the effectiveness of the proposed decomposition and the discriminator regularization technique. |
While SIDDMs show promising results, there is still a small quality gap compared to state-of-the-art diffusion-based models.
Future work could explore further improvements in the discriminator regularization and investigate the application of SIDDMs to other generative tasks. |
generative models, diffusion models, fast sampling, gans, image generation |
2306.12423
Report |
Benchmarking and Analyzing 3D-aware Image Synthesis with a Modularized Codebase |
Qiuyu Wang, Zifan Shi, Kecheng Zheng, Yinghao Xu, Sida Peng, Yujun Shen |
Despite the rapid advance of 3D-aware image synthesis, existing studies
usually adopt a mixture of techniques and tricks, leaving it unclear how each
part contributes to the final performance in terms of generality. Following the
most popular and effective paradigm in this field, which incorporates a neural
radiance field (NeRF) into the generator of a generative adversarial network
(GAN), we build a well-structured codebase, dubbed Carver, through modularizing
the generation process. Such a design allows researchers to develop and replace
each module independently, and hence offers an opportunity to fairly compare
various approaches and recognize their contributions from the module
perspective. The reproduction of a range of cutting-edge algorithms
demonstrates the availability of our modularized codebase. We also perform a
variety of in-depth analyses, such as the comparison across different types of
point feature, the necessity of the tailing upsampler in the generator, the
reliance on the camera pose prior, etc., which deepen our understanding of
existing methods and point out some further directions of the research work. We
release code and models at https://github.com/qiuyu96/Carver to facilitate the
development and evaluation of this field. |
This paper introduces Carver, a modular codebase for 3D-aware image synthesis, enabling researchers to readily develop and replace individual modules within the generation pipeline. |
Existing 3D-aware image synthesis methods often rely on entangled implementations, making it challenging to isolate and compare the contributions of different techniques. Carver addresses this limitation by offering a modular framework. |
Carver decomposes the generation process into independent modules: pose sampler, stochasticity mapper, point sampler, point embedder, feature decoder, volume renderer, and upsampler. This modular design allows for flexible configuration and integration of various techniques. |
Different point embedders (MLP, volume, tri-plane) show comparable performance when combined with an upsampler.
SIREN-based MLPs excel without an upsampler, while ReLU-based MLPs benefit from upsampling.
Exploiting SDF-based geometric representations generally yields inferior results compared to density-based representations. |
Training 3D GANs remains computationally expensive, especially with MLP and volume-based point embedders.
The paper primarily focuses on object-level datasets, and future work should explore extending 3D GANs to more diverse and complex scenes. |
3d-aware image synthesis, generative adversarial networks (gans), neural radiance fields (nerfs), modular codebase, 3d representation learning |
2306.12321
Report |
Dynamic Implicit Image Function for Efficient Arbitrary-Scale Image Representation |
Zongyao He, Zhi Jin |
Recent years have witnessed the remarkable success of implicit neural
representation methods. The recent work Local Implicit Image Function (LIIF)
has achieved satisfactory performance for continuous image representation,
where pixel values are inferred from a neural network in a continuous spatial
domain. However, the computational cost of such implicit arbitrary-scale
super-resolution (SR) methods increases rapidly as the scale factor increases,
which makes arbitrary-scale SR time-consuming. In this paper, we propose
Dynamic Implicit Image Function (DIIF), which is a fast and efficient method to
represent images with arbitrary resolution. Instead of taking an image
coordinate and the nearest 2D deep features as inputs to predict its pixel
value, we propose a coordinate grouping and slicing strategy, which enables the
neural network to perform decoding from coordinate slices to pixel value
slices. We further propose a Coarse-to-Fine Multilayer Perceptron (C2F-MLP) to
perform decoding with dynamic coordinate slicing, where the number of
coordinates in each slice varies as the scale factor varies. With dynamic
coordinate slicing, DIIF significantly reduces the computational cost when
encountering arbitrary-scale SR. Experimental results demonstrate that DIIF can
be integrated with implicit arbitrary-scale SR methods and achieves SOTA SR
performance with significantly superior computational efficiency, thereby
opening a path for real-time arbitrary-scale image representation. Our code can
be found at https://github.com/HeZongyao/DIIF. |
Proposes Dynamic Implicit Image Function (DIIF), a fast and efficient arbitrary-resolution image representation method, significantly reducing computational cost in arbitrary-scale super-resolution. |
Arbitrary-scale super-resolution methods based on implicit neural representations are computationally expensive, limiting their practical use despite offering continuous image representation. |
Introduces coordinate grouping and slicing strategy for efficient pixel value prediction, and a Coarse-to-Fine Multilayer Perceptron (C2F-MLP) for dynamic coordinate slicing based on scale factor. |
DIIF significantly reduces the computational cost of arbitrary-scale super-resolution, achieving up to 87% lower cost compared to previous methods.
DIIF, when integrated with existing implicit methods like LIIF and LTE, improves their super-resolution performance while enhancing efficiency.
DIIF demonstrates state-of-the-art super-resolution performance with superior computational efficiency, enabling faster and higher-quality results. |
Limited effectiveness in reducing the number of parameters, though most originate from the encoder.
Future work to focus on improving image representation and exploring more efficient decoding function architectures. |
image representation, super-resolution, implicit neural representation, arbitrary-scale, computational efficiency |
2306.11719
Report |
Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision |
Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T. Freeman, Vincent Sitzmann |
Denoising diffusion models are a powerful type of generative models used to
capture complex distributions of real-world signals. However, their
applicability is limited to scenarios where training samples are readily
available, which is not always the case in real-world applications. For
example, in inverse graphics, the goal is to generate samples from a
distribution of 3D scenes that align with a given image, but ground-truth 3D
scenes are unavailable and only 2D images are accessible. To address this
limitation, we propose a novel class of denoising diffusion probabilistic
models that learn to sample from distributions of signals that are never
directly observed. Instead, these signals are measured indirectly through a
known differentiable forward model, which produces partial observations of the
unknown signal. Our approach involves integrating the forward model directly
into the denoising process. This integration effectively connects the
generative modeling of observations with the generative modeling of the
underlying signals, allowing for end-to-end training of a conditional
generative model over signals. During inference, our approach enables sampling
from the distribution of underlying signals that are consistent with a given
partial observation. We demonstrate the effectiveness of our method on three
challenging computer vision tasks. For instance, in the context of inverse
graphics, our model enables direct sampling from the distribution of 3D scenes
that align with a single 2D input image. |
This paper introduces a novel method that integrates differentiable forward models with conditional denoising diffusion models, enabling the sampling of distributions of signals never observed directly, but only through partial observations generated by a known forward model. |
This approach addresses the limitation of existing generative models that require direct access to training samples from the output distribution, which is often not feasible for real-world tasks like inverse graphics. |
The method integrates the forward model into the denoising process of a conditional diffusion model. It trains on pairs of partial observations of the same signal, using one as context and the other as the target for denoising. By iteratively denoising the target observation conditioned on the context and forward model, the model learns to sample underlying signals consistent with the observations. |
The paper provides a formal proof demonstrating that the proposed model asymptotically learns the true conditional distribution over signals as the number of observations per signal increases.
The method is successfully applied to three challenging computer vision tasks: inverse graphics, single-image motion prediction, and GAN inversion, demonstrating its efficacy in generating diverse and plausible samples consistent with partial observations.
In inverse graphics, the method enables direct sampling from the distribution of 3D scenes consistent with a single 2D image, outperforming previous state-of-the-art approaches in generating realistic and diverse 3D scenes. |
The method can be computationally expensive, particularly for tasks like inverse graphics that involve computationally intensive forward models (e.g., volume rendering).
The current implementation requires multi-view observations for training in the inverse graphics application, limiting its applicability to scenarios with single-view data. |
generative modeling, diffusion models, inverse problems, computer vision, inverse graphics |
2306.11510
Report |
Pushing the Limits of 3D Shape Generation at Scale |
Yu Wang, Xuelin Qian, Jingyang Huo, Tiejun Huang, Bo Zhao, Yanwei Fu |
We present a significant breakthrough in 3D shape generation by scaling it to
unprecedented dimensions. Through the adaptation of the Auto-Regressive model
and the utilization of large language models, we have developed a remarkable
model with an astounding 3.6 billion trainable parameters, establishing it as
the largest 3D shape generation model to date, named Argus-3D. Our approach
addresses the limitations of existing methods by enhancing the quality and
diversity of generated 3D shapes. To tackle the challenges of high-resolution
3D shape generation, our model incorporates tri-plane features as latent
representations, effectively reducing computational complexity. Additionally,
we introduce a discrete codebook for efficient quantization of these
representations. Leveraging the power of transformers, we enable multi-modal
conditional generation, facilitating the production of diverse and visually
impressive 3D shapes. To train our expansive model, we leverage an ensemble of
publicly-available 3D datasets, consisting of a comprehensive collection of
approximately 900,000 objects from renowned repositories such as ModelNet40,
ShapeNet, Pix3D, 3D-Future, and Objaverse. This diverse dataset empowers our
model to learn from a wide range of object variations, bolstering its ability
to generate high-quality and diverse 3D shapes. Extensive experimentation
demonstrate the remarkable efficacy of our approach in significantly improving
the visual quality of generated 3D shapes. By pushing the boundaries of 3D
generation, introducing novel methods for latent representation learning, and
harnessing the power of transformers for multi-modal conditional generation,
our contributions pave the way for substantial advancements in the field. Our
work unlocks new possibilities for applications in gaming, virtual reality,
product design, and other domains that demand high-quality and diverse 3D
objects. |
This paper presents Argus-3D, a novel 3D shape generation model boasting 3.6 billion trainable parameters, making it the largest of its kind. This model leverages auto-regressive techniques and large language models to enhance the quality and diversity of generated shapes, outperforming previous state-of-the-art methods. |
Existing 3D shape generation methods struggle to produce high-resolution shapes with both quality and diversity. This work addresses these limitations, pushing the boundaries of 3D generation and opening doors for applications in gaming, VR, and product design. |
The methodology involves a two-stage process: 1) Learning discrete representations by encoding point clouds into tri-plane features and quantizing them with a discrete codebook. 2) Training a transformer to generate these quantized representations autoregressively, allowing for multi-modal conditional generation based on inputs like text or images. |
Argus-3D significantly improves visual quality of generated 3D shapes, evidenced by quantitative metrics like IoU, MMD, and FPD.
The model exhibits high diversity in generated shapes, surpassing previous methods in metrics like TMD and COV.
Argus-3D demonstrates strong capability in multi-modal conditional generation, successfully producing 3D shapes guided by class labels, images, and even text prompts. |
The model's effectiveness relies heavily on the availability of large-scale 3D datasets, which can be costly and complex to create.
The transformer architecture demands significant computational resources, limiting accessibility and inference speed. |
3d shape generation, auto-regressive model, large language models, multi-modal generation, deep learning |
2306.11363
Report |
Masked Diffusion Models Are Fast Distribution Learners |
Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo Wang, Zhenguang Liu, Kui Ren |
Diffusion model has emerged as the \emph{de-facto} model for image
generation, yet the heavy training overhead hinders its broader adoption in the
research community. We observe that diffusion models are commonly trained to
learn all fine-grained visual information from scratch. This paradigm may cause
unnecessary training costs hence requiring in-depth investigation. In this
work, we show that it suffices to train a strong diffusion model by first
pre-training the model to learn some primer distribution that loosely
characterizes the unknown real image distribution. Then the pre-trained model
can be fine-tuned for various generation tasks efficiently. In the pre-training
stage, we propose to mask a high proportion (e.g., up to 90\%) of input images
to approximately represent the primer distribution and introduce a masked
denoising score matching objective to train a model to denoise visible areas.
In subsequent fine-tuning stage, we efficiently train diffusion model without
masking. Utilizing the two-stage training framework, we achieves significant
training acceleration and a new FID score record of 6.27 on CelebA-HQ $256
\times 256$ for ViT-based diffusion models. The generalizability of a
pre-trained model further helps building models that perform better than ones
trained from scratch on different downstream datasets. For instance, a
diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when
fine-tuned on a different dataset that contains only 3000 images. Our code is
available at \url{https://github.com/jiachenlei/maskdm}. |
This paper proposes Masked Diffusion Models (MaskDM), a two-stage training framework for diffusion models that significantly reduces training time and improves performance in image generation. |
Training diffusion models for image generation is computationally expensive, hindering broader research adoption. This work aims to improve the efficiency of the training process. |
The authors employ masked pre-training, where a model learns from masked images to approximate a "primer" distribution that captures salient image features. This pre-trained model is then fine-tuned on full images using a standard denoising score matching objective. |
MaskDM achieves a new FID score record of 6.27 on CelebA-HQ 256x256 for ViT-based diffusion models.
Masked pre-training accelerates training across various datasets and shows superior performance even when fine-tuned with limited data.
The training efficiency gains of MaskDM become increasingly significant as image resolution increases. |
The study primarily focuses on U-ViT architecture; further exploration of other ViT variants is needed.
While manual adjustment of mask rates during training demonstrates improved performance, future work could explore automated dynamic training schedules. |
diffusion models, image generation, vision transformer (vit), masked pre-training, training efficiency |
2306.10959
Report |
RaViTT: Random Vision Transformer Tokens |
Felipe A. Quezada, Carlos F. Navarro, Cristian Muñoz, Manuel Zamorano, Jorge Jara-Wilde, Violeta Chang, Cristóbal A. Navarro, Mauricio Cerda |
Vision Transformers (ViTs) have successfully been applied to image
classification problems where large annotated datasets are available. On the
other hand, when fewer annotations are available, such as in biomedical
applications, image augmentation techniques like introducing image variations
or combinations have been proposed. However, regarding ViT patch sampling, less
has been explored outside grid-based strategies. In this work, we propose
Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy
that can be incorporated into existing ViTs. We experimentally evaluated RaViTT
for image classification, comparing it with a baseline ViT and state-of-the-art
(SOTA) augmentation techniques in 4 datasets, including ImageNet-1k and
CIFAR-100. Results show that RaViTT increases the accuracy of the baseline in
all datasets and outperforms the SOTA augmentation techniques in 3 out of 4
datasets by a significant margin +1.23% to +4.32%. Interestingly, RaViTT
accuracy improvements can be achieved even with fewer tokens, thus reducing the
computational load of any ViT model for a given accuracy value. |
This paper introduces Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy for Vision Transformer (ViT) models that enhances image classification performance, especially for datasets with limited training samples. |
In image classification, especially biomedical applications, limited annotated datasets hinder the training of deep learning models. While augmentation techniques exist, exploring patch sampling beyond grid-based methods in ViTs is limited, creating a need for alternative approaches. |
Instead of using the standard regular grid-like patch sampling, RaViTT randomly selects patches from the input image, potentially increasing the diversity of training samples and improving feature extraction. This random sampling allows for overlapping patches and employs a sampling factor (r) to control the number of patches extracted. |
RaViTT increases the accuracy of the baseline ViT model in all four evaluated datasets (ImageNet-1k, CIFAR-100, G. CANCER-3, and DeFungi).
RaViTT outperforms state-of-the-art (SOTA) augmentation techniques (RandAugment and MixUp) in three out of four datasets.
RaViTT can achieve accuracy improvements even with fewer tokens than the baseline, indicating the potential for reducing computational load without sacrificing accuracy. |
The performance gain of RaViTT is limited on the CIFAR-100 dataset, potentially due to the small image size and the resulting high overlap between randomly sampled patches.
Future work can explore optimizing random distributions for patch sampling to further enhance RaViTT's efficiency, especially when the sampling factor is high (r>1). |
vision transformers, image classification, random patch sampling, data augmentation, computational efficiency |
2306.10730
Report |
UniG3D: A Unified 3D Object Generation Dataset |
Qinghong Sun, Yangguang Li, ZeXiang Liu, Xiaoshui Huang, Fenggang Liu, Xihui Liu, Wanli Ouyang, Jing Shao |
The field of generative AI has a transformative impact on various areas,
including virtual reality, autonomous driving, the metaverse, gaming, and
robotics. Among these applications, 3D object generation techniques are of
utmost importance. This technique has unlocked fresh avenues in the realm of
creating, customizing, and exploring 3D objects. However, the quality and
diversity of existing 3D object generation methods are constrained by the
inadequacies of existing 3D object datasets, including issues related to text
quality, the incompleteness of multi-modal data representation encompassing 2D
rendered images and 3D assets, as well as the size of the dataset. In order to
resolve these issues, we present UniG3D, a unified 3D object generation dataset
constructed by employing a universal data transformation pipeline on Objaverse
and ShapeNet datasets. This pipeline converts each raw 3D model into
comprehensive multi-modal data representation
by employing rendering engines and multi-modal models. These modules ensure the
richness of textual information and the comprehensiveness of data
representation. Remarkably, the universality of our pipeline refers to its
ability to be applied to any 3D dataset, as it only requires raw 3D data. The
selection of data sources for our dataset is based on their scale and quality.
Subsequently, we assess the effectiveness of our dataset by employing Point-E
and SDFusion, two widely recognized methods for object generation, tailored to
the prevalent 3D representations of point clouds and signed distance functions.
Our dataset is available at: https://unig3d.github.io. |
This paper introduces \textbf{\Datasetname}, a unified large-scale 3D object generation dataset with rich textual descriptions and comprehensive multi-modal data (mesh, point cloud, image). |
Existing 3D object generation methods are limited by the inadequacies of current datasets, including issues related to text quality, the lack of multi-modal data (e.g., 2D rendered images, 3D assets), and dataset size. |
The authors construct \textbf{\Datasetname} by developing a universal data transformation pipeline that converts raw 3D models from ShapeNet and Objaverse into the unified multi-modal representation using a rendering engine (Blender), and multi-modal models (CLIP and BLIP). |
Using both text and images as conditioning inputs leads to better 3D object generation than using either modality alone.
Increasing data sources and incorporating multi-view data improve the diversity and quality of generated 3D objects.
Generating and leveraging richer textual descriptions beyond object categories significantly improves the controllability and quality of text-conditioned 3D object generation. |
The experiments are limited by computational resources, preventing the use of the full \textbf{\Datasetname-Objaverse} dataset.
Future work will explore a wider range of 3D generation methods and incorporate 3D understanding tasks. |
3d object generation, dataset, multi-modal, text-to-3d, image-to-3d |
2306.10533
Report |
Point-Cloud Completion with Pretrained Text-to-image Diffusion Models |
Yoni Kasten, Ohad Rahamim, Gal Chechik |
Point-cloud data collected in real-world applications are often incomplete.
Data is typically missing due to objects being observed from partial
viewpoints, which only capture a specific perspective or angle. Additionally,
data can be incomplete due to occlusion and low-resolution sampling. Existing
completion approaches rely on datasets of predefined objects to guide the
completion of noisy and incomplete, point clouds. However, these approaches
perform poorly when tested on Out-Of-Distribution (OOD) objects, that are
poorly represented in the training dataset. Here we leverage recent advances in
text-guided image generation, which lead to major breakthroughs in text-guided
shape generation. We describe an approach called SDS-Complete that uses a
pre-trained text-to-image diffusion model and leverages the text semantics of a
given incomplete point cloud of an object, to obtain a complete surface
representation. SDS-Complete can complete a variety of objects using test-time
optimization without expensive collection of 3D information. We evaluate SDS
Complete on incomplete scanned objects, captured by real-world depth sensors
and LiDAR scanners. We find that it effectively reconstructs objects that are
absent from common datasets, reducing Chamfer loss by 50% on average compared
with current methods. Project page: https://sds-complete.github.io/ |
Presents SDS-Complete, a method for completing point clouds into complete surface representations using pre-trained text-to-image diffusion models and test-time optimization. |
Addresses limitations of existing point cloud completion methods that struggle with out-of-distribution (OOD) objects not well-represented in training datasets. |
Leverages the semantic prior of pre-trained text-to-image diffusion models through the SDS loss, combined with an SDF surface representation and constraints to enforce consistency with input points and sensor observations. |
Achieves state-of-the-art completion results for OOD objects.
Demonstrates robustness to variations in text prompts.
Maintains comparable performance to existing methods on in-domain objects. |
Limited by low-resolution image rendering for SDS loss due to GPU memory constraints.
Struggles with objects containing components with disc topology due to SDF initialization. |
point cloud completion, diffusion models, text-to-image synthesis, signed distance function, out-of-distribution generalization |
2306.10441
Report |
Image Harmonization with Diffusion Model |
Jiajie Li, Jian Wang, Chen Wang, Jinjun Xiong |
Image composition in image editing involves merging a foreground image with a
background image to create a composite. Inconsistent lighting conditions
between the foreground and background often result in unrealistic composites.
Image harmonization addresses this challenge by adjusting illumination and
color to achieve visually appealing and consistent outputs. In this paper, we
present a novel approach for image harmonization by leveraging diffusion
models. We conduct a comparative analysis of two conditional diffusion models,
namely Classifier-Guidance and Classifier-Free. Our focus is on addressing the
challenge of adjusting illumination and color in foreground images to create
visually appealing outputs that seamlessly blend with the background. Through
this research, we establish a solid groundwork for future investigations in the
realm of diffusion model-based image harmonization. |
This paper presents a novel image harmonization method utilizing diffusion models, focusing on adjusting foreground illumination and color for realistic integration with the background. |
Image composition often suffers from inconsistent lighting between foreground and background, leading to unrealistic composites. This method leverages diffusion models to address this challenge and create visually appealing, consistent outputs. |
The approach utilizes both classifier-guided and classifier-free conditional diffusion models, including DDPM and LDM. It introduces an appearance consistency discriminator and a color transfer method to maintain visual coherence throughout the harmonization process. |
The method achieves superior performance compared to existing state-of-the-art approaches on the iHarmony4 dataset.
Experiments on real composite images from Open Image Dataset V6 and Flick Dataset also demonstrate its effectiveness.
The inherent stochasticity of the diffusion model allows generating multiple diverse harmonization results for a single input, providing users with flexibility and control. |
The discrepancy between the iHarmony4 dataset's synthesized composite images and real-world scenarios might limit the model's generalizability.
Future work will focus on addressing real-world image harmonization challenges with more complex and diverse lighting conditions. |
image harmonization, diffusion models, image editing, deep learning, computer vision |
2306.10128
Report |
Systematic Architectural Design of Scale Transformed Attention Condenser DNNs via Multi-Scale Class Representational Response Similarity Analysis |
Andre Hryniowski, Alexander Wong |
Self-attention mechanisms are commonly included in a convolutional neural
networks to achieve an improved efficiency performance balance. However, adding
self-attention mechanisms adds additional hyperparameters to tune for the
application at hand. In this work we propose a novel type of DNN analysis
called Multi-Scale Class Representational Response Similarity Analysis
(ClassRepSim) which can be used to identify specific design interventions that
lead to more efficient self-attention convolutional neural network
architectures. Using insights grained from ClassRepSim we propose the Spatial
Transformed Attention Condenser (STAC) module, a novel attention-condenser
based self-attention module. We show that adding STAC modules to ResNet style
architectures can result in up to a 1.6% increase in top-1 accuracy compared to
vanilla ResNet models and up to a 0.5% increase in top-1 accuracy compared to
SENet models on the ImageNet64x64 dataset, at the cost of up to 1.7% increase
in FLOPs and 2x the number of parameters. In addition, we demonstrate that
results from ClassRepSim analysis can be used to select an effective
parameterization of the STAC module resulting in competitive performance
compared to an extensive parameter search. |
This paper proposes the Spatial Transformed Attention Condenser (STAC) module, an efficient self-attention module for convolutional neural networks informed by a novel analysis method called Multi-Scale Class Representational Response Similarity Analysis (ClassRepSim). |
Adding self-attention mechanisms can improve deep neural network efficiency and performance, but it introduces additional hyperparameters. This work aims to guide the design of more efficient self-attention architectures. |
The authors introduce ClassRepSim, which analyzes the class-wise similarity of data representations at different spatial scales within a DNN. They use insights from this analysis to design the STAC module. Experiments are conducted on CIFAR10, ImageNet64x64-50, and ImageNet64x64 datasets using ResNet architectures. |
Adding STAC modules to ResNet architectures consistently improves top-1 accuracy compared to vanilla ResNet models.
STAC modules outperform existing self-attention modules like SENet and BAM in most cases, achieving higher accuracy with a smaller increase in computational cost.
ClassRepSim analysis effectively guides the selection of STAC module parameters, leading to competitive performance compared to extensive parameter search. |
Further experiments are needed to assess the generalizability of STAC modules across a wider range of datasets and model architectures.
Future work could explore the relationship between ClassRepSim and other representational response metrics, such as intrinsic dimensionality. |
self-attention, deep neural networks, computer vision, image classification, representational similarity analysis |
2306.10012
Report |
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing |
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, Yu Su |
Text-guided image editing is widely needed in daily life, ranging from
personal use to professional applications such as Photoshop. However, existing
methods are either zero-shot or trained on an automatically synthesized
dataset, which contains a high volume of noise. Thus, they still require lots
of manual tuning to produce desirable outcomes in practice. To address this
issue, we introduce MagicBrush (https://osu-nlp-group.github.io/MagicBrush/),
the first large-scale, manually annotated dataset for instruction-guided real
image editing that covers diverse scenarios: single-turn, multi-turn,
mask-provided, and mask-free editing. MagicBrush comprises over 10K manually
annotated triplets (source image, instruction, target image), which supports
trainining large-scale text-guided image editing models. We fine-tune
InstructPix2Pix on MagicBrush and show that the new model can produce much
better images according to human evaluation. We further conduct extensive
experiments to evaluate current image editing baselines from multiple
dimensions including quantitative, qualitative, and human evaluations. The
results reveal the challenging nature of our dataset and the gap between
current baselines and real-world editing needs. |
This paper introduces MagicBrush, the first large-scale, manually annotated dataset specifically designed for instruction-guided real image editing, covering diverse scenarios like single-turn, multi-turn, mask-provided, and mask-free editing. |
Existing methods rely on zero-shot learning or training on synthetic data with noise, limiting their effectiveness for real-world editing. MagicBrush addresses this gap by providing high-quality, human-annotated data to facilitate the development and evaluation of more robust and user-friendly image editing models. |
The dataset was created using a rigorous crowdsourcing process involving qualified workers on Amazon Mechanical Turk. Workers proposed edit instructions and utilized the DALL-E 2 image editing platform to interactively synthesize target images. The process involved single and multi-turn edits with and without mask guidance, ensuring diversity in editing scenarios. |
MagicBrush comprises over 10K manually annotated triplets (source image, instruction, target image), making it suitable for training large-scale text-guided image editing models.
Fine-tuning InstructPix2Pix on MagicBrush significantly improved its performance compared to the original model and other baselines, as demonstrated by quantitative, qualitative, and human evaluations.
Existing image editing models, even with additional guidance like masks, struggle to match the quality and consistency of human-annotated edits in MagicBrush, highlighting the challenging nature of the dataset and the need for more advanced models. |
While efforts were made to ensure diversity, MagicBrush's reliance on DALL-E 2 for ground truth generation may introduce inherent biases.
The dataset primarily focuses on local editing tasks and does not cover global editing operations like style transfer, which could be explored in future work. |
image editing, text-guided image editing, dataset, instruction following, human evaluation |
2306.09864
Report |
AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation |
Yifei Zeng, Yuanxun Lu, Xinya Ji, Yao Yao, Hao Zhu, Xun Cao |
We introduce AvatarBooth, a novel method for generating high-quality 3D
avatars using text prompts or specific images. Unlike previous approaches that
can only synthesize avatars based on simple text descriptions, our method
enables the creation of personalized avatars from casually captured face or
body images, while still supporting text-based model generation and editing.
Our key contribution is the precise avatar generation control by using dual
fine-tuned diffusion models separately for the human face and body. This
enables us to capture intricate details of facial appearance, clothing, and
accessories, resulting in highly realistic avatar generations. Furthermore, we
introduce pose-consistent constraint to the optimization process to enhance the
multi-view consistency of synthesized head images from the diffusion model and
thus eliminate interference from uncontrolled human poses. In addition, we
present a multi-resolution rendering strategy that facilitates coarse-to-fine
supervision of 3D avatar generation, thereby enhancing the performance of the
proposed system. The resulting avatar model can be further edited using
additional text descriptions and driven by motion sequences. Experiments show
that AvatarBooth outperforms previous text-to-3D methods in terms of rendering
and geometric quality from either text prompts or specific images. Please check
our project website at https://zeng-yifei.github.io/avatarbooth_page/. |
AvatarBooth, a novel method for generating high-quality, customizable 3D avatars from text prompts or specific images, enabling personalized avatar creation with intricate details of facial appearance, clothing, and accessories. |
Creating 3D human avatars from text or images is crucial for various applications, but existing methods struggle to synthesize high-quality shapes and appearances, especially for personalized avatars. |
The method uses dual fine-tuned diffusion models for the face and body, a pose-consistent constraint for multi-view consistency, and a multi-resolution rendering strategy for coarse-to-fine supervision of 3D avatar generation. |
Generates high-quality 3D avatars matching text prompts or specific images.
Enables personalized avatar creation with detailed facial features, clothing, and accessories.
Outperforms previous text-to-3D methods in rendering and geometric quality. |
Accuracy and speed of model generation can be further improved.
Leveraging existing 3D human datasets could enhance avatar quality. |
avatar creation, diffusion model, neural implicit field, model fine-tuning, 3d human avatar generation |
2306.09683
Report |
Scaling Open-Vocabulary Object Detection |
Matthias Minderer, Alexey Gritsenko, Neil Houlsby |
Open-vocabulary object detection has benefited greatly from pretrained
vision-language models, but is still limited by the amount of available
detection training data. While detection training data can be expanded by using
Web image-text pairs as weak supervision, this has not been done at scales
comparable to image-level pretraining. Here, we scale up detection data with
self-training, which uses an existing detector to generate pseudo-box
annotations on image-text pairs. Major challenges in scaling self-training are
the choice of label space, pseudo-annotation filtering, and training
efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which
address these challenges. OWLv2 surpasses the performance of previous
state-of-the-art open-vocabulary detectors already at comparable training
scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples,
yielding further large improvement: With an L/14 architecture, OWL-ST improves
AP on LVIS rare classes, for which the model has seen no human box annotations,
from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale
training for open-world localization, similar to what has been seen for image
classification and language modelling. |
This paper introduces OWL-ST, a self-training approach for open-vocabulary object detection that leverages web-scale image-text pairs, and OWLv2, an architecture optimized for efficient training. |
Open-vocabulary object detection, despite benefiting from pre-trained vision-language models, is still hampered by limited detection training data. This work aims to address this bottleneck using self-training on a massive scale, comparable to image-level pre-training. |
The authors utilize OWL-ViT for generating bounding box pseudo-annotations on the WebLI dataset. They introduce a simple yet effective self-training recipe focusing on: (1) employing all N-grams from image captions as detection prompts, (2) utilizing weak confidence filtering for pseudo-labels, and (3) enhancing training efficiency through token dropping, instance selection, and image mosaics. |
OWLv2 coupled with OWL-ST surpasses previous state-of-the-art open-vocabulary detectors even at moderate training scales.
Scaling self-training to billions of examples further boosts performance, significantly improving AP on LVIS rare classes (unseen by human annotators).
The authors demonstrate a trade-off between fine-tuned and open-vocabulary performance, suggesting potential for future research in robust generalization. |
The self-training process demands significant compute and data resources, making further scaling increasingly challenging.
The trade-off between performance on fine-tuned classes and open-vocabulary robustness needs further investigation for improved generalization. |
open-vocabulary object detection, self-training, weak supervision, web-scale data, vision-language models |
2306.09551
Report |
Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model |
Lu Yu, Wei Xiang, Kang Han |
Recent research has demonstrated that the combination of pretrained diffusion
models with neural radiance fields (NeRFs) has emerged as a promising approach
for text-to-3D generation. Simply coupling NeRF with diffusion models will
result in cross-view inconsistency and degradation of stylized view syntheses.
To address this challenge, we propose the Edit-DiffNeRF framework, which is
composed of a frozen diffusion model, a proposed delta module to edit the
latent semantic space of the diffusion model, and a NeRF. Instead of training
the entire diffusion for each scene, our method focuses on editing the latent
semantic space in frozen pretrained diffusion models by the delta module. This
fundamental change to the standard diffusion framework enables us to make
fine-grained modifications to the rendered views and effectively consolidate
these instructions in a 3D scene via NeRF training. As a result, we are able to
produce an edited 3D scene that faithfully aligns to input text instructions.
Furthermore, to ensure semantic consistency across different viewpoints, we
propose a novel multi-view semantic consistency loss that extracts a latent
semantic embedding from the input view as a prior, and aim to reconstruct it in
different views. Our proposed method has been shown to effectively edit
real-world 3D scenes, resulting in 25% improvement in the alignment of the
performed 3D edits with text instructions compared to prior work. |
This paper proposes Edit-DiffNeRF, a framework for editing pretrained NeRF scenes using text instructions by manipulating the latent space of frozen, pretrained diffusion models. |
Existing methods for editing NeRFs with text instructions often result in inconsistencies across views and struggle to faithfully apply edits to the 3D scene. |
Edit-DiffNeRF uses a delta module to learn edits in the latent space of a frozen diffusion model, guided by text instructions. A multi-view semantic consistency loss ensures consistent edits across different viewpoints during NeRF training. |
Edit-DiffNeRF achieves 25% better alignment of 3D edits with text instructions compared to prior work.
The method exhibits improved CLIP Direction Consistency, indicating better temporal stability of edits across multiple views.
Edit-DiffNeRF maintains high visual fidelity after editing, as evidenced by FID scores comparable to pre-edit scenes. |
The model's performance depends on the quality of the initial NeRF reconstruction and the diffusion model's generalization ability.
Editing results can be negatively impacted by low-resolution or blurry input images. |
neural radiance fields (nerfs), diffusion models, text-to-3d generation, 3d scene editing, multi-view consistency |
2306.09349
Report |
UrbanIR: Large-Scale Urban Scene Inverse Rendering from a Single Video |
Zhi-Hao Lin, Bohan Liu, Yi-Ting Chen, David Forsyth, Jia-Bin Huang, Anand Bhattad, Shenlong Wang |
We show how to build a model that allows realistic, free-viewpoint renderings
of a scene under novel lighting conditions from video. Our method -- UrbanIR:
Urban Scene Inverse Rendering -- computes an inverse graphics representation
from the video. UrbanIR jointly infers shape, albedo, visibility, and sun and
sky illumination from a single video of unbounded outdoor scenes with unknown
lighting. UrbanIR uses videos from cameras mounted on cars (in contrast to many
views of the same points in typical NeRF-style estimation). As a result,
standard methods produce poor geometry estimates (for example, roofs), and
there are numerous ''floaters''. Errors in inverse graphics inference can
result in strong rendering artifacts. UrbanIR uses novel losses to control
these and other sources of error. UrbanIR uses a novel loss to make very good
estimates of shadow volumes in the original scene. The resulting
representations facilitate controllable editing, delivering photorealistic
free-viewpoint renderings of relit scenes and inserted objects. Qualitative
evaluation demonstrates strong improvements over the state-of-the-art. |
UrbanIR enables realistic, free-viewpoint renderings of large-scale urban scenes under novel lighting conditions from a single video. |
Existing methods struggle with poor geometry estimates and rendering artifacts when applied to unbounded outdoor scenes from car-mounted cameras. |
UrbanIR combines monocular intrinsic decomposition and inverse rendering with a neural scene model. It uses novel losses to ensure consistency between scene geometry, detected shadows, and deshadowed images. |
Significantly improved geometry estimates and shadow rendering compared to baselines.
Enables realistic relighting effects, including changes in sun position and nighttime simulations.
Facilitates accurate object insertion with realistic shadow casting. |
Relies on multiple 2D priors during optimization, leading to occasional shadow removal imperfections.
Large changes in sun direction can lead to inaccurate shadows due to limitations in geometry refinement. |
inverse rendering, neural rendering, scene relighting, shadow modeling, urban scenes |
2306.09344
Report |
DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data |
Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola |
Current perceptual similarity metrics operate at the level of pixels and
patches. These metrics compare images in terms of their low-level colors and
textures, but fail to capture mid-level similarities and differences in image
layout, object pose, and semantic content. In this paper, we develop a
perceptual metric that assesses images holistically. Our first step is to
collect a new dataset of human similarity judgments over image pairs that are
alike in diverse ways. Critical to this dataset is that judgments are nearly
automatic and shared by all observers. To achieve this we use recent
text-to-image models to create synthetic pairs that are perturbed along various
dimensions. We observe that popular perceptual metrics fall short of explaining
our new data, and we introduce a new metric, DreamSim, tuned to better align
with human perception. We analyze how our metric is affected by different
visual attributes, and find that it focuses heavily on foreground objects and
semantic content while also being sensitive to color and layout. Notably,
despite being trained on synthetic data, our metric generalizes to real images,
giving strong results on retrieval and reconstruction tasks. Furthermore, our
metric outperforms both prior learned metrics and recent large vision models on
these tasks. |
This paper introduces a new perceptual metric, DreamSim, trained on a novel dataset of synthetic image triplets (NIGHTS), designed to capture mid-level visual similarities. |
Existing perceptual similarity metrics fail to capture mid-level similarities like object pose, layout, and semantic content, which are crucial for human perception. |
The authors collect human similarity judgments on synthetic image triplets generated with Stable Diffusion, ensuring cognitive impenetrability. They then train DreamSim by ensembling and fine-tuning large vision models (DINO, CLIP, OpenCLIP) on NIGHTS. |
DreamSim achieves high agreement with human judgments on NIGHTS (96.16%) and generalizes well to real images, outperforming existing metrics in retrieval and reconstruction tasks.
Analysis reveals DreamSim's sensitivity to foreground objects, color, and layout, surpassing prior metrics in capturing mid-level similarities.
Despite being trained on synthetic data, DreamSim demonstrates improved performance on low-level similarity benchmarks (BAPPS, TID2013, KADID-10k) compared to base models. |
The dataset predominantly focuses on object-centric domains, limiting the generalizability of DreamSim to other aspects of human similarity perception.
The model might inherit biases from the pretrained backbones (Stable Diffusion, CLIP, OpenCLIP, DINO) and the generative process. |
perceptual similarity, metric learning, synthetic data, image retrieval, feature inversion |
2306.09341
Report |
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis |
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, Hongsheng Li |
Recent text-to-image generative models can generate high-fidelity images from
text inputs, but the quality of these generated images cannot be accurately
evaluated by existing evaluation metrics. To address this issue, we introduce
Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human
preferences on images from a wide range of sources. HPD v2 comprises 798,090
human preference choices on 433,760 pairs of images, making it the largest
dataset of its kind. The text prompts and images are deliberately collected to
eliminate potential bias, which is a common issue in previous datasets. By
fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a
scoring model that can more accurately predict human preferences on generated
images. Our experiments demonstrate that HPS v2 generalizes better than
previous metrics across various image distributions and is responsive to
algorithmic improvements of text-to-image generative models, making it a
preferable evaluation metric for these models. We also investigate the design
of the evaluation prompts for text-to-image generative models, to make the
evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for
text-to-image generative models using HPS v2, which includes a set of recent
text-to-image models from the academic, community and industry. The code and
dataset is available at https://github.com/tgxs002/HPSv2 . |
This paper introduces Human Preference Dataset v2 (HPD v2), a large-scale dataset for evaluating human preferences in text-to-image generation, and Human Preference Score v2 (HPS v2), a model fine-tuned on HPD v2 to predict human preferences. |
Existing evaluation metrics fail to accurately assess the quality of images generated by text-to-image models, necessitating a method aligned with human perception. |
HPD v2 is built by collecting prompts, generating images from various models, and gathering human preference annotations. HPS v2 is then trained by fine-tuning a CLIP model on HPD v2. |
HPD v2 is larger and less biased than previous datasets, containing 798k human preference comparisons.
HPS v2 outperforms previous preference prediction models, achieving 83.3% accuracy on HPD v2 test set.
The authors establish a benchmark for text-to-image models using HPS v2, comparing models across different styles. |
The prompts and images in HPD v2 are sourced from specific databases (DiffusionDB, COCO Captions) which may not cover all aspects of image generation.
The use of ChatGPT for prompt cleaning, while mitigating bias, could potentially introduce new biases. |
text-to-image generation, human preference, evaluation metrics, benchmark, clip |
2306.09329
Report |
DreamHuman: Animatable 3D Avatars from Text |
Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Eduard Gabriel Bazavan, Mihai Fieraru, Cristian Sminchisescu |
We present DreamHuman, a method to generate realistic animatable 3D human
avatar models solely from textual descriptions. Recent text-to-3D methods have
made considerable strides in generation, but are still lacking in important
aspects. Control and often spatial resolution remain limited, existing methods
produce fixed rather than animated 3D human models, and anthropometric
consistency for complex structures like people remains a challenge. DreamHuman
connects large text-to-image synthesis models, neural radiance fields, and
statistical human body models in a novel modeling and optimization framework.
This makes it possible to generate dynamic 3D human avatars with high-quality
textures and learned, instance-specific, surface deformations. We demonstrate
that our method is capable to generate a wide variety of animatable, realistic
3D human models from text. Our 3D models have diverse appearance, clothing,
skin tones and body shapes, and significantly outperform both generic
text-to-3D approaches and previous text-based 3D avatar generators in visual
fidelity. For more results and animations please check our website at
https://dream-human.github.io. |
Presents DreamHuman, a method to generate realistic, animatable 3D human avatars solely from textual descriptions, by combining large text-to-image models, neural radiance fields, and statistical human body models. |
Existing text-to-3D methods lack control, spatial resolution, animation capabilities, and anthropometric consistency, particularly for complex structures like human bodies. DreamHuman addresses these limitations. |
Combines text-to-image diffusion models, neural radiance fields (using mip-NeRF 360 architecture), and the imGHUM statistical human body model. Employs a novel modeling and optimization framework with semantic zooming and refining prompts for detail, and incorporates multiple losses to ensure quality in structure, appearance, and deformation. |
Generates high-quality, animatable 3D human avatars with diverse appearances, clothing, skin tones, and body shapes from text prompts.
Learns instance-specific, pose-dependent geometric deformations, enabling realistic clothing representation, including loose garments.
Outperforms generic text-to-3D approaches and previous text-based 3D avatar generators in visual fidelity, as demonstrated by qualitative comparisons and CLIP-based evaluation. |
Fine details like wrinkles are sometimes drawn using the albedo map instead of geometry due to the lack of 3D training data.
Occasional disentanglement issues between albedo and shading can result in baked reflections and shadows. |
text-to-3d, 3d human avatar generation, neural radiance fields, diffusion models, human body model |
2306.09316
Report |
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation |
Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht |
The variety of objects in the real world is nearly unlimited and is thus
impossible to capture using models trained on a fixed set of categories. As a
result, in recent years, open-vocabulary methods have attracted the interest of
the community. This paper proposes a new method for zero-shot open-vocabulary
segmentation. Prior work largely relies on contrastive training using
image-text pairs, leveraging grouping mechanisms to learn image features that
are both aligned with language and well-localised. This however can introduce
ambiguity as the visual appearance of images with similar captions often
varies. Instead, we leverage the generative properties of large-scale
text-to-image diffusion models to sample a set of support images for a given
textual category. This provides a distribution of appearances for a given text
circumventing the ambiguity problem. We further propose a mechanism that
considers the contextual background of the sampled images to better localise
objects and segment the background directly. We show that our method can be
used to ground several existing pre-trained self-supervised feature extractors
in natural language and provide explainable predictions by mapping back to
regions in the support set. Our proposal is training-free, relying on
pre-trained components only, yet, shows strong performance on a range of
open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on
the Pascal VOC benchmark. |
This paper introduces OVdiff, a training-free method for zero-shot open-vocabulary segmentation that leverages text-to-image diffusion models to generate visual prototypes for grounding pre-trained feature extractors. |
Existing open-vocabulary segmentation methods rely on extensive training with image-text pairs or labeled datasets, which can introduce ambiguity and limit scalability to new categories. OVdiff overcomes these limitations by utilizing the generative power of diffusion models and pre-trained feature extractors. |
OVdiff samples a support set of images for a given textual category using a text-to-image diffusion model. It then extracts visual prototypes at class, instance, and part levels from these images using an off-the-shelf feature extractor. These prototypes are used in a nearest-neighbor lookup scheme for segmenting any image. |
OVdiff achieves state-of-the-art performance on open-vocabulary segmentation benchmarks, outperforming existing methods by a significant margin.
The method effectively leverages contextual priors by encoding background prototypes, leading to improved object localization and boundary delineation.
OVdiff provides a degree of explainability by mapping back segmentation decisions to specific regions in the support set images. |
The resolution of segmentation masks is limited by the resolution of the employed feature extractor.
Sampling support images for a large number of categories can be computationally expensive, although this cost can be amortized over multiple images. |
open-vocabulary segmentation, zero-shot learning, diffusion models, feature grounding, explainable ai |
2306.09305
Report |
Fast Training of Diffusion Models with Masked Transformers |
Hongkai Zheng, Weili Nie, Arash Vahdat, Anima Anandkumar |
We propose an efficient approach to train large diffusion models with masked
transformers. While masked transformers have been extensively explored for
representation learning, their application to generative learning is less
explored in the vision domain. Our work is the first to exploit masked training
to reduce the training cost of diffusion models significantly. Specifically, we
randomly mask out a high proportion (e.g., 50%) of patches in diffused input
images during training. For masked training, we introduce an asymmetric
encoder-decoder architecture consisting of a transformer encoder that operates
only on unmasked patches and a lightweight transformer decoder on full patches.
To promote a long-range understanding of full patches, we add an auxiliary task
of reconstructing masked patches to the denoising score matching objective that
learns the score of unmasked patches. Experiments on ImageNet-256x256 and
ImageNet-512x512 show that our approach achieves competitive and even better
generative performance than the state-of-the-art Diffusion Transformer (DiT)
model, using only around 30% of its original training time. Thus, our method
shows a promising way of efficiently training large transformer-based diffusion
models without sacrificing the generative performance. |
This paper introduces MaskDiT, a novel approach for training diffusion models efficiently using masked transformers. |
Training large diffusion models is computationally expensive. This work aims to significantly reduce the training cost without compromising image generation quality. |
The authors propose an asymmetric encoder-decoder architecture where the encoder processes only unmasked patches while the lightweight decoder handles all patches. They also introduce a new training objective combining denoising score matching on unmasked tokens and an auxiliary masked patch reconstruction task. |
MaskDiT achieves competitive image generation quality compared to state-of-the-art models on ImageNet 256x256 and 512x512 benchmarks.
The method significantly reduces training time and memory consumption compared to previous transformer-based diffusion models (DiT and MDT).
An ablation study reveals that the success of MaskDiT comes from the interplay of image masking, the asymmetric architecture, and the dual training objective. |
The current method requires a few steps of unmasking tuning to achieve the best FID scores with classifier-free guidance.
Future work could focus on improving unconditional image generation performance with masked training. |
diffusion models, masked image modeling, transformers, image generation, efficient training |
2306.09117
Report |
UniOcc: Unifying Vision-Centric 3D Occupancy Prediction with Geometric and Semantic Rendering |
Mingjie Pan, Li Liu, Jiaming Liu, Peixiang Huang, Longlong Wang, Shanghang Zhang, Shaoqing Xu, Zhiyi Lai, Kuiyuan Yang |
In this technical report, we present our solution, named UniOCC, for the
Vision-Centric 3D occupancy prediction track in the nuScenes Open Dataset
Challenge at CVPR 2023. Existing methods for occupancy prediction primarily
focus on optimizing projected features on 3D volume space using 3D occupancy
labels. However, the generation process of these labels is complex and
expensive (relying on 3D semantic annotations), and limited by voxel
resolution, they cannot provide fine-grained spatial semantics. To address this
limitation, we propose a novel Unifying Occupancy (UniOcc) prediction method,
explicitly imposing spatial geometry constraint and complementing fine-grained
semantic supervision through volume ray rendering. Our method significantly
enhances model performance and demonstrates promising potential in reducing
human annotation costs. Given the laborious nature of annotating 3D occupancy,
we further introduce a Depth-aware Teacher Student (DTS) framework to enhance
prediction accuracy using unlabeled data. Our solution achieves 51.27\% mIoU on
the official leaderboard with single model, placing 3rd in this challenge. |
Introduces UniOcc, a novel method for unifying 2D and 3D representation supervision in multi-camera occupancy prediction by leveraging volume rendering to generate 2D semantic and depth maps for fine-grained supervision. |
Addresses the limitations of existing 3D occupancy prediction methods that rely on expensive and complex 3D annotations by utilizing readily available 2D annotations and potentially reducing annotation costs. |
Employs volume rendering to generate 2D semantic and depth maps from 3D occupancy predictions, enabling fine-grained supervision with 2D pixels and enforcing geometric and semantic consistency through explicit occlusion relationships. |
Achieves comparable performance to methods using 3D labels without relying on them, highlighting the potential to reduce annotation costs.
Integrating temporal frames as supplementary perspectives significantly enhances rendering supervision by considering occlusion relationships between voxels.
The proposed Depth-aware Teacher Student (DTS) framework, utilizing unlabeled data and LiDAR information, effectively improves prediction accuracy. |
Limited overlap between surrounding cameras hinders multi-view consistency in rendering supervision.
Reliance on visibility masks during training, while improving evaluation metrics, may lead to overlooking occluded areas and affect visualization quality. |
occupancy prediction, volume rendering, autonomous driving, semi-supervised learning, multi-view consistency |
2306.09109
Report |
NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations |
Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engelhardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, André Araujo, Ricardo Martin-Brualla, Kaushal Patel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, Howard Zhou |
Recent advances in neural reconstruction enable high-quality 3D object
reconstruction from casually captured image collections. Current techniques
mostly analyze their progress on relatively simple image collections where
Structure-from-Motion (SfM) techniques can provide ground-truth (GT) camera
poses. We note that SfM techniques tend to fail on in-the-wild image
collections such as image search results with varying backgrounds and
illuminations. To enable systematic research progress on 3D reconstruction from
casual image captures, we propose NAVI: a new dataset of category-agnostic
image collections of objects with high-quality 3D scans along with per-image
2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D
alignments allow us to extract accurate derivative annotations such as dense
pixel correspondences, depth and segmentation maps. We demonstrate the use of
NAVI image collections on different problem settings and show that NAVI enables
more thorough evaluations that were not possible with existing datasets. We
believe NAVI is beneficial for systematic research progress on 3D
reconstruction and correspondence estimation. Project page:
https://navidataset.github.io |
The paper introduces NAVI, a novel dataset of image collections featuring precise 2D-3D alignments with high-quality 3D scans, aimed at advancing 3D reconstruction from casual captures, including in-the-wild scenarios. |
Existing datasets often rely on SfM for camera poses, limiting capture setups and hindering research on in-the-wild reconstruction where SfM struggles. NAVI addresses this by providing accurate ground truth for challenging scenarios. |
NAVI uses high-quality 3D scanners for ground truth shapes and employs a rigorous manual 2D-3D alignment process with interactive tools and expert verification, ensuring near-perfect annotations. |
NAVI's ground truth poses significantly improve multiview reconstruction quality compared to using SfM (COLMAP).
For in-the-wild reconstruction, NAVI enables camera pose analysis, revealing performance differences among techniques (SAMURAI, NeRS, NeROIC) under varying noise levels.
NAVI's dense correspondence annotations highlight the limitations of existing methods in achieving comprehensive coverage, particularly in in-the-wild scenarios. |
NAVI's main limitation is its scale, consisting of 36 objects and ~10K images due to the meticulous annotation process.
Future work involves expanding NAVI to include video sequences. |
3d reconstruction, dataset, in-the-wild, correspondence estimation, camera pose estimation |
2306.08904
Report |
Enhancing Neural Rendering Methods with Image Augmentations |
Juan C. Pérez, Sara Rojas, Jesus Zarzar, Bernard Ghanem |
Faithfully reconstructing 3D geometry and generating novel views of scenes
are critical tasks in 3D computer vision. Despite the widespread use of image
augmentations across computer vision applications, their potential remains
underexplored when learning neural rendering methods (NRMs) for 3D scenes. This
paper presents a comprehensive analysis of the use of image augmentations in
NRMs, where we explore different augmentation strategies. We found that
introducing image augmentations during training presents challenges such as
geometric and photometric inconsistencies for learning NRMs from images.
Specifically, geometric inconsistencies arise from alterations in shapes,
positions, and orientations from the augmentations, disrupting spatial cues
necessary for accurate 3D reconstruction. On the other hand, photometric
inconsistencies arise from changes in pixel intensities introduced by the
augmentations, affecting the ability to capture the underlying 3D structures of
the scene. We alleviate these issues by focusing on color manipulations and
introducing learnable appearance embeddings that allow NRMs to explain away
photometric variations. Our experiments demonstrate the benefits of
incorporating augmentations when learning NRMs, including improved photometric
quality and surface reconstruction, as well as enhanced robustness against data
quality issues, such as reduced training data and image degradations. |
This paper presents a comprehensive analysis of the use of image augmentations in neural rendering methods for 3D scenes, focusing on addressing challenges and benefits for both static and dynamic augmentation strategies. |
Image augmentations are widely used in computer vision but their potential in neural rendering is underexplored. This work investigates how to effectively incorporate them and analyzes their impact on performance and robustness. |
The authors propose two methods: Static Image Augmentations (SIA) and Dynamic Image Augmentations (DIA). They address geometric inconsistencies by using color manipulations and photometric inconsistencies by introducing learnable appearance embeddings. Experiments are conducted on NeRF, NGP (for photometric quality) and NeuS (for surface reconstruction) using Blender and DTU datasets, respectively. |
Both SIA and DIA, particularly SIA, improve photometric quality (PSNR, SSIM, LPIPS) on Blender dataset.
SIA consistently outperforms other setups in surface reconstruction quality (Chamfer distance) on DTU dataset.
SIA enhances robustness against reduced training data and image degradations for both photometric and geometric quality. |
The focus on color manipulations as augmentations might limit diversity and robustness.
Reliance on geometry-preserving augmentations restricts applicability to complex transformations involving shape or viewpoint changes. |
neural rendering, image augmentation, 3d scene reconstruction, novel view synthesis, data augmentation |
2306.08768
Report |
Generalizable One-shot Neural Head Avatar |
Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, Jan Kautz |
We present a method that reconstructs and animates a 3D head avatar from a
single-view portrait image. Existing methods either involve time-consuming
optimization for a specific person with multiple images, or they struggle to
synthesize intricate appearance details beyond the facial region. To address
these limitations, we propose a framework that not only generalizes to unseen
identities based on a single-view image without requiring person-specific
optimization, but also captures characteristic details within and beyond the
face area (e.g. hairstyle, accessories, etc.). At the core of our method are
three branches that produce three tri-planes representing the coarse 3D
geometry, detailed appearance of a source image, as well as the expression of a
target image. By applying volumetric rendering to the combination of the three
tri-planes followed by a super-resolution module, our method yields a high
fidelity image of the desired identity, expression and pose. Once trained, our
model enables efficient 3D head avatar reconstruction and animation via a
single forward pass through a network. Experiments show that the proposed
approach generalizes well to unseen validation datasets, surpassing SOTA
baseline methods by a large margin on head avatar reconstruction and animation. |
This paper proposes a novel framework for reconstructing and animating 3D head avatars from single-view portrait images, capturing intricate details while generalizing to unseen identities without test-time optimization. |
Existing methods for head avatar animation are either inefficient, requiring per-person optimization, or lack fidelity in synthesizing detailed appearances beyond the face. This work addresses these limitations with a practical and efficient solution for high-quality avatar creation. |
The framework uses three branches: 1) a canonical branch reconstructs coarse 3D geometry with a neutral expression, 2) an appearance branch captures detailed texture by mapping image pixels to the canonical 3D space, and 3) an expression branch modifies the reconstruction to match the target expression using a 3DMM rendering. A super-resolution module enhances the final output. |
The method achieves state-of-the-art performance on 3D portrait reconstruction, surpassing baselines on fidelity metrics.
It exhibits superior performance in cross-identity reenactment, accurately transferring expressions and head poses while preserving identity and details.
The framework is highly efficient, reconstructing and animating avatars with a single forward pass, significantly faster than optimization-based methods. |
The model currently struggles to accurately reconstruct teeth and pupils, often relying on hallucination which can lead to discrepancies with the source image.
Future work includes addressing these limitations by developing mechanisms to better handle open/closed mouth and eye states during reconstruction. |
3d head avatar, neural rendering, one-shot learning, facial animation, generative adversarial networks |
2306.08757
Report |
InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models |
Yingheng Wang, Yair Schiff, Aaron Gokaslan, Weishen Pan, Fei Wang, Christopher De Sa, Volodymyr Kuleshov |
While diffusion models excel at generating high-quality samples, their latent
variables typically lack semantic meaning and are not suitable for
representation learning. Here, we propose InfoDiffusion, an algorithm that
augments diffusion models with low-dimensional latent variables that capture
high-level factors of variation in the data. InfoDiffusion relies on a learning
objective regularized with the mutual information between observed and hidden
variables, which improves latent space quality and prevents the latents from
being ignored by expressive diffusion-based decoders. Empirically, we find that
InfoDiffusion learns disentangled and human-interpretable latent
representations that are competitive with state-of-the-art generative and
contrastive methods, while retaining the high sample quality of diffusion
models. Our method enables manipulating the attributes of generated images and
has the potential to assist tasks that require exploring a learned latent space
to generate quality samples, e.g., generative design. |
InfoDiffusion, an algorithm that augments diffusion models with low-dimensional latent variables to capture high-level factors of variation in the data. |
Diffusion models, while excellent at generating high-quality samples, typically lack semantic meaning in their latent variables, making them unsuitable for representation learning. |
InfoDiffusion uses variational inference and maximizes the mutual information between the observed data and the hidden variables. It also incorporates a prior regularization term to prevent the latent space from being ignored by the decoder. |
InfoDiffusion learns disentangled and human-interpretable latent representations.
The latent representations are competitive with state-of-the-art generative and contrastive methods.
InfoDiffusion retains the high sample quality of diffusion models. |
Investigating the impact of different divergence measures on the prior regularization term.
Exploring alternative architectures for the encoder and decoder networks. |
diffusion models, representation learning, mutual information, variational inference, disentanglement |
2306.08707
Report |
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing |
Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome |
Recently, diffusion-based generative models have achieved remarkable success
for image generation and edition. However, existing diffusion-based video
editing approaches lack the ability to offer precise control over generated
content that maintains temporal consistency in long-term videos. On the other
hand, atlas-based methods provide strong temporal consistency but are costly to
edit a video and lack spatial control. In this work, we introduce VidEdit, a
novel method for zero-shot text-based video editing that guarantees robust
temporal and spatial consistency. In particular, we combine an atlas-based
video representation with a pre-trained text-to-image diffusion model to
provide a training-free and efficient video editing method, which by design
fulfills temporal smoothness. To grant precise user control over generated
content, we utilize conditional information extracted from off-the-shelf
panoptic segmenters and edge detectors which guides the diffusion sampling
process. This method ensures a fine spatial control on targeted regions while
strictly preserving the structure of the original video. Our quantitative and
qualitative experiments show that VidEdit outperforms state-of-the-art methods
on DAVIS dataset, regarding semantic faithfulness, image preservation, and
temporal consistency metrics. With this framework, processing a single video
only takes approximately one minute, and it can generate multiple compatible
edits based on a unique text prompt. Project web-page at
https://videdit.github.io |
\textsc{VidEdit} is a novel, lightweight, zero-shot text-based video editing method that leverages the power of pre-trained text-to-image diffusion models and the temporal consistency of atlas-based video representations. |
Existing diffusion-based video editing approaches struggle with precise control over generated content and maintaining temporal consistency in long videos. Atlas-based methods, while offering strong temporal consistency, are computationally expensive and lack spatial editing control. \textsc{VidEdit} bridges this gap by combining the strengths of both approaches. |
\textsc{VidEdit} decomposes a video into 2D atlas representations using Neural Layered Atlases (NLA). It then utilizes a pre-trained text-to-image diffusion model, guided by conditional information from a panoptic segmenter and edge detector, to perform spatially controlled edits on the atlases. These edits are then mapped back onto the original video frames, ensuring temporal consistency. |
\textsc{VidEdit} demonstrates superior performance compared to state-of-the-art video editing methods in terms of semantic faithfulness to the target text prompt, preservation of original video content, and temporal consistency.
The method offers a significant speed-up in editing time, capable of processing a full video in approximately one minute.
By leveraging the probabilistic nature of diffusion models, \textsc{VidEdit} enables the generation of diverse and creative edits from a single text prompt. |
The performance of \textsc{VidEdit} is reliant on the quality of the underlying atlas representations, which can be limited for videos with complex motions or long durations.
Future work could focus on enhancing the robustness of atlas construction methods to broaden the applicability of \textsc{VidEdit} to a wider range of videos. |
video editing, text-driven editing, diffusion models, neural layered atlases, zero-shot learning |
2306.08687
Report |
Norm-guided latent space exploration for text-to-image generation |
Dvir Samuel, Rami Ben-Ari, Nir Darshan, Haggai Maron, Gal Chechik |
Text-to-image diffusion models show great potential in synthesizing a large
variety of concepts in new compositions and scenarios. However, the latent
space of initial seeds is still not well understood and its structure was shown
to impact the generation of various concepts. Specifically, simple operations
like interpolation and finding the centroid of a set of seeds perform poorly
when using standard Euclidean or spherical metrics in the latent space. This
paper makes the observation that, in current training procedures, diffusion
models observed inputs with a narrow range of norm values. This has strong
implications for methods that rely on seed manipulation for image generation,
with applications to few-shot and long-tail learning tasks. To address this
issue, we propose a novel method for interpolating between two seeds and
demonstrate that it defines a new non-Euclidean metric that takes into account
a norm-based prior on seeds. We describe a simple yet efficient algorithm for
approximating this interpolation procedure and use it to further define
centroids in the latent seed space. We show that our new interpolation and
centroid techniques significantly enhance the generation of rare concept
images. This further leads to state-of-the-art performance on few-shot and
long-tail benchmarks, improving prior approaches in terms of generation speed,
image quality, and semantic content. |
This paper proposes Norm-Aware Optimization (NAO), a novel method for interpolating between seeds and finding centroids in the latent space of text-to-image diffusion models, by leveraging a norm-based prior derived from the Chi distribution. |
Current diffusion models exhibit poor performance in latent space interpolation and centroid finding due to a training bias towards specific seed norm values. This limits their ability to generate rare concepts and perform well in few-shot and long-tail learning tasks. |
NAO defines a new distance metric based on the likelihood of a seed under a Chi distribution prior. It then leverages this metric to find optimal interpolation paths and centroids by minimizing the total distance between points in latent space. |
NAO generates higher-quality images with better semantic content compared to baseline interpolation and centroid methods.
Using NAO for seed initialization significantly improves the performance of SeedSelect in rare concept generation, achieving state-of-the-art results on few-shot and long-tail learning benchmarks.
NAO significantly reduces the runtime of SeedSelect by providing a better starting point for optimization. |
NAO involves an additional optimization step compared to standard interpolation and centroid calculation.
While NAO improves seed initialization, it might still require further optimization using methods like SeedSelect for optimal results. |
diffusion models, latent space exploration, rare concept generation, few-shot learning, long-tail learning |
2306.08659
Report |
Explore In-Context Learning for 3D Point Cloud Understanding |
Zhongbin Fang, Xiangtai Li, Xia Li, Joachim M. Buhmann, Chen Change Loy, Mengyuan Liu |
With the rise of large-scale models trained on broad data, in-context
learning has become a new learning paradigm that has demonstrated significant
potential in natural language processing and computer vision tasks. Meanwhile,
in-context learning is still largely unexplored in the 3D point cloud domain.
Although masked modeling has been successfully applied for in-context learning
in 2D vision, directly extending it to 3D point clouds remains a formidable
challenge. In the case of point clouds, the tokens themselves are the point
cloud positions (coordinates) that are masked during inference. Moreover,
position embedding in previous works may inadvertently introduce information
leakage. To address these challenges, we introduce a novel framework, named
Point-In-Context, designed especially for in-context learning in 3D point
clouds, where both inputs and outputs are modeled as coordinates for each task.
Additionally, we propose the Joint Sampling module, carefully designed to work
in tandem with the general point sampling operator, effectively resolving the
aforementioned technical issues. We conduct extensive experiments to validate
the versatility and adaptability of our proposed methods in handling a wide
range of tasks. |
This paper presents Point-In-Context (PIC), the first framework to explore in-context learning for 3D point cloud understanding. |
In-context learning, showing promise in NLP and 2D vision, remains unexplored for 3D point clouds. This work establishes a baseline for this novel research direction. |
The authors create a new benchmark dataset with four tasks: reconstruction, denoising, registration, and part segmentation. They propose PIC with a Joint Sampling module to address information leakage issues inherent in adapting existing methods. |
PIC achieves state-of-the-art performance on the benchmark, outperforming multitask models.
The method generalizes to out-of-distribution data and unseen tasks.
Prompt selection significantly impacts performance, suggesting future research directions. |
The model struggles to reconstruct fine details in complex point clouds.
Future work includes exploring higher-quality prompts for improved performance. |
in-context learning, 3d point cloud, masked point modeling, joint sampling, prompt engineering |
2306.08645
Report |
Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis |
Zhiyu Jin, Xuli Shen, Bin Li, Xiangyang Xue |
Diffusion models (DMs) have recently gained attention with state-of-the-art
performance in text-to-image synthesis. Abiding by the tradition in deep
learning, DMs are trained and evaluated on the images with fixed sizes.
However, users are demanding for various images with specific sizes and various
aspect ratio. This paper focuses on adapting text-to-image diffusion models to
handle such variety while maintaining visual fidelity. First we observe that,
during the synthesis, lower resolution images suffer from incomplete object
portrayal, while higher resolution images exhibit repetitively disordered
presentation. Next, we establish a statistical relationship indicating that
attention entropy changes with token quantity, suggesting that models aggregate
spatial information in proportion to image resolution. The subsequent
interpretation on our observations is that objects are incompletely depicted
due to limited spatial information for low resolutions, while repetitively
disorganized presentation arises from redundant spatial information for high
resolutions. From this perspective, we propose a scaling factor to alleviate
the change of attention entropy and mitigate the defective pattern observed.
Extensive experimental results validate the efficacy of the proposed scaling
factor, enabling models to achieve better visual effects, image quality, and
text alignment. Notably, these improvements are achieved without additional
training or fine-tuning techniques. |
This paper proposes a novel scaling factor for visual attention layers in text-to-image diffusion models, enabling them to synthesize high-fidelity images of varying sizes without additional training. |
Existing diffusion models struggle to maintain visual fidelity when synthesizing images at resolutions different from their training resolution. This limits their practical application and requires costly training of specialized models. |
The authors establish a statistical relationship between attention entropy and token quantity, demonstrating that attention entropy changes proportionally to the logarithm of the token number. They then propose a scaling factor to mitigate these entropy fluctuations during image synthesis. |
The proposed scaling factor significantly improves FID scores across various resolutions, indicating improved image quality and diversity.
It enhances the semantic alignment between generated images and text prompts, resulting in higher CLIP scores.
Qualitative results demonstrate that the method alleviates issues of incomplete objects in low-resolution images and repetitive patterns in high-resolution images. |
The paper lacks a dedicated metric for evaluating image fidelity across different resolutions.
Further investigation is needed to assess the generalizability of the proposed scaling factor to other diffusion-based models. |
diffusion models, text-to-image synthesis, attention mechanism, entropy, variable-sized image generation |
2306.08637
Report |
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement |
Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman |
We present a novel model for Tracking Any Point (TAP) that effectively tracks
any queried point on any physical surface throughout a video sequence. Our
approach employs two stages: (1) a matching stage, which independently locates
a suitable candidate point match for the query point on every other frame, and
(2) a refinement stage, which updates both the trajectory and query features
based on local correlations. The resulting model surpasses all baseline methods
by a significant margin on the TAP-Vid benchmark, as demonstrated by an
approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model
facilitates fast inference on long and high-resolution video sequences. On a
modern GPU, our implementation has the capacity to track points faster than
real-time, and can be flexibly extended to higher-resolution videos. Given the
high-quality trajectories extracted from a large dataset, we demonstrate a
proof-of-concept diffusion model which generates trajectories from static
images, enabling plausible animations. Visualizations, source code, and
pretrained models can be found on our project webpage. |
Presents TAPIR, a novel model for long-term point tracking that combines per-frame matching with temporal refinement, significantly improving performance on the TAP-Vid benchmark. |
Addresses limitations of prior methods in handling occlusions and leveraging temporal continuity for accurate and robust point tracking in videos. |
Combines a TAP-Net-like matching stage for robust initialization with a PIPs-inspired refinement stage using depthwise convolutional networks for efficient temporal smoothing. |
Achieves state-of-the-art results on the TAP-Vid benchmark, with a 10.6% absolute improvement on Kinetics and 19.3% on DAVIS over previous best methods.
Demonstrates robust performance even on high-resolution videos by employing an image pyramid approach.
Enables a proof-of-concept diffusion model for animating still images by generating plausible motion trajectories. |
Performance on the RGB-Stacking dataset, while improved, suggests further research is needed for tracking points on textureless objects.
Exploring more sophisticated temporal integration methods beyond RNNs could further enhance performance. |
point tracking, video understanding, deep learning, computer vision, motion analysis |
2306.08571
Report |
GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image |
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, Yunhe Wang |
The extraordinary ability of generative models to generate photographic
images has intensified concerns about the spread of disinformation, thereby
leading to the demand for detectors capable of distinguishing between
AI-generated fake images and real images. However, the lack of large datasets
containing images from the most advanced image generators poses an obstacle to
the development of such detectors. In this paper, we introduce the GenImage
dataset, which has the following advantages: 1) Plenty of Images, including
over one million pairs of AI-generated fake images and collected real images.
2) Rich Image Content, encompassing a broad range of image classes. 3)
State-of-the-art Generators, synthesizing images with advanced diffusion models
and GANs. The aforementioned advantages allow the detectors trained on GenImage
to undergo a thorough evaluation and demonstrate strong applicability to
diverse images. We conduct a comprehensive analysis of the dataset and propose
two tasks for evaluating the detection method in resembling real-world
scenarios. The cross-generator image classification task measures the
performance of a detector trained on one generator when tested on the others.
The degraded image classification task assesses the capability of the detectors
in handling degraded images such as low-resolution, blurred, and compressed
images. With the GenImage dataset, researchers can effectively expedite the
development and evaluation of superior AI-generated image detectors in
comparison to prevailing methodologies. |
This paper introduces GenImage, a large-scale dataset designed for detecting fake images generated by both diffusion models and GANs. |
The proliferation of highly realistic AI-generated images necessitates robust detectors, and existing datasets are limited in scale, content diversity, or the use of advanced generators, hindering detector development. |
The authors generate over one million fake images across 1000 ImageNet classes using eight state-of-the-art diffusion models and GANs, paired with real ImageNet images. |
The dataset enables cross-generator image classification, showing that detectors struggle to generalize to unseen generators.
Detectors are evaluated on degraded images (low resolution, compression, blur), revealing performance drops under real-world conditions.
Analysis shows that diffusion models pose a greater challenge for detection than GANs due to fewer spectral artifacts. |
The study primarily focuses on ResNet-based detectors, leaving room to explore more specialized architectures.
Future work can investigate the impact of different prompts and generation parameters on detector performance. |
ai-generated image detection, fake image detection, diffusion models, generative adversarial networks, dataset |
2306.08498
Report |
Extending CLIP's Image-Text Alignment to Referring Image Segmentation |
Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak |
Referring Image Segmentation (RIS) is a cross-modal task that aims to segment
an instance described by a natural language expression. Recent methods leverage
large-scale pretrained unimodal models as backbones along with fusion
techniques for joint reasoning across modalities. However, the inherent
cross-modal nature of RIS raises questions about the effectiveness of unimodal
backbones. We propose RISCLIP, a novel framework that effectively leverages the
cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between
image and text features, we capitalize on this starting point and introduce
simple but strong modules that enhance unimodal feature extraction and leverage
rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP
exhibits outstanding results on all three major RIS benchmarks and also
outperforms previous CLIP-based methods, demonstrating the efficacy of our
strategy in extending CLIP's image-text alignment to RIS. |
RISCLIP, a novel framework that leverages the cross-modal alignment capabilities of CLIP for Referring Image Segmentation (RIS). |
Existing RIS methods often rely on unimodal backbones and fusion techniques, but the inherent cross-modal nature of RIS suggests that a model like CLIP, pretrained on a massive dataset of image-text pairs, could be more effective. |
The method freezes the CLIP backbone and introduces three key components: (1) Adapters to refine CLIP features for segmentation; (2) Cross-modal Feature Extraction (CFE) modules to align image and text features at candidate regions; (3) Shared-space Knowledge Exploitation (SKE) modules to leverage the rich alignment knowledge in CLIP's shared embedding space for target discernment. Finally, a decoder transforms the patch-level grounding into a pixel-wise segmentation. |
RISCLIP achieves state-of-the-art performance on three major RIS benchmarks: RefCOCO, RefCOCO+, and RefCOCOg.
The ablation study demonstrates that freezing CLIP and adapting its features with the proposed modules is crucial for optimal performance.
The method excels in handling complex referring expressions, particularly on the challenging RefCOCOg dataset. |
RISCLIP currently exhibits limitations in recognizing alphanumeric characters and comprehending expressions that describe target objects based on the absence of specific attributes.
Future work will explore the adaptation of other image-text alignment backbones like ALIGN and Florence to RIS. |
referring image segmentation, cross-modal learning, clip, image-text alignment, deep learning |
2306.08276
Report |
TryOnDiffusion: A Tale of Two UNets |
Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, Ira Kemelmacher-Shlizerman |
Given two images depicting a person and a garment worn by another person, our
goal is to generate a visualization of how the garment might look on the input
person. A key challenge is to synthesize a photorealistic detail-preserving
visualization of the garment, while warping the garment to accommodate a
significant body pose and shape change across the subjects. Previous methods
either focus on garment detail preservation without effective pose and shape
variation, or allow try-on with the desired shape and pose but lack garment
details. In this paper, we propose a diffusion-based architecture that unifies
two UNets (referred to as Parallel-UNet), which allows us to preserve garment
details and warp the garment for significant pose and body change in a single
network. The key ideas behind Parallel-UNet include: 1) garment is warped
implicitly via a cross attention mechanism, 2) garment warp and person blend
happen as part of a unified process as opposed to a sequence of two separate
tasks. Experimental results indicate that TryOnDiffusion achieves
state-of-the-art performance both qualitatively and quantitatively. |
TryOnDiffusion, a diffusion-based model using a novel architecture called \unet, synthesizes high-resolution (1024x1024) virtual try-on images, realistically warping garments onto target individuals while preserving intricate garment details even with significant pose and shape variations. |
Virtual try-on enhances online shopping experiences, but existing methods struggle to balance garment detail preservation with accurate warping across different body shapes and poses, particularly in high resolution. This work tackles this challenge, aiming to improve realism and detail in virtual try-on. |
The method uses cascaded diffusion models with a novel architecture called \unet. \unet consists of two sub-UNets, one handling the person and the other the garment. It implicitly warps the garment onto the target person using cross-attention between their features. The model is trained on a massive dataset of 4 million paired images and further enhanced by super-resolution diffusion models for high-quality output. |
TryOnDiffusion achieves state-of-the-art performance, quantitatively outperforming baselines like TryOnGAN, SDAFN, and HR-VITON in FID and KID metrics.
Extensive user studies confirm TryOnDiffusion's superiority, with participants consistently ranking its results as the most realistic.
The method excels in preserving garment details like patterns, text, and textures, even under challenging conditions of occlusion and pose variation, surpassing the capabilities of existing techniques. |
Limitations include potential garment leaking artifacts due to errors in preprocessing steps like segmentation and pose estimation, and challenges in fully representing individual identity using clothing-agnostic RGB.
Future work will focus on addressing limitations, extending the model to full-body try-on, incorporating more complex backgrounds, and exploring its application to videos and general image editing. |
virtual try-on, diffusion models, image synthesis, deep learning, computer vision |
2306.08257
Report |
On the Robustness of Latent Diffusion Models |
Jianping Zhang, Zhuoer Xu, Shiwen Cui, Changhua Meng, Weibin Wu, Michael R. Lyu |
Latent diffusion models achieve state-of-the-art performance on a variety of
generative tasks, such as image synthesis and image editing. However, the
robustness of latent diffusion models is not well studied. Previous works only
focus on the adversarial attacks against the encoder or the output image under
white-box settings, regardless of the denoising process. Therefore, in this
paper, we aim to analyze the robustness of latent diffusion models more
thoroughly. We first study the influence of the components inside latent
diffusion models on their white-box robustness. In addition to white-box
scenarios, we evaluate the black-box robustness of latent diffusion models via
transfer attacks, where we consider both prompt-transfer and model-transfer
settings and possible defense mechanisms. However, all these explorations need
a comprehensive benchmark dataset, which is missing in the literature.
Therefore, to facilitate the research of the robustness of latent diffusion
models, we propose two automatic dataset construction pipelines for two kinds
of image editing models and release the whole dataset. Our code and dataset are
available at \url{https://github.com/jpzhang1810/LDM-Robustness}. |
This paper investigates the robustness of latent diffusion models, particularly in image editing, against adversarial attacks. |
Assessing the robustness of latent diffusion models is crucial for ensuring their reliable deployment in real-world applications, especially given their increasing use in image editing. |
The authors propose two automatic dataset construction pipelines for image variation and inpainting models. They then evaluate the models' robustness by launching adversarial attacks under both white-box and black-box settings, analyzing the effects of attacks on different model components. |
The denoising process, especially the Resnet module, is identified as the most vulnerable component in latent diffusion models.
Instruct-pix2pix demonstrates greater robustness compared to standard stable diffusion models.
Adversarial examples exhibit transferability across different prompts (prompt-transfer) and models (model-transfer), raising concerns about the vulnerability of newer diffusion model versions. |
The attacking strategy, which destroys all internal features of the target module in the denoising process, may not be optimal.
Future work could explore attacking specific steps in the denoising process or developing more robust defense mechanisms. |
adversarial attacks, latent diffusion models, image editing, robustness, transfer attacks |
2306.08247
Report |
Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation |
Ruoyu Wang, Yongqi Yang, Zhihao Qian, Ye Zhu, Yu Wu |
Originating from the diffusion phenomenon in physics that describes particle
movement, the diffusion generative models inherit the characteristics of
stochastic random walk in the data space along the denoising trajectory.
However, the intrinsic mutual interference among image regions contradicts the
need for practical downstream application scenarios where the preservation of
low-level pixel information from given conditioning is desired (e.g.,
customization tasks like personalized generation and inpainting based on a
user-provided single image). In this work, we investigate the diffusion
(physics) in diffusion (machine learning) properties and propose our Cyclic
One-Way Diffusion (COW) method to control the direction of diffusion phenomenon
given a pre-trained frozen diffusion model for versatile customization
application scenarios, where the low-level pixel information from the
conditioning needs to be preserved. Notably, unlike most current methods that
incorporate additional conditions by fine-tuning the base text-to-image
diffusion model or learning auxiliary networks, our method provides a novel
perspective to understand the task needs and is applicable to a wider range of
customization scenarios in a learning-free manner. Extensive experiment results
show that our proposed COW can achieve more flexible customization based on
strict visual conditions in different application settings. Project page:
https://wangruoyu02.github.io/cow.github.io/. |
This paper proposes Cyclic One-Way Diffusion (COW), a training-free method that controls information diffusion in pre-trained diffusion models for image customization. |
Existing methods for customizing diffusion models with visual conditions often rely on computationally expensive fine-tuning and struggle to balance fidelity to both visual and textual inputs. This work aims to address these limitations by controlling the direction of information diffusion during generation. |
COW employs three main components: (1) Seed Initialization, where the visual condition is embedded in a neutral background, (2) Cyclic One-Way Diffusion, where the visual condition's latent representation is gradually injected during generation to promote unidirectional information flow, and (3) Visual Condition Preservation, where the visual condition is re-introduced at a later stage to maintain fidelity. |
COW achieves high fidelity to both textual and visual conditions, outperforming baselines in quantitative metrics and human evaluations.
The cyclic one-way diffusion strategy effectively propagates information from the visual condition while adapting to the textual prompt.
COW is efficient, generating images in 6 seconds compared to minutes or more for fine-tuning based approaches. |
The model can struggle with extreme conflicts between visual and textual conditions.
Future work could explore extending COW to handle multiple visual conditions more robustly. |
diffusion models, image generation, customization, training-free, visual conditioning |
2306.08226
Report |
CLIPXPlore: Coupled CLIP and Shape Spaces for 3D Shape Exploration |
Jingyu Hu, Ka-Hei Hui, Zhengzhe liu, Hao Zhang, Chi-Wing Fu |
This paper presents CLIPXPlore, a new framework that leverages a
vision-language model to guide the exploration of the 3D shape space. Many
recent methods have been developed to encode 3D shapes into a learned latent
shape space to enable generative design and modeling. Yet, existing methods
lack effective exploration mechanisms, despite the rich information. To this
end, we propose to leverage CLIP, a powerful pre-trained vision-language model,
to aid the shape-space exploration. Our idea is threefold. First, we couple the
CLIP and shape spaces by generating paired CLIP and shape codes through sketch
images and training a mapper network to connect the two spaces. Second, to
explore the space around a given shape, we formulate a co-optimization strategy
to search for the CLIP code that better matches the geometry of the shape.
Third, we design three exploration modes, binary-attribute-guided, text-guided,
and sketch-guided, to locate suitable exploration trajectories in shape space
and induce meaningful changes to the shape. We perform a series of experiments
to quantitatively and visually compare CLIPXPlore with different baselines in
each of the three exploration modes, showing that CLIPXPlore can produce many
meaningful exploration results that cannot be achieved by the existing
solutions. |
CLIPXPlore, a framework that leverages the CLIP vision-language model to guide the exploration of a pre-trained 3D shape latent space. |
Existing shape exploration methods lack fine-grained semantic control and struggle to connect to user-friendly interfaces like language or sketching. |
The framework connects CLIP and shape spaces by training a mapper network on paired CLIP and shape codes generated from sketch images. It then co-optimizes these codes for accurate shape representation and provides three exploration modes: binary-attribute-guided, text-guided, and sketch-guided, to locate suitable exploration trajectories. |
CLIPXPlore produces meaningful shape variations based on different conditions.
Quantitative and qualitative evaluations show CLIPXPlore outperforms existing methods in shape exploration.
Model analysis confirms the effectiveness of the space connection and the co-optimization strategy. |
Exploring the latent space may lead to unexpected shape changes beyond the given condition.
Identifying the optimal step size along the exploration trajectory remains a challenge. |
3d shape exploration, clip, vision-language model, latent space exploration, multi-modal shape modeling |
2306.07969
Report |
GeneCIS: A Benchmark for General Conditional Image Similarity |
Sagar Vaze, Nicolas Carion, Ishan Misra |
We argue that there are many notions of 'similarity' and that models, like
humans, should be able to adapt to these dynamically. This contrasts with most
representation learning methods, supervised or self-supervised, which learn a
fixed embedding function and hence implicitly assume a single notion of
similarity. For instance, models trained on ImageNet are biased towards object
categories, while a user might prefer the model to focus on colors, textures or
specific elements in the scene. In this paper, we propose the GeneCIS
('genesis') benchmark, which measures models' ability to adapt to a range of
similarity conditions. Extending prior work, our benchmark is designed for
zero-shot evaluation only, and hence considers an open-set of similarity
conditions. We find that baselines from powerful CLIP models struggle on
GeneCIS and that performance on the benchmark is only weakly correlated with
ImageNet accuracy, suggesting that simply scaling existing methods is not
fruitful. We further propose a simple, scalable solution based on automatically
mining information from existing image-caption datasets. We find our method
offers a substantial boost over the baselines on GeneCIS, and further improves
zero-shot performance on related image retrieval benchmarks. In fact, though
evaluated zero-shot, our model surpasses state-of-the-art supervised models on
MIT-States. Project page at https://sgvaze.github.io/genecis/. |
The paper introduces GeneCIS, a benchmark designed to measure a model's ability to adapt to various notions of image similarity given explicit conditions. |
Most existing representation learning methods, supervised or self-supervised, learn a fixed embedding function and implicitly assume a single notion of similarity, which is insufficient for real-world applications. |
GeneCIS is constructed by re-purposing existing datasets (VAW, COCO) to create four retrieval tasks: Focus on an Attribute, Change an Attribute, Focus on an Object, and Change an Object. The authors propose a method to automatically mine training data for conditional image similarity from large-scale image-caption datasets by extracting Subject-Predicate-Object relationships. |
Baselines using only image or text information struggle on GeneCIS, indicating the benchmark effectively evaluates conditional similarity.
The proposed method, trained on mined triplets, outperforms CLIP-only baselines and even surpasses a model trained on manually annotated data (CIRR).
Performance on GeneCIS is weakly correlated with ImageNet accuracy, suggesting that the benchmark measures different aspects of model capability compared to traditional vision tasks. |
The benchmark currently relies on potentially noisy annotations from source datasets and requires manual verification.
Future work could explore mining triplets from even larger image-caption datasets like LAION-5B. |
image similarity, conditional similarity, benchmarking, representation learning, zero-shot learning |
2306.07967
Report |
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning |
Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen |
We present Generalized LoRA (GLoRA), an advanced approach for universal
parameter-efficient fine-tuning tasks. Enhancing Low-Rank Adaptation (LoRA),
GLoRA employs a generalized prompt module to optimize pre-trained model weights
and adjust intermediate activations, providing more flexibility and capability
across diverse tasks and datasets. Moreover, GLoRA facilitates efficient
parameter adaptation by employing a scalable, modular, layer-wise structure
search that learns individual adapter of each layer. Originating from a unified
mathematical formulation, GLoRA exhibits strong transfer learning, few-shot
learning and domain generalization abilities, as it adapts to new tasks through
not only weights but also additional dimensions like activations. Comprehensive
experiments demonstrate that GLoRA outperforms all previous methods in natural,
specialized, and structured vision benchmarks, achieving superior accuracy with
fewer parameters and computations. The proposed method on LLaMA-1 and LLaMA-2
also show considerable enhancements compared to the original LoRA in the
language domain. Furthermore, our structural re-parameterization design ensures
that GLoRA incurs no extra inference cost, rendering it a practical solution
for resource-limited applications. Code and models are available at:
https://github.com/Arnav0400/ViT-Slim/tree/master/GLoRA. |
This paper introduces Generalized LoRA (GLoRA), a universal parameter-efficient fine-tuning method that improves upon LoRA by optimizing pre-trained model weights and adjusting intermediate activations. |
GLoRA addresses limitations of existing parameter-efficient fine-tuning methods, offering more flexibility and capability across diverse tasks and datasets while avoiding extra inference costs. |
GLoRA employs a generalized prompt module and facilitates efficient parameter adaptation using a scalable, modular, layer-wise structure search with a unified mathematical formulation. |
GLoRA outperforms previous PEFT methods on VTAB-1K, achieving state-of-the-art accuracy with fewer parameters.
It shows superior few-shot learning abilities on fine-grained visual recognition datasets.
GLoRA exhibits strong domain generalization capabilities, outperforming existing methods on out-of-domain datasets. |
The search process in GLoRA, while automated, can increase training time compared to methods requiring manual hyperparameter tuning.
The paper primarily focuses on vision tasks, with limited exploration of GLoRA's potential in other domains like NLP. |
parameter-efficient fine-tuning, low-rank adaptation (lora), transfer learning, few-shot learning, domain generalization |
2306.07954
Report |
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation |
Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy |
Large text-to-image diffusion models have exhibited impressive proficiency in
generating high-quality images. However, when applying these models to video
domain, ensuring temporal consistency across video frames remains a formidable
challenge. This paper proposes a novel zero-shot text-guided video-to-video
translation framework to adapt image models to videos. The framework includes
two parts: key frame translation and full video translation. The first part
uses an adapted diffusion model to generate key frames, with hierarchical
cross-frame constraints applied to enforce coherence in shapes, textures and
colors. The second part propagates the key frames to other frames with
temporal-aware patch matching and frame blending. Our framework achieves global
style and local texture temporal consistency at a low cost (without re-training
or optimization). The adaptation is compatible with existing image diffusion
techniques, allowing our framework to take advantage of them, such as
customizing a specific subject with LoRA, and introducing extra spatial
guidance with ControlNet. Extensive experimental results demonstrate the
effectiveness of our proposed framework over existing methods in rendering
high-quality and temporally-coherent videos. |
This paper introduces a novel zero-shot framework for text-guided video-to-video translation, capable of rendering temporally consistent videos by adapting pre-trained image diffusion models to videos. |
Existing text-to-image diffusion models struggle to maintain temporal consistency when applied to videos. This work addresses this challenge by proposing a method that leverages the strengths of both diffusion models and frame interpolation for high-quality, efficient, and temporally consistent video translation. |
The framework comprises two stages: key frame translation and full video translation. Key frame translation utilizes hierarchical cross-frame constraints, including style-aware cross-frame attention, shape-aware latent fusion, pixel-aware latent fusion with a novel fidelity-oriented image encoding method, and color-aware adaptive latent adjustment, to ensure temporal consistency at different levels. Full video translation propagates the rendered key frames to other frames using temporal-aware patch matching and frame blending (adapted from EbSynth). |
The proposed framework outperforms existing zero-shot video translation methods in terms of visual quality and temporal consistency, as demonstrated by both qualitative and quantitative evaluations.
The hierarchical cross-frame constraints effectively enforce temporal consistency at different levels, from global style to local texture.
The fidelity-oriented image encoding significantly reduces error accumulation during iterative encoding and decoding, crucial for preserving details in the pixel-aware latent fusion. |
The framework relies on accurate optical flow estimation, which may be challenging for large motions or significant appearance changes.
Uniform key frame sampling may not be optimal for all videos, and future work could explore content-aware key frame selection or user-interactive translation. |
video-to-video translation, text-guided video editing, diffusion models, temporal consistency, zero-shot learning |
2306.07881
Report |
Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data |
Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi |
We present Viewset Diffusion, a diffusion-based generator that outputs 3D
objects while only using multi-view 2D data for supervision. We note that there
exists a one-to-one mapping between viewsets, i.e., collections of several 2D
views of an object, and 3D models. Hence, we train a diffusion model to
generate viewsets, but design the neural network generator to reconstruct
internally corresponding 3D models, thus generating those too. We fit a
diffusion model to a large number of viewsets for a given category of objects.
The resulting generator can be conditioned on zero, one or more input views.
Conditioned on a single view, it performs 3D reconstruction accounting for the
ambiguity of the task and allowing to sample multiple solutions compatible with
the input. The model performs reconstruction efficiently, in a feed-forward
manner, and is trained using only rendering losses using as few as three views
per viewset. Project page: szymanowiczs.github.io/viewset-diffusion. |
Introduces Viewset Diffusion, a diffusion-based generative model for 3D objects that learns from multi-view 2D data, enabling probabilistic single-view 3D reconstruction and unconditional generation. |
Addresses the limitations of deterministic 3D reconstruction methods in handling ambiguity and leverages the abundance of 2D data for learning 3D object priors. |
Trains a DDPM to generate viewsets by reconstructing a 3D radiance field internally, allowing for conditional generation based on varying noise levels across input views. |
Achieves state-of-the-art single-view reconstruction results on ShapeNet-SRN Cars in terms of PSNR.
Demonstrates superior perceptual quality and sharpness in reconstructions compared to deterministic baselines, particularly in ambiguous settings.
Enables unconditional 3D generation with higher visual detail than previous diffusion-based methods trained on single views. |
Reconstruction quality on complex objects with high ambiguity can be further improved, potentially by exploring larger sample sizes during inference.
Exploring alternative 3D representations beyond radiance fields could enhance the model's efficiency and expressiveness. |
3d reconstruction, diffusion models, generative models, single-view reconstruction, computer vision |
2306.07754
Report |
Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis |
Yihan Ma, Zhengyu Zhao, Xinlei He, Zheng Li, Michael Backes, Yang Zhang |
Large text-to-image models have shown remarkable performance in synthesizing
high-quality images. In particular, the subject-driven model makes it possible
to personalize the image synthesis for a specific subject, e.g., a human face
or an artistic style, by fine-tuning the generic text-to-image model with a few
images from that subject. Nevertheless, misuse of subject-driven image
synthesis may violate the authority of subject owners. For example, malicious
users may use subject-driven synthesis to mimic specific artistic styles or to
create fake facial images without authorization. To protect subject owners
against such misuse, recent attempts have commonly relied on adversarial
examples to indiscriminately disrupt subject-driven image synthesis. However,
this essentially prevents any benign use of subject-driven synthesis based on
protected images.
In this paper, we take a different angle and aim at protection without
sacrificing the utility of protected images for general synthesis purposes.
Specifically, we propose GenWatermark, a novel watermark system based on
jointly learning a watermark generator and a detector. In particular, to help
the watermark survive the subject-driven synthesis, we incorporate the
synthesis process in learning GenWatermark by fine-tuning the detector with
synthesized images for a specific subject. This operation is shown to largely
improve the watermark detection accuracy and also ensure the uniqueness of the
watermark for each individual subject. Extensive experiments validate the
effectiveness of GenWatermark, especially in practical scenarios with unknown
models and text prompts (74% Acc.), as well as partial data watermarking (80%
Acc. for 1/4 watermarking). We also demonstrate the robustness of GenWatermark
to two potential countermeasures that substantially degrade the synthesis
quality. |
This paper introduces \MethodName, a novel generative watermarking method designed to safeguard images from unauthorized subject-driven synthesis while preserving their usability for authorized purposes. |
The rise of subject-driven image synthesis models raises concerns about the potential misuse of personal images, such as replicating an artist's style or generating fake facial images without consent. Existing protection methods often disrupt both malicious and benign uses, hindering authorized applications. |
\MethodName employs a two-phase learning approach. The first phase involves jointly training a watermark generator and detector on a large-scale dataset. In the second phase, the detector is fine-tuned for each subject using images synthesized from both clean and watermarked versions of their images. |
\MethodName achieves high detection accuracy (above 98%) in scenarios with known models and prompts, and maintains reasonable accuracy (around 74%) even with unknown models and prompts.
Injecting watermarks has minimal impact on image synthesis quality, with FID scores changing by less than 1%.
\MethodName demonstrates robustness against partial watermarking, watermark forgery with random noise, and watermark removal attempts using image transformations. |
The cross-model transferability of \MethodName could be further enhanced, potentially by incorporating model-specific properties during detector fine-tuning.
While \MethodName exhibits substantial watermark uniqueness, there is potential for improvement by fine-tuning both the generator and detector based on subject-specific images. |
image watermarking, subject-driven synthesis, generative models, image protection, digital copyright |
2306.07716
Report |
Dynamically Masked Discriminator for Generative Adversarial Networks |
Wentian Zhang, Haozhe Liu, Bing Li, Jinheng Xie, Yawen Huang, Yuexiang Li, Yefeng Zheng, Bernard Ghanem |
Training Generative Adversarial Networks (GANs) remains a challenging
problem. The discriminator trains the generator by learning the distribution of
real/generated data. However, the distribution of generated data changes
throughout the training process, which is difficult for the discriminator to
learn. In this paper, we propose a novel method for GANs from the viewpoint of
online continual learning. We observe that the discriminator model, trained on
historically generated data, often slows down its adaptation to the changes in
the new arrival generated data, which accordingly decreases the quality of
generated results. By treating the generated data in training as a stream, we
propose to detect whether the discriminator slows down the learning of new
knowledge in generated data. Therefore, we can explicitly enforce the
discriminator to learn new knowledge fast. Particularly, we propose a new
discriminator, which automatically detects its retardation and then dynamically
masks its features, such that the discriminator can adaptively learn the
temporally-vary distribution of generated data. Experimental results show our
method outperforms the state-of-the-art approaches. |
This paper proposes DMD, a novel method for training GANs that tackles the challenge of time-varying generated data distributions by viewing it as an online continual learning problem. |
Training GANs is difficult due to the discriminator's struggle to adapt to the evolving distribution of generated data throughout the training process, leading to subpar generated results. |
DMD employs two key modules: (1) discriminator retardation detection, which identifies when the discriminator relies too heavily on past data, and (2) dynamic discriminator adjustment, which utilizes dynamic feature masking to force the discriminator to learn new knowledge rapidly. |
DMD achieves state-of-the-art FID scores on FFHQ, AFHQ-V2, and LSUN-Church datasets, outperforming both traditional GAN methods and diffusion models.
Ablation studies demonstrate that DMD's dynamic masking strategy is superior to fixed interval masking or dropout.
Analysis reveals that DMD effectively reduces the discriminator's reliance on historical knowledge while improving its adaptation to new data distributions. |
The paper lacks theoretical analysis of the proposed method.
Further investigation is needed to explore the integration of DMD with Transformer-based GAN models. |
generative adversarial networks (gans), online continual learning, dynamic feature masking, discriminator regularization, image generation |
2306.07684
Report |
Lookaround Optimizer: $k$ steps around, 1 step average |
Jiangtao Zhang, Shunyu Liu, Jie Song, Tongtian Zhu, Zhengqi Xu, Mingli Song |
Weight Average (WA) is an active research topic due to its simplicity in
ensembling deep networks and the effectiveness in promoting generalization.
Existing weight average approaches, however, are often carried out along only
one training trajectory in a post-hoc manner (i.e., the weights are averaged
after the entire training process is finished), which significantly degrades
the diversity between networks and thus impairs the effectiveness. In this
paper, inspired by weight average, we propose Lookaround, a straightforward yet
effective SGD-based optimizer leading to flatter minima with better
generalization. Specifically, Lookaround iterates two steps during the whole
training period: the around step and the average step. In each iteration, 1)
the around step starts from a common point and trains multiple networks
simultaneously, each on transformed data by a different data augmentation, and
2) the average step averages these trained networks to get the averaged
network, which serves as the starting point for the next iteration. The around
step improves the functionality diversity while the average step guarantees the
weight locality of these networks during the whole training, which is essential
for WA to work. We theoretically explain the superiority of Lookaround by
convergence analysis, and make extensive experiments to evaluate Lookaround on
popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs,
demonstrating clear superiority over state-of-the-arts. Our code is available
at https://github.com/Ardcy/Lookaround. |
This paper proposes Lookaround, an SGD-based optimizer that leverages data augmentation and iterative weight averaging throughout training to find flatter minima, enhancing generalization performance in deep neural networks. |
Existing weight averaging techniques for deep network ensembling often struggle to balance model diversity and weight locality, limiting their effectiveness in finding flat minima and improving generalization. |
Lookaround iterates two steps: 1) "around" trains multiple networks from a common starting point using diverse data augmentations for higher functional diversity, and 2) "average" averages these network weights to maintain weight locality and guide training towards flatter minima. |
Theoretical analysis demonstrates Lookaround achieves lower variance and faster convergence than SGD and Lookahead.
Empirical evaluations on CIFAR and ImageNet datasets, with both CNNs and ViTs, show consistent performance improvements and improved training stability compared to baselines and ensemble methods.
Lookaround's performance is robust across varying data augmentation counts and around step sizes. |
Increased training time proportional to the number of networks trained due to multiple forward and backward passes.
Future work could explore learning rate schedulers tailored to Lookaround's iterative averaging for optimal performance. |
deep learning, optimization, weight averaging, generalization, data augmentation |
2306.07596
Report |
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model |
Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa |
Text-to-image generative models have attracted rising attention for flexible
image editing via user-specified descriptions. However, text descriptions alone
are not enough to elaborate the details of subjects, often compromising the
subjects' identity or requiring additional per-subject fine-tuning. We
introduce a new framework called \textit{Paste, Inpaint and Harmonize via
Denoising} (PhD), which leverages an exemplar image in addition to text
descriptions to specify user intentions. In the pasting step, an off-the-shelf
segmentation model is employed to identify a user-specified subject within an
exemplar image which is subsequently inserted into a background image to serve
as an initialization capturing both scene context and subject identity in one.
To guarantee the visual coherence of the generated or edited image, we
introduce an inpainting and harmonizing module to guide the pre-trained
diffusion model to seamlessly blend the inserted subject into the scene
naturally. As we keep the pre-trained diffusion model frozen, we preserve its
strong image synthesis ability and text-driven ability, thus achieving
high-quality results and flexible editing with diverse texts. In our
experiments, we apply PhD to both subject-driven image editing tasks and
explore text-driven scene generation given a reference subject. Both
quantitative and qualitative comparisons with baseline methods demonstrate that
our approach achieves state-of-the-art performance in both tasks. More
qualitative results can be found at
\url{https://sites.google.com/view/phd-demo-page}. |
This paper introduces PhD, a novel framework that leverages pre-trained diffusion models and an inpainting and harmonizing module for subject-driven image editing and generation. |
Existing text-to-image methods struggle to accurately portray user-specific subjects, and subject-driven editing methods often compromise subject identity or model flexibility. |
PhD uses a two-step process: 1) pasting a segmented subject from an exemplar image onto a background scene, 2) using an inpainting and harmonizing module to guide a frozen pre-trained diffusion model in generating context-consistent and photorealistic images. |
PhD achieves state-of-the-art performance on subject-driven image editing tasks, surpassing baselines in metrics like FID and CLIP scores.
The method effectively preserves both subject identity and background scene quality, as shown in quantitative and qualitative evaluations.
PhD demonstrates promising results for subject-driven scene generation and style transfer by leveraging the text-guided capabilities of the frozen diffusion model. |
While PhD excels in harmonizing subjects into scenes, it can struggle with generating unseen regions of the subject.
Future work will explore incorporating 3D information to enhance the generation of complete and consistent subjects. |
image editing, image generation, diffusion models, subject-driven synthesis, image harmonization |
2306.07470
Report |
Reviving Shift Equivariance in Vision Transformers |
Peijian Ding, Davit Soselia, Thomas Armstrong, Jiahao Su, Furong Huang |
Shift equivariance is a fundamental principle that governs how we perceive
the world - our recognition of an object remains invariant with respect to
shifts. Transformers have gained immense popularity due to their effectiveness
in both language and vision tasks. While the self-attention operator in vision
transformers (ViT) is permutation-equivariant and thus shift-equivariant, patch
embedding, positional encoding, and subsampled attention in ViT variants can
disrupt this property, resulting in inconsistent predictions even under small
shift perturbations. Although there is a growing trend in incorporating the
inductive bias of convolutional neural networks (CNNs) into vision
transformers, it does not fully address the issue. We propose an adaptive
polyphase anchoring algorithm that can be seamlessly integrated into vision
transformer models to ensure shift-equivariance in patch embedding and
subsampled attention modules, such as window attention and global subsampled
attention. Furthermore, we utilize depth-wise convolution to encode positional
information. Our algorithms enable ViT, and its variants such as Twins to
achieve 100% consistency with respect to input shift, demonstrate robustness to
cropping, flipping, and affine transformations, and maintain consistent
predictions even when the original models lose 20 percentage points on average
when shifted by just a few pixels with Twins' accuracy dropping from 80.57% to
62.40%. |
This paper introduces an adaptive polyphase anchoring algorithm to address the lack of shift equivariance in vision transformers (ViT), improving their robustness to image translations. |
Shift equivariance is a crucial property for consistent object recognition regardless of its position. Existing ViT models often lack this, leading to inconsistent predictions even with small shifts in the input image. |
The authors propose replacing non-shift-equivariant modules in ViTs, like patch embedding and subsampled attention, with their polyphase anchoring counterparts. Additionally, they utilize depth-wise convolution with circular padding for positional encoding, further enhancing shift equivariance. |
The modified ViT models achieve 100% consistency in image classification under shift perturbations.
The approach provides significant improvements in accuracy under challenging transformations like cropping, flipping, and affine transformations.
The proposed method leads to a substantial gain in robustness, as demonstrated by a 20% average improvement under a worst-of-30 shift attack. |
Due to computational limitations, this work focuses on controlled experiments rather than achieving state-of-the-art accuracy.
Future work will explore using larger-scale computing resources to optimize the proposed method for state-of-the-art performance on ViT models. |
vision transformers, shift equivariance, polyphase anchoring, robustness, image classification |
2306.07346
Report |
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training |
Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara |
The use of self-supervised pre-training has emerged as a promising approach
to enhance the performance of visual tasks such as image classification. In
this context, recent approaches have employed the Masked Image Modeling
paradigm, which pre-trains a backbone by reconstructing visual tokens
associated with randomly masked image patches. This masking approach, however,
introduces noise into the input data during pre-training, leading to
discrepancies that can impair performance during the fine-tuning phase.
Furthermore, input masking neglects the dependencies between corrupted patches,
increasing the inconsistencies observed in downstream fine-tuning tasks. To
overcome these issues, we propose a new self-supervised pre-training approach,
named Masked and Permuted Vision Transformer (MaPeT), that employs
autoregressive and permuted predictions to capture intra-patch dependencies. In
addition, MaPeT employs auxiliary positional information to reduce the
disparity between the pre-training and fine-tuning phases. In our experiments,
we employ a fair setting to ensure reliable and meaningful comparisons and
conduct investigations on multiple visual tokenizers, including our proposed
$k$-CLIP which directly employs discretized CLIP features. Our results
demonstrate that MaPeT achieves competitive performance on ImageNet, compared
to baselines and competitors under the same model setting. Source code and
trained models are publicly available at: https://github.com/aimagelab/MaPeT. |
The paper introduces Masked and Permuted Vision Transformer (MaPeT), a self-supervised pre-training approach for vision tasks, and k-CLIP, a novel visual tokenizer employing discretized CLIP features. |
Addresses limitations of Masked Image Modeling (MIM) in self-supervised pre-training for vision tasks, aiming to improve performance by capturing inter-token dependencies and mitigating pre-training/fine-tuning discrepancies. |
Combines masked and permuted image modeling strategies, incorporating auxiliary position embeddings to provide full patch position information. Leverages two-stream self-attention mechanism and attention masking for capturing bidirectional context. k-CLIP tokenizer directly utilizes discretized CLIP features for visual token generation. |
MaPeT consistently outperforms MIM and permutation-based image pre-training (PIM) in image classification tasks.
k-CLIP tokenizer demonstrates superior performance compared to VQ-KD and DALL-E, especially with smaller models, due to its ability to capture rich semantic information.
MaPeT exhibits strong cross-domain transfer learning capabilities, showcasing its potential for real-world applications. |
High computational requirements during training pose challenges for widespread adoption.
Further research is needed to assess scalability and adaptability to more diverse and complex domains. |
self-supervised learning, vision transformers, masked image modeling, visual tokenizers, clip |
2306.07280
Report |
Controlling Text-to-Image Diffusion by Orthogonal Finetuning |
Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, Bernhard Schölkopf |
Large text-to-image diffusion models have impressive capabilities in
generating photorealistic images from text prompts. How to effectively guide or
control these powerful models to perform different downstream tasks becomes an
important open problem. To tackle this challenge, we introduce a principled
finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image
diffusion models to downstream tasks. Unlike existing methods, OFT can provably
preserve hyperspherical energy which characterizes the pairwise neuron
relationship on the unit hypersphere. We find that this property is crucial for
preserving the semantic generation ability of text-to-image diffusion models.
To improve finetuning stability, we further propose Constrained Orthogonal
Finetuning (COFT) which imposes an additional radius constraint to the
hypersphere. Specifically, we consider two important finetuning text-to-image
tasks: subject-driven generation where the goal is to generate subject-specific
images given a few images of a subject and a text prompt, and controllable
generation where the goal is to enable the model to take in additional control
signals. We empirically show that our OFT framework outperforms existing
methods in generation quality and convergence speed. |
This paper presents Orthogonal Finetuning (OFT), a novel method for adapting large text-to-image diffusion models to downstream tasks while preserving their generative capabilities. |
Fine-tuning diffusion models for tasks like subject-driven and controllable generation is crucial for extending their utility, but existing methods often fail to preserve or struggle to balance generation quality and task-specific control. |
OFT learns layer-wise orthogonal transformations of neuron weights, provably preserving hyperspherical energy, which is argued to be key to retaining the semantic knowledge of the pretrained model. A constrained variant (COFT) further enhances stability by limiting deviation from pretrained weights. |
OFT demonstrates significantly improved stability and convergence speed over DreamBooth and LoRA in subject-driven generation.
For controllable generation, OFT achieves superior control accuracy compared to ControlNet, T2I-Adapter, and LoRA, often with fewer training data and parameters.
Experiments on various control tasks (Canny edges, segmentation maps, landmarks) validate OFT's effectiveness, showcasing its ability to generate high-fidelity images with accurate control. |
The reliance on Cayley parametrization for orthogonality introduces a matrix inversion step that can be computationally expensive for large models.
While block-diagonal parametrization improves efficiency, it introduces biases and limits flexibility; exploring alternative efficient parametrizations is crucial. |
text-to-image generation, diffusion models, fine-tuning, controllable generation, subject-driven generation |
2306.07257
Report |
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images |
Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, Jianlong Fu |
In this paper, we present MovieFactory, a powerful framework to generate
cinematic-picture (3072$\times$1280), film-style (multi-scene), and
multi-modality (sounding) movies on the demand of natural languages. As the
first fully automated movie generation model to the best of our knowledge, our
approach empowers users to create captivating movies with smooth transitions
using simple text inputs, surpassing existing methods that produce soundless
videos limited to a single scene of modest quality. To facilitate this
distinctive functionality, we leverage ChatGPT to expand user-provided text
into detailed sequential scripts for movie generation. Then we bring scripts to
life visually and acoustically through vision generation and audio retrieval.
To generate videos, we extend the capabilities of a pretrained text-to-image
diffusion model through a two-stage process. Firstly, we employ spatial
finetuning to bridge the gap between the pretrained image model and the new
video dataset. Subsequently, we introduce temporal learning to capture object
motion. In terms of audio, we leverage sophisticated retrieval models to select
and align audio elements that correspond to the plot and visual content of the
movie. Extensive experiments demonstrate that our MovieFactory produces movies
with realistic visuals, diverse scenes, and seamlessly fitting audio, offering
users a novel and immersive experience. Generated samples can be found in
YouTube or Bilibili (1080P). |
MovieFactory: a novel framework for generating high-definition, cinematic-style (ultrawide format), multi-scene, and sounding movies from text inputs. |
Automatic movie generation is a challenging task with the potential to democratize filmmaking and empower individuals to bring their stories to life. |
The framework leverages ChatGPT for script generation, a two-stage fine-tuned diffusion model for video generation, and retrieval models for synchronized audio. |
MovieFactory produces high-quality videos with clear visuals and smooth object motion.
The two-stage training strategy effectively addresses the domain shift between image and video datasets.
The framework successfully combines large-scale AI models to create engaging and immersive movie experiences. |
Current limitations include reliance on retrieval-based audio and the potential for further enhancing the quality of generated content.
Future work may explore end-to-end audio generation, improved temporal consistency, and the incorporation of user feedback for interactive movie creation. |
movie generation, diffusion model, text-to-video, chatgpt, multi-modal generation |
2306.07200
Report |
Fill-Up: Balancing Long-Tailed Data with Generative Models |
Joonghyuk Shin, Minguk Kang, Jaesik Park |
Modern text-to-image synthesis models have achieved an exceptional level of
photorealism, generating high-quality images from arbitrary text descriptions.
In light of the impressive synthesis ability, several studies have exhibited
promising results in exploiting generated data for image recognition. However,
directly supplementing data-hungry situations in the real-world (e.g. few-shot
or long-tailed scenarios) with existing approaches result in marginal
performance gains, as they suffer to thoroughly reflect the distribution of the
real data. Through extensive experiments, this paper proposes a new image
synthesis pipeline for long-tailed situations using Textual Inversion. The
study demonstrates that generated images from textual-inverted text tokens
effectively aligns with the real domain, significantly enhancing the
recognition ability of a standard ResNet50 backbone. We also show that
real-world data imbalance scenarios can be successfully mitigated by filling up
the imbalanced data with synthetic images. In conjunction with techniques in
the area of long-tailed recognition, our method achieves state-of-the-art
results on standard long-tailed benchmarks when trained from scratch. |
This paper proposes a new image synthesis pipeline for long-tailed image recognition using Textual Inversion, which fills up imbalanced datasets with synthetic images aligned with the real domain, improving recognition accuracy. |
Real-world data often exhibits imbalanced distributions, posing challenges for long-tailed recognition tasks, and existing synthetic data generation methods struggle to reflect real data distributions effectively. |
The authors evaluate various image generation strategies, finding Textual Inversion most effective. They optimize per-class text tokens to generate images that align with the real domain and adopt a two-stage training procedure with Balanced Softmax loss. |
Textual Inversion-based image generation outperforms other methods in terms of diversity and alignment with real data.
The proposed pipeline achieves state-of-the-art results on standard long-tailed benchmarks (ImageNet-LT, Places-LT, iNaturalist2018) when trained from scratch.
The method demonstrates significant improvements in few-shot scenarios, particularly for classes with fewer than 20 samples. |
Generating synthetic images through diffusion models demands extensive computational resources.
Despite the efficiency of the approach, generating images with features on par with real samples remains a challenge. |
long-tailed recognition, text-to-image synthesis, textual inversion, synthetic data generation, data imbalance |
2306.07180
Report |
Diffusion Models for Black-Box Optimization |
Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, Aditya Grover |
The goal of offline black-box optimization (BBO) is to optimize an expensive
black-box function using a fixed dataset of function evaluations. Prior works
consider forward approaches that learn surrogates to the black-box function and
inverse approaches that directly map function values to corresponding points in
the input domain of the black-box function. These approaches are limited by the
quality of the offline dataset and the difficulty in learning one-to-many
mappings in high dimensions, respectively. We propose Denoising Diffusion
Optimization Models (DDOM), a new inverse approach for offline black-box
optimization based on diffusion models. Given an offline dataset, DDOM learns a
conditional generative model over the domain of the black-box function
conditioned on the function values. We investigate several design choices in
DDOM, such as re-weighting the dataset to focus on high function values and the
use of classifier-free guidance at test-time to enable generalization to
function values that can even exceed the dataset maxima. Empirically, we
conduct experiments on the Design-Bench benchmark and show that DDOM achieves
results competitive with state-of-the-art baselines. |
Presents Denoising Diffusion Optimization Models (DDOM), a novel inverse method for offline black-box optimization that leverages conditional diffusion models to learn a mapping from function values to input points. |
Addresses limitations of existing forward (surrogate-based) and inverse approaches for offline black-box optimization, particularly in handling limited dataset coverage and challenges in learning one-to-many mappings in high-dimensional spaces. |
Trains a conditional diffusion model on an offline dataset of input-value pairs, employing loss reweighting to prioritize high function values and classifier-free guidance during sampling to enhance conditioning and enable generalization beyond dataset maxima. |
DDOM successfully learns the inverse mapping, generating points with function values closely matching the conditioned values.
Outperforms existing forward and inverse baselines on the Design-Bench suite, achieving the best average rank and demonstrating robustness to initialization compared to alternatives.
Effectiveness of loss reweighting and classifier-free guidance is validated through ablation studies, highlighting their contribution to DDOM's performance. |
Sampling speed in diffusion models can be a limitation for some real-time applications.
Potential for misuse in optimizing for undesirable outcomes necessitates careful consideration during real-world deployment. |
black-box optimization, diffusion models, offline optimization, generative models, conditional generation |
2306.06991
Report |
Fast Diffusion Model |
Zike Wu, Pan Zhou, Kenji Kawaguchi, Hanwang Zhang |
Diffusion models (DMs) have been adopted across diverse fields with its
remarkable abilities in capturing intricate data distributions. In this paper,
we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a
stochastic optimization perspective for both faster training and sampling. We
first find that the diffusion process of DMs accords with the stochastic
optimization process of stochastic gradient descent (SGD) on a stochastic
time-variant problem. Then, inspired by momentum SGD that uses both gradient
and an extra momentum to achieve faster and more stable convergence than SGD,
we integrate momentum into the diffusion process of DMs. This comes with a
unique challenge of deriving the noise perturbation kernel from the
momentum-based diffusion process. To this end, we frame the process as a Damped
Oscillation system whose critically damped state -- the kernel solution --
avoids oscillation and yields a faster convergence speed of the diffusion
process. Empirical results show that our FDM can be applied to several popular
DM frameworks, e.g., VP, VE, and EDM, and reduces their training cost by about
50% with comparable image synthesis performance on CIFAR-10, FFHQ, and AFHQv2
datasets. Moreover, FDM decreases their sampling steps by about 3x to achieve
similar performance under the same samplers. The code is available at
https://github.com/sail-sg/FDM. |
The paper proposes Fast Diffusion Model (FDM) which integrates momentum into the diffusion process of Diffusion Models (DMs) to accelerate both training and sampling. |
DMs, while powerful in generative tasks, suffer from slow and costly training and sampling, hindering broader applications. FDM tackles this limitation by fundamentally improving the diffusion process. |
The authors establish a connection between DMs' diffusion process and stochastic gradient descent (SGD). Leveraging the faster convergence of momentum SGD, they incorporate momentum into the diffusion process and derive a tractable perturbation kernel for efficient training and sampling. |
FDM reduces training cost by about 50% compared to popular DM frameworks (VP, VE, EDM) while maintaining comparable image synthesis performance.
FDM achieves similar image generation quality with 3 times fewer sampling steps compared to baselines.
The momentum-based diffusion process shows stable and faster convergence towards the target distribution both theoretically and empirically. |
Verification is limited to three popular DMs (VP, VE, EDM). Further validation on a wider range of DMs is needed.
Evaluation is conducted on a limited set of datasets. Testing on diverse tasks is necessary to fully understand FDM's potential. |
diffusion models, generative models, momentum sgd, fast sampling, efficient training |
2306.06899
Report |
Augmenting Zero-Shot Detection Training with Image Labels |
Katharina Kornmeier, Ulla Scheler, Pascal Herrmann |
Zero-shot detection (ZSD), i.e., detection on classes not seen during
training, is essential for real world detection use-cases, but remains a
difficult task. Recent research attempts ZSD with detection models that output
embeddings instead of direct class labels. To this aim, the output of the
detection model must be aligned to a learned embedding space such as CLIP.
However, this alignment is hindered by detection data sets which are expensive
to produce compared to image classification annotations, and the resulting lack
of category diversity in the training data. We address this challenge by
leveraging the CLIP embedding space in combination with image labels from
ImageNet. Our results show that image labels are able to better align the
detector output to the embedding space and thus have a high potential for ZSD.
Compared to only training on detection data, we see a significant gain by
adding image label data of 3.3 mAP for the 65/15 split on COCO on the unseen
classes, i.e., we more than double the gain of related work. |
This paper proposes a method to improve zero-shot detection (ZSD) performance by augmenting the training of object detectors with image labels from a large-scale image classification dataset (ImageNet). |
ZSD is crucial for real-world applications as it allows detectors to identify objects not present in the training data. Existing ZSD methods suffer from limited category diversity due to the expensive nature of object detection annotations. |
The authors modify a single-stage detector (YOLOX) to predict embedding vectors instead of class probabilities. They align the model to the CLIP embedding space using both object detection data (COCO) and image classification data (ImageNet). A key aspect is the filtering and selection of appropriate bounding box predictions from ImageNet data for backpropagation. |
Adding ImageNet image labels significantly improves ZSD performance, more than doubling the gain of previous work using image embeddings.
This approach outperforms methods aligning to both text and image embeddings from COCO, highlighting the benefit of diverse category information from ImageNet.
The authors identify a new failure mode related to the underlying structure of label embeddings in embedding spaces. |
The study uses a smaller YOLOX model and fewer data augmentations due to resource constraints, potentially limiting performance.
Future work could explore alternative training losses or embedding spaces for improved alignment. |
zero-shot detection, object detection, clip embedding space, image labels, data augmentation |
2306.06684
Report |
Happy People -- Image Synthesis as Black-Box Optimization Problem in the Discrete Latent Space of Deep Generative Models |
Steffen Jung, Jan Christian Schwedhelm, Claudia Schillings, Margret Keuper |
In recent years, optimization in the learned latent space of deep generative
models has been successfully applied to black-box optimization problems such as
drug design, image generation or neural architecture search. Existing models
thereby leverage the ability of neural models to learn the data distribution
from a limited amount of samples such that new samples from the distribution
can be drawn. In this work, we propose a novel image generative approach that
optimizes the generated sample with respect to a continuously quantifiable
property. While we anticipate absolutely no practically meaningful application
for the proposed framework, it is theoretically principled and allows to
quickly propose samples at the mere boundary of the training data distribution.
Specifically, we propose to use tree-based ensemble models as mathematical
programs over the discrete latent space of vector quantized VAEs, which can be
globally solved. Subsequent weighted retraining on these queries allows to
induce a distribution shift. In lack of a practically relevant problem, we
consider a visually appealing application: the generation of happily smiling
faces (where the training distribution only contains less happy people) - and
show the principled behavior of our approach in terms of improved FID and
higher smile degree over baseline approaches. |
This paper presents a novel method for black-box optimization in the discrete latent space of VQ-VAEs, aiming to generate high-quality images with desired properties by optimizing a continuously quantifiable objective. |
This approach addresses the limitations of existing latent space optimization (LSO) methods, particularly in situations where the global optimum lies far from the training data distribution, by enabling efficient optimization in discrete latent spaces and inducing distribution shifts via weighted retraining. |
The methodology involves training a tree-based ensemble model as a surrogate for the black-box objective function in the discrete latent space. This model's predictions are then encoded as a mixed-integer optimization problem, solved globally to determine the next query point for image generation. The VQ-VAE is iteratively fine-tuned on the weighted data acquired during optimization, inducing a distribution shift towards the desired properties. |
The proposed method significantly outperforms continuous LSO with VAEs in a smiling face generation task, achieving higher objective function values.
Weighted retraining is shown to effectively induce a distribution shift, leading to improved results compared to optimization without retraining.
The use of a discrete latent space through VQ-VAE allows for the generation of higher-quality images compared to standard VAEs. |
The method is computationally more expensive than continuous LSO approaches due to the need for global optimization in the discrete latent space.
The quality of generated images, although better than those from VAEs, can still be further improved, potentially by exploring more sophisticated generative models or optimization strategies.
Future work could explore the application of this method to other domains beyond image synthesis, such as drug discovery or neural architecture search. |
black-box optimization, latent space optimization, vq-vae, image synthesis, distribution shift |
2306.06638
Report |
Face0: Instantaneously Conditioning a Text-to-Image Model on a Face |
Dani Valevski, Danny Wasserman, Yossi Matias, Yaniv Leviathan |
We present Face0, a novel way to instantaneously condition a text-to-image
generation model on a face, in sample time, without any optimization procedures
such as fine-tuning or inversions. We augment a dataset of annotated images
with embeddings of the included faces and train an image generation model, on
the augmented dataset. Once trained, our system is practically identical at
inference time to the underlying base model, and is therefore able to generate
images, given a user-supplied face image and a prompt, in just a couple of
seconds. Our method achieves pleasing results, is remarkably simple, extremely
fast, and equips the underlying model with new capabilities, like controlling
the generated images both via text or via direct manipulation of the input face
embeddings. In addition, when using a fixed random vector instead of a face
embedding from a user supplied image, our method essentially solves the problem
of consistent character generation across images. Finally, while requiring
further research, we hope that our method, which decouples the model's textual
biases from its biases on faces, might be a step towards some mitigation of
biases in future text-to-image models. |
Presents Face0, a novel method for instantaneously conditioning a text-to-image model on a face without fine-tuning or inversions. |
Addresses the challenge of generating images depicting a person from a user-supplied image instantly and efficiently. |
Augments a dataset with face embeddings, trains a projection module to map embeddings to CLIP space, and jointly fine-tunes a diffusion model (Stable Diffusion) on text and projected embeddings. |
Enables instant generation of images resembling a person from a single photo.
Allows control over generated faces through text prompts and direct manipulation of face embeddings.
Facilitates consistent character generation across multiple images using fixed random embedding vectors. |
May not perfectly preserve provided identities, sometimes generating "look-alike" characters.
Relies on a face embedding mechanism that primarily fixes pose and expression, limiting control over these aspects. |
text-to-image synthesis, face embedding, diffusion models, stable diffusion, personalized image generation |
2306.06577
Report |
Semantically-aware Mask CycleGAN for Translating Artistic Portraits to Photo-realistic Visualizations |
Zhuohao Yin |
Image-to-image translation (I2I) is defined as a computer vision task where
the aim is to transfer images in a source domain to a target domain with
minimal loss or alteration of the content representations. Major progress has
been made since I2I was proposed with the invention of a variety of
revolutionary generative models. Among them, GAN-based models perform
exceptionally well as they are mostly tailor-made for specific domains or
tasks. However, few works proposed a tailor-made method for the artistic
domain. In this project, I propose the Semantic-aware Mask CycleGAN
(SMCycleGAN) architecture which can translate artistic portraits to
photo-realistic visualizations. This model can generate realistic human
portraits by feeding the discriminators semantically masked fake samples, thus
enforcing them to make discriminative decisions with partial information so
that the generators can be optimized to synthesize more realistic human
portraits instead of increasing the similarity of other irrelevant components,
such as the background. Experiments have shown that the SMCycleGAN generate
images with significantly increased realism and minimal loss of content
representations. |
This paper proposes Semantic-aware Mask CycleGAN (SMCycleGAN) to translate artistic portraits to photo-realistic visualizations. |
This work aims to restore the realistic appearances of subjects in art portraits, bridging the gap between painted and photorealistic representations. |
SMCycleGAN utilizes semantic segmentation to mask generated images, focusing the discriminator on human subjects and improving realism. |
SMCycleGAN generates portraits with high realism, adjusting skin color and smoothing textures.
It reduces artifacts and maintains facial details better than baseline models like CycleGAN and Art2Real.
Quantitative evaluation using Fréchet Inception Distance shows SMCycleGAN generates images closest to realistic portraits. |
The model struggles with diverse ethnicities due to training data imbalance.
Highly abstract or artifact-heavy artworks pose challenges for realistic image generation. |
image-to-image translation, generative adversarial networks, cyclegan, semantic segmentation, art portrait |
2306.06513
Report |
Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration |
Kechun Liu, Yitong Jiang, Inchang Choi, Jinwei Gu |
Recent work on discrete generative priors, in the form of codebooks, has
shown exciting performance for image reconstruction and restoration, as the
discrete prior space spanned by the codebooks increases the robustness against
diverse image degradations. Nevertheless, these methods require separate
training of codebooks for different image categories, which limits their use to
specific image categories only (e.g. face, architecture, etc.), and fail to
handle arbitrary natural images. In this paper, we propose AdaCode for learning
image-adaptive codebooks for class-agnostic image restoration. Instead of
learning a single codebook for each image category, we learn a set of basis
codebooks. For a given input image, AdaCode learns a weight map with which we
compute a weighted combination of these basis codebooks for adaptive image
restoration. Intuitively, AdaCode is a more flexible and expressive discrete
generative prior than previous work. Experimental results demonstrate that
AdaCode achieves state-of-the-art performance on image reconstruction and
restoration tasks, including image super-resolution and inpainting. |
This paper presents AdaCode, a novel image-adaptive codebook learning method for class-agnostic image restoration. |
Existing methods using discrete generative priors (codebooks) often require separate training for different image categories, limiting their applicability to arbitrary natural images. |
AdaCode learns a set of basis codebooks, each trained on a specific image category. For a given image, it then learns a weight map to combine these basis codebooks into an image-adaptive representation for improved restoration. |
AdaCode achieves state-of-the-art performance on image reconstruction, outperforming methods using single general codebooks or merged codebooks.
AdaCode demonstrates superior performance in super-resolution tasks compared to existing methods, showing better detail preservation and fewer artifacts.
AdaCode exhibits state-of-the-art results in image inpainting, effectively recovering missing regions with high fidelity across various scenes. |
The optimal number of basis codebooks and code entries per codebook requires further investigation.
The current method could benefit from incorporating high-level semantic information for improved restoration. |
image restoration, generative priors, codebook learning, class-agnostic, image super-resolution, image inpainting |
2306.06189
Report |
FasterViT: Fast Vision Transformers with Hierarchical Attention |
Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov |
We design a new family of hybrid CNN-ViT neural networks, named FasterViT,
with a focus on high image throughput for computer vision (CV) applications.
FasterViT combines the benefits of fast local representation learning in CNNs
and global modeling properties in ViT. Our newly introduced Hierarchical
Attention (HAT) approach decomposes global self-attention with quadratic
complexity into a multi-level attention with reduced computational costs. We
benefit from efficient window-based self-attention. Each window has access to
dedicated carrier tokens that participate in local and global representation
learning. At a high level, global self-attentions enable the efficient
cross-window communication at lower costs. FasterViT achieves a SOTA
Pareto-front in terms of accuracy and image throughput. We have extensively
validated its effectiveness on various CV tasks including classification,
object detection and segmentation. We also show that HAT can be used as a
plug-and-play module for existing networks and enhance them. We further
demonstrate significantly faster and more accurate performance than competitive
counterparts for images with high resolution. Code is available at
https://github.com/NVlabs/FasterViT. |
This paper proposes FasterViT, a novel hybrid CNN-ViT architecture designed for high image throughput in computer vision tasks. It leverages a new Hierarchical Attention (HAT) approach to reduce the computational cost of global self-attention while maintaining accuracy. |
FasterViT addresses the limitations of ViTs in terms of computational complexity and the need for efficient global modeling, particularly for high-resolution images, making it suitable for real-world applications requiring fast image processing. |
The methodology involves combining CNNs for early-stage feature extraction with HAT-based transformer blocks in later stages. HAT decomposes global self-attention into a multi-level approach using learnable carrier tokens to summarize local windows and facilitate efficient cross-window communication. |
FasterViT achieves state-of-the-art performance in terms of image throughput and accuracy trade-off on ImageNet-1k classification, outperforming both convolutional and transformer-based counterparts.
It demonstrates competitive performance on dense prediction tasks like object detection, instance segmentation (MS COCO), and semantic segmentation (ADE20K).
The effectiveness of HAT as a plug-and-play module is demonstrated by its ability to enhance the performance of existing architectures like Swin-T on various tasks with minimal overhead. |
The paper acknowledges the potential for further exploration of joint optimization with acceleration methods like compression.
Further investigation into the scalability of HAT to even higher image resolutions and its impact on performance is left for future work. |
vision transformer, cnn, hierarchical attention, image throughput, computer vision |
2306.06093
Report |
HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork |
Bipasha Sen, Gaurav Singh, Aditya Agarwal, Rohith Agaram, K Madhava Krishna, Srinath Sridhar |
Neural Radiance Fields (NeRF) have become an increasingly popular
representation to capture high-quality appearance and shape of scenes and
objects. However, learning generalizable NeRF priors over categories of scenes
or objects has been challenging due to the high dimensionality of network
weight space. To address the limitations of existing work on generalization,
multi-view consistency and to improve quality, we propose HyP-NeRF, a latent
conditioning method for learning generalizable category-level NeRF priors using
hypernetworks. Rather than using hypernetworks to estimate only the weights of
a NeRF, we estimate both the weights and the multi-resolution hash encodings
resulting in significant quality gains. To improve quality even further, we
incorporate a denoise and finetune strategy that denoises images rendered from
NeRFs estimated by the hypernetwork and finetunes it while retaining multiview
consistency. These improvements enable us to use HyP-NeRF as a generalizable
prior for multiple downstream tasks including NeRF reconstruction from
single-view or cluttered scenes and text-to-NeRF. We provide qualitative
comparisons and evaluate HyP-NeRF on three tasks: generalization, compression,
and retrieval, demonstrating our state-of-the-art results. |
\acro is a latent conditioning method that learns generalizable, category-level NeRF priors using hypernetworks. It generates both instance-specific multi-resolution hash encodings and neural network weights, significantly improving quality. It also employs a denoising and finetuning strategy for further improvement, enabling applications like single-view NeRF reconstruction, text-to-NeRF, and reconstruction from cluttered scenes. |
Existing methods struggle to learn generalizable NeRF priors due to the high dimensionality of network weight space, often resulting in lower quality or inconsistent representations. This work addresses these limitations by combining the advantages of instance-specific representations with the generalization capabilities of hypernetworks. |
\acro utilizes a two-step process: 1) A hypernetwork is trained to predict both the multi-resolution hash encodings and weights of a NeRF model, conditioned on an instance code. 2) A denoising network improves the rendered views, and the NeRF is finetuned using these enhanced images to achieve higher quality while retaining multiview consistency. |
\acro significantly outperforms baselines like PixelNeRF in single-view novel NeRF generation, achieving higher PSNR, SSIM, and lower LPIPS and FID scores.
It demonstrates effective compression by learning from thousands of NeRF instances, achieving 60x compression gain compared to instance-specific methods with minimal quality degradation.
The learned prior enables retrieval of novel NeRFs from various modalities like single-view images, segmented images, and text prompts, showcasing its generalizability. |
Test-time optimization requires known poses, limiting its application in unposed image scenarios.
The learned prior is non-standard, making unconditional generation challenging and requiring mapping from known distributions. |
neural radiance fields, nerf, hypernetworks, generative models, 3d reconstruction |
2306.06092
Report |
Realistic Saliency Guided Image Enhancement |
S. Mahdi H. Miangoleh, Zoya Bylinskii, Eric Kee, Eli Shechtman, Yağız Aksoy |
Common editing operations performed by professional photographers include the
cleanup operations: de-emphasizing distracting elements and enhancing subjects.
These edits are challenging, requiring a delicate balance between manipulating
the viewer's attention while maintaining photo realism. While recent approaches
can boast successful examples of attention attenuation or amplification, most
of them also suffer from frequent unrealistic edits. We propose a realism loss
for saliency-guided image enhancement to maintain high realism across varying
image types, while attenuating distractors and amplifying objects of interest.
Evaluations with professional photographers confirm that we achieve the dual
objective of realism and effectiveness, and outperform the recent approaches on
their own datasets, while requiring a smaller memory footprint and runtime. We
thus offer a viable solution for automating image enhancement and photo cleanup
operations. |
This paper introduces a novel saliency-guided image enhancement method that leverages a realism loss to maintain photorealism while attenuating distracting elements or enhancing subjects in an image. |
This work addresses the limitations of existing saliency-guided image editing techniques, which often produce unrealistic edits. It offers a viable solution for automating image enhancement and photo cleanup operations while preserving realism. |
The method utilizes a realism network trained on a dataset of realistic and unrealistic edits. This network learns to estimate the realism of local image edits. This realism score is then incorporated into a saliency-guided image editing pipeline that optimizes for both saliency and realism. |
The method outperforms state-of-the-art approaches in terms of realism and effectiveness, as confirmed by evaluations with professional photographers.
The realism network effectively learns a continuous measure of realism for various editing operations despite being trained on binary data (realistic vs. unrealistic).
The approach generalizes well to multiple image regions and masks, allowing for iterative editing. |
The reliance on global edits within a mask can lead to artifacts at mask boundaries, especially with imperfect masks.
Future work could explore incorporating pixel-wise optimization to address boundary artifacts and further enhance realism. |
image enhancement, saliency detection, realism estimation, photo editing, deep learning |
2306.05720
Report |
Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model |
Yida Chen, Fernanda Viégas, Martin Wattenberg |
Latent diffusion models (LDMs) exhibit an impressive ability to produce
realistic images, yet the inner workings of these models remain mysterious.
Even when trained purely on images without explicit depth information, they
typically output coherent pictures of 3D scenes. In this work, we investigate a
basic interpretability question: does an LDM create and use an internal
representation of simple scene geometry? Using linear probes, we find evidence
that the internal activations of the LDM encode linear representations of both
3D depth data and a salient-object / background distinction. These
representations appear surprisingly early in the denoising process$-$well
before a human can easily make sense of the noisy images. Intervention
experiments further indicate these representations play a causal role in image
synthesis, and may be used for simple high-level editing of an LDM's output.
Project page: https://yc015.github.io/scene-representation-diffusion-model/ |
This paper investigates whether Latent Diffusion Models (LDMs) develop internal representations of 3D scene geometry despite being trained only on 2D images. |
Understanding how LDMs generate realistic images, particularly the emergence of 3D understanding, is crucial for interpretability and potential applications in image editing. |
The authors use linear probes to analyze the internal activations of a pre-trained LDM (Stable Diffusion) and conduct intervention experiments to study the causal role of these representations. |
LDMs encode linear representations of both continuous depth maps and a salient object/background distinction.
These representations appear early in the denoising process, well before a human can perceive coherent structures in the noisy images.
Intervention experiments demonstrate a causal relationship between these internal representations and the final output image, enabling simple high-level editing. |
The study primarily focuses on a single LDM (Stable Diffusion) and a limited set of scene attributes.
Future work could explore the representation of other scene attributes like lighting, texture, and semantic aspects. |
latent diffusion models, interpretability, 3d scene understanding, linear probing, causal intervention |
2306.05544
Report |
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping |
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, Josh Susskind |
Diffusion models have demonstrated excellent potential for generating diverse
images. However, their performance often suffers from slow generation due to
iterative denoising. Knowledge distillation has been recently proposed as a
remedy that can reduce the number of inference steps to one or a few without
significant quality degradation. However, existing distillation methods either
require significant amounts of offline computation for generating synthetic
training data from the teacher model or need to perform expensive online
learning with the help of real data. In this work, we present a novel technique
called BOOT, that overcomes these limitations with an efficient data-free
distillation algorithm. The core idea is to learn a time-conditioned model that
predicts the output of a pre-trained diffusion model teacher given any time
step. Such a model can be efficiently trained based on bootstrapping from two
consecutive sampled steps. Furthermore, our method can be easily adapted to
large-scale text-to-image diffusion models, which are challenging for
conventional methods given the fact that the training sets are often large and
difficult to access. We demonstrate the effectiveness of our approach on
several benchmark datasets in the DDIM setting, achieving comparable generation
quality while being orders of magnitude faster than the diffusion teacher. The
text-to-image results show that the proposed approach is able to handle highly
complex distributions, shedding light on more efficient generative modeling. |
This paper proposes BOOT, a data-free knowledge distillation method to boost the inference speed of diffusion models by distilling them into single-step models. |
Diffusion models excel in generating diverse images but suffer from slow generation speed due to iterative denoising. Existing distillation methods either demand extensive offline computation or require real data for online learning, making them impractical for large models. |
BOOT learns a time-conditioned model to predict the output of a pre-trained diffusion model for any given time-step. It utilizes a novel signal-ODE derived from the original probability-flow ODE for efficient training based on bootstrapping from consecutive sampled steps. This eliminates the reliance on real data during distillation. |
BOOT achieves comparable image generation quality to multi-step diffusion models (around 10 steps) with a 10x speedup on standard benchmarks (FFHQ, LSUN, ImageNet).
It successfully distills large-scale text-to-image models like Stable Diffusion and DeepFloyd IF, maintaining good generation quality with significant speed improvements.
The method enables controllable generation by interpolating in the learned latent space or modifying text prompts while keeping the noise input fixed. |
BOOT's sampling quality depends on the pre-trained teacher model and might be lower than data-dependent distillation methods.
The current implementation uses a similar architecture for the student and teacher models. Exploring different architectures could further improve performance. |
knowledge distillation, diffusion models, generative models, text-to-image generation, fast inference |
2306.05493
Report |
Multi-Modal Classifiers for Open-Vocabulary Object Detection |
Prannay Kaul, Weidi Xie, Andrew Zisserman |
The goal of this paper is open-vocabulary object detection (OVOD)
$\unicode{x2013}$ building a model that can detect objects beyond the set of
categories seen at training, thus enabling the user to specify categories of
interest at inference without the need for model retraining. We adopt a
standard two-stage object detector architecture, and explore three ways for
specifying novel categories: via language descriptions, via image exemplars, or
via a combination of the two. We make three contributions: first, we prompt a
large language model (LLM) to generate informative language descriptions for
object classes, and construct powerful text-based classifiers; second, we
employ a visual aggregator on image exemplars that can ingest any number of
images as input, forming vision-based classifiers; and third, we provide a
simple method to fuse information from language descriptions and image
exemplars, yielding a multi-modal classifier. When evaluating on the
challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our
text-based classifiers outperform all previous OVOD works; (ii) our
vision-based classifiers perform as well as text-based classifiers in prior
work; (iii) using multi-modal classifiers perform better than either modality
alone; and finally, (iv) our text-based and multi-modal classifiers yield
better performance than a fully-supervised detector. |
This paper presents a multi-modal open-vocabulary object detector that uses language descriptions, image exemplars, or a combination of both to specify novel categories. |
This is important because it enables users to specify categories of interest at inference time without retraining the model, overcoming limitations of previous methods that rely solely on class names. |
The authors propose three methods: 1) prompting an LLM for rich category descriptions to generate text-based classifiers, 2) using a visual aggregator on image exemplars to form vision-based classifiers, and 3) fusing both language and image information for multi-modal classifiers. |
Text-based classifiers using LLM descriptions outperform previous OVOD methods.
Vision-based classifiers achieve comparable performance to prior text-based methods.
Multi-modal classifiers outperform single-modality methods and even a fully-supervised detector on LVIS. |
Vision-based classifiers still lag behind text-based classifiers, requiring further research.
Exploration of more sophisticated multi-modal fusion techniques could further enhance performance. |
open-vocabulary object detection, multi-modal learning, large language models, vision-language models, zero-shot learning |
2306.05427
Report |
Grounded Text-to-Image Synthesis with Attention Refocusing |
Quynh Phung, Songwei Ge, Jia-Bin Huang |
Driven by the scalable diffusion models trained on large-scale datasets,
text-to-image synthesis methods have shown compelling results. However, these
models still fail to precisely follow the text prompt involving multiple
objects, attributes, or spatial compositions. In this paper, we reveal the
potential causes in the diffusion model's cross-attention and self-attention
layers. We propose two novel losses to refocus attention maps according to a
given spatial layout during sampling. Creating the layouts manually requires
additional effort and can be tedious. Therefore, we explore using large
language models (LLM) to produce these layouts for our method. We conduct
extensive experiments on the DrawBench, HRS, and TIFA benchmarks to evaluate
our proposed method. We show that our proposed attention refocusing effectively
improves the controllability of existing approaches. |
This paper introduces an attention-refocusing approach to enhance the controllability of layout-conditioned text-to-image synthesis using diffusion models by regulating both cross- and self-attention layers during sampling, guided by explicit layout representations. |
Existing text-to-image models often struggle to accurately represent the spatial relationships, quantities, and attributes of multiple objects described in text prompts. This work aims to improve the controllability of these models, allowing for more accurate and user-intended image generation. |
The proposed method uses a two-stage pipeline: 1) text-to-layout: utilize a Large Language Model (LLM) like GPT-4 to generate object bounding boxes from the text prompt. 2) grounded text-to-image generation: introduce Cross-Attention Refocusing (CAR) and Self-Attention Refocusing (SAR) losses to guide the diffusion model's attention towards the desired regions within the generated layout during the sampling process. |
The proposed attention-refocusing method consistently improves performance across various text-to-image models and benchmarks, including HRS, DrawBench, and TIFA, particularly in spatial accuracy and object counting.
Utilizing LLMs like GPT-4 for layout generation demonstrates strong spatial reasoning ability and allows for flexible integration with existing text-to-image models without requiring retraining.
The framework enables novel capabilities such as chatGPT-based iterative image refinement by instructing layout modifications. |
The LLM-based layout generation may still struggle with complex prompts involving a large number of objects or unusual spatial compositions.
The grounded text-to-image model may not always perfectly adhere to out-of-distribution layouts generated by the LLM, requiring further research in layout-conditional generation. |
text-to-image synthesis, diffusion models, attention mechanisms, layout generation, large language models |
2306.05422
Report |
Tracking Everything Everywhere All at Once |
Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, Noah Snavely |
We present a new test-time optimization method for estimating dense and
long-range motion from a video sequence. Prior optical flow or particle video
tracking algorithms typically operate within limited temporal windows,
struggling to track through occlusions and maintain global consistency of
estimated motion trajectories. We propose a complete and globally consistent
motion representation, dubbed OmniMotion, that allows for accurate, full-length
motion estimation of every pixel in a video. OmniMotion represents a video
using a quasi-3D canonical volume and performs pixel-wise tracking via
bijections between local and canonical space. This representation allows us to
ensure global consistency, track through occlusions, and model any combination
of camera and object motion. Extensive evaluations on the TAP-Vid benchmark and
real-world footage show that our approach outperforms prior state-of-the-art
methods by a large margin both quantitatively and qualitatively. See our
project page for more results: http://omnimotion.github.io/ |
Proposes OmniMotion, a test-time optimization method using a quasi-3D representation for estimating dense, long-range, globally consistent motion trajectories in videos, even through occlusions. |
Existing methods struggle to estimate both dense and long-range pixel trajectories accurately and consistently, especially through occlusions. |
Represents a video as a canonical 3D volume and per-frame local volumes, connected by learned bijections. Optimizes the representation using noisy pairwise correspondences (e.g., optical flow) and photometric consistency. |
Achieves state-of-the-art performance on the TAP-Vid benchmark, outperforming previous methods in position accuracy, occlusion accuracy, and temporal coherence.
Successfully tracks points through long occlusions and provides plausible locations even during occlusion.
Demonstrates robustness to varying camera setups and scene dynamics. |
Struggles with rapid, highly non-rigid motions and thin structures due to reliance on reliable pairwise correspondence input.
Can be computationally expensive, particularly in the flow collection and optimization stages. |
motion estimation, dense correspondence, occlusion handling, video representation, test-time optimization |
2306.05414
Report |
Improving Tuning-Free Real Image Editing with Proximal Guidance |
Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, Di Liu, Qilong Zhangli, Jindong Jiang, Zhaoyang Xia, Akash Srivastava, Dimitris Metaxas |
DDIM inversion has revealed the remarkable potential of real image editing
within diffusion-based methods. However, the accuracy of DDIM reconstruction
degrades as larger classifier-free guidance (CFG) scales being used for
enhanced editing. Null-text inversion (NTI) optimizes null embeddings to align
the reconstruction and inversion trajectories with larger CFG scales, enabling
real image editing with cross-attention control. Negative-prompt inversion
(NPI) further offers a training-free closed-form solution of NTI. However, it
may introduce artifacts and is still constrained by DDIM reconstruction
quality. To overcome these limitations, we propose proximal guidance and
incorporate it to NPI with cross-attention control. We enhance NPI with a
regularization term and reconstruction guidance, which reduces artifacts while
capitalizing on its training-free nature. Additionally, we extend the concepts
to incorporate mutual self-attention control, enabling geometry and layout
alterations in the editing process. Our method provides an efficient and
straightforward approach, effectively addressing real image editing tasks with
minimal computational overhead. |
This paper introduces "proximal guidance," a technique for improving tuning-free real image editing in diffusion models. It enhances both Negative-Prompt Inversion (NPI) and Mutual Self-Attention Control by regularizing the editing process and optionally aligning with inversion latents. |
Existing methods for real image editing with diffusion models often struggle with identity preservation or require time-consuming per-image optimization. This method addresses these limitations, enabling high-quality edits with minimal computational overhead. |
The proposed method incorporates a proximal function, akin to proximal gradient methods, to constrain the noise difference between target and source prompts during image generation. It optionally uses inversion guidance, performing a gradient descent step towards the inversion latent to further refine the editing process. |
Proximal guidance enhances NPI, achieving better reconstruction and editing quality compared to NTI and baseline NPI.
When applied to Mutual Self-Attention Control, it improves stability and preserves desired details, addressing limitations of direct NPI integration.
The method allows for simultaneous editing of both texture and geometry by sequentially applying proximal guidance to NPI and MasaCtrl. |
The performance of proximal guidance can be sensitive to hyperparameters like threshold and step size, necessitating careful tuning.
Future work could explore heuristics or automated methods for hyperparameter selection, improving usability and generalizability. |
image editing, diffusion models, negative-prompt inversion, mutual self-attention control, proximal guidance |
2306.05410
Report |
LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs |
Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, Ameesh Makadia |
A critical obstacle preventing NeRF models from being deployed broadly in the
wild is their reliance on accurate camera poses. Consequently, there is growing
interest in extending NeRF models to jointly optimize camera poses and scene
representation, which offers an alternative to off-the-shelf SfM pipelines
which have well-understood failure modes. Existing approaches for unposed NeRF
operate under limited assumptions, such as a prior pose distribution or coarse
pose initialization, making them less effective in a general setting. In this
work, we propose a novel approach, LU-NeRF, that jointly estimates camera poses
and neural radiance fields with relaxed assumptions on pose configuration. Our
approach operates in a local-to-global manner, where we first optimize over
local subsets of the data, dubbed mini-scenes. LU-NeRF estimates local pose and
geometry for this challenging few-shot task. The mini-scene poses are brought
into a global reference frame through a robust pose synchronization step, where
a final global optimization of pose and scene can be performed. We show our
LU-NeRF pipeline outperforms prior attempts at unposed NeRF without making
restrictive assumptions on the pose prior. This allows us to operate in the
general SE(3) pose setting, unlike the baselines. Our results also indicate our
model can be complementary to feature-based SfM pipelines as it compares
favorably to COLMAP on low-texture and low-resolution images. |
LU-NeRF: a novel local-to-global pipeline that jointly estimates camera poses in general configurations and neural radiance fields from unposed image sets. |
Existing NeRF models often rely on accurate camera poses from SfM pipelines, which can fail in challenging conditions like low-texture scenes. LU-NeRF aims to address this limitation by directly optimizing poses within the NeRF framework. |
LU-NeRF partitions the scene into mini-scenes, optimizing local poses and geometry for each. It utilizes a novel two-stage training process to address the mirror symmetry ambiguity inherent in few-shot unposed settings. These local estimations are then aligned into a global frame via robust pose synchronization, followed by joint refinement of global poses and scene representation. |
Outperforms prior unposed NeRF methods without relying on pose priors, operating in the general SE(3) pose setting.
Demonstrates robustness to outliers in mini-scene construction.
Shows complementarity to feature-based SfM (COLMAP) by achieving better pose estimation in low-texture or low-resolution settings. |
Computational cost is high, but can be potentially addressed by recent advances in neural rendering.
Building reliable graphs for unordered image collections remains challenging and requires further investigation. |
neural radiance fields, pose estimation, structure from motion, 3d scene reconstruction, few-shot learning |
2306.05399
Report |
Matting Anything |
Jiachen Li, Jitesh Jain, Humphrey Shi |
In this paper, we propose the Matting Anything Model (MAM), an efficient and
versatile framework for estimating the alpha matte of any instance in an image
with flexible and interactive visual or linguistic user prompt guidance. MAM
offers several significant advantages over previous specialized image matting
networks: (i) MAM is capable of dealing with various types of image matting,
including semantic, instance, and referring image matting with only a single
model; (ii) MAM leverages the feature maps from the Segment Anything Model
(SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha
matte through iterative refinement, which has only 2.7 million trainable
parameters. (iii) By incorporating SAM, MAM simplifies the user intervention
required for the interactive use of image matting from the trimap to the box,
point, or text prompt. We evaluate the performance of MAM on various image
matting benchmarks, and the experimental results demonstrate that MAM achieves
comparable performance to the state-of-the-art specialized image matting models
under different metrics on each benchmark. Overall, MAM shows superior
generalization ability and can effectively handle various image matting tasks
with fewer parameters, making it a practical solution for unified image
matting. Our code and models are open-sourced at
https://github.com/SHI-Labs/Matting-Anything. |
This paper introduces Matting Anything Model (MAM), a versatile framework for estimating alpha mattes of any instance in an image, utilizing flexible and interactive visual or linguistic user prompts. |
Existing image matting methods are often specialized for specific tasks and lack the flexibility to handle various scenarios with a single model. MAM addresses this limitation by providing a unified and efficient solution for different matting types. |
MAM leverages the Segment Anything Model (SAM) for instance segmentation and incorporates a lightweight Mask-to-Matte (M2M) module to refine SAM's binary masks into high-quality alpha mattes through multi-scale predictions and iterative refinement. |
MAM achieves comparable performance to state-of-the-art specialized models on various benchmarks for semantic, instance, and referring image matting.
The model exhibits superior generalization ability and effectively handles different matting tasks with fewer parameters compared to previous approaches.
MAM demonstrates significant improvements in refining alpha matte predictions, particularly in transition areas, without requiring trimap guidance. |
The performance of referring image matting using text prompts is significantly lower than using bounding box prompts.
Future work could focus on improving text-guided matting performance and exploring more complex prompt engineering techniques. |
image matting, interactive segmentation, instance segmentation, referring image matting, segment anything model (sam) |
2306.05390
Report |
HQ-50K: A Large-scale, High-quality Dataset for Image Restoration |
Qinhong Yang, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Lu Yuan, Gang Hua, Nenghai Yu |
This paper introduces a new large-scale image restoration dataset, called
HQ-50K, which contains 50,000 high-quality images with rich texture details and
semantic diversity. We analyze existing image restoration datasets from five
different perspectives, including data scale, resolution, compression rates,
texture details, and semantic coverage. However, we find that all of these
datasets are deficient in some aspects. In contrast, HQ-50K considers all of
these five aspects during the data curation process and meets all requirements.
We also present a new Degradation-Aware Mixture of Expert (DAMoE) model, which
enables a single model to handle multiple corruption types and unknown levels.
Our extensive experiments demonstrate that HQ-50K consistently improves the
performance on various image restoration tasks, such as super-resolution,
denoising, dejpeg, and deraining. Furthermore, our proposed DAMoE, trained on
our \dataset, outperforms existing state-of-the-art unified models designed for
multiple restoration tasks and levels. The dataset and code are available at
\url{https://github.com/littleYaang/HQ-50K}. |
This paper introduces \dataset, a large-scale, high-quality dataset for image restoration containing 50,000 high-quality images, and proposes DAMoE, a Degradation-Aware Mixture of Expert model for unified image restoration. |
Existing image restoration datasets are limited in scale, resolution, compression rates, texture details, or semantic coverage, hindering the development of robust and generalizable restoration models. |
\dataset is curated by selecting high-quality images from the internet and existing datasets, considering five key aspects: scale, resolution, compression, texture details, and semantic coverage. DAMoE leverages Mixture of Expert (MoE) layers within a transformer-based architecture to handle various restoration tasks and degradation levels with shared modules and task/degradation-specific experts. |
\dataset consistently improves performance on various image restoration tasks, including super-resolution, denoising, dejpeg, and deraining.
Models trained on \dataset demonstrate better generalization across different semantic categories of images.
DAMoE, trained on \dataset, outperforms existing state-of-the-art unified models designed for multiple restoration tasks and levels. |
While larger than existing restoration datasets, \dataset is still smaller than datasets for high-level vision tasks, limiting its potential for training even larger models.
Further research is needed to explore different MoE block designs and routing strategies for optimal performance in DAMoE. |
image restoration, dataset, deep learning, mixture of experts, unified model |
2306.05382
Report |
Image Blending Algorithm with Automatic Mask Generation |
Haochen Xue, Mingyu Jin, Chong Zhang, Yuxuan Huang, Qian Weng, Xiaobo Jin |
In recent years, image blending has gained popularity for its ability to
create visually stunning content. However, the current image blending
algorithms mainly have the following problems: manually creating image blending
masks requires a lot of manpower and material resources; image blending
algorithms cannot effectively solve the problems of brightness distortion and
low resolution. To this end, we propose a new image blending method with
automatic mask generation: it combines semantic object detection and
segmentation with mask generation to achieve deep blended images based on our
proposed new saturation loss and two-stage iteration of the PAN algorithm to
fix brightness distortion and low-resolution issues. Results on publicly
available datasets show that our method outperforms other classical image
blending algorithms on various performance metrics, including PSNR and SSIM. |
This paper introduces an automatic two-stage image blending method utilizing DINO and SAM for mask generation and a novel saturation loss with PAN for enhanced image quality. |
Existing image blending algorithms suffer from manual mask creation effort, brightness distortion, and low resolution in blended images. This work aims to automate the process and address these quality issues. |
The method employs DINO for object detection and SAM for accurate mask generation. Erosion and dilation refine the mask. A two-stage blending process uses gradient, content, style, and a novel saturation loss, further enhanced by PAN for high-resolution output. |
The proposed automatic mask generation surpasses traditional RCNN in accuracy and generalizability.
The introduction of saturation loss effectively mitigates brightness discrepancies at blended image seams.
The method outperforms classic algorithms like GP-GAN and Poisson Blending in PSNR and SSIM, demonstrating superior visual quality. |
The evaluation currently relies on standard metrics (PSNR, SSIM, MSE) which may not perfectly capture human perception of image quality.
Future research could focus on addressing challenges posed by object occlusion in image blending scenarios. |
image blending, mask generation, image segmentation, object detection, deep learning |
2306.05356
Report |
ReliableSwap: Boosting General Face Swapping Via Reliable Supervision |
Ge Yuan, Maomao Li, Yong Zhang, Huicheng Zheng |
Almost all advanced face swapping approaches use reconstruction as the proxy
task, i.e., supervision only exists when the target and source belong to the
same person. Otherwise, lacking pixel-level supervision, these methods struggle
for source identity preservation. This paper proposes to construct reliable
supervision, dubbed cycle triplets, which serves as the image-level guidance
when the source identity differs from the target one during training.
Specifically, we use face reenactment and blending techniques to synthesize the
swapped face from real images in advance, where the synthetic face preserves
source identity and target attributes. However, there may be some artifacts in
such a synthetic face. To avoid the potential artifacts and drive the
distribution of the network output close to the natural one, we reversely take
synthetic images as input while the real face as reliable supervision during
the training stage of face swapping. Besides, we empirically find that the
existing methods tend to lose lower-face details like face shape and mouth from
the source. This paper additionally designs a FixerNet, providing
discriminative embeddings of lower faces as an enhancement. Our face swapping
framework, named ReliableSwap, can boost the performance of any existing face
swapping network with negligible overhead. Extensive experiments demonstrate
the efficacy of our ReliableSwap, especially in identity preservation. The
project page is https://reliable-swap.github.io/. |
The paper proposes ReliableSwap, a general face swapping framework that improves identity preservation in existing methods by using synthetically generated "cycle triplets" as reliable supervision during training and introducing a FixerNet to enhance lower face details. |
Existing face swapping methods often struggle to maintain the source identity, resulting in swapped faces that appear as an interpolation between the source and target. This is due to the lack of pixel-level supervision when the source and target identities differ. |
The method involves 1) synthesizing "naive triplets" of images using face reenactment and blending to have controlled identity and attributes, 2) constructing "cycle triplets" from these naive triplets to provide reliable supervision during training, and 3) incorporating a FixerNet that extracts discriminative features of the lower face to enhance detail preservation. |
ReliableSwap enhances identity preservation in swapped faces, as demonstrated by quantitative metrics and qualitative comparisons on datasets like FaceForensics++ and CelebA-HQ.
The proposed cycle triplets effectively address the issue of lacking supervision for different identities during training.
The FixerNet successfully improves the consistency of lower face details like mouth and face shape in swapped results. |
The face reenactment model used in constructing cycle triplets may not perfectly transfer pose, leading to potential artifacts.
Limited evaluation on high-resolution images (512$^2$ or 1024$^2$) due to the scarcity of publicly available training code for such methods. |
face swapping, identity preservation, cycle triplet, fixernet, reliable supervision |
2306.05178
Report |
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions |
Yuseung Lee, Kunho Kim, Hyunjin Kim, Minhyuk Sung |
The remarkable capabilities of pretrained image diffusion models have been
utilized not only for generating fixed-size images but also for creating
panoramas. However, naive stitching of multiple images often results in visible
seams. Recent techniques have attempted to address this issue by performing
joint diffusions in multiple windows and averaging latent features in
overlapping regions. However, these approaches, which focus on seamless montage
generation, often yield incoherent outputs by blending different scenes within
a single image. To overcome this limitation, we propose SyncDiffusion, a
plug-and-play module that synchronizes multiple diffusions through gradient
descent from a perceptual similarity loss. Specifically, we compute the
gradient of the perceptual loss using the predicted denoised images at each
denoising step, providing meaningful guidance for achieving coherent montages.
Our experimental results demonstrate that our method produces significantly
more coherent outputs compared to previous methods (66.35% vs. 33.65% in our
user study) while still maintaining fidelity (as assessed by GIQA) and
compatibility with the input prompt (as measured by CLIP score). We further
demonstrate the versatility of our method across three plug-and-play
applications: layout-guided image generation, conditional image generation and
360-degree panorama generation. Our project page is at
https://syncdiffusion.github.io. |
This paper proposes SyncDiffusion, a plug-and-play module for synchronizing multiple diffusions to enhance global coherence in image montage generation, particularly panoramas. |
Existing panorama generation methods using diffusion models often result in either visible seams or incoherent outputs, lacking global semantic consistency. |
SyncDiffusion utilizes gradient descent from a perceptual similarity loss (LPIPS or Style Loss) calculated between predicted denoised images at each denoising step to synchronize multiple diffusions across different image regions. |
SyncDiffusion produces significantly more coherent panoramas compared to previous methods, as demonstrated by lower Intra-LPIPS and Intra-Style-L scores.
The method maintains fidelity to the input prompt (Mean-CLIP-S) and image quality (Mean-GIQA), comparable to single-image generations.
User studies confirm a strong preference for SyncDiffusion (66.35%) over previous methods (33.65%) in terms of coherence, image quality, and prompt compatibility. |
Generating realistic panoramas relies on suitable input prompts.
The gradient descent computation in SyncDiffusion introduces additional computational overhead. |
image generation, diffusion models, panorama generation, coherence, perceptual similarity |
2306.04988
Report |
StreetSurf: Extending Multi-view Implicit Surface Reconstruction to Street Views |
Jianfei Guo, Nianchen Deng, Xinyang Li, Yeqi Bai, Botian Shi, Chiyu Wang, Chenjing Ding, Dongliang Wang, Yikang Li |
We present a novel multi-view implicit surface reconstruction technique,
termed StreetSurf, that is readily applicable to street view images in
widely-used autonomous driving datasets, such as Waymo-perception sequences,
without necessarily requiring LiDAR data. As neural rendering research expands
rapidly, its integration into street views has started to draw interests.
Existing approaches on street views either mainly focus on novel view synthesis
with little exploration of the scene geometry, or rely heavily on dense LiDAR
data when investigating reconstruction. Neither of them investigates multi-view
implicit surface reconstruction, especially under settings without LiDAR data.
Our method extends prior object-centric neural surface reconstruction
techniques to address the unique challenges posed by the unbounded street views
that are captured with non-object-centric, long and narrow camera trajectories.
We delimit the unbounded space into three parts, close-range, distant-view and
sky, with aligned cuboid boundaries, and adapt cuboid/hyper-cuboid hash-grids
along with road-surface initialization scheme for finer and disentangled
representation. To further address the geometric errors arising from
textureless regions and insufficient viewing angles, we adopt geometric priors
that are estimated using general purpose monocular models. Coupled with our
implementation of efficient and fine-grained multi-stage ray marching strategy,
we achieve state of the art reconstruction quality in both geometry and
appearance within only one to two hours of training time with a single RTX3090
GPU for each street view sequence. Furthermore, we demonstrate that the
reconstructed implicit surfaces have rich potential for various downstream
tasks, including ray tracing and LiDAR simulation. |
StreetSurf, a novel multi-view implicit surface reconstruction framework for street views, achieving state-of-the-art geometry and appearance quality within a short training time without requiring LiDAR data. |
Existing methods for street view reconstruction either focus on novel view synthesis without exploring scene geometry or rely heavily on LiDAR data. StreetSurf addresses these limitations, enabling accurate surface reconstruction from camera images alone. |
The method divides the scene into close-range, distant-view, and sky regions, each modeled by a dedicated neural network. It uses aligned cuboid boundaries and hash-grids for efficient representation and employs road-surface initialization and entropy regularization for disentangling close-range and distant-view models. Geometric priors from monocular estimations further enhance reconstruction accuracy. |
StreetSurf reconstructs high-quality surfaces from street view images without LiDAR data.
The disentanglement of close-range and distant-view models improves reconstruction quality and enables accurate LiDAR simulation.
The reconstructed implicit surfaces can be used for various downstream tasks, such as ray tracing and occupancy grid extraction. |
StreetSurf currently ignores dynamic foreground objects in street views.
The method faces challenges in handling complex lighting conditions and long-tail environmental variations. |
neural rendering, implicit surface reconstruction, street views, multi-view reconstruction, autonomous driving |
2306.04865
Report |
MyStyle++: A Controllable Personalized Generative Prior |
Libing Zeng, Lele Chen, Yi Xu, Nima Kalantari |
In this paper, we propose an approach to obtain a personalized generative
prior with explicit control over a set of attributes. We build upon MyStyle, a
recently introduced method, that tunes the weights of a pre-trained StyleGAN
face generator on a few images of an individual. This system allows
synthesizing, editing, and enhancing images of the target individual with high
fidelity to their facial features. However, MyStyle does not demonstrate
precise control over the attributes of the generated images. We propose to
address this problem through a novel optimization system that organizes the
latent space in addition to tuning the generator. Our key contribution is to
formulate a loss that arranges the latent codes, corresponding to the input
images, along a set of specific directions according to their attributes. We
demonstrate that our approach, dubbed MyStyle++, is able to synthesize, edit,
and enhance images of an individual with great control over the attributes,
while preserving the unique facial characteristics of that individual. |
This paper introduces MyStyle++, a novel optimization system that enhances personalized generative priors, like MyStyle, by enabling explicit control over attributes in synthesized images while preserving individual facial characteristics. |
Existing methods for personalized image synthesis often lack precise control over attributes or fail to maintain identity during editing. MyStyle++ aims to address these issues. |
MyStyle++ builds upon StyleGAN and employs a two-pronged approach: 1) It organizes the latent space by optimizing anchor latent codes based on their attributes, and 2) It tunes the generator to ensure fidelity to the target individual. |
MyStyle++ demonstrates superior control over attributes like expression, yaw, pitch, and age compared to baseline methods.
Quantitative evaluations reveal lower standard deviation in desired attributes and better preservation of identity during editing.
The method proves effective for controllable image enhancement tasks such as inpainting and super-resolution. |
The number of images required for MyStyle++ grows with the number of attributes, posing a limitation for highly controllable synthesis.
While attribute control is precise, reconstructions for attributes like view lack physical accuracy, suggesting an area for future improvement. |
generative adversarial networks, personalized image synthesis, controllable gans, few-shot learning, semantic image editing |
2306.04849
Report |
ScaleDet: A Scalable Multi-Dataset Object Detector |
Yanbei Chen, Manchen Wang, Abhay Mittal, Zhenlin Xu, Paolo Favaro, Joseph Tighe, Davide Modolo |
Multi-dataset training provides a viable solution for exploiting
heterogeneous large-scale datasets without extra annotation cost. In this work,
we propose a scalable multi-dataset detector (ScaleDet) that can scale up its
generalization across datasets when increasing the number of training datasets.
Unlike existing multi-dataset learners that mostly rely on manual relabelling
efforts or sophisticated optimizations to unify labels across datasets, we
introduce a simple yet scalable formulation to derive a unified semantic label
space for multi-dataset training. ScaleDet is trained by visual-textual
alignment to learn the label assignment with label semantic similarities across
datasets. Once trained, ScaleDet can generalize well on any given upstream and
downstream datasets with seen and unseen classes. We conduct extensive
experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and
13 datasets from Object Detection in the Wild (ODinW) as downstream datasets.
Our results show that ScaleDet achieves compelling strong model performance
with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on
OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the
same backbone. |
ScaleDet is a novel multi-dataset object detector that effectively scales its generalization as the number of training datasets increases. |
Training on multiple datasets is crucial for building robust and generalizable object detectors, but unifying diverse label spaces across datasets is challenging. |
ScaleDet unifies labels from different datasets into a single semantic space using text embeddings and learns via hard and soft label assignments for visual-textual alignment. |
Scaling up the number of training datasets significantly improves performance on both upstream and downstream datasets.
ScaleDet outperforms state-of-the-art multi-dataset detectors like UniDet and Detic, even when trained on fewer datasets.
The method achieves strong transferability, as demonstrated by its state-of-the-art results on the challenging ODinW benchmark. |
The reliance on text embeddings might limit performance if the text encoder doesn't adequately capture certain visual concepts.
Future work can explore incorporating weakly-supervised or semi-supervised learning techniques to further leverage the large amounts of available unlabeled or partially labeled data. |
object detection, multi-dataset learning, visual-textual alignment, zero-shot detection, generalization |
2306.04848
Report |
Interpreting and Improving Diffusion Models Using the Euclidean Distance Function |
Frank Permenter, Chenyang Yuan |
Denoising is intuitively related to projection. Indeed, under the manifold
hypothesis, adding random noise is approximately equivalent to orthogonal
perturbation. Hence, learning to denoise is approximately learning to project.
In this paper, we use this observation to reinterpret denoising diffusion
models as approximate gradient descent applied to the Euclidean distance
function. We then provide straight-forward convergence analysis of the DDIM
sampler under simple assumptions on the projection-error of the denoiser.
Finally, we propose a new sampler based on two simple modifications to DDIM
using insights from our theoretical results. In as few as 5-10 function
evaluations, our sampler achieves state-of-the-art FID scores on pretrained
CIFAR-10 and CelebA models and can generate high quality samples on latent
diffusion models. |
This paper presents a novel interpretation of denoising diffusion models as performing approximate gradient descent on the Euclidean distance function to the data manifold, providing theoretical analysis and a new improved sampler. |
Diffusion models achieve state-of-the-art results in generative tasks, but their understanding is mainly probabilistic. This work offers a deterministic analysis, enabling new insights and algorithmic improvements. |
The authors analyze DDIM sampling under a relative-error model, showing its equivalence to gradient descent with error. They leverage this to design a second-order sampler that reduces error in denoiser output by combining previous estimates. |
The paper validates the proposed relative error model both theoretically and empirically on image datasets.
It provides convergence analysis of DDIM under the error model, linking error parameters to the noise schedule.
The proposed second-order sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models, outperforming existing samplers. |
The analysis assumes existence of admissible noise schedules, which are characterized but their optimality remains unexplored.
The relative error model, while empirically validated, is not guaranteed to hold in all cases, suggesting future work on tighter bounds or alternative models. |
denoising diffusion models, generative models, distance functions, sampling algorithms, gradient descent |
2306.04744
Report |
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models |
Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, Yezhou Yang |
The rapid advancement of generative models, facilitating the creation of
hyper-realistic images from textual descriptions, has concurrently escalated
critical societal concerns such as misinformation. Although providing some
mitigation, traditional fingerprinting mechanisms fall short in attributing
responsibility for the malicious use of synthetic images. This paper introduces
a novel approach to model fingerprinting that assigns responsibility for the
generated images, thereby serving as a potential countermeasure to model
misuse. Our method modifies generative models based on each user's unique
digital fingerprint, imprinting a unique identifier onto the resultant content
that can be traced back to the user. This approach, incorporating fine-tuning
into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates
near-perfect attribution accuracy with a minimal impact on output quality.
Through extensive evaluation, we show that our method outperforms baseline
methods with an average improvement of 11\% in handling image post-processes.
Our method presents a promising and novel avenue for accountable model
distribution and responsible use. Our code is available in
\url{https://github.com/kylemin/WOUAF}. |
This paper introduces WOUAF, a novel distributor-centered weight modulation method for fingerprinting text-to-image diffusion models, enabling user attribution for generated images. |
The rise of hyper-realistic image generation raises concerns about misuse, like misinformation. WOUAF provides a way to attribute generated images to their source, combating malicious use. |
WOUAF embeds user-specific fingerprints directly into the model weights of the Stable Diffusion decoder via a mapping network and affine transformations during the fine-tuning process. |
WOUAF achieves near-perfect attribution accuracy while minimally impacting the quality of generated images.
The method demonstrates robustness against various image post-processing techniques, outperforming baseline methods.
WOUAF proves resilient to deliberate fingerprint removal attempts like auto-encoder obfuscation and model purification. |
The fingerprint capacity, while supporting over 4 billion users, shows a trade-off with increasing fingerprint dimensions.
Future work aims to extend the methodology beyond image data to encompass diverse data types like text, audio, and video. |
model fingerprinting, user attribution, text-to-image synthesis, diffusion models, stable diffusion |
2306.04695
Report |
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models |
Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang |
The ability to understand visual concepts and replicate and compose these
concepts from images is a central goal for computer vision. Recent advances in
text-to-image (T2I) models have lead to high definition and realistic image
quality generation by learning from large databases of images and their
descriptions. However, the evaluation of T2I models has focused on photorealism
and limited qualitative measures of visual understanding. To quantify the
ability of T2I models in learning and synthesizing novel visual concepts
(a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that
consists of 284 unique visual concepts, and 33K composite text prompts. Along
with the dataset, we propose an evaluation metric, Concept Confidence Deviation
(CCD), that uses the confidence of oracle concept classifiers to measure the
alignment between concepts generated by T2I generators and concepts contained
in target images. We evaluate visual concepts that are either objects,
attributes, or styles, and also evaluate four dimensions of compositionality:
counting, attributes, relations, and actions. Our human study shows that CCD is
highly correlated with human understanding of concepts. Our results point to a
trade-off between learning the concepts and preserving the compositionality
which existing approaches struggle to overcome. The data, code, and interactive
demo is available at: https://conceptbed.github.io/ |
The paper introduces ConceptBed, a large-scale dataset and evaluation framework for assessing the ability of text-to-image models to learn and synthesize novel visual concepts. |
Existing evaluation methods for text-to-image models primarily focus on photorealism and lack robust measures for visual understanding, particularly in concept learning (personalized T2I). |
ConceptBed consists of 284 unique visual concepts and 33K composite text prompts. The authors propose Concept Confidence Deviation (CCD), a metric leveraging oracle concept classifiers to measure the alignment between generated and target images. |
There's a trade-off between concept alignment and composition alignment: methods excelling at one often struggle with the other.
CCD strongly correlates with human preferences for concept and composition alignment, outperforming prior metrics.
Using a pre-trained CLIP textual encoder helps maintain compositionality but hinders learning complex concepts. |
While large-scale, ConceptBed doesn't encompass all possible concepts; future work should combine it with qualitative examples.
The focus is on Stable Diffusion-based models; extending to other text-conditioned concept learners is important. |
concept learning, text-to-image synthesis, personalized t2i, evaluation metrics, diffusion models |
2306.04654
Report |
DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency |
Yike Yuan, Xinghe Fu, Yunlong Yu, Xi Li |
In this paper, we propose a simple yet effective transformer framework for
self-supervised learning called DenseDINO to learn dense visual
representations. To exploit the spatial information that the dense prediction
tasks require but neglected by the existing self-supervised transformers, we
introduce point-level supervision across views in a novel token-based way.
Specifically, DenseDINO introduces some extra input tokens called reference
tokens to match the point-level features with the position prior. With the
reference token, the model could maintain spatial consistency and deal with
multi-object complex scene images, thus generalizing better on dense prediction
tasks. Compared with the vanilla DINO, our approach obtains competitive
performance when evaluated on classification in ImageNet and achieves a large
margin (+7.2% mIoU) improvement in semantic segmentation on PascalVOC under the
linear probing protocol for segmentation. |
The paper proposes DenseDINO, a simple yet effective transformer framework for self-supervised learning of dense visual representations, introducing point-level supervision across views in a novel token-based way. |
Existing self-supervised transformers struggle to produce high-quality vision representations that generalize well to diverse downstream tasks, particularly dense prediction tasks like segmentation that require spatial information neglected by image-level approaches. |
DenseDINO introduces extra input tokens called reference tokens, defined as positional embeddings of randomly sampled point pairs across views, enabling the model to maintain spatial consistency and attend to multiple objects in complex scenes. The framework minimizes both image-level and point-level distillation losses using a modified masked-attention module to disentangle reference tokens. |
DenseDINO achieves competitive performance on ImageNet classification compared to the state-of-the-art DINO.
DenseDINO significantly surpasses DINO on PascalVOC semantic segmentation, demonstrating superior dense prediction capabilities.
Analysis reveals that multi-crop augmentation, while beneficial for classification, can hurt segmentation due to object misalignment between views, an issue mitigated by DenseDINO's point-level consistency and modified view generation. |
The selection and generation of reference tokens can be further optimized for improved object localization and supervision accuracy.
Exploring the framework's applicability to other dense prediction tasks beyond segmentation. |
self-supervised learning, transformer, dense visual representation, point-level supervision, object misalignment |
2306.04642
Report |
DiffusionShield: A Watermark for Copyright Protection against Generative Diffusion Models |
Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, Yue Xing, Jiliang Tang |
Recently, Generative Diffusion Models (GDMs) have showcased their remarkable
capabilities in learning and generating images. A large community of GDMs has
naturally emerged, further promoting the diversified applications of GDMs in
various fields. However, this unrestricted proliferation has raised serious
concerns about copyright protection. For example, artists including painters
and photographers are becoming increasingly concerned that GDMs could
effortlessly replicate their unique creative works without authorization. In
response to these challenges, we introduce a novel watermarking scheme,
DiffusionShield, tailored for GDMs. DiffusionShield protects images from
copyright infringement by GDMs through encoding the ownership information into
an imperceptible watermark and injecting it into the images. Its watermark can
be easily learned by GDMs and will be reproduced in their generated images. By
detecting the watermark from generated images, copyright infringement can be
exposed with evidence. Benefiting from the uniformity of the watermarks and the
joint optimization method, DiffusionShield ensures low distortion of the
original image, high watermark detection performance, and the ability to embed
lengthy messages. We conduct rigorous and comprehensive experiments to show the
effectiveness of DiffusionShield in defending against infringement by GDMs and
its superiority over traditional watermarking methods. The code for
DiffusionShield is accessible in
https://github.com/Yingqiancui/DiffusionShield. |
This paper proposes DiffusionShield, a novel watermarking scheme designed to protect image copyright against infringement by Generative Diffusion Models (GDMs). |
The rise of GDMs raises concerns about copyright infringement, as these models can easily replicate creative works without permission. Existing watermarking techniques are not designed for GDMs and often fail to be reproduced in generated images. |
DiffusionShield enhances "pattern uniformity" by employing a blockwise watermarking approach, where the same watermark pattern is applied to all images from the same owner. It also uses a joint optimization method to optimize the watermark patterns and the watermark detector simultaneously. |
DiffusionShield achieves high bit accuracy (close to 100%) in watermark detection on generated images, even with very small watermark budgets.
It demonstrates robustness against image corruptions and variations in GDM training hyperparameters.
The method is flexible for multiple-user scenarios, allowing new users to adopt the scheme without retraining. |
The current version of DiffusionShield focuses on protecting image data and may need further adaptation for other data modalities.
Future work could explore more advanced encoding and decoding techniques to further improve the capacity and robustness of the watermarking scheme. |
watermark, copyright protection, generative diffusion models, pattern uniformity, joint optimization |
2306.04619
Report |
ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections |
Chun-Han Yao, Amit Raj, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani |
Estimating 3D articulated shapes like animal bodies from monocular images is
inherently challenging due to the ambiguities of camera viewpoint, pose,
texture, lighting, etc. We propose ARTIC3D, a self-supervised framework to
reconstruct per-instance 3D shapes from a sparse image collection in-the-wild.
Specifically, ARTIC3D is built upon a skeleton-based surface representation and
is further guided by 2D diffusion priors from Stable Diffusion. First, we
enhance the input images with occlusions/truncation via 2D diffusion to obtain
cleaner mask estimates and semantic features. Second, we perform
diffusion-guided 3D optimization to estimate shape and texture that are of
high-fidelity and faithful to input images. We also propose a novel technique
to calculate more stable image-level gradients via diffusion models compared to
existing alternatives. Finally, we produce realistic animations by fine-tuning
the rendered shape and texture under rigid part transformations. Extensive
evaluations on multiple existing datasets as well as newly introduced noisy web
image collections with occlusions and truncation demonstrate that ARTIC3D
outputs are more robust to noisy images, higher quality in terms of shape and
texture details, and more realistic when animated. Project page:
https://chhankyao.github.io/artic3d/ |
ARTIC3D, a self-supervised framework for reconstructing 3D articulated animal shapes from sparse, noisy in-the-wild images, guided by 2D diffusion priors from Stable Diffusion. |
Creating articulated animal models is crucial for various applications, but existing methods struggle with noisy, real-world images. This work leverages the power of 2D diffusion priors to reconstruct high-quality, animatable 3D animals from limited, imperfect data. |
ARTIC3D uses a skeleton-based surface representation. It preprocesses images with diffusion to improve mask and feature estimates. It employs a novel Decoder-based Accumulative Score Sampling (DASS) for stable gradient calculation during 3D optimization. Finally, it refines animations using a temporal consistency loss. |
ARTIC3D demonstrates robustness to occlusions and truncation in images, outperforming baselines on keypoint transfer accuracy, particularly on the newly introduced noisy E-LASSIE dataset.
Qualitative results showcase detailed, realistic 3D shapes and textures, faithful to input images from both input and novel viewpoints.
The framework allows for realistic animation and applications like texture transfer due to its explicit part representation. |
ARTIC3D's reliance on accurate skeleton initialization can be limiting for heavily occluded images or animals with ambiguous skeletal structures.
The front-facing bias inherent in diffusion models can sometimes lead to unrealistic textures. |
3d reconstruction, diffusion models, articulated shapes, animal modeling, sparse image optimization |
2306.04396
Report |
Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance |
Gihyun Kwon, Jong Chul Ye |
Diffusion models have shown significant progress in image translation tasks
recently. However, due to their stochastic nature, there's often a trade-off
between style transformation and content preservation. Current strategies aim
to disentangle style and content, preserving the source image's structure while
successfully transitioning from a source to a target domain under text or
one-shot image conditions. Yet, these methods often require computationally
intense fine-tuning of diffusion models or additional neural networks. To
address these challenges, here we present an approach that guides the reverse
process of diffusion sampling by applying asymmetric gradient guidance. This
results in quicker and more stable image manipulation for both text-guided and
image-guided image translation. Our model's adaptability allows it to be
implemented with both image- and latent-diffusion models. Experiments show that
our method outperforms various state-of-the-art models in image translation
tasks. |
This paper presents Asymmetric Gradient Guidance (AGG), a novel sampling approach for efficient and flexible image translation in both image- and latent-diffusion models. |
Existing image translation methods using diffusion models often struggle to balance style transformation with content preservation and can be computationally expensive. This work aims to address these limitations. |
AGG combines the strengths of MCG and DDS methods. It guides the reverse diffusion sampling process by first applying a single step of MCG for initial update, followed by computationally efficient DDS update using the Adam optimizer. A simpler structural regularization term based on intermediate products of the DDIM forward step helps preserve source image structure. |
AGG outperforms state-of-the-art models in text-guided image translation on Animals and Landscapes datasets, achieving better image quality (SFID, CSFID) and comparable content preservation (LPIPS) with faster sampling.
For image-guided translation, AGG demonstrates superior perceptual quality compared to existing appearance and style transfer methods.
AGG effectively adapts to latent diffusion models, enabling fast and accurate semantic image manipulation with better content preservation compared to methods like P2P and PnP. |
The method's performance can be limited when there's a large semantic gap between the source image and target text in the CLIP space (e.g., lion to building).
Future work could explore integrating better text embedding models to address this limitation. |
image translation, diffusion models, gradient guidance, text-to-image synthesis, image manipulation |
2306.04356
Report |
Fine-Grained Visual Prompting |
Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang |
Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive
zero-shot transfer capabilities in image-level visual perception. However,
these models have shown limited performance in instance-level tasks that demand
precise localization and recognition. Previous works have suggested that
incorporating visual prompts, such as colorful boxes or circles, can improve
the ability of models to recognize objects of interest. Nonetheless, compared
to language prompting, visual prompting designs are rarely explored. Existing
approaches, which employ coarse visual cues such as colorful boxes or circles,
often result in sub-optimal performance due to the inclusion of irrelevant and
noisy pixels. In this paper, we carefully study the visual prompting designs by
exploring more fine-grained markings, such as segmentation masks and their
variations. In addition, we introduce a new zero-shot framework that leverages
pixel-level annotations acquired from a generalist segmentation model for
fine-grained visual prompting. Consequently, our investigation reveals that a
straightforward application of blur outside the target mask, referred to as the
Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting
strategy leverages the precise mask annotations to reduce focus on weakly
related regions while retaining spatial coherence between the target and the
surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates
superior performance in zero-shot comprehension of referring expressions on the
RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an
average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the
RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP. |
This paper proposes Fine-Grained Visual Prompting (FGVP), which uses precise semantic masks to guide Vision-Language Models (VLMs) for improved zero-shot instance-level understanding. |
Existing VLMs struggle with precise localization in tasks like referring expression comprehension and part detection, often relying on coarse visual prompts (e.g., boxes, circles). FGVP aims to overcome this limitation by providing fine-grained guidance. |
FGVP leverages a robust segmentation model (Segment Anything Model, SAM) to generate accurate semantic masks. These masks are then used to prompt VLMs, particularly focusing on a 'Blur Reverse Mask' strategy where the background is blurred to highlight the target. |
Blur Reverse Mask prompting consistently outperforms other visual prompting methods across multiple datasets.
FGVP achieves state-of-the-art zero-shot results on referring expression comprehension benchmarks (RefCOCO, RefCOCO+, RefCOCOg), surpassing previous methods by significant margins.
The proposed zero-shot pipeline also demonstrates superior performance in part detection on the PACO dataset compared to existing techniques. |
FGVP's reliance on a segmentation model increases inference time compared to methods using simpler prompts.
The current implementation does not yet explore joint optimization of visual and language prompts for potentially enhanced performance. |
visual prompting, vision-language models, referring expression comprehension, part detection, zero-shot learning |
2306.04180
Report |
FusedRF: Fusing Multiple Radiance Fields |
Rahul Goel, Dhawal Sirikonda, Rajvi Shah, PJ Narayanan |
Radiance Fields (RFs) have shown great potential to represent scenes from
casually captured discrete views. Compositing parts or whole of multiple
captured scenes could greatly interest several XR applications. Prior works can
generate new views of such scenes by tracing each scene in parallel. This
increases the render times and memory requirements with the number of
components. In this work, we provide a method to create a single, compact,
fused RF representation for a scene composited using multiple RFs. The fused RF
has the same render times and memory utilizations as a single RF. Our method
distills information from multiple teacher RFs into a single student RF while
also facilitating further manipulations like addition and deletion into the
fused representation. |
This paper introduces FusedRF, a method to fuse multiple Radiance Fields (RFs) into a single, compact RF representation for efficient rendering of composited scenes. |
Compositing scenes from multiple RFs currently requires parallel tracing, which increases rendering time and memory proportionally to the number of scenes. FusedRF addresses this by creating a single representation with the same efficiency as a single RF. |
The method distills information from multiple teacher RFs into a single student RF. It iteratively fuses source RFs with affine composition, using supervised losses on density and color values at sampled points. The process is sped up by pruning low-density points and initializing with the dominant scene's weights. |
FusedRF significantly reduces rendering time and memory consumption compared to rendering composited scenes with existing methods.
Quantitative results demonstrate that FusedRF maintains comparable visual quality to naive composition.
The method is applicable to various RF representations using explicit 3D lattices, including TensoRF, InstantNGP, DVGO, and Plenoxels. |
The paper primarily demonstrates fusion with TensoRF; further evaluation with other RF representations is needed.
Exploration of more complex compositions beyond affine transformations is a potential avenue for future work. |
radiance fields, scene composition, 3d reconstruction, neural rendering, distillation |
2306.03881
Report |
Emergent Correspondence from Image Diffusion |
Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, Bharath Hariharan |
Finding correspondences between images is a fundamental problem in computer
vision. In this paper, we show that correspondence emerges in image diffusion
models without any explicit supervision. We propose a simple strategy to
extract this implicit knowledge out of diffusion networks as image features,
namely DIffusion FeaTures (DIFT), and use them to establish correspondences
between real images. Without any additional fine-tuning or supervision on the
task-specific data or annotations, DIFT is able to outperform both
weakly-supervised methods and competitive off-the-shelf features in identifying
semantic, geometric, and temporal correspondences. Particularly for semantic
correspondence, DIFT from Stable Diffusion is able to outperform DINO and
OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k
benchmark. It even outperforms the state-of-the-art supervised methods on 9 out
of 18 categories while remaining on par for the overall performance. Project
page: https://diffusionfeatures.github.io |
This paper discovers and leverages the implicit correspondence learning capability within image diffusion models, introducing a novel feature extractor called DIFT (DIffusion FeaTures). |
Discovering implicit correspondence learning in diffusion models offers a new path towards robust and accurate correspondence estimation without the need for explicit supervision, which is crucial for tasks like 3D reconstruction, object tracking, and image editing. |
DIFT extracts correspondence information from pre-trained diffusion models by adding noise to real images to simulate the forward diffusion process. Then, intermediate layer activations from the model's U-Net are used as feature maps for correspondence matching. |
Without explicit supervision, DIFT outperforms weakly-supervised methods and other self-supervised features on semantic, geometric, and temporal correspondence benchmarks.
DIFT achieves state-of-the-art performance on semantic correspondence, even rivaling supervised methods on PF-WILLOW and certain SPair-71k categories.
The choice of time step during feature extraction significantly influences the type of correspondence captured, with larger time steps favoring semantic relationships. |
The reliance on potentially biased datasets like LAION for training diffusion models might lead to uneven performance across different image types.
DIFT's performance could be further enhanced through sophisticated adaptation mechanisms, such as combining features from various time steps and layers or fine-tuning with task-specific supervision. |
diffusion models, correspondence learning, self-supervision, feature extraction, image editing |
2306.03436
Report |
Intellectual Property Protection of Diffusion Models via the Watermark Diffusion Process |
Sen Peng, Yufei Chen, Cong Wang, Xiaohua Jia |
Diffusion models have rapidly become a vital part of deep generative
architectures, given today's increasing demands. Obtaining large,
high-performance diffusion models demands significant resources, highlighting
their importance as intellectual property worth protecting. However, existing
watermarking techniques for ownership verification are insufficient when
applied to diffusion models. Very recent research in watermarking diffusion
models either exposes watermarks during task generation, which harms the
imperceptibility, or is developed for conditional diffusion models that require
prompts to trigger the watermark. This paper introduces WDM, a novel
watermarking solution for diffusion models without imprinting the watermark
during task generation. It involves training a model to concurrently learn a
Watermark Diffusion Process (WDP) for embedding watermarks alongside the
standard diffusion process for task generation. We provide a detailed
theoretical analysis of WDP training and sampling, relating it to a shifted
Gaussian diffusion process via the same reverse noise. Extensive experiments
are conducted to validate the effectiveness and robustness of our approach in
various trigger and watermark data configurations. |
This paper introduces WDM, a novel watermarking solution for diffusion models that embeds watermarks without affecting the task generation process. |
Protecting intellectual property of large diffusion models is crucial due to the significant resources required to train them and the potential for misuse, such as generating disinformation. |
WDM trains a model to learn a Watermark Diffusion Process (WDP) for embedding watermarks alongside the standard diffusion process for task generation. The WDP utilizes a trigger to generate distinct data distributions, enabling watermark extraction and verification. |
WDM achieves high watermark fidelity, allowing effective extraction and verification of embedded watermarks.
WDM demonstrates robustness against model compression and weight perturbation attacks.
The watermark remains detectable even when using DDIM architecture or varying watermark extraction timesteps. |
WDM's robustness against model fine-tuning attacks is limited, especially when a large amount of data is used during fine-tuning.
Future work can explore methods to improve the robustness of WDM against more sophisticated watermark removal attacks. |
watermarking, diffusion models, intellectual property protection, deep generative models, watermark diffusion process |
2306.03253
Report |
Zero-Shot 3D Shape Correspondence |
Ahmed Abdelreheem, Abdelrahman Eldesokey, Maks Ovsjanikov, Peter Wonka |
We propose a novel zero-shot approach to computing correspondences between 3D
shapes. Existing approaches mainly focus on isometric and near-isometric shape
pairs (e.g., human vs. human), but less attention has been given to strongly
non-isometric and inter-class shape matching (e.g., human vs. cow). To this
end, we introduce a fully automatic method that exploits the exceptional
reasoning capabilities of recent foundation models in language and vision to
tackle difficult shape correspondence problems. Our approach comprises multiple
stages. First, we classify the 3D shapes in a zero-shot manner by feeding
rendered shape views to a language-vision model (e.g., BLIP2) to generate a
list of class proposals per shape. These proposals are unified into a single
class per shape by employing the reasoning capabilities of ChatGPT. Second, we
attempt to segment the two shapes in a zero-shot manner, but in contrast to the
co-segmentation problem, we do not require a mutual set of semantic regions.
Instead, we propose to exploit the in-context learning capabilities of ChatGPT
to generate two different sets of semantic regions for each shape and a
semantic mapping between them. This enables our approach to match strongly
non-isometric shapes with significant differences in geometric structure.
Finally, we employ the generated semantic mapping to produce coarse
correspondences that can further be refined by the functional maps framework to
produce dense point-to-point maps. Our approach, despite its simplicity,
produces highly plausible results in a zero-shot manner, especially between
strongly non-isometric shapes. Project webpage:
https://samir55.github.io/3dshapematch/. |
This paper proposes a novel zero-shot approach for computing correspondences between 3D shapes, particularly targeting strongly non-isometric and inter-class shape matching. |
Existing methods struggle with matching shapes across different classes and with significant geometric variations, limiting their application to understanding relationships between diverse 3D shapes. |
The method leverages foundation models in language and vision (BLIP2, ChatGPT, DINO, Segment-Anything) for:
- Zero-shot 3D shape classification
- Generating shape-specific semantic regions and mappings between them
- Zero-shot 3D semantic segmentation (SAM-3D)
- Dense correspondence refinement using functional maps. |
Achieves high zero-shot 3D object classification accuracy using BLIP2 and ChatGPT reasoning.
Generates accurate semantic regions and mappings using ChatGPT in-context learning, outperforming BLIP2.
Proposed SAM-3D outperforms existing zero-shot 3D segmentation methods in terms of region IoU and keypoint label matching. |
Current method focuses on coarse semantic regions, with finer-grained segmentation being a challenge for future work.
Developing foundation models specifically for 3D shapes and adapting functional maps for strongly non-isometric cases are potential future directions. |
3d shape correspondence, zero-shot learning, semantic segmentation, foundation models, non-isometric shape matching |
2306.03092
Report |
Neuralangelo: High-Fidelity Neural Surface Reconstruction |
Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, Chen-Hsuan Lin |
Neural surface reconstruction has been shown to be powerful for recovering
dense 3D surfaces via image-based neural rendering. However, current methods
struggle to recover detailed structures of real-world scenes. To address the
issue, we present Neuralangelo, which combines the representation power of
multi-resolution 3D hash grids with neural surface rendering. Two key
ingredients enable our approach: (1) numerical gradients for computing
higher-order derivatives as a smoothing operation and (2) coarse-to-fine
optimization on the hash grids controlling different levels of details. Even
without auxiliary inputs such as depth, Neuralangelo can effectively recover
dense 3D surface structures from multi-view images with fidelity significantly
surpassing previous methods, enabling detailed large-scale scene reconstruction
from RGB video captures. |
Neuralangelo, a framework for high-fidelity 3D surface reconstruction from multi-view images, even without auxiliary data like depth or segmentation. |
Current neural surface reconstruction methods struggle to recover fine details. Neuralangelo addresses this by combining multi-resolution 3D hash grids with neural surface rendering. |
Leverages representation power of multi-resolution hash grids with two key components: 1) Numerical gradients for smoothing higher-order derivatives and 2) Coarse-to-fine optimization on hash grids for capturing different detail levels. |
Significantly surpasses previous methods in fidelity on DTU and Tanks and Temples benchmarks.
Enables detailed large-scale scene reconstruction from RGB video captures.
Progressively recovers more details as hash grid resolution increases during optimization. |
Sampling strategy could be improved for faster training.
Robustness for highly reflective scenes can be improved. |
3d reconstruction, neural rendering, hash grids, surface reconstruction, neuralangelo |
2306.02949
Report |
INDigo: An INN-Guided Probabilistic Diffusion Algorithm for Inverse Problems |
Di You, Andreas Floros, Pier Luigi Dragotti |
Recently it has been shown that using diffusion models for inverse problems
can lead to remarkable results. However, these approaches require a closed-form
expression of the degradation model and can not support complex degradations.
To overcome this limitation, we propose a method (INDigo) that combines
invertible neural networks (INN) and diffusion models for general inverse
problems. Specifically, we train the forward process of INN to simulate an
arbitrary degradation process and use the inverse as a reconstruction process.
During the diffusion sampling process, we impose an additional data-consistency
step that minimizes the distance between the intermediate result and the
INN-optimized result at every iteration, where the INN-optimized image is
composed of the coarse information given by the observed degraded image and the
details generated by the diffusion process. With the help of INN, our algorithm
effectively estimates the details lost in the degradation process and is no
longer limited by the requirement of knowing the closed-form expression of the
degradation model. Experiments demonstrate that our algorithm obtains
competitive results compared with recently leading methods both quantitatively
and visually. Moreover, our algorithm performs well on more complex degradation
models and real-world low-quality images. |
This paper introduces INDigo, an algorithm for inverse problems that combines Invertible Neural Networks (INN) and diffusion models. INDigo leverages INN to simulate the degradation process and uses a diffusion model to generate detailed reconstructions, effectively handling complex degradations without requiring a closed-form expression of the degradation model. |
Existing diffusion-based methods for inverse problems often struggle with complex or unknown degradation processes and can blur reconstructed details. This new method addresses these limitations, enabling high-quality reconstruction in challenging scenarios. |
The method trains a Wavelet-inspired INN (WINN) to decompose images into a coarse representation similar to degraded observations and lost details. During diffusion sampling, WINN guides the process by replacing intermediate coarse representations with observed data, ensuring data consistency while the diffusion model generates missing details. |
INDigo achieves state-of-the-art results compared to recent diffusion-based methods on super-resolution tasks, both with and without noise.
The method effectively handles complex degradation models, such as combined downsampling and JPEG compression, producing high-quality reconstructions with realistic details.
INDigo demonstrates strong performance on real-world image restoration tasks, reconstructing high-quality images from real degraded images with unknown degradation processes. |
The current implementation of INDigo focuses on image restoration tasks, and its extension to other inverse problems requires further investigation.
The computational complexity of INDigo remains relatively high due to the iterative diffusion process, and exploring optimization strategies for faster inference is a promising direction. |
inverse problems, diffusion models, invertible neural networks, image restoration, deep learning |
2306.02903
Report |
Instruct-Video2Avatar: Video-to-Avatar Generation with Instructions |
Shaoxu Li |
We propose a method for synthesizing edited photo-realistic digital avatars
with text instructions. Given a short monocular RGB video and text
instructions, our method uses an image-conditioned diffusion model to edit one
head image and uses the video stylization method to accomplish the editing of
other head images. Through iterative training and update (three times or more),
our method synthesizes edited photo-realistic animatable 3D neural head avatars
with a deformable neural radiance field head synthesis method. In quantitative
and qualitative studies on various subjects, our method outperforms
state-of-the-art methods. |
This paper introduces Instruct-Video2Avatar, a novel approach for generating customizable, photorealistic, and animatable 3D head avatars from a short RGB video and text instructions. |
The method addresses the growing demand for personalized and stylized avatars for various applications, including VR/AR, by simplifying the avatar creation process and enabling users to customize avatars with text instructions. |
The method employs a three-step process: (1) edits an exemplar head image using an image-conditioned diffusion model (InstructPix2Pix) guided by text instructions, (2) propagates the edits to other frames in the video using a video stylization technique (EbSynth), (3) iteratively trains and updates a 3D neural head avatar (using INSTA) based on the edited images. |
The method generates high-quality, stylized avatars that preserve facial expressions and outperform existing techniques in terms of visual fidelity and temporal consistency.
The iterative dataset update strategy effectively minimizes inconsistencies and artifacts in the final rendered avatar.
A perceptual study confirms the superiority of Instruct-Video2Avatar compared to baseline approaches. |
Limitations include potential expression inconsistencies when applying large spatial manipulations and difficulties handling edits that introduce new objects.
Future work involves exploring techniques to address these limitations and improve the method's robustness and versatility. |
3d head avatar, text-guided editing, neural radiance fields, diffusion models, video stylization |
2306.02854
Report |
Asymmetric Patch Sampling for Contrastive Learning |
Chengchao Shen, Jianzhong Chen, Shu Wang, Hulin Kuang, Jin Liu, Jianxin Wang |
Asymmetric appearance between positive pair effectively reduces the risk of
representation degradation in contrastive learning. However, there are still a
mass of appearance similarities between positive pair constructed by the
existing methods, which inhibits the further representation improvement. In
this paper, we propose a novel asymmetric patch sampling strategy for
contrastive learning, to further boost the appearance asymmetry for better
representations. Specifically, dual patch sampling strategies are applied to
the given image, to obtain asymmetric positive pairs. First, sparse patch
sampling is conducted to obtain the first view, which reduces spatial
redundancy of image and allows a more asymmetric view. Second, a selective
patch sampling is proposed to construct another view with large appearance
discrepancy relative to the first one. Due to the inappreciable appearance
similarity between positive pair, the trained model is encouraged to capture
the similarity on semantics, instead of low-level ones. Experimental results
demonstrate that our proposed method significantly outperforms the existing
self-supervised methods on both ImageNet-1K and CIFAR dataset, e.g., 2.5%
finetune accuracy improvement on CIFAR100. Furthermore, our method achieves
state-of-the-art performance on downstream tasks, object detection and instance
segmentation on COCO.Additionally, compared to other self-supervised methods,
our method is more efficient on both memory and computation during training.
The source code is available at https://github.com/visresearch/aps. |
This paper proposes Asymmetric Patch Sampling (APS), a novel strategy for contrastive learning that constructs positive pairs with significant appearance differences but consistent semantics. |
Existing contrastive learning methods suffer from appearance similarities in positive pairs, hindering representation learning. This paper addresses this by maximizing appearance asymmetry while preserving semantic consistency. |
APS employs dual patch sampling strategies: sparse sampling for reduced spatial redundancy and selective sampling for minimizing overlapping patches between views. This encourages the model to learn semantic representations to minimize the contrastive objective. |
APS significantly outperforms previous state-of-the-art self-supervised methods on ImageNet-1K and CIFAR datasets.
The method achieves state-of-the-art performance on downstream tasks like object detection and instance segmentation on COCO.
APS demonstrates greater efficiency in memory and computation compared to other self-supervised methods. |
The paper primarily focuses on spatial asymmetry and could explore other forms of asymmetry.
Future work can investigate extending APS to other self-supervised learning paradigms beyond contrastive learning. |
contrastive learning, self-supervised learning, asymmetric patch sampling, image representation learning, computer vision |
2306.02850
Report |
TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments |
Yu Sun, Qian Bao, Wu Liu, Tao Mei, Michael J. Black |
Although the estimation of 3D human pose and shape (HPS) is rapidly
progressing, current methods still cannot reliably estimate moving humans in
global coordinates, which is critical for many applications. This is
particularly challenging when the camera is also moving, entangling human and
camera motion. To address these issues, we adopt a novel 5D representation
(space, time, and identity) that enables end-to-end reasoning about people in
scenes. Our method, called TRACE, introduces several novel architectural
components. Most importantly, it uses two new "maps" to reason about the 3D
trajectory of people over time in camera, and world, coordinates. An additional
memory unit enables persistent tracking of people even during long occlusions.
TRACE is the first one-stage method to jointly recover and track 3D humans in
global coordinates from dynamic cameras. By training it end-to-end, and using
full image information, TRACE achieves state-of-the-art performance on tracking
and HPS benchmarks. The code and dataset are released for research purposes. |
TRACE is a novel one-stage method for tracking and recovering 3D human motion from videos captured by moving cameras. |
Recovering the 3D motion of humans in a global coordinate frame is critical for applications like computer graphics, sports analysis, and XR. |
TRACE introduces a holistic 5D representation (space, time, identity) and leverages novel "maps" to reason about human trajectories across time in both camera and world coordinates. A memory unit is incorporated to handle long-term occlusions. |
TRACE outperforms previous methods in estimating global 3D human trajectories from videos with dynamic cameras.
It achieves state-of-the-art results in tracking people, particularly under long-term occlusions.
TRACE demonstrates the effectiveness of learning a holistic 5D representation for this task. |
The synthetic camera motion used to generate the DynaCam dataset may not fully capture the complexity of real-world camera movement.
Future work should investigate explicitly estimating camera motion for improved global trajectory recovery. |
3d human pose estimation, human motion tracking, dynamic cameras, 5d representation, temporal reasoning |
2306.02741
Report |
ZIGNeRF: Zero-shot 3D Scene Representation with Invertible Generative Neural Radiance Fields |
Kanghyeok Ko, Minhyeok Lee |
Generative Neural Radiance Fields (NeRFs) have demonstrated remarkable
proficiency in synthesizing multi-view images by learning the distribution of a
set of unposed images. Despite the aptitude of existing generative NeRFs in
generating 3D-consistent high-quality random samples within data distribution,
the creation of a 3D representation of a singular input image remains a
formidable challenge. In this manuscript, we introduce ZIGNeRF, an innovative
model that executes zero-shot Generative Adversarial Network (GAN) inversion
for the generation of multi-view images from a single out-of-domain image. The
model is underpinned by a novel inverter that maps out-of-domain images into
the latent code of the generator manifold. Notably, ZIGNeRF is capable of
disentangling the object from the background and executing 3D operations such
as 360-degree rotation or depth and horizontal translation. The efficacy of our
model is validated using multiple real-image datasets: Cats, AFHQ, CelebA,
CelebA-HQ, and CompCars. |
Presents ZIGNeRF, a novel approach for generating multi-view images from single, out-of-domain images using a 3D-aware zero-shot GAN inversion technique. |
Existing generative NeRF models struggle to create 3D representations of single, out-of-domain images without computationally expensive fine-tuning. |
Combines a 3D generation module (based on GIRAFFE with enhancements) with a 3D-aware GAN inversion module trained on synthesized images to map input images to the generator's latent space. |
Successfully generates multi-view consistent images from out-of-domain images across various datasets (Cats, AFHQ, CelebA, CelebA-HQ, CompCars).
Demonstrates 3D controllability, including 360-degree rotation and object disentanglement.
Shows robust adaptation capabilities, generating plausible multi-view images from FFHQ images using a model trained on CelebA-HQ. |
Explored generating multiple objects in a single scene only with CompCars dataset.
Future work includes enabling image editing by manipulating the inverted latent code. |
generative neural radiance fields, gan inversion, multi-view synthesis, zero-shot learning, 3d image representation |
2306.02583
Report |
Stable Diffusion is Unstable |
Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu |
Recently, text-to-image models have been thriving. Despite their powerful
generative capacity, our research has uncovered a lack of robustness in this
generation process. Specifically, the introduction of small perturbations to
the text prompts can result in the blending of primary subjects with other
categories or their complete disappearance in the generated images. In this
paper, we propose Auto-attack on Text-to-image Models (ATM), a gradient-based
approach, to effectively and efficiently generate such perturbations. By
learning a Gumbel Softmax distribution, we can make the discrete process of
word replacement or extension continuous, thus ensuring the differentiability
of the perturbation generation. Once the distribution is learned, ATM can
sample multiple attack samples simultaneously. These attack samples can prevent
the generative model from generating the desired subjects without compromising
image quality. ATM has achieved a 91.1% success rate in short-text attacks and
an 81.2% success rate in long-text attacks. Further empirical analysis revealed
four attack patterns based on: 1) the variability in generation speed, 2) the
similarity of coarse-grained characteristics, 3) the polysemy of words, and 4)
the positioning of words. |
The paper proposes ATM (Auto-attack on Text-to-image Models), a gradient-based approach to generate attack prompts against text-to-image models, causing them to fail in generating desired subjects. |
This is important for revealing vulnerabilities in text-to-image generation pipelines and inspiring research on attack/defense mechanisms to improve their robustness. |
ATM uses a Gumbel Softmax distribution to enable differentiable word replacements or extensions in text prompts. It incorporates fluency and semantic similarity constraints during optimization and utilizes a margin loss to minimize the classifier's confidence in the true class. |
ATM achieves a 91.1% success rate in short-text attacks and 81.2% in long-text attacks.
Four attack patterns are identified: variability in generation speed, similarity of coarse-grained characteristics, polysemy of words, and positioning of words.
Generated attack prompts are transferable to other models like DALL·E2 and Midjourney (black-box attacks). |
The paper mainly focuses on attacking Stable Diffusion; applying ATM to other text-to-image models is left for future work.
The impact of different decoding strategies on attack performance could be further explored. |
text-to-image generation, adversarial attacks, stable diffusion, vulnerability analysis, robustness |
2306.02245
Report |
SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model |
Dingyuan Zhang, Dingkang Liang, Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu, Xiang Bai |
With the development of large language models, many remarkable linguistic
systems like ChatGPT have thrived and achieved astonishing success on many
tasks, showing the incredible power of foundation models. In the spirit of
unleashing the capability of foundation models on vision tasks, the Segment
Anything Model (SAM), a vision foundation model for image segmentation, has
been proposed recently and presents strong zero-shot ability on many downstream
2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be
explored, especially 3D object detection. With this inspiration, we explore
adapting the zero-shot ability of SAM to 3D object detection in this paper. We
propose a SAM-powered BEV processing pipeline to detect objects and get
promising results on the large-scale Waymo open dataset. As an early attempt,
our method takes a step toward 3D object detection with vision foundation
models and presents the opportunity to unleash their power on 3D vision tasks.
The code is released at https://github.com/DYZhang09/SAM3D. |
This paper presents SAM3D, a method for zero-shot 3D object detection using the Segment Anything Model (SAM) by leveraging Bird's Eye View (BEV) representations of LiDAR data. |
Exploring zero-shot 3D object detection is crucial for practical applications due to the high cost of 3D data annotation. This work investigates the potential of powerful vision foundation models like SAM for 3D vision tasks. |
SAM3D projects LiDAR points to BEV images, enhances them to better fit SAM's training domain, employs SAM for segmentation with mesh grid prompts, applies rule-based post-processing to filter noisy masks, and finally predicts 3D bounding boxes by leveraging depth information from BEV and LiDAR points. |
SAM3D demonstrates the capability of SAM to segment objects in BEV images without any 3D training data, showcasing promising zero-shot detection ability.
Using reflection intensity and a predefined color palette for BEV generation significantly improves the performance compared to binary or grayscale BEV representations.
Post-processing techniques, including morphological dilation for BEV and area/aspect ratio filtering for masks, are crucial for bridging the domain gap and enhancing detection results. |
SAM3D's reliance on BEV might limit its applicability in indoor scenes with vertical object stacking.
The inference speed, while improved, is still limited by SAM's complexity, especially with a large number of prompts. |
zero-shot learning, 3d object detection, lidar, segment anything model (sam), "birds eye view (bev)" |
2306.02236
Report |
Detector Guidance for Multi-Object Text-to-Image Generation |
Luping Liu, Zijian Zhang, Yi Ren, Rongjie Huang, Xiang Yin, Zhou Zhao |
Diffusion models have demonstrated impressive performance in text-to-image
generation. They utilize a text encoder and cross-attention blocks to infuse
textual information into images at a pixel level. However, their capability to
generate images with text containing multiple objects is still restricted.
Previous works identify the problem of information mixing in the CLIP text
encoder and introduce the T5 text encoder or incorporate strong prior knowledge
to assist with the alignment. We find that mixing problems also occur on the
image side and in the cross-attention blocks. The noisy images can cause
different objects to appear similar, and the cross-attention blocks inject
information at a pixel level, leading to leakage of global object understanding
and resulting in object mixing. In this paper, we introduce Detector Guidance
(DG), which integrates a latent object detection model to separate different
objects during the generation process. DG first performs latent object
detection on cross-attention maps (CAMs) to obtain object information. Based on
this information, DG then masks conflicting prompts and enhances related
prompts by manipulating the following CAMs. We evaluate the effectiveness of DG
using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark,
MRO. Human evaluations demonstrate that DG provides an 8-22\% advantage in
preventing the amalgamation of conflicting concepts and ensuring that each
object possesses its unique region without any human involvement and additional
iterations. Our implementation is available at
\url{https://github.com/luping-liu/Detector-Guidance}. |
This paper introduces Detector Guidance (DG), a method that integrates a latent object detection model into pre-trained diffusion models to improve the generation of images with multiple objects. |
Existing text-to-image diffusion models struggle with generating images containing multiple objects due to information mixing problems, leading to attribute mixing, object mixing, and object disappearance. |
DG uses a latent object detection model trained on cross-attention maps (CAMs) to identify objects during the image generation process. It then leverages object information to correct CAMs by masking conflicting prompts and enhancing related prompts, improving object separation and attribute alignment. |
DG achieves 8-22% improvement in preventing the mixing of conflicting concepts and ensures each object has its unique region.
The latent object detection model, trained on COCO, exhibits good generalization to unseen categories.
DG shows improvement in both FID and CLIP-score when the guidance scale is larger than 3. |
Limited improvement in practice despite the theoretical importance of Smooth Involvement.
Reliance on language parsers, which can sometimes introduce errors. |
diffusion models, text-to-image generation, object detection, cross-attention, multi-object generation |
2306.02083
Report |
Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution |
Yiji Cheng, Fei Yin, Xiaoke Huang, Xintong Yu, Jiaxiang Liu, Shikun Feng, Yujiu Yang, Yansong Tang |
Text-to-3D is an emerging task that allows users to create 3D content with
infinite possibilities. Existing works tackle the problem by optimizing a 3D
representation with guidance from pre-trained diffusion models. An apparent
drawback is that they need to optimize from scratch for each prompt, which is
computationally expensive and often yields poor visual fidelity. In this paper,
we propose DreamPortrait, which aims to generate text-guided 3D-aware portraits
in a single-forward pass for efficiency. To achieve this, we extend Score
Distillation Sampling from datapoint to distribution formulation, which injects
semantic prior into a 3D distribution. However, the direct extension will lead
to the mode collapse problem since the objective only pursues semantic
alignment. Hence, we propose to optimize a distribution with hierarchical
condition adapters and GAN loss regularization. For better 3D modeling, we
further design a 3D-aware gated cross-attention mechanism to explicitly let the
model perceive the correspondence between the text and the 3D-aware space.
These elaborated designs enable our model to generate portraits with robust
multi-view semantic consistency, eliminating the need for optimization-based
methods. Extensive experiments demonstrate our model's highly competitive
performance and significant speed boost against existing methods. |
Proposes \ours{}, a method for efficient text-guided 3D-aware portrait generation by extending Score Distillation Sampling (SDS) to distribution formulation. |
Existing text-to-3D methods are computationally expensive, requiring optimization from scratch for each prompt, and often yield poor visual fidelity, especially in multi-view consistency. |
Extends SDS to optimize a 3D-aware distribution using hierarchical condition adapters to inject textual information and GAN loss regularization to prevent mode collapse. Employs a 3D-aware gated cross-attention mechanism to enhance multi-view consistency. |
Significantly faster than optimization-based methods, achieving ~15 FPS generation speed.
Generates higher-quality 3D portraits with better multi-view semantic consistency compared to two-stage methods.
Demonstrates robust generalization ability by effectively handling out-of-distribution text prompts. |
Limited to generating avatars and cannot handle general 3D scenes or objects.
Future work includes expanding modeling capabilities to broader applications beyond avatars. |
text-to-3d, score distillation sampling, 3d-aware portrait generation, multi-view consistency, generative adversarial networks |
2306.02080
Report |
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models |
Shuo Chen, Jindong Gu, Zhen Han, Yunpu Ma, Philip Torr, Volker Tresp |
Various adaptation methods, such as LoRA, prompts, and adapters, have been
proposed to enhance the performance of pre-trained vision-language models in
specific domains. The robustness of these adaptation methods against
distribution shifts have not been studied. In this study, we assess the
robustness of 11 widely-used adaptation methods across 4 vision-language
datasets under multimodal corruptions. Concretely, we introduce 7 benchmark
datasets, including 96 visual and 87 textual corruptions, to investigate the
robustness of different adaptation methods, the impact of available adaptation
examples, and the influence of trainable parameter size during adaptation. Our
analysis reveals that: 1) Adaptation methods are more sensitive to text
corruptions than visual corruptions. 2) Full fine-tuning does not consistently
provide the highest robustness; instead, adapters can achieve better robustness
with comparable clean performance. 3) Contrary to expectations, our findings
indicate that increasing the number of adaptation data and parameters does not
guarantee enhanced robustness; instead it results in even lower robustness. We
hope this study could benefit future research in the development of robust
multimodal adaptation methods. The benchmark, code, and dataset used in this
study can be accessed at https://adarobustness.github.io . |
This paper introduces a large-scale benchmark to evaluate the robustness of different adaptation methods for pre-trained vision-language models under multimodal corruptions, including variations in lighting conditions in images and typos in texts. |
Robustness against distribution shifts in vision-language models is crucial for real-world applications, especially in safety-critical domains like self-driving systems and clinical diagnostics. |
The authors introduce a benchmark with 96 visual and 87 textual corruptions across 4 VL datasets. They evaluate 11 widely-used adaptation methods, analyzing the impact of adaptation examples and trainable parameter size. |
Adaptation methods are more sensitive to text corruptions than visual corruptions.
Full fine-tuning doesn't guarantee the highest robustness; adapters can achieve better robustness with comparable clean performance.
Increasing adaptation data and parameters doesn't guarantee enhanced robustness; it can even lead to lower robustness. |
The analysis is limited to a limited number of multimodal models due to the availability of usable code and model weights.
Future work includes investigating more diverse VL models and designing more robust adaptation methods. |
robustness, vision-language models, adaptation methods, multimodal corruptions, benchmarking |
2306.02064
Report |
Flew Over Learning Trap: Learn Unlearnable Samples by Progressive Staged Training |
Pucheng Dang, Xing Hu, Kaidi Xu, Jinhao Duan, Di Huang, Husheng Han, Rui Zhang, Zidong Du, Qi Guo, Yunji Chen |
Unlearning techniques are proposed to prevent third parties from exploiting
unauthorized data, which generate unlearnable samples by adding imperceptible
perturbations to data for public publishing. These unlearnable samples
effectively misguide model training to learn perturbation features but ignore
image semantic features. We make the in-depth analysis and observe that models
can learn both image features and perturbation features of unlearnable samples
at an early stage, but rapidly go to the overfitting stage since the shallow
layers tend to overfit on perturbation features and make models fall into
overfitting quickly. Based on the observations, we propose Progressive Staged
Training to effectively prevent models from overfitting in learning
perturbation features. We evaluated our method on multiple model architectures
over diverse datasets, e.g., CIFAR-10, CIFAR-100, and ImageNet-mini. Our method
circumvents the unlearnability of all state-of-the-art methods in the
literature and provides a reliable baseline for further evaluation of
unlearnable techniques. |
This paper proposes a novel training framework called Progressive Staged Training (ST) to defeat unlearnable samples, a data protection method. |
Unlearnable samples, by injecting imperceptible perturbations into training data, mislead models into learning perturbation features instead of valuable semantic features, limiting their practical use. This work aims to circumvent this protection and provide a reliable baseline for evaluating unlearnable techniques. |
ST utilizes an Activation Cluster Measurement (ACM) to identify model overfitting on perturbation features. It then adjusts learning rates progressively, slowing down shallow layer learning to resist overfitting. The authors also investigate the effectiveness of color-jitter and gray-scale augmentation (CG). |
ST significantly improves model accuracy on unlearnable samples across various datasets (CIFAR-10, CIFAR-100, ImageNet-mini) and model architectures (ResNet, VGG, DenseNet, WideResNet).
The augmentation CG further enhances ST's performance, making perturbations less effective even when CG is used during their generation.
Analysis of activation patterns and loss landscape demonstrates that ST effectively prevents overfitting on perturbation features. |
The paper lacks a clear explanation of why CG augmentation weakens the protection of unlearnable perturbations.
Further investigation on the effectiveness of ST on different mixing ratios of unlearnable samples and clean data is required. |
unlearnable samples, data protection, overfitting, staged training, data augmentation |
2306.02000
Report |
Context-PIPs: Persistent Independent Particles Demands Spatial Context Features |
Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yitong Dong, Yijin Li, Hongsheng Li |
We tackle the problem of Persistent Independent Particles (PIPs), also called
Tracking Any Point (TAP), in videos, which specifically aims at estimating
persistent long-term trajectories of query points in videos. Previous methods
attempted to estimate these trajectories independently to incorporate longer
image sequences, therefore, ignoring the potential benefits of incorporating
spatial context features. We argue that independent video point tracking also
demands spatial context features. To this end, we propose a novel framework
Context-PIPs, which effectively improves point trajectory accuracy by
aggregating spatial context features in videos. Context-PIPs contains two main
modules: 1) a SOurse Feature Enhancement (SOFE) module, and 2) a TArget Feature
Aggregation (TAFA) module. Context-PIPs significantly improves PIPs all-sided,
reducing 11.4% Average Trajectory Error of Occluded Points (ATE-Occ) on CroHD
and increasing 11.8% Average Percentage of Correct Keypoint (A-PCK) on
TAP-Vid-Kinectics. Demos are available at
https://wkbian.github.io/Projects/Context-PIPs/. |
The paper introduces Context-PIPs, a novel framework for enhancing Persistent Independent Particles (PIPs) in video point tracking by incorporating spatial context features from both source and target frames. |
Existing methods for video point tracking, like PIPs, primarily focus on temporal information while neglecting valuable spatial context, limiting accuracy and robustness, especially in challenging scenarios like occlusions or texture-less regions. |
Context-PIPs extends PIPs with two key modules: SOFE (Source Feature Enhancement) and TAFA (Target Feature Aggregation). SOFE leverages self-similarity in the source frame to guide the sampling of auxiliary features, enriching the representation of the query point. TAFA utilizes cross-attention between augmented correlation features and target frame features to aggregate relevant context, further improving point trajectory refinement. |
Context-PIPs achieves state-of-the-art performance on four benchmarks: FlyingThings++, CroHD, TAP-Vid-DAVIS, and TAP-Vid-Kinectics, demonstrating significant improvements over previous methods like PIPs and TAP-Net.
The ablation study confirms the effectiveness of both SOFE and TAFA modules in enhancing point tracking accuracy.
Context-PIPs exhibits high efficiency, achieving superior results even with fewer parameters and computations compared to PIPs. |
Context-PIPs currently relies on a sliding window approach for tracking, limiting its ability to re-identify points that become lost.
Future work will explore methods for re-identifying lost points when they reappear in the video. |
video point tracking, persistent independent particles (pips), spatial context, source feature enhancement, target feature aggregation |
2306.01923
Report |
The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation |
Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek Kar, Mohammad Norouzi, Deqing Sun, David J. Fleet |
Denoising diffusion probabilistic models have transformed image generation
with their impressive fidelity and diversity. We show that they also excel in
estimating optical flow and monocular depth, surprisingly, without
task-specific architectures and loss functions that are predominant for these
tasks. Compared to the point estimates of conventional regression-based
methods, diffusion models also enable Monte Carlo inference, e.g., capturing
uncertainty and ambiguity in flow and depth. With self-supervised pre-training,
the combined use of synthetic and real data for supervised training, and
technical innovations (infilling and step-unrolled denoising diffusion
training) to handle noisy-incomplete training data, and a simple form of
coarse-to-fine refinement, one can train state-of-the-art diffusion models for
depth and optical flow estimation. Extensive experiments focus on quantitative
performance against benchmarks, ablations, and the model's ability to capture
uncertainty and multimodality, and impute missing values. Our model, DDVM
(Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth
error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\%
on the KITTI optical flow benchmark, about 25\% better than the best published
method. For an overview see https://diffusion-vision.github.io. |
The paper introduces a denoising diffusion model for optical flow and monocular depth estimation, using an image-to-image translation framework without task-specific architectures or loss functions. |
This approach offers several advantages over traditional regression-based methods, including the ability to capture uncertainty and multimodality, and impute missing values. |
The model is trained using a combination of self-supervised pretraining and supervised training on synthetic and real data. Key technical innovations include infilling, step-unrolled denoising diffusion training, and coarse-to-fine refinement. |
The model achieves state-of-the-art results on optical flow benchmarks, with a 3.26% Fl-all outlier rate on KITTI, significantly outperforming the best published method.
For monocular depth estimation, the model obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark.
The diffusion model effectively captures multimodality and uncertainty in both depth and optical flow, enabling it to handle challenging cases such as transparent, reflective, and occluded regions. |
Diffusion models are computationally expensive compared to traditional methods, requiring many denoising steps during inference, which leads to longer inference times.
While achieving state-of-the-art on zero-shot optical flow estimation for both Sintel and KITTI, the model falls behind FlowFormer on Sintel after finetuning. Possible reasons include a fine-tuning procedure better suited for KITTI, and a significant domain gap between training and testing data for Sintel compared to KITTI. |
diffusion models, optical flow estimation, monocular depth estimation, image-to-image translation, generative models |
2306.01900
Report |
Conditional Generation from Unconditional Diffusion Models using Denoiser Representations |
Alexandros Graikos, Srikar Yellapragada, Dimitris Samaras |
Denoising diffusion models have gained popularity as a generative modeling
technique for producing high-quality and diverse images. Applying these models
to downstream tasks requires conditioning, which can take the form of text,
class labels, or other forms of guidance. However, providing conditioning
information to these models can be challenging, particularly when annotations
are scarce or imprecise. In this paper, we propose adapting pre-trained
unconditional diffusion models to new conditions using the learned internal
representations of the denoiser network. We demonstrate the effectiveness of
our approach on various conditional generation tasks, including
attribute-conditioned generation and mask-conditioned generation. Additionally,
we show that augmenting the Tiny ImageNet training set with synthetic images
generated by our approach improves the classification accuracy of ResNet
baselines by up to 8%. Our approach provides a powerful and flexible way to
adapt diffusion models to new conditions and generate high-quality augmented
data for various conditional generation tasks. |
This paper proposes a method to adapt pre-trained unconditional diffusion models to new conditions (e.g., attributes, masks) using the learned internal representations of the denoiser network. |
This is important because it allows for conditional image generation even when annotations are scarce or imprecise, eliminating the need for extensive labeled data for training guidance classifiers. |
The method leverages the denoiser's intermediate features to train a guidance network, exploiting the denoiser's robustness to noisy inputs and ability to learn from limited data. This guidance network then modifies the diffusion process to generate images aligned with the desired conditions. For larger datasets, the method combines this guidance with fine-tuning and rejection sampling to further enhance image quality. |
The method achieves comparable FID scores to state-of-the-art methods for few-shot attribute-conditioned generation on CelebA-64.
It outperforms baseline approaches in few-shot segmentation-conditioned generation on CelebA-Mask, achieving better mIoU and FID scores.
Augmenting Tiny ImageNet with synthetic images generated by this method significantly improves classification accuracy (up to 8%) over ResNet baselines, demonstrating its potential for data augmentation. |
The method relies on the assumption that the denoiser's estimates of the final image become accurate relatively early in the denoising process.
Future work could explore methods for controlling the guidance strength during sampling to better balance image quality and diversity. |
diffusion models, conditional image generation, few-shot learning, data augmentation, image classification |
2306.01721
Report |
Denoising Diffusion Semantic Segmentation with Mask Prior Modeling |
Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang |
The evolution of semantic segmentation has long been dominated by learning
more discriminative image representations for classifying each pixel. Despite
the prominent advancements, the priors of segmentation masks themselves, e.g.,
geometric and semantic constraints, are still under-explored. In this paper, we
propose to ameliorate the semantic segmentation quality of existing
discriminative approaches with a mask prior modeled by a recently-developed
denoising diffusion generative model. Beginning with a unified architecture
that adapts diffusion models for mask prior modeling, we focus this work on a
specific instantiation with discrete diffusion and identify a variety of key
design choices for its successful application. Our exploratory analysis
revealed several important findings, including: (1) a simple integration of
diffusion models into semantic segmentation is not sufficient, and a
poorly-designed diffusion process might lead to degradation in segmentation
performance; (2) during the training, the object to which noise is added is
more important than the type of noise; (3) during the inference, the strict
diffusion denoising scheme may not be essential and can be relaxed to a simpler
scheme that even works better. We evaluate the proposed prior modeling with
several off-the-shelf segmentors, and our experimental results on ADE20K and
Cityscapes demonstrate that our approach could achieve competitively
quantitative performance and more appealing visual quality. |
This paper proposes DDPS, a novel framework utilizing denoising diffusion generative models to enhance semantic segmentation by modeling segmentation mask priors, such as geometric and semantic constraints. |
Existing semantic segmentation methods primarily focus on discriminative feature learning, often overlooking the intrinsic properties and priors of segmentation masks themselves, which limits their performance. |
DDPS employs a two-stage pipeline. First, an off-the-shelf segmentation model generates initial predictions. Then, a denoising diffusion model, specifically a discrete diffusion model in this work, refines these predictions by aligning them with the learned mask prior distribution. |
DDPS consistently enhances the performance of various base segmentation models, including DeepLabV3+ and Segformer, on ADE20K and Cityscapes datasets.
The method demonstrates significant gains in boundary IoU, indicating its effectiveness in modeling geometric constraints.
Key design choices, such as noise applied to the first prediction and free re-noising during inference, are crucial for DDPS's success. |
The impact of DDPS on datasets with less inherent structure, like Cityscapes, is less pronounced compared to datasets like ADE20K.
Exploration of more sophisticated mask representation codecs and alternative diffusion models beyond discrete diffusion could be interesting future directions. |
semantic segmentation, denoising diffusion models, mask prior modeling, generative models for segmentation, deep learning |
2306.01667
Report |
Towards In-context Scene Understanding |
Ivana Balažević, David Steiner, Nikhil Parthasarathy, Relja Arandjelović, Olivier J. Hénaff |
In-context learning$\unicode{x2013}$the ability to configure a model's
behavior with different prompts$\unicode{x2013}$has revolutionized the field of
natural language processing, alleviating the need for task-specific models and
paving the way for generalist models capable of assisting with any query.
Computer vision, in contrast, has largely stayed in the former regime:
specialized decoders and finetuning protocols are generally required to perform
dense tasks such as semantic segmentation and depth estimation. In this work we
explore a simple mechanism for in-context learning of such scene understanding
tasks: nearest neighbor retrieval from a prompt of annotated features. We
propose a new pretraining protocol$\unicode{x2013}$leveraging attention within
and across images$\unicode{x2013}$which yields representations particularly
useful in this regime. The resulting Hummingbird model, suitably prompted,
performs various scene understanding tasks without modification while
approaching the performance of specialists that have been finetuned for each
task. Moreover, Hummingbird can be configured to perform new tasks much more
efficiently than finetuned models, raising the possibility of scene
understanding in the interactive assistant regime. |
This paper explores in-context learning for scene understanding tasks like semantic segmentation and depth estimation using a simple nearest neighbor retrieval mechanism from annotated features. |
This approach is significant because it eliminates the need for task-specific decoders and finetuning, paving the way for general-purpose vision models. |
The authors propose Hummingbird, a pretraining method that uses attention across and within images. The model retrieves nearest neighbors from a prompt of annotated features to make predictions on new images. |
Hummingbird representations perform well on semantic segmentation and depth estimation using NN retrieval without modification.
The approach achieves performance comparable to fully finetuned specialist models on some tasks.
Hummingbird with NN retrieval adapts to new tasks faster and more data-efficiently than finetuned models. |
The absolute performance in the low-data regime (less than 100 examples) needs improvement.
Expanding the evaluation to other tasks like object detection is left for future work. |
in-context learning, scene understanding, nearest neighbor retrieval, self-supervised learning, computer vision |
2306.01293
Report |
LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning |
Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa |
We present a novel vision-language prompt learning approach for few-shot
out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD
images from classes that are unseen during training using only a few labeled
in-distribution (ID) images. While prompt learning methods such as CoOp have
shown effectiveness and efficiency in few-shot ID classification, they still
face limitations in OOD detection due to the potential presence of
ID-irrelevant information in text embeddings. To address this issue, we
introduce a new approach called Local regularized Context Optimization
(LoCoOp), which performs OOD regularization that utilizes the portions of CLIP
local features as OOD features during training. CLIP's local features have a
lot of ID-irrelevant nuisances (e.g., backgrounds), and by learning to push
them away from the ID class text embeddings, we can remove the nuisances in the
ID class text embeddings and enhance the separation between ID and OOD.
Experiments on the large-scale ImageNet OOD detection benchmarks demonstrate
the superiority of our LoCoOp over zero-shot, fully supervised detection
methods and prompt learning methods. Notably, even in a one-shot setting --
just one label per class, LoCoOp outperforms existing zero-shot and fully
supervised detection methods. The code will be available via
https://github.com/AtsuMiyai/LoCoOp. |
This paper tackles few-shot out-of-distribution (OOD) detection with vision-language models and proposes a novel prompt learning method called LoCoOp. |
Existing OOD detection methods for vision-language models are limited to zero-shot or fully supervised settings, which either suffer from domain gaps or require large training costs. Few-shot OOD detection offers a balanced solution. |
LoCoOp leverages CLIP's local features to identify ID-irrelevant regions (treated as OOD) and pushes them away from ID class text embeddings during training, effectively removing irrelevant information from text embeddings. |
LoCoOp outperforms existing zero-shot, few-shot, and fully supervised OOD detection methods on ImageNet benchmarks.
Remarkably, LoCoOp surpasses existing methods with only one label per class (one-shot setting).
Experiments show LoCoOp's effectiveness with both ViT and ResNet architectures. |
The application of LoCoOp to other light-weight tuning methods (e.g., Tip-Adapter, visual prompt methods) is left for future work.
LoCoOp requires models with strong local visual-text alignment and may not be readily applicable to models lacking such capabilities. |
out-of-distribution detection, few-shot learning, prompt learning, vision-language models, clip |
2306.01272
Report |
DeepfakeArt Challenge: A Benchmark Dataset for Generative AI Art Forgery and Data Poisoning Detection |
Hossein Aboutalebi, Dayou Mao, Rongqi Fan, Carol Xu, Chris He, Alexander Wong |
The tremendous recent advances in generative artificial intelligence
techniques have led to significant successes and promise in a wide range of
different applications ranging from conversational agents and textual content
generation to voice and visual synthesis. Amid the rise in generative AI and
its increasing widespread adoption, there has been significant growing concern
over the use of generative AI for malicious purposes. In the realm of visual
content synthesis using generative AI, key areas of significant concern has
been image forgery (e.g., generation of images containing or derived from
copyright content), and data poisoning (i.e., generation of adversarially
contaminated images). Motivated to address these key concerns to encourage
responsible generative AI, we introduce the DeepfakeArt Challenge, a
large-scale challenge benchmark dataset designed specifically to aid in the
building of machine learning algorithms for generative AI art forgery and data
poisoning detection. Comprising of over 32,000 records across a variety of
generative forgery and data poisoning techniques, each entry consists of a pair
of images that are either forgeries / adversarially contaminated or not. Each
of the generated images in the DeepfakeArt Challenge benchmark dataset
\footnote{The link to the dataset: http://anon\_for\_review.com} has been
quality checked in a comprehensive manner. |
Introduces DeepfakeArt Challenge, a large-scale benchmark dataset for detecting art forgery and data poisoning in generative AI. |
Addresses growing concerns of copyright infringement and adversarial data poisoning in AI-generated visual content to encourage responsible generative AI. |
Creates over 32,000 image pairs using four generative forgery and data poisoning techniques: Inpainting, Style Transfer, Adversarial Data Poisoning, and Cutmix, based on modifications of source images from the WikiArt dataset. |
DINO-v2 ViT-L/14 model achieves the best overall performance for detecting similar and dissimilar image pairs.
Models generally show high precision but low recall, indicating a high rate of false negatives.
The high false negative rate highlights the need for more robust detection tools to identify and mitigate copyright infringements in generative AI models. |
The dataset currently focuses on four specific generative techniques and may not encompass the full spectrum of potential forgery methods.
Future work could explore expanding the dataset with additional techniques and exploring more sophisticated detection algorithms. |
generative ai, copyright infringement, data poisoning, deep learning, computer vision |
2306.00987
Report |
StyleGAN knows Normal, Depth, Albedo, and More |
Anand Bhattad, Daniel McKee, Derek Hoiem, D. A. Forsyth |
Intrinsic images, in the original sense, are image-like maps of scene
properties like depth, normal, albedo or shading. This paper demonstrates that
StyleGAN can easily be induced to produce intrinsic images. The procedure is
straightforward. We show that, if StyleGAN produces $G({w})$ from latents
${w}$, then for each type of intrinsic image, there is a fixed offset ${d}_c$
so that $G({w}+{d}_c)$ is that type of intrinsic image for $G({w})$. Here
${d}_c$ is {\em independent of ${w}$}. The StyleGAN we used was pretrained by
others, so this property is not some accident of our training regime. We show
that there are image transformations StyleGAN will {\em not} produce in this
fashion, so StyleGAN is not a generic image regression engine.
It is conceptually exciting that an image generator should ``know'' and
represent intrinsic images. There may also be practical advantages to using a
generative model to produce intrinsic images. The intrinsic images obtained
from StyleGAN compare well both qualitatively and quantitatively with those
obtained by using SOTA image regression techniques; but StyleGAN's intrinsic
images are robust to relighting effects, unlike SOTA methods. |
This paper reveals that StyleGAN, despite not being trained on intrinsic images, can be prompted to generate them by applying specific offsets to its latent codes. |
This finding is significant as it suggests that intrinsic image representations might be inherently encoded within StyleGAN, indicating a natural alignment with these scene properties. |
The authors search for fixed offsets in the StyleGAN latent space that correspond to different intrinsic images (normals, depth, albedo, shading, segmentation). This is achieved by minimizing the L1 distance between StyleGAN outputs and predictions from pre-trained, state-of-the-art intrinsic image prediction models. |
StyleGAN generates intrinsic images comparable in quality to those produced by state-of-the-art supervised methods, even though it was never explicitly trained on such data.
StyleGAN-derived intrinsic images exhibit remarkable robustness to lighting variations, outperforming current leading methods which show sensitivity to such changes.
A control experiment demonstrates that StyleGAN is not simply a generic image processing engine, as it cannot perform tasks unrelated to intrinsic image manipulation, like swapping image halves. |
The reliance on accurate GAN inversion methods to apply this technique to real images is currently a limiting factor.
Further investigation into whether this capability extends to other generative models and exploration of potentially undiscovered intrinsic representations within StyleGAN is warranted. |
stylegan, intrinsic images, generative models, image editing, representation learning |
2306.00984
Report |
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners |
Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan |
We investigate the potential of learning visual representations using
synthetic images generated by text-to-image models. This is a natural question
in the light of the excellent performance of such models in generating
high-quality images. We consider specifically the Stable Diffusion, one of the
leading open source text-to-image models. We show that (1) when the generative
model is configured with proper classifier-free guidance scale, training
self-supervised methods on synthetic images can match or beat the real image
counterpart; (2) by treating the multiple images generated from the same text
prompt as positives for each other, we develop a multi-positive contrastive
learning method, which we call StableRep. With solely synthetic images, the
representations learned by StableRep surpass the performance of representations
learned by SimCLR and CLIP using the same set of text prompts and corresponding
real images, on large scale datasets. When we further add language supervision,
StableRep trained with 20M synthetic images achieves better accuracy than CLIP
trained with 50M real images. |
This paper explores using synthetic images from text-to-image models, especially Stable Diffusion, for visual representation learning. |
Collecting and curating large real-world datasets for training AI models is costly and prone to bias. Leveraging generative models as data sources offers a promising alternative. |
The authors investigate training standard self-supervised methods (SimCLR, MAE) on Stable Diffusion images, finding optimal guidance scales for image generation. They further propose StableRep, a novel multi-positive contrastive learning approach leveraging the unique property of generative models to create diverse positive samples from a single text prompt. |
Training self-supervised methods on synthetic images from Stable Diffusion with an appropriate guidance scale often surpasses the performance achieved by training on an equivalent amount of real data.
StableRep, solely trained on synthetic data, outperforms state-of-the-art methods like CLIP trained on real data, achieving 76.7% linear accuracy on ImageNet with ViT-B/16.
Adding language supervision to StableRep exhibits a 5x improvement in caption efficiency compared to CLIP trained on real images. |
Current image generation speed is slow, hindering online image synthesis during training.
Semantic mismatch between prompts and generated images remains an open problem, impacting data quality. |
representation learning, synthetic data, text-to-image generation, stable diffusion, contrastive learning |
2306.00977
Report |
AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation |
Yuanwen Yue, Sabarinath Mahadevan, Jonas Schult, Francis Engelmann, Bastian Leibe, Konrad Schindler, Theodora Kontogianni |
During interactive segmentation, a model and a user work together to
delineate objects of interest in a 3D point cloud. In an iterative process, the
model assigns each data point to an object (or the background), while the user
corrects errors in the resulting segmentation and feeds them back into the
model. The current best practice formulates the problem as binary
classification and segments objects one at a time. The model expects the user
to provide positive clicks to indicate regions wrongly assigned to the
background and negative clicks on regions wrongly assigned to the object.
Sequentially visiting objects is wasteful since it disregards synergies between
objects: a positive click for a given object can, by definition, serve as a
negative click for nearby objects. Moreover, a direct competition between
adjacent objects can speed up the identification of their common boundary. We
introduce AGILE3D, an efficient, attention-based model that (1) supports
simultaneous segmentation of multiple 3D objects, (2) yields more accurate
segmentation masks with fewer user clicks, and (3) offers faster inference. Our
core idea is to encode user clicks as spatial-temporal queries and enable
explicit interactions between click queries as well as between them and the 3D
scene through a click attention module. Every time new clicks are added, we
only need to run a lightweight decoder that produces updated segmentation
masks. In experiments with four different 3D point cloud datasets, AGILE3D sets
a new state-of-the-art. Moreover, we also verify its practicality in real-world
setups with real user studies. |
AGILE3D is introduced, an attention-based deep learning model for interactive segmentation of multiple objects in 3D point clouds, overcoming the limitations of sequential, single-object segmentation. |
Existing methods for interactive 3D segmentation are inefficient, disregarding synergies between objects, and limiting the segmentation to one object at a time. |
AGILE3D encodes user clicks as spatial-temporal queries, enabling interaction between clicks and the 3D scene via a click attention module. It employs a pre-computed backbone for efficiency and trains using an iterative strategy simulating user behavior. |
AGILE3D outperforms state-of-the-art methods in both single- and multi-object segmentation benchmarks.
The model effectively segments multiple objects simultaneously, requiring fewer clicks for higher-quality masks.
Real-user studies confirm AGILE3D's efficiency and the effectiveness of the iterative training strategy. |
AGILE3D may require more clicks to accurately segment fine-grained object parts.
The model currently doesn't provide semantic labels along with the segmented masks. |
3d point clouds, interactive segmentation, multi-object segmentation, attention mechanism, deep learning |
2306.00973
Report |
Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models |
Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, Weidi Xie |
Generative models have recently exhibited exceptional capabilities in
text-to-image generation, but still struggle to generate image sequences
coherently. In this work, we focus on a novel, yet challenging task of
generating a coherent image sequence based on a given storyline, denoted as
open-ended visual storytelling. We make the following three contributions: (i)
to fulfill the task of visual storytelling, we propose a learning-based
auto-regressive image generation model, termed as StoryGen, with a novel
vision-language context module, that enables to generate the current frame by
conditioning on the corresponding text prompt and preceding image-caption
pairs; (ii) to address the data shortage of visual storytelling, we collect
paired image-text sequences by sourcing from online videos and open-source
E-books, establishing processing pipeline for constructing a large-scale
dataset with diverse characters, storylines, and artistic styles, named
StorySalon; (iii) Quantitative experiments and human evaluations have validated
the superiority of our StoryGen, where we show StoryGen can generalize to
unseen characters without any optimization, and generate image sequences with
coherent content and consistent character. Code, dataset, and models are
available at https://haoningwu3639.github.io/StoryGen_Webpage/ |
This paper proposes StoryGen, a novel learning-based auto-regressive image generation model for open-ended visual storytelling, enabling the generation of coherent image sequences from given storylines, even with unseen characters. |
Open-ended visual storytelling holds significant potential in education by offering an engaging way for children to learn visual concepts and fostering imagination, creativity, and language skills. |
StoryGen builds upon a pre-trained stable diffusion model and incorporates a novel vision-language context module to condition image generation on both text prompts and preceding image-caption pairs, ensuring both content coherence and character consistency. The model is trained on StorySalon, a newly constructed large-scale dataset of storybooks with diverse characters, storylines, and artistic styles. |
StoryGen generates visually coherent stories with unseen characters without requiring character-specific optimization.
Quantitative experiments show StoryGen outperforms baselines in terms of image quality and text-image alignment, validated by FID and CLIP scores.
Human evaluations confirm StoryGen's superiority in generating coherent and engaging visual stories, as evidenced by higher scores in consistency, quality, and user preference. |
StoryGen inherits limitations from the underlying stable diffusion model, such as inaccuracies in limb counts and reduced quality with multiple objects.
Future work will explore more robust architectures like DALL-E 3 or consistency models to address these limitations. |
visual storytelling, latent diffusion models, open-ended generation, storysalon dataset, character consistency |
2306.00968
Report |
GRES: Generalized Referring Expression Segmentation |
Chang Liu, Henghui Ding, Xudong Jiang |
Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES
datasets and methods commonly support single-target expressions only, i.e., one
expression refers to one target object. Multi-target and no-target expressions
are not considered. This limits the usage of RES in practice. In this paper, we
introduce a new benchmark called Generalized Referring Expression Segmentation
(GRES), which extends the classic RES to allow expressions to refer to an
arbitrary number of target objects. Towards this, we construct the first
large-scale GRES dataset called gRefCOCO that contains multi-target, no-target,
and single-target expressions. GRES and gRefCOCO are designed to be
well-compatible with RES, facilitating extensive experiments to study the
performance gap of the existing RES methods on the GRES task. In the
experimental study, we find that one of the big challenges of GRES is complex
relationship modeling. Based on this, we propose a region-based GRES baseline
ReLA that adaptively divides the image into regions with sub-instance clues,
and explicitly models the region-region and region-language dependencies. The
proposed approach ReLA achieves new state-of-the-art performance on the both
newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and
method are available at https://henghuiding.github.io/GRES. |
This paper introduces Generalized Referring Expression Segmentation (GRES), a new benchmark extending classic Referring Expression Segmentation (RES) to handle expressions referring to an arbitrary number of target objects, including multi-target and no-target expressions. |
Classic RES suffers from limitations as it only supports single-target expressions, hindering its practical usage in scenarios with multiple or no target objects. GRES addresses this by supporting a wider range of expressions, enabling greater flexibility and robustness in real-world applications. |
The authors create gRefCOCO, a large-scale dataset for GRES, by augmenting RefCOCO with multi-target and no-target expressions. They also propose ReLA, a region-based GRES baseline method that leverages sub-instance clues to explicitly model region-region and region-language dependencies. |
Models trained solely on single-target RES datasets generalize poorly to GRES, highlighting the need for gRefCOCO.
Explicit modeling of region-region and region-language interactions significantly improves performance on GRES.
ReLA achieves state-of-the-art results on both classic RES and the newly proposed GRES tasks. |
No-target expression identification, while improved, still presents challenges due to the deceptive nature of some expressions.
Future research should focus on addressing complex relationships, such as possession and fine-grained attribute understanding. |
referring expression segmentation, multi-target segmentation, no-target identification, relationship modeling, computer vision |
2306.00965
Report |
BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image |
Tao Chu, Pan Zhang, Qiong Liu, Jiaqi Wang |
Understanding and modeling the 3D scene from a single image is a practical
problem. A recent advance proposes a panoptic 3D scene reconstruction task that
performs both 3D reconstruction and 3D panoptic segmentation from a single
image. Although having made substantial progress, recent works only focus on
top-down approaches that fill 2D instances into 3D voxels according to
estimated depth, which hinders their performance by two ambiguities. (1)
instance-channel ambiguity: The variable ids of instances in each scene lead to
ambiguity during filling voxel channels with 2D information, confusing the
following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting
with estimated single view depth only propagates 2D information onto the
surface of 3D regions, leading to ambiguity during the reconstruction of
regions behind the frontal view surface. In this paper, we propose BUOL, a
Bottom-Up framework with Occupancy-aware Lifting to address the two issues for
panoptic 3D scene reconstruction from a single image. For instance-channel
ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on
deterministic semantic assignments rather than arbitrary instance id
assignments. The 3D voxels are then refined and grouped into 3D instances
according to the predicted 2D instance centers. For voxel-reconstruction
ambiguity, the estimated multi-plane occupancy is leveraged together with depth
to fill the whole regions of things and stuff. Our method shows a tremendous
performance advantage over state-of-the-art methods on synthetic dataset
3D-Front and real-world dataset Matterport3D. Code and models are available in
https://github.com/chtsy/buol. |
This paper presents BUOL, a novel bottom-up framework with occupancy-aware lifting, for panoptic 3D scene reconstruction from a single RGB image. |
Existing top-down methods suffer from instance-channel ambiguity (inconsistent instance ID assignment) and voxel-reconstruction ambiguity (limited 2D information propagation to 3D). |
BUOL uses a 2D model to predict semantic maps, instance centers, depth, and multi-plane occupancy. It then lifts 2D semantics to 3D using occupancy-aware lifting and refines them with a 3D model, predicting occupancy, semantics, and offsets for 3D instance grouping. |
BUOL outperforms previous state-of-the-art methods on 3D-Front and Matterport3D datasets by significant margins (+11.81% and +7.46% in PRQ).
The bottom-up framework effectively addresses instance-channel ambiguity by utilizing semantic information for lifting.
Occupancy-aware lifting alleviates voxel-reconstruction ambiguity, enabling accurate 3D reconstruction. |
The performance on the Matterport3D dataset is relatively low due to noisy ground truth data.
The computational cost of 3D models limits the use of complex architectures like 3D UNet. |
panoptic 3d scene reconstruction, single image 3d reconstruction, occupancy-aware lifting, bottom-up framework, 3d instance segmentation |
2306.00956
Report |
The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects |
Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu |
We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for
multisensory object-centric learning, centered around object recognition,
reconstruction, and manipulation with sight, sound, and touch. We also
introduce the ObjectFolder Real dataset, including the multisensory
measurements for 100 real-world household objects, building upon a newly
designed pipeline for collecting the 3D meshes, videos, impact sounds, and
tactile readings of real-world objects. We conduct systematic benchmarking on
both the 1,000 multisensory neural objects from ObjectFolder, and the real
multisensory data from ObjectFolder Real. Our results demonstrate the
importance of multisensory perception and reveal the respective roles of
vision, audio, and touch for different object-centric learning tasks. By
publicly releasing our dataset and benchmark suite, we hope to catalyze and
enable new research in multisensory object-centric learning in computer vision,
robotics, and beyond. Project page: https://objectfolder.stanford.edu |
This paper introduces the ObjectFolder Benchmark, a suite of 10 tasks for multisensory object-centric learning, and the ObjectFolder Real dataset, which includes multisensory measurements for 100 real-world household objects. |
Modeling the complete multisensory profile of objects is important for applications in computer vision, robotics, graphics, and VR/AR, but existing datasets and benchmarks are limited. |
The authors designed a data collection pipeline for capturing 3D meshes, videos, impact sounds, and tactile readings of real objects. They also standardized 10 tasks and developed baseline approaches for each. |
Vision and audio are more reliable than touch for object recognition.
Fusing multiple sensory modalities achieves the best results for object reconstruction.
Vision and touch are both crucial for robotic manipulation tasks. |
Sim-to-real transfer remains challenging for some tasks.
Future work includes exploring more robust sim-to-real calibration methods. |
multisensory learning, object-centric learning, benchmarking, dataset, robotics |
2306.00943
Report |
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance |
Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong |
Creating a vivid video from the event or scenario in our imagination is a
truly fascinating experience. Recent advancements in text-to-video synthesis
have unveiled the potential to achieve this with prompts only. While text is
convenient in conveying the overall scene context, it may be insufficient to
control precisely. In this paper, we explore customized video generation by
utilizing text as context description and motion structure (e.g. frame-wise
depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves
joint-conditional video generation using a Latent Diffusion Model that is
pre-trained for still image synthesis and then promoted for video generation
with the introduction of temporal modules. This two-stage learning scheme not
only reduces the computing resources required, but also improves the
performance by transferring the rich concepts available in image datasets
solely into video generation. Moreover, we use a simple yet effective causal
attention mask strategy to enable longer video synthesis, which mitigates the
potential quality degradation effectively. Experimental results show the
superiority of our method over existing baselines, particularly in terms of
temporal coherence and fidelity to users' guidance. In addition, our model
enables several intriguing applications that demonstrate potential for
practical usage. |
This paper introduces Make-Your-Video, an efficient approach for customized video generation using both textual descriptions and motion structures (e.g., frame-wise depth) as guidance. |
The method aims to address the limitations of text-only video generation, where precise control over video content can be challenging. By incorporating motion structures, the model offers enhanced controllability and enables users to create videos that closely align with their specific vision. |
The method leverages a Latent Diffusion Model (LDM) pre-trained for still image synthesis and adapts it for video generation. It introduces temporal modules while keeping the pre-trained spatial modules frozen to maintain visual richness. A causal attention mask strategy is also employed to enhance temporal coherence, especially in longer video synthesis. |
Make-Your-Video outperforms existing text-to-video generation baselines in terms of temporal coherence and fidelity to user guidance, as demonstrated by quantitative metrics like FVD and KVD.
The method enables various applications, including generating videos from real-life scene setups, 3D scene modeling, and video re-rendering with different styles.
Ablation studies confirm the importance of the proposed adapting strategy and the causal attention mask for improving performance. |
The model currently lacks precise control over visual details, such as synthesizing videos featuring specific individuals or objects.
Relying on frame-wise depth guidance can be demanding; exploring sparse keyframe guidance could broaden applicability. |
video generation, text-to-video synthesis, latent diffusion models, motion structures, conditional video generation |
2306.00926
Report |
Inserting Anybody in Diffusion Models via Celeb Basis |
Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng |
Exquisite demand exists for customizing the pretrained large text-to-image
model, $\textit{e.g.}$, Stable Diffusion, to generate innovative concepts, such
as the users themselves. However, the newly-added concept from previous
customization methods often shows weaker combination abilities than the
original ones even given several images during training. We thus propose a new
personalization method that allows for the seamless integration of a unique
individual into the pre-trained diffusion model using just $\textbf{one facial
photograph}$ and only $\textbf{1024 learnable parameters}$ under $\textbf{3
minutes}$. So as we can effortlessly generate stunning images of this person in
any pose or position, interacting with anyone and doing anything imaginable
from text prompts. To achieve this, we first analyze and build a well-defined
celeb basis from the embedding space of the pre-trained large text encoder.
Then, given one facial photo as the target identity, we generate its own
embedding by optimizing the weight of this basis and locking all other
parameters. Empowered by the proposed celeb basis, the new identity in our
customized model showcases a better concept combination ability than previous
personalization methods. Besides, our model can also learn several new
identities at once and interact with each other where the previous
customization model fails to. The code will be released. |
This paper presents a novel personalization method for pre-trained text-to-image models, enabling the seamless integration of a unique individual into the model using only one facial photograph and 1024 learnable parameters. |
Existing methods for customizing pre-trained text-to-image models often struggle to generate text description-aligned images with newly learned concepts, especially for fine-grained concepts like human identities. This limits users' ability to generate images of themselves or others in diverse scenarios. |
The method leverages a 'celeb basis' constructed from the embedding space of celebrity names in the pre-trained model. Given a facial photo, the method optimizes coefficients for this basis to represent the new identity. This personalized embedding then drives image generation, preserving the model's original composition abilities. |
The method produces high-quality images of new identities that maintain consistency with text prompts.
It surpasses previous personalization methods in terms of identity preservation and concept combination abilities.
The approach is efficient, requiring only 1024 learnable parameters and 3 minutes of training time per identity. |
The quality of generated images is limited by the pre-trained model's inherent biases and artifacts.
The current work focuses on human faces, and exploring the applicability to other concept classes remains for future work. |
text-to-image generation, personalization, diffusion models, identity representation, celeb basis |
2306.00905
Report |
T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation |
Jialu Wang, Xinyue Gabby Liu, Zonglin Di, Yang Liu, Xin Eric Wang |
Warning: This paper contains several contents that may be toxic, harmful, or
offensive.
In the last few years, text-to-image generative models have gained remarkable
success in generating images with unprecedented quality accompanied by a
breakthrough of inference speed. Despite their rapid progress, human biases
that manifest in the training examples, particularly with regard to common
stereotypical biases, like gender and skin tone, still have been found in these
generative models. In this work, we seek to measure more complex human biases
exist in the task of text-to-image generations. Inspired by the well-known
Implicit Association Test (IAT) from social psychology, we propose a novel
Text-to-Image Association Test (T2IAT) framework that quantifies the implicit
stereotypes between concepts and valence, and those in the images. We replicate
the previously documented bias tests on generative models, including morally
neutral tests on flowers and insects as well as demographic stereotypical tests
on diverse social attributes. The results of these experiments demonstrate the
presence of complex stereotypical behaviors in image generations. |
The paper proposes Text-to-Image Association Test (T2IAT), a novel framework to quantify implicit stereotypes in text-to-image generation models, going beyond simple demographic biases. |
Text-to-image models, trained on massive datasets, can perpetuate harmful stereotypes. Existing bias detection methods are limited in capturing nuanced associations between visual concepts and attributes. |
T2IAT adapts the Implicit Association Test from social psychology. It measures the distance between images generated with neutral prompts and those generated with attribute-guided prompts (e.g., gender, valence). Statistical tests determine the significance of observed biases. |
Generative models exhibit human-like biases even for non-demographic concepts (e.g., flowers are associated with pleasantness, insects with unpleasantness).
Significant biases were found in areas like race, sexuality, and gender roles, aligning with documented societal biases.
The model amplifies implicit stereotypes present in textual prompts, exacerbating existing biases in generated images. |
The verbal stimuli used, while aligned with prior IAT tests, might not fully represent all nuances of a concept.
The image encoder used to measure distance between images might introduce its own biases. |
bias detection, text-to-image generation, implicit association test, stereotype amplification, ai ethics |
2306.00783
Report |
FaceDNeRF: Semantics-Driven Face Reconstruction, Prompt Editing and Relighting with Diffusion Models |
Hao Zhang, Yanbo Xu, Tianyuan Dai, Yu-Wing Tai, Chi-Keung Tang |
The ability to create high-quality 3D faces from a single image has become
increasingly important with wide applications in video conferencing, AR/VR, and
advanced video editing in movie industries. In this paper, we propose Face
Diffusion NeRF (FaceDNeRF), a new generative method to reconstruct high-quality
Face NeRFs from single images, complete with semantic editing and relighting
capabilities. FaceDNeRF utilizes high-resolution 3D GAN inversion and expertly
trained 2D latent-diffusion model, allowing users to manipulate and construct
Face NeRFs in zero-shot learning without the need for explicit 3D data. With
carefully designed illumination and identity preserving loss, as well as
multi-modal pre-training, FaceDNeRF offers users unparalleled control over the
editing process enabling them to create and edit face NeRFs using just
single-view images, text prompts, and explicit target lighting. The advanced
features of FaceDNeRF have been designed to produce more impressive results
than existing 2D editing approaches that rely on 2D segmentation maps for
editable attributes. Experiments show that our FaceDNeRF achieves exceptionally
realistic results and unprecedented flexibility in editing compared with
state-of-the-art 3D face reconstruction and editing methods. Our code will be
available at https://github.com/BillyXYB/FaceDNeRF. |
Proposes FaceDNeRF, a novel method for reconstructing high-quality 3D face NeRFs from single images, enabling semantic editing and relighting. |
Addresses limitations of existing 3D face generation and editing methods, which lack photorealism, flexibility, and ease of control. |
Leverages a pre-trained EG3D generator and a stable diffusion model. Optimizes latent codes in EG3D's latent space via a combination of reconstruction, identity, diffusion, and illumination losses. |
Achieves high-fidelity 3D face reconstruction and editing from single images using text prompts.
Enables explicit and view-consistent control over illumination.
Demonstrates generalizability across different data domains (faces, cats, cars) and backbone architectures (EG3D, PanoHead). |
Performance is limited by the capabilities of the chosen GAN and diffusion models.
Generation from rare-sampled latent codes can produce unrealistic results. |
3d face reconstruction, nerf, semantic editing, relighting, diffusion models |
2306.00738
Report |
ReFACT: Updating Text-to-Image Models by Editing the Text Encoder |
Dana Arad, Hadas Orgad, Yonatan Belinkov |
Our world is marked by unprecedented technological, global, and
socio-political transformations, posing a significant challenge to
text-to-image generative models. These models encode factual associations
within their parameters that can quickly become outdated, diminishing their
utility for end-users. To that end, we introduce ReFACT, a novel approach for
editing factual associations in text-to-image models without relaying on
explicit input from end-users or costly re-training. ReFACT updates the weights
of a specific layer in the text encoder, modifying only a tiny portion of the
model's parameters and leaving the rest of the model unaffected. We empirically
evaluate ReFACT on an existing benchmark, alongside a newly curated dataset.
Compared to other methods, ReFACT achieves superior performance in both
generalization to related concepts and preservation of unrelated concepts.
Furthermore, ReFACT maintains image generation quality, making it a practical
tool for updating and correcting factual information in text-to-image models. |
This paper introduces ReFACT, a novel method for revising factual knowledge in text-to-image models without retraining or explicit user input. |
Text-to-image models can encode outdated or incorrect factual associations, limiting their utility. ReFACT provides an efficient way to update these models and keep them current. |
ReFACT modifies weights in the text encoder's MLP layer, viewing it as a key-value store. It optimizes a vector to align the representation of an edit prompt with a target prompt while contrasting against negative examples. |
ReFACT effectively edits various factual associations, including implicit model assumptions and object appearances.
It outperforms previous methods in efficacy, generalization to related concepts, and specificity, minimizing impact on unrelated concepts.
ReFACT maintains the model's image generation quality, as demonstrated by comparable FID and CLIP scores to the unedited model. |
ReFACT is slower than the compared editing method (TIME), requiring an optimization process.
The method exhibits limitations in editing facial features and occasional specificity failures, prompting further investigation into knowledge encoding and layer-specific editing effects. |
text-to-image generation, knowledge editing, model updating, factual consistency, clip |
2306.00693
Report |
GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks? |
Ning Ding, Yehui Tang, Zhongqian Fu, Chao Xu, Kai Han, Yunhe Wang |
The recent upsurge in pre-trained large models (e.g. GPT-4) has swept across
the entire deep learning community. Such powerful large language models (LLMs)
demonstrate advanced generative ability and multimodal understanding
capability, which quickly achieve new state-of-the-art performances on a
variety of benchmarks. The pre-trained LLM usually plays the role as a
universal AI model that can conduct various tasks, including context reasoning,
article analysis and image content comprehension. However, considering the
prohibitively high memory and computational cost for implementing such a large
model, the conventional models (such as CNN and ViT), are still essential for
many visual perception tasks. In this paper, we propose to enhance the
representation ability of ordinary vision models for perception tasks (e.g.
image classification) by taking advantage of large pre-trained models. We
present a new learning paradigm in which the knowledge extracted from large
pre-trained models are utilized to help models like CNN and ViT learn enhanced
representations and achieve better performance. Firstly, we curate a high
quality description set by prompting a multimodal LLM to generate descriptive
text for all training images. Furthermore, we feed these detailed descriptions
into a pre-trained encoder to extract text embeddings with rich semantic
information that encodes the content of images. During training, text
embeddings will serve as extra supervising signals and be aligned with image
representations learned by vision models. The alignment process helps vision
models learn better and achieve higher accuracy with the assistance of
pre-trained LLMs. We conduct extensive experiments to verify that the proposed
algorithm consistently improves the performance for various vision models with
heterogeneous architectures. |
This paper proposes GPT4Image, a novel supervised learning framework where conventional vision models learn enhanced representations by leveraging the knowledge and multimodal capabilities of large pre-trained models (LLMs) for improved performance in perception tasks like image classification. |
This approach allows smaller companies with limited resources to benefit from the power of LLMs without the need for the high computational cost of training and deploying these models themselves. |
The method involves curating a text description set for training images using a pre-trained multimodal LLM. Embeddings of these descriptions are then extracted with a text encoder and aligned with image representations learned by the vision models through a distance loss minimization process. |
GPT4Image consistently improved performance across various vision models (ResNet, ViT, ConvNeXt) on CIFAR and ImageNet-1K benchmarks.
The framework utilizes cross-modality knowledge from LLMs as a supervisory signal to enhance the training of vision models.
Short image descriptions focusing on salient objects proved more effective than long descriptions. |
The reliance on pre-generated descriptions limits flexibility during training.
The effectiveness depends on the quality and relevance of the LLM generated descriptions. |
image classification, large language models, multimodal learning, representation learning, supervised learning |
2306.00637
Report |
Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models |
Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, Marc Aubreville |
We introduce W\"urstchen, a novel architecture for text-to-image synthesis
that combines competitive performance with unprecedented cost-effectiveness for
large-scale text-to-image diffusion models. A key contribution of our work is
to develop a latent diffusion technique in which we learn a detailed but
extremely compact semantic image representation used to guide the diffusion
process. This highly compressed representation of an image provides much more
detailed guidance compared to latent representations of language and this
significantly reduces the computational requirements to achieve
state-of-the-art results. Our approach also improves the quality of
text-conditioned image generation based on our user preference study. The
training requirements of our approach consists of 24,602 A100-GPU hours -
compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also
requires less training data to achieve these results. Furthermore, our compact
latent representations allows us to perform inference over twice as fast,
slashing the usual costs and carbon footprint of a state-of-the-art (SOTA)
diffusion model significantly, without compromising the end performance. In a
broader comparison against SOTA models our approach is substantially more
efficient and compares favorably in terms of image quality. We believe that
this work motivates more emphasis on the prioritization of both performance and
computational accessibility. |
This paper introduces "Würstchen", a novel three-stage text-to-image synthesis architecture that achieves competitive performance with significantly reduced computational cost compared to existing large-scale diffusion models. |
State-of-the-art text-to-image models, while impressive, are computationally demanding and expensive to train. Würstchen addresses this limitation by achieving high-quality image synthesis with a fraction of the computational resources. |
The method employs a three-stage architecture: (1) a VQGAN compresses images into a latent space. (2) a latent diffusion model (LDM) operates on this compressed space, guided by a "Semantic Compressor" that provides highly compressed semantic image representations. (3) A final text-conditional LDM generates images in the compressed latent space, guided by text embeddings. |
Würstchen achieves a comparable performance to Stable Diffusion 2.1, while requiring 8x less training compute.
Human evaluation and PickScore metrics show that Würstchen consistently outperforms existing models of similar computational cost and even surpasses some larger models in image quality.
The architecture allows for fast inference, significantly reducing the cost and carbon footprint associated with large-scale image generation. |
While computationally efficient, Würstchen's FID score, although exceeding some larger models, is lower compared to other state-of-the-art models. This is attributed to smoother image features compared to other models.
The paper acknowledges the potential for further optimization, such as removing text conditioning from Stage B in future iterations. |
text-to-image synthesis, latent diffusion models, vqgan, efficient ai, computational efficiency |
2306.00547
Report |
AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars |
Mohit Mendiratta, Xingang Pan, Mohamed Elgharib, Kartik Teotia, Mallikarjun B R, Ayush Tewari, Vladislav Golyanik, Adam Kortylewski, Christian Theobalt |
Capturing and editing full head performances enables the creation of virtual
characters with various applications such as extended reality and media
production. The past few years witnessed a steep rise in the photorealism of
human head avatars. Such avatars can be controlled through different input data
modalities, including RGB, audio, depth, IMUs and others. While these data
modalities provide effective means of control, they mostly focus on editing the
head movements such as the facial expressions, head pose and/or camera
viewpoint. In this paper, we propose AvatarStudio, a text-based method for
editing the appearance of a dynamic full head avatar. Our approach builds on
existing work to capture dynamic performances of human heads using neural
radiance field (NeRF) and edits this representation with a text-to-image
diffusion model. Specifically, we introduce an optimization strategy for
incorporating multiple keyframes representing different camera viewpoints and
time stamps of a video performance into a single diffusion model. Using this
personalized diffusion model, we edit the dynamic NeRF by introducing
view-and-time-aware Score Distillation Sampling (VT-SDS) following a
model-based guidance approach. Our method edits the full head in a canonical
space, and then propagates these edits to remaining time steps via a pretrained
deformation network. We evaluate our method visually and numerically via a user
study, and results show that our method outperforms existing approaches. Our
experiments validate the design choices of our method and highlight that our
edits are genuine, personalized, as well as 3D- and time-consistent. |
AvatarStudio is the first text-driven method for editing the appearance of dynamic 3D human head avatars represented as dynamic NeRFs, enabling a wide range of personalized, 3D- and time-consistent edits. |
Existing methods for editing digital faces mainly focus on motion (e.g., facial expressions, head pose), while appearance editing is limited to relighting or non-photorealistic edits. Text-driven editing offers a user-friendly way to control and personalize dynamic avatars. |
The method fine-tunes a pre-trained text-to-image diffusion model on multiple keyframes from a multi-view video, capturing the identity from various viewpoints and time stamps. It then introduces a view- and time-aware Score Distillation Sampling (VT-SDS) approach to edit the dynamic NeRF based on the target text prompt while preserving identity and coherence. |
Generates a diverse range of photorealistic and non-photorealistic text-based edits, including changes to appearance and geometry.
Maintains the integrity of the input identity while adhering to the text prompt.
Produces 3D-consistent edits viewable from arbitrary camera angles and ensures temporal coherence for smooth video editing. |
Requires multi-view data captured in uniform illumination, limiting its application to controlled environments.
Computationally expensive, taking about 60 minutes to train on a single A100 GPU. |
text-driven editing, neural rendering, 3d dynamic human head avatar, diffusion model, nerf |
2306.00450
Report |
Exploring Open-Vocabulary Semantic Segmentation without Human Labels |
Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana |
Semantic segmentation is a crucial task in computer vision that involves
segmenting images into semantically meaningful regions at the pixel level.
However, existing approaches often rely on expensive human annotations as
supervision for model training, limiting their scalability to large, unlabeled
datasets. To address this challenge, we present ZeroSeg, a novel method that
leverages the existing pretrained vision-language (VL) model (e.g. CLIP) to
train open-vocabulary zero-shot semantic segmentation models. Although acquired
extensive knowledge of visual concepts, it is non-trivial to exploit knowledge
from these VL models to the task of semantic segmentation, as they are usually
trained at an image level. ZeroSeg overcomes this by distilling the visual
concepts learned by VL models into a set of segment tokens, each summarizing a
localized region of the target image. We evaluate ZeroSeg on multiple popular
segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO,
in a zero-shot manner (i.e., no training or adaption on target segmentation
datasets). Our approach achieves state-of-the-art performance when compared to
other zero-shot segmentation methods under the same training data, while also
performing competitively compared to strongly supervised methods. Finally, we
also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation,
through both human studies and qualitative visualizations. |
Introduces ZeroSeg, a model for open-vocabulary zero-shot semantic segmentation that eliminates the need for human annotations by distilling knowledge from pre-trained vision-language models. |
Addresses the limitations of traditional supervised methods, which are expensive, time-consuming, and struggle to generalize to new visual concepts. This enables more flexible and efficient semantic segmentation learning. |
Utilizes a masked encoder-decoder architecture and a segmentation head that groups pixels into semantically meaningful segments. Employs multi-scale image feature distillation and a segment matching loss to transfer knowledge from a pre-trained CLIP visual encoder without relying on text annotations. |
Achieves competitive performance compared to supervised and weakly-supervised methods on PASCAL VOC 2012, PASCAL Context, and COCO datasets despite not using any segmentation labels during training.
Outperforms existing zero-shot segmentation methods, even those trained on significantly larger datasets.
Demonstrates superior performance in open-vocabulary segmentation tasks, as evidenced by human studies and qualitative visualizations. |
Potential biases present in the pre-trained vision-language models may perpetuate in ZeroSeg, requiring mitigation strategies.
Limited exploration of performance scaling with even larger datasets due to the inaccessibility of certain datasets like YFCC100M. |
semantic segmentation, zero-shot learning, open-vocabulary, vision-language models, knowledge distillation |
2306.00354
Report |
Addressing Negative Transfer in Diffusion Models |
Hyojun Go, JinYoung Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, Seungtaek Choi |
Diffusion-based generative models have achieved remarkable success in various
domains. It trains a shared model on denoising tasks that encompass different
noise levels simultaneously, representing a form of multi-task learning (MTL).
However, analyzing and improving diffusion models from an MTL perspective
remains under-explored. In particular, MTL can sometimes lead to the well-known
phenomenon of negative transfer, which results in the performance degradation
of certain tasks due to conflicts between tasks. In this paper, we first aim to
analyze diffusion training from an MTL standpoint, presenting two key
observations: (O1) the task affinity between denoising tasks diminishes as the
gap between noise levels widens, and (O2) negative transfer can arise even in
diffusion training. Building upon these observations, we aim to enhance
diffusion training by mitigating negative transfer. To achieve this, we propose
leveraging existing MTL methods, but the presence of a huge number of denoising
tasks makes this computationally expensive to calculate the necessary per-task
loss or gradient. To address this challenge, we propose clustering the
denoising tasks into small task clusters and applying MTL methods to them.
Specifically, based on (O2), we employ interval clustering to enforce temporal
proximity among denoising tasks within clusters. We show that interval
clustering can be solved using dynamic programming, utilizing signal-to-noise
ratio, timestep, and task affinity for clustering objectives. Through this, our
approach addresses the issue of negative transfer in diffusion models by
allowing for efficient computation of MTL methods. We validate the efficacy of
proposed clustering and its integration with MTL methods through various
experiments, demonstrating 1) improved generation quality and 2) faster
training convergence of diffusion models. |
This paper investigates the presence of negative transfer in diffusion model training, where learning denoising tasks at different noise levels can negatively impact each other, and proposes a strategy using interval clustering and multi-task learning methods to mitigate it. |
Diffusion models, while successful, are inherently multi-task learners, and understanding and addressing the potential negative transfer between denoising tasks can lead to significant improvements in generation quality and training efficiency. |
The paper analyzes task affinity between different denoising tasks and observes negative transfer by comparing models trained on specific timestep intervals to models trained on all tasks. It then proposes to cluster denoising tasks into intervals based on timesteps, SNR, or task affinity, and applies MTL methods (PCGrad, NashMTL, Uncertainty Weighting) to these task clusters to reduce negative transfer. |
Incorporating MTL methods with interval clustering significantly improves image generation quality (FID, precision) compared to vanilla diffusion training across different datasets (FFHQ, CelebA-HQ, ImageNet) and architectures (ADM, LDM, DiT).
The proposed method achieves faster convergence compared to vanilla training.
Uncertainty Weighting (UW) generally achieves better sample quality, while NashMTL shows better distribution coverage, and PCGrad presents a balanced performance. |
Negative transfer is not completely resolved, suggesting room for further improvement by enabling the model to learn entire denoising tasks more harmoniously.
The study does not explore architectural designs specific to MTL for diffusion models, which could be a promising direction for future work. |
diffusion models, multi-task learning, negative transfer, interval clustering, image generation |
2306.00241
Report |
Balancing Reconstruction and Editing Quality of GAN Inversion for Real Image Editing with StyleGAN Prior Latent Space |
Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama |
The exploration of the latent space in StyleGANs and GAN inversion exemplify
impressive real-world image editing, yet the trade-off between reconstruction
quality and editing quality remains an open problem. In this study, we revisit
StyleGANs' hyperspherical prior $\mathcal{Z}$ and $\mathcal{Z}^+$ and integrate
them into seminal GAN inversion methods to improve editing quality. Besides
faithful reconstruction, our extensions achieve sophisticated editing quality
with the aid of the StyleGAN prior. We project the real images into the
proposed space to obtain the inverted codes, by which we then move along
$\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality.
Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most
commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while
preserving reconstruction quality, resulting in reduced distortion of edited
images. |
This paper revisits the use of StyleGAN's hyperspherical prior spaces, \(\ZS\) and \(\ZPS\), for GAN inversion to enhance editing quality without sacrificing reconstruction quality. |
Existing GAN inversion methods struggle to balance high-fidelity reconstruction with the ability to perform semantic image edits without introducing artifacts. This work addresses this trade-off by leveraging the desirable properties of \(\ZS\) and \(\ZPS\). |
The authors integrate \(\ZPS\) into established GAN inversion techniques like BDInvert, SAM, and PTI, replacing unbounded latent spaces like \(\WPS\) with the bounded \(\ZPS\). They introduce the \(\FZS\) space, combining \(\ZPS\) with a feature space (\(\FS\)) for improved reconstruction. Optimization retracts latent codes to the hypersphere surface during each iteration. |
\(\FZS\) achieves reconstruction quality comparable to state-of-the-art methods like \(\FWS\) using both qualitative and quantitative metrics (LPIPS, MSE, SSIM).
Editing operations in \(\FZS\) preserve image quality and identity significantly better than methods relying on \(\WPS\), as shown with GANSpace and InterfaceGAN directions.
Integrating \(\ZPS\) into other GAN inversion methods like PTI and SAM demonstrates consistent improvement in editing quality without hindering reconstruction. |
The authors primarily focus on evaluating their approach on face datasets, leaving exploration of other domains for future work.
Further investigation into the impact of different editing techniques and their compatibility with \(\ZPS\) is warranted. |
gan inversion, stylegan, image editing, latent space, hyperspherical prior |
2306.00219
Report |
Diffusion Brush: A Latent Diffusion Model-based Editing Tool for AI-generated Images |
Peyman Gholami, Robert Xiao |
Text-to-image generative models have made remarkable advancements in
generating high-quality images. However, generated images often contain
undesirable artifacts or other errors due to model limitations. Existing
techniques to fine-tune generated images are time-consuming (manual editing),
produce poorly-integrated results (inpainting), or result in unexpected changes
across the entire image (variation selection and prompt fine-tuning). In this
work, we present Diffusion Brush, a Latent Diffusion Model-based (LDM) tool to
efficiently fine-tune desired regions within an AI-synthesized image. Our
method introduces new random noise patterns at targeted regions during the
reverse diffusion process, enabling the model to efficiently make changes to
the specified regions while preserving the original context for the rest of the
image. We evaluate our method's usability and effectiveness through a user
study with artists, comparing our technique against other state-of-the-art
image inpainting techniques and editing software for fine-tuning AI-generated
imagery. |
This document provides guidelines for authors submitting papers to the WACV conference, covering style, formatting, and anonymization for blind review. |
It ensures a consistent and high-quality standard for submissions to WACV, aiding the review process and final publication. |
The paper outlines specific instructions for language, length, formatting (including margins, fonts, and references), figures, equations, and blind review requirements. |
Papers must not exceed eight pages excluding references, with no extra page charges.
Anonymization for blind review involves avoiding self-identifying language like 'my work' while still citing your past research appropriately.
Cross-referencing figures, tables, and equations using specific commands like \cref{} is encouraged. |
The guidelines primarily focus on LaTeX users, potentially leaving authors using other systems with less specific guidance.
The document could benefit from clearer explanations and examples regarding open challenge result reporting for blind review. |
author guidelines, wacv, conference paper, latex, blind review |
2306.00180
Report |
FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow |
Cameron Smith, Yilun Du, Ayush Tewari, Vincent Sitzmann |
Reconstruction of 3D neural fields from posed images has emerged as a
promising method for self-supervised representation learning. The key challenge
preventing the deployment of these 3D scene learners on large-scale video data
is their dependence on precise camera poses from structure-from-motion, which
is prohibitively expensive to run at scale. We propose a method that jointly
reconstructs camera poses and 3D neural scene representations online and in a
single forward pass. We estimate poses by first lifting frame-to-frame optical
flow to 3D scene flow via differentiable rendering, preserving locality and
shift-equivariance of the image processing backbone. SE(3) camera pose
estimation is then performed via a weighted least-squares fit to the scene flow
field. This formulation enables us to jointly supervise pose estimation and a
generalizable neural scene representation via re-rendering the input video, and
thus, train end-to-end and fully self-supervised on real-world video datasets.
We demonstrate that our method performs robustly on diverse, real-world video,
notably on sequences traditionally challenging to optimization-based pose
estimation techniques. |
Presents FlowCam, a method for jointly training a feed-forward generalizable 3D neural scene representation and camera trajectory estimation, self-supervised by re-rendering losses on video frames, without ground-truth camera poses or depth maps. |
Unlocks orders of magnitude more training data for 3D scene learners by removing dependence on expensive structure-from-motion for camera pose estimation, paving the way for large-scale 3D representation learning. |
Leverages single-image neural scene representations and differentiable rendering to lift frame-to-frame optical flow to 3D scene flow. Estimates SE(3) camera poses via a robust, weighted least-squares solver on the scene flow field. Jointly supervises pose estimation and neural scene representation via re-rendering the input video with RGB and flow losses. |
Outperforms state-of-the-art unposed methods on novel view synthesis.
Demonstrates robust pose estimation on sequences challenging for conventional SLAM approaches (e.g., ORB-SLAM3).
Generalizes to out-of-distribution scenes and supports fine-tuning for improved accuracy. |
As an odometry method, it suffers from drift and lacks loop closure.
Currently does not model scene dynamics. |
3d scene representation, camera pose estimation, self-supervised learning, differentiable rendering, neural radiance fields |
2305.20091
Report |
Humans in 4D: Reconstructing and Tracking Humans with Transformers |
Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, Jitendra Malik |
We present an approach to reconstruct humans and track them over time. At the
core of our approach, we propose a fully "transformerized" version of a network
for human mesh recovery. This network, HMR 2.0, advances the state of the art
and shows the capability to analyze unusual poses that have in the past been
difficult to reconstruct from single images. To analyze video, we use 3D
reconstructions from HMR 2.0 as input to a tracking system that operates in 3D.
This enables us to deal with multiple people and maintain identities through
occlusion events. Our complete approach, 4DHumans, achieves state-of-the-art
results for tracking people from monocular video. Furthermore, we demonstrate
the effectiveness of HMR 2.0 on the downstream task of action recognition,
achieving significant improvements over previous pose-based action recognition
approaches. Our code and models are available on the project website:
https://shubham-goel.github.io/4dhumans/. |
This paper presents a novel approach, 4DHumans, for reconstructing and tracking humans in videos using a fully transformer-based architecture, HMR 2.0, for 3D human mesh recovery. |
This work pushes the limits of analyzable videos with 3D human reconstruction techniques and achieves state-of-the-art results for human mesh recovery and tracking. |
The authors introduce HMR 2.0, a transformer-based network for reconstructing 3D human meshes from single images, and integrate it into a modified PHALP tracking system that operates in 3D. They train their models on a combination of datasets, leveraging pseudo-ground truth annotations for unlabeled data. |
HMR 2.0 surpasses previous methods in 2D and 3D pose accuracy metrics, particularly for unusual and challenging poses.
4DHumans achieves state-of-the-art tracking performance on the PoseTrack dataset, showing robustness to occlusions and complex scenes.
The accuracy of HMR 2.0's 3D pose estimations translates to superior performance on the downstream task of action recognition on the AVA dataset. |
The reliance on the SMPL model limits the system's ability to capture finer details such as hand poses, facial expressions, and variations in age and body shape.
Reconstructions are performed in the camera frame, neglecting a common world coordinate frame which is essential for comprehensive action understanding in videos. Future work could address camera motion and multi-person interactions. |
human mesh recovery, 3d human pose estimation, tracking, transformers, action recognition |
2305.20087
Report |
Too Large; Data Reduction for Vision-Language Pre-Training |
Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou |
This paper examines the problems of severe image-text misalignment and high
redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP)
datasets. To address these issues, we propose an efficient and straightforward
Vision-Language learning algorithm called TL;DR, which aims to compress the
existing large VLP data into a small, high-quality set. Our approach consists
of two major steps. First, a codebook-based encoder-decoder captioner is
developed to select representative samples. Second, a new caption is generated
to complement the original captions for selected samples, mitigating the
text-image misalignment problem while maintaining uniqueness. As the result,
TL;DR enables us to reduce the large dataset into a small set of high-quality
data, which can serve as an alternative pre-training dataset. This algorithm
significantly speeds up the time-consuming pretraining process. Specifically,
TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce
well-cleaned CC3M dataset from 2.82M to 0.67M ($\sim$24\%) and noisy YFCC15M
from 15M to 2.5M ($\sim$16.7\%). Extensive experiments with three popular VLP
models over seven downstream tasks show that VLP model trained on the
compressed dataset provided by TL;DR can perform similar or even better results
compared with training on the full-scale dataset. The code will be made
available at \url{https://github.com/showlab/datacentric.vlp}. |
This paper introduces \ModelName, a novel algorithm designed to compress large-scale, noisy Vision-Language Pre-training (VLP) datasets into smaller, high-quality datasets. |
Large VLP datasets are computationally expensive to train and often contain significant image-text misalignment and redundancy. \ModelName addresses these issues by creating smaller, more efficient datasets without sacrificing performance. |
\ModelName uses a two-stage approach: 1) Training a codebook-based captioner to cluster and select representative image-text pairs, and 2) Refining the captions of selected samples to reduce misalignment. |
Training VLP models on \ModelName-compressed datasets (10%-25% of the original size) achieves comparable or better performance than training on the full datasets.
The codebook-based clustering effectively groups semantically similar image-text pairs.
\ModelName successfully reduces image-text misalignment, as evidenced by improved Image-Text Matching (ITM) scores. |
The current implementation relies on manually selecting the compression ratio.
Further exploration is needed to achieve even higher compression ratios, potentially leveraging text-to-image generation models. |
vision-language pre-training, data reduction, dataset compression, image-text misalignment, codebook learning |
2305.20082
Report |
Control4D: Efficient 4D Portrait Editing with Text |
Ruizhi Shao, Jingxiang Sun, Cheng Peng, Zerong Zheng, Boyao Zhou, Hongwen Zhang, Yebin Liu |
We introduce Control4D, an innovative framework for editing dynamic 4D
portraits using text instructions. Our method addresses the prevalent
challenges in 4D editing, notably the inefficiencies of existing 4D
representations and the inconsistent editing effect caused by diffusion-based
editors. We first propose GaussianPlanes, a novel 4D representation that makes
Gaussian Splatting more structured by applying plane-based decomposition in 3D
space and time. This enhances both efficiency and robustness in 4D editing.
Furthermore, we propose to leverage a 4D generator to learn a more continuous
generation space from inconsistent edited images produced by the
diffusion-based editor, which effectively improves the consistency and quality
of 4D editing. Comprehensive evaluation demonstrates the superiority of
Control4D, including significantly reduced training time, high-quality
rendering, and spatial-temporal consistency in 4D portrait editing. The link to
our project website is https://control4darxiv.github.io. |
Control4D is a novel framework for efficient, high-quality, and temporally consistent editing of dynamic 4D portraits using text instructions. |
Existing 4D editing techniques lack interactivity and struggle with spatial-temporal consistency and quality. Control4D addresses these limitations by introducing an efficient 4D representation and a novel editing framework. |
The method utilizes GaussianPlanes, a 4D representation built upon Gaussian Splatting with plane-based decomposition for efficiency and robustness. It integrates a 4D generator with a 2D diffusion-based editor to learn a continuous generation space and mitigate inconsistencies in edited images. |
Significantly reduced training time compared to previous methods.
Achieves high-quality rendering of dynamic portraits with intricate details.
Ensures spatiotemporal consistency in editing, resulting in coherent and realistic 4D edits. |
Challenges in handling rapid and extensive non-rigid movements due to reliance on flow learning.
Limited editing granularity due to ControlNet constraints, preventing precise expression or action edits. |
4d portrait editing, text-guided editing, gaussian splatting, generative adversarial networks (gans), diffusion models |
2305.20049
Report |
A Unified Conditional Framework for Diffusion-based Image Restoration |
Yi Zhang, Xiaoyu Shi, Dasong Li, Xiaogang Wang, Jian Wang, Hongsheng Li |
Diffusion Probabilistic Models (DPMs) have recently shown remarkable
performance in image generation tasks, which are capable of generating highly
realistic images. When adopting DPMs for image restoration tasks, the crucial
aspect lies in how to integrate the conditional information to guide the DPMs
to generate accurate and natural output, which has been largely overlooked in
existing works. In this paper, we present a unified conditional framework based
on diffusion models for image restoration. We leverage a lightweight UNet to
predict initial guidance and the diffusion model to learn the residual of the
guidance. By carefully designing the basic module and integration module for
the diffusion model block, we integrate the guidance and other auxiliary
conditional information into every block of the diffusion model to achieve
spatially-adaptive generation conditioning. To handle high-resolution images,
we propose a simple yet effective inter-step patch-splitting strategy to
produce arbitrary-resolution images without grid artifacts. We evaluate our
conditional framework on three challenging tasks: extreme low-light denoising,
deblurring, and JPEG restoration, demonstrating its significant improvements in
perceptual quality and the generalization to restoration tasks. |
This paper presents a unified conditional framework based on diffusion models for image restoration, focusing on effectively integrating conditional information (e.g., degraded image, noise level) into the diffusion process. |
Existing image restoration methods using diffusion models often lack effective integration of conditional information, limiting their ability to generate accurate and natural outputs. |
The framework utilizes a lightweight UNet to predict an initial guidance image and employs a diffusion model to learn the residual details. It introduces an Adaptive Kernel Guidance Block (AKGB) to adaptively fuse conditional information into each diffusion block for spatially-adaptive generation. An inter-step patch-splitting strategy is proposed for high-resolution image generation without grid artifacts. |
The method achieves state-of-the-art perceptual quality on extreme low-light denoising (SID dataset), outperforming existing regression and diffusion-based methods.
It demonstrates superior performance on image deblurring (GoPro dataset), surpassing previous deblurring methods in perceptual metrics.
The framework generalizes well to JPEG restoration, showing significant improvements over regression-based methods and previous diffusion-based approaches. |
The current implementation uses a simple uniform noise schedule for faster sampling, which could be further optimized with advanced sampling techniques.
While the method excels in generating realistic textures, it may occasionally produce unnatural details like incorrect characters, requiring further exploration of generation control mechanisms. |
image restoration, diffusion models, conditional image generation, adaptive kernel guidance, high-resolution image synthesis |
2305.19858
Report |
Enhancing image quality prediction with self-supervised visual masking |
Uğur Çoğalan, Mojtaba Bemana, Hans-Peter Seidel, Karol Myszkowski |
Full-reference image quality metrics (FR-IQMs) aim to measure the visual
differences between a pair of reference and distorted images, with the goal of
accurately predicting human judgments. However, existing FR-IQMs, including
traditional ones like PSNR and SSIM and even perceptual ones such as HDR-VDP,
LPIPS, and DISTS, still fall short in capturing the complexities and nuances of
human perception. In this work, rather than devising a novel IQM model, we seek
to improve upon the perceptual quality of existing FR-IQM methods. We achieve
this by considering visual masking, an important characteristic of the human
visual system that changes its sensitivity to distortions as a function of
local image content. Specifically, for a given FR-IQM metric, we propose to
predict a visual masking model that modulates reference and distorted images in
a way that penalizes the visual errors based on their visibility. Since the
ground truth visual masks are difficult to obtain, we demonstrate how they can
be derived in a self-supervised manner solely based on mean opinion scores
(MOS) collected from an FR-IQM dataset. Our approach results in enhanced FR-IQM
metrics that are more in line with human prediction both visually and
quantitatively. |
This paper introduces a self-supervised visual masking approach to enhance the perceptual quality prediction of existing full-reference image quality metrics (FR-IQMs). |
Existing FR-IQMs, both traditional and learning-based, often fail to accurately capture the complexities of human perception, particularly the phenomenon of visual masking. |
A lightweight CNN is trained to predict a visual mask for a given reference and distorted image pair. This mask, learned in a self-supervised manner using MOS data, modulates the input images or features to emphasize perceptually important distortions. |
The proposed method significantly improves the correlation of various classic and learning-based FR-IQMs with human judgments on standard benchmark datasets.
The generated error maps better align with human perception of distortion visibility compared to the original metrics.
The enhanced metrics show promise as loss functions for image restoration tasks, improving perceptual quality in denoising and deblurring. |
The effectiveness of the visual masking model is limited to the specific viewing conditions and display setup of the training dataset.
Integrating the masking model into complex end-to-end deep learning-based metrics might be challenging. |
image quality assessment, visual masking, perceptual metrics, deep learning, self-supervised learning |
2305.19599
Report |
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment |
Guian Fang, Zutao Jiang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai Liao, Xiaodan Liang |
Recent advances in text-to-image diffusion models have achieved remarkable
success in generating high-quality, realistic images from textual descriptions.
However, these approaches have faced challenges in precisely aligning the
generated visual content with the textual concepts described in the prompts. In
this paper, we propose a two-stage coarse-to-fine semantic re-alignment method,
named RealignDiff, aimed at improving the alignment between text and images in
text-to-image diffusion models. In the coarse semantic re-alignment phase, a
novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the
semantic discrepancy between the generated image caption and the given text
prompt. Subsequently, the fine semantic re-alignment stage employs a local
dense caption generation module and a re-weighting attention modulation module
to refine the previously generated images from a local semantic view.
Experimental results on the MS-COCO benchmark demonstrate that the proposed
two-stage coarse-to-fine semantic re-alignment method outperforms other
baseline re-alignment techniques by a substantial margin in both visual quality
and semantic similarity with the input prompt. |
This paper proposes RealignDiff, a two-stage coarse-to-fine semantic re-alignment method for text-to-image diffusion models, aiming to improve the alignment between generated images and input textual descriptions. |
Existing text-to-image diffusion models often struggle to precisely align the generated visual content with the concepts described in textual prompts, particularly in capturing object attributes and relationships. |
The method consists of: (1) **Coarse Semantic Re-alignment:** Fine-tuning the diffusion model using a novel caption reward, which leverages the BLIP-2 model to evaluate the semantic similarity between the generated image caption and the input prompt. (2) **Fine Semantic Re-alignment:** Refining the generated image from a local semantic view using a local dense caption generation module and a re-weighting attention modulation module. |
RealignDiff significantly outperforms other baseline re-alignment techniques on the MS-COCO benchmark in terms of visual quality and semantic similarity with the input prompt.
The proposed caption reward proves to be more effective than traditional reward functions like CLIP reward, BLIP reward, and ImageReward.
Ablation studies demonstrate that both the coarse and fine semantic re-alignment modules contribute to the improved performance. |
The fine semantic re-alignment stage may fail if the large language model fails to provide accurate intermediate results, requiring further research to address this limitation.
Future work will explore the dynamic learning from multiple reward functions, incorporating both semantic and aesthetic considerations, within the diffusion model. |
text-to-image generation, diffusion models, semantic alignment, caption reward, attention modulation |
2305.19412
Report |
Are Large Kernels Better Teachers than Transformers for ConvNets? |
Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, Shiwei Liu |
This paper reveals a new appeal of the recently emerged large-kernel
Convolutional Neural Networks (ConvNets): as the teacher in Knowledge
Distillation (KD) for small-kernel ConvNets. While Transformers have led
state-of-the-art (SOTA) performance in various fields with ever-larger models
and labeled data, small-kernel ConvNets are considered more suitable for
resource-limited applications due to the efficient convolution operation and
compact weight sharing. KD is widely used to boost the performance of
small-kernel ConvNets. However, previous research shows that it is not quite
effective to distill knowledge (e.g., global information) from Transformers to
small-kernel ConvNets, presumably due to their disparate architectures. We
hereby carry out a first-of-its-kind study unveiling that modern large-kernel
ConvNets, a compelling competitor to Vision Transformers, are remarkably more
effective teachers for small-kernel ConvNets, due to more similar
architectures. Our findings are backed up by extensive experiments on both
logit-level and feature-level KD ``out of the box", with no dedicated
architectural nor training recipe modifications. Notably, we obtain the
\textbf{best-ever pure ConvNet} under 30M parameters with \textbf{83.1\%} top-1
accuracy on ImageNet, outperforming current SOTA methods including ConvNeXt V2
and Swin V2. We also find that beneficial characteristics of large-kernel
ConvNets, e.g., larger effective receptive fields, can be seamlessly
transferred to students through this large-to-small kernel distillation. Code
is available at: \url{https://github.com/VITA-Group/SLaK}. |
This paper investigates the effectiveness of using large-kernel Convolutional Neural Networks (ConvNets) as teachers for knowledge distillation into small-kernel ConvNets, finding them superior to Vision Transformers in this role. |
Small-kernel ConvNets are preferable for resource-constrained applications, but struggle to match the performance of large-scale Vision Transformers. Knowledge distillation offers a path to improve their performance without increasing model size, but previous work showed limited effectiveness when distilling from Transformers to ConvNets. |
The authors conduct systematic experiments on ImageNet, distilling various large-kernel ConvNets (ConvNeXt, SLaK) and Vision Transformers (ViT, Swin, CSWin) into small-kernel ConvNets (ResNet-50, ConvNeXt-T), using both logit-level (KD, NKD) and feature-level (FD) distillation methods. They analyze performance gains, effective receptive fields (ERF), and robustness of the distilled models. |
Large-kernel ConvNets consistently outperform Vision Transformers as teachers for small-kernel ConvNets across different distillation methods.
Students distilled from larger kernel teachers achieve better performance than those trained on smaller kernels, indicating successful transfer of the benefits of large kernels.
Students distilled from large-kernel ConvNets inherit their advantageous properties, exhibiting larger and denser ERF and improved robustness compared to those distilled from Transformers. |
The study primarily focuses on ImageNet classification; further investigation is needed for other tasks and datasets.
Future work could explore optimal distillation techniques and training recipes tailored for large-to-small kernel knowledge transfer. |
knowledge distillation, convolutional neural networks, vision transformers, large kernels, robustness |
2305.19327
Report |
Cones 2: Customizable Image Synthesis with Multiple Subjects |
Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, Yang Cao |
Synthesizing images with user-specified subjects has received growing
attention due to its practical applications. Despite the recent success in
single subject customization, existing algorithms suffer from high training
cost and low success rate along with increased number of subjects. Towards
controllable image synthesis with multiple subjects as the constraints, this
work studies how to efficiently represent a particular subject as well as how
to appropriately compose different subjects. We find that the text embedding
regarding the subject token already serves as a simple yet effective
representation that supports arbitrary combinations without any model tuning.
Through learning a residual on top of the base embedding, we manage to robustly
shift the raw subject to the customized subject given various text conditions.
We then propose to employ layout, a very abstract and easy-to-obtain prior, as
the spatial guidance for subject arrangement. By rectifying the activations in
the cross-attention map, the layout appoints and separates the location of
different subjects in the image, significantly alleviating the interference
across them. Both qualitative and quantitative experimental results demonstrate
our superiority over state-of-the-art alternatives under a variety of settings
for multi-subject customization. |
This paper proposes \method, a novel approach for multi-subject customization using a pre-trained text-to-image diffusion model that utilizes a simple yet effective representation to register a subject and enables the arbitrary composition of various subjects without requiring any model retraining. |
Existing algorithms for user-specified subject image synthesis, despite success in single subject customization, suffer from high training cost and low success rate when multiple subjects are introduced. This work addresses the need for controllable image synthesis with multiple subjects as constraints. |
The approach decomposes the task into two components: 1) efficiently representing a subject, achieved by fine-tuning the text encoder with subject-specific images and deriving residual token embeddings, and 2) effectively combining different subjects, addressed by a layout guidance method that controls the generation process by rectifying activations in cross-attention maps based on a user-defined layout. |
The method demonstrates superior performance over existing baselines in multi-subject customization, particularly with three or more subjects, as evidenced by quantitative metrics and user studies.
It effectively mitigates attribute confusion among subjects with high semantic similarity, a challenge faced by other methods.
The approach allows for the customization of image synthesis with a relatively large number of subjects (e.g., six subjects). |
The approach may not consistently generate satisfactory results when combining more than six subjects.
The user-provided layout needs to be roughly consistent with the textual description to achieve desired generation results. |
image synthesis, text-to-image generation, multi-subject customization, diffusion models, layout guidance |
2305.19270
Report |
Learning without Forgetting for Vision-Language Models |
Da-Wei Zhou, Yuanhan Zhang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu |
Class-Incremental Learning (CIL) or continual learning is a desired
capability in the real world, which requires a learning system to adapt to new
tasks without forgetting former ones. While traditional CIL methods focus on
visual information to grasp core features, recent advances in Vision-Language
Models (VLM) have shown promising capabilities in learning generalizable
representations with the aid of textual information. However, when continually
trained with new classes, VLMs often suffer from catastrophic forgetting of
former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to
adapt the model without forgetting; and 2) how to make full use of the
multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that
enables VLMs to learn without forgetting. To handle the first challenge, we
propose training task-specific projections based on the frozen image/text
encoders. When facing new tasks, new projections are expanded and former
projections are fixed, alleviating the forgetting of old concepts. For the
second challenge, we propose the fusion module to better utilize the
cross-modality information. By jointly adjusting visual and textual features,
the model can capture semantic information with stronger representation
ability. Extensive experiments on nine benchmark datasets validate PROOF
achieves state-of-the-art performance. |
This paper presents PROjectiOn Fusion (PROOF), a novel approach to address catastrophic forgetting in Vision-Language Models (VLMs) for Class-Incremental Learning (CIL). |
The paper addresses the limitations of existing CIL methods that either rely solely on visual information or suffer from catastrophic forgetting when adapting VLMs incrementally. VLMs offer the potential for learning generalizable representations by leveraging textual information, making them suitable for CIL. |
PROOF utilizes a two-fold strategy: 1) **Expandable Feature Projection:** Freezing pre-trained image/text backbones and appending task-specific linear projections to capture new concepts without overwriting old ones. 2) **Contextualizing Projections with Projection Fusion:** Employing self-attention to fuse and adapt query instance embeddings with visual and textual context, including prototypes and learnable prompts, for robust classification. |
PROOF achieves state-of-the-art performance on nine benchmark CIL datasets, consistently outperforming existing methods.
The ablation study validates the contribution of both expandable projections and cross-modal fusion to the model's performance.
A variation of PROOF is proposed to address the zero-shot performance degradation in CIL, striking a balance between adapting to new tasks and preserving generalization ability. |
The current implementation of PROOF relies on exemplars for rehearsal, which may raise storage and privacy concerns.
Future work includes extending PROOF to exemplar-free scenarios and exploring its application in other VLMs and vision-language tasks. |
class-incremental learning, vision-language models, catastrophic forgetting, projection fusion, cross-modal learning |
2305.19193
Report |
Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models |
Ernie Chu, Shuo-Yen Lin, Jun-Cheng Chen |
In this study, we present an efficient and effective approach for achieving
temporally consistent synthetic-to-real video translation in videos of varying
lengths. Our method leverages off-the-shelf conditional image diffusion models,
allowing us to perform multiple synthetic-to-real image generations in
parallel. By utilizing the available optical flow information from the
synthetic videos, our approach seamlessly enforces temporal consistency among
corresponding pixels across frames. This is achieved through joint noise
optimization, effectively minimizing spatial and temporal discrepancies. To the
best of our knowledge, our proposed method is the first to accomplish diverse
and temporally consistent synthetic-to-real video translation using conditional
image diffusion models. Furthermore, our approach does not require any training
or fine-tuning of the diffusion models. Extensive experiments conducted on
various benchmarks for synthetic-to-real video translation demonstrate the
effectiveness of our approach, both quantitatively and qualitatively. Finally,
we show that our method outperforms other baseline methods in terms of both
temporal consistency and visual quality. |
This paper presents Video ControlNet, a novel approach that leverages pre-trained conditional image diffusion models, like ControlNet, to achieve temporally consistent synthetic-to-real video translation. |
Existing image-to-image translation methods often produce temporally inconsistent videos with flickering artifacts. Video ControlNet addresses this issue by enforcing temporal consistency through a novel optimization process. |
The method uses optical flow information from synthetic videos to guide a joint noise optimization process, minimizing discrepancies between corresponding pixels across frames. This ensures both spatial and temporal consistency in the generated videos. |
Video ControlNet significantly reduces temporal inconsistency compared to vanilla ControlNet, as evidenced by lower average endpoint errors in optical flow estimation.
It demonstrates improved instance-level temporal consistency, resulting in fewer ID switches and fragmentation in multi-object tracking.
The paper introduces effective acceleration techniques that significantly speed up the optimization process without compromising temporal consistency. |
The optimization process might lead to slightly blurry videos due to the emphasis on temporal consistency.
Further exploration of different interpolation techniques for in-between frame generation could enhance overall quality and efficiency. |
video generation, diffusion models, temporal consistency, synthetic-to-real, controlnet |
2305.19129
Report |
Key-Value Transformer |
Ali Borji |
Transformers have emerged as the prevailing standard solution for various AI
tasks, including computer vision and natural language processing. The widely
adopted Query, Key, and Value formulation (QKV) has played a significant role
in this. Nevertheless, no research has examined the essentiality of these three
components for transformer performance. Therefore, we conducted an evaluation
of the key-value formulation (KV), which generates symmetric attention maps,
along with an asymmetric version that incorporates a 2D positional encoding
into the attention matrix. Remarkably, this transformer requires fewer
parameters and computation than the original one. Through experiments
encompassing three task types -- synthetics (such as reversing or sorting a
list), vision (mnist or cifar classification), and NLP (character generation
and translation) -- we discovered that the KV transformer occasionally
outperforms the QKV transformer. However, it also exhibits instances of
underperformance compared to QKV, making it challenging to draw a definitive
conclusion. Nonetheless, we consider the reported results to be encouraging and
anticipate that they may pave the way for more efficient transformers in the
future. |
This paper explores the necessity of the Query, Key, and Value (QKV) formulation in transformers, proposing two alternative models: Key-Value (KV) and KV with 2D positional encoding (KV+Pos). |
Investigating the essentiality of QKV can lead to more efficient transformer architectures with reduced computational complexity and parameters. |
The authors empirically evaluate KV and KV+Pos against QKV on 13 tasks spanning synthetics (list manipulation), vision (classification, anomaly detection), and NLP (character/number generation, translation). |
KV and KV+Pos demonstrate competitive performance, occasionally outperforming QKV, particularly in synthetic and vision tasks.
KV+Pos often surpasses KV, suggesting the importance of asymmetry in attention for certain tasks.
The effectiveness of KV attention varies across tasks, indicating a need for further investigation into the role of symmetric attention. |
The study primarily relies on empirical evaluation without delving into the theoretical underpinnings of why KV might succeed or fail.
Future work could explore the impact of KV attention on larger, more complex tasks and datasets. |
transformers, attention mechanism, key-value attention, symmetric attention, model efficiency |
2305.19094
Report |
Diffusion Model for Dense Matching |
Jisu Nam, Gyuseong Lee, Sunwoo Kim, Hyeonsu Kim, Hyoungwon Cho, Seyeon Kim, Seungryong Kim |
The objective for establishing dense correspondence between paired images
consists of two terms: a data term and a prior term. While conventional
techniques focused on defining hand-designed prior terms, which are difficult
to formulate, recent approaches have focused on learning the data term with
deep neural networks without explicitly modeling the prior, assuming that the
model itself has the capacity to learn an optimal prior from a large-scale
dataset. The performance improvement was obvious, however, they often fail to
address inherent ambiguities of matching, such as textureless regions,
repetitive patterns, and large displacements. To address this, we propose
DiffMatch, a novel conditional diffusion-based framework designed to explicitly
model both the data and prior terms. Unlike previous approaches, this is
accomplished by leveraging a conditional denoising diffusion model. DiffMatch
consists of two main components: conditional denoising diffusion module and
cost injection module. We stabilize the training process and reduce memory
usage with a stage-wise training strategy. Furthermore, to boost performance,
we introduce an inference technique that finds a better path to the accurate
matching field. Our experimental results demonstrate significant performance
improvements of our method over existing approaches, and the ablation studies
validate our design choices along with the effectiveness of each component.
Project page is available at https://ku-cvlab.github.io/DiffMatch/. |
This paper introduces DiffMatch, a novel diffusion-based framework for dense correspondence that explicitly learns both data and prior terms of matching field distribution. |
Existing methods for dense correspondence often struggle with inherent ambiguities like textureless regions and repetitive patterns, as they mainly focus on the data term without explicitly modeling the matching prior. |
The proposed method leverages a conditional denoising diffusion model conditioned on initial correspondence and local matching cost. Additionally, a cascaded pipeline with a super-resolution diffusion model is used for upsampling the matching field. |
DiffMatch achieves state-of-the-art performance on standard benchmarks like HPatches and ETH3D.
The method shows robustness to image corruptions, outperforming previous approaches on ImageNet-C corrupted benchmarks.
Analysis demonstrates the efficacy of the generative prior in capturing the matching field manifold and handling challenging matching scenarios. |
The performance on ETH3D with small displacements is slightly lower, potentially due to lower input resolution compared to some prior works.
Future work includes exploring higher resolution, advanced feature extractors beyond VGG-16, and incorporating techniques like zoom-in and patch-match for detailed matching. |
dense correspondence, diffusion models, generative prior, matching field, image corruptions |
2305.19066
Report |
Nested Diffusion Processes for Anytime Image Generation |
Noam Elata, Bahjat Kawar, Tomer Michaeli, Michael Elad |
Diffusion models are the current state-of-the-art in image generation,
synthesizing high-quality images by breaking down the generation process into
many fine-grained denoising steps. Despite their good performance, diffusion
models are computationally expensive, requiring many neural function
evaluations (NFEs). In this work, we propose an anytime diffusion-based method
that can generate viable images when stopped at arbitrary times before
completion. Using existing pretrained diffusion models, we show that the
generation scheme can be recomposed as two nested diffusion processes, enabling
fast iterative refinement of a generated image. In experiments on ImageNet and
Stable Diffusion-based text-to-image generation, we show, both qualitatively
and quantitatively, that our method's intermediate generation quality greatly
exceeds that of the original diffusion model, while the final generation result
remains comparable. We illustrate the applicability of Nested Diffusion in
several settings, including for solving inverse problems, and for rapid
text-based content creation by allowing user intervention throughout the
sampling process. |
This paper introduces Nested Diffusion, an anytime sampling algorithm for pre-trained diffusion models, enabling the generation of plausible images even with early termination. |
Diffusion models (DMs) excel in image generation but are computationally expensive. Existing methods struggle to produce high-quality intermediate images during sampling, hindering user intervention and real-time applications. |
Nested Diffusion embeds an inner diffusion process within each step of an outer diffusion process. The inner diffusion generates plausible images iteratively, providing progressively refined intermediate outputs. |
Nested Diffusion generates superior intermediate images compared to vanilla DMs, as demonstrated by FID scores on ImageNet.
Text-to-image generation using Stable Diffusion shows that Nested Diffusion provides semantically meaningful intermediate outputs, unlike vanilla Stable Diffusion.
Nested Diffusion successfully generalizes to inverse problems, enabling anytime solutions for tasks like inpainting, super-resolution, and colorization. |
Nested Diffusion requires careful tuning of the ratio between outer and inner diffusion steps.
Further exploration of dynamic allocation for the number of inner steps per outer step is left for future work. |
diffusion models, anytime algorithms, image generation, inverse problems, human-in-the-loop learning |
2305.19012
Report |
StyleAvatar3D: Leveraging Image-Text Diffusion Models for High-Fidelity 3D Avatar Generation |
Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang YU, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, Chunhua Shen |
The recent advancements in image-text diffusion models have stimulated
research interest in large-scale 3D generative models. Nevertheless, the
limited availability of diverse 3D resources presents significant challenges to
learning. In this paper, we present a novel method for generating high-quality,
stylized 3D avatars that utilizes pre-trained image-text diffusion models for
data generation and a Generative Adversarial Network (GAN)-based 3D generation
network for training. Our method leverages the comprehensive priors of
appearance and geometry offered by image-text diffusion models to generate
multi-view images of avatars in various styles. During data generation, we
employ poses extracted from existing 3D models to guide the generation of
multi-view images. To address the misalignment between poses and images in
data, we investigate view-specific prompts and develop a coarse-to-fine
discriminator for GAN training. We also delve into attribute-related prompts to
increase the diversity of the generated avatars. Additionally, we develop a
latent diffusion model within the style space of StyleGAN to enable the
generation of avatars based on image inputs. Our approach demonstrates superior
performance over current state-of-the-art methods in terms of visual quality
and diversity of the produced avatars. |
This paper proposes a novel framework for generating high-fidelity 3D avatars by leveraging pre-trained text-to-image diffusion models for data generation and training a 3D GAN. |
Existing 3D generative models are limited by the scarcity and lack of diversity in 3D training data. This work leverages the rich priors of image-text diffusion models to address this challenge. |
The proposed method uses ControlNet with StableDiffusion for generating multi-view stylized avatar images guided by poses and text prompts. A coarse-to-fine discriminator is introduced to handle image-pose misalignment during 3D GAN training. Finally, a latent diffusion model in the StyleGAN latent space enables image-conditioned 3D avatar generation. |
The coarse-to-fine discriminator significantly outperforms existing methods, achieving a FID of 5.6 compared to 7.8 for EG3D and 20.9 for PoF3D.
The framework successfully generates diverse and high-quality 3D avatars with various styles defined by text prompts or example images.
The latent diffusion model effectively captures facial features and allows for conditional 3D avatar generation even with large pose angles or out-of-domain input images. |
The reliance on pre-trained pose estimators for guidance can introduce inaccuracies in synthesized images, especially for complex styles.
The current implementation focuses on avatar generation and may require further exploration for general 3D object generation. |
3d avatar generation, text-to-3d, image-to-3d, diffusion models, generative adversarial networks |
2305.18980
Report |
Multi-modal Queried Object Detection in the Wild |
Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xiaoshan Yang, Ke Li, Changsheng Xu |
We introduce MQ-Det, an efficient architecture and pre-training strategy
design to utilize both textual description with open-set generalization and
visual exemplars with rich description granularity as category queries, namely,
Multi-modal Queried object Detection, for real-world detection with both
open-vocabulary categories and various granularity. MQ-Det incorporates vision
queries into existing well-established language-queried-only detectors. A
plug-and-play gated class-scalable perceiver module upon the frozen detector is
proposed to augment category text with class-wise visual information. To
address the learning inertia problem brought by the frozen detector, a vision
conditioned masked language prediction strategy is proposed. MQ-Det's simple
yet effective architecture and training strategy design is compatible with most
language-queried object detectors, thus yielding versatile applications.
Experimental results demonstrate that multi-modal queries largely boost
open-world detection. For instance, MQ-Det significantly improves the
state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via
multi-modal queries without any downstream finetuning, and averagely +6.3% AP
on 13 few-shot downstream tasks, with merely additional 3% modulating time
required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det. |
This paper introduces MQ-Det, a novel approach for object detection that utilizes both textual descriptions and visual exemplars as category queries. |
Existing object detectors often struggle with insufficient description granularity when using only textual queries. MQ-Det addresses this limitation by incorporating visual information, leading to more accurate and versatile detection, especially for fine-grained categories. |
MQ-Det proposes a plug-and-play Gated Class-scalable Perceiver (GCP) module that augments textual category queries with class-wise visual information extracted from exemplars. It employs a vision-conditioned masked language prediction strategy to overcome the learning inertia caused by the frozen pre-trained detector. |
MQ-Det significantly boosts open-world detection performance, achieving state-of-the-art results on LVIS and ODinW benchmarks.
It demonstrates strong finetuning-free transferability, enabling detection of customized objects without any finetuning.
MQ-Det requires significantly less training time than previous leading detectors while exhibiting strong few-shot learning capabilities. |
The contribution of multi-modal queries diminishes with sufficient training data per category.
The application of MQ-Det to other dense prediction tasks like segmentation needs further exploration. |
object detection, multi-modal learning, open-vocabulary detection, few-shot learning, vision-language models |
2305.18832
Report |
ReTR: Modeling Rendering Via Transformer for Generalizable Neural Surface Reconstruction |
Yixun Liang, Hao He, Ying-cong Chen |
Generalizable neural surface reconstruction techniques have attracted great
attention in recent years. However, they encounter limitations of low
confidence depth distribution and inaccurate surface reasoning due to the
oversimplified volume rendering process employed. In this paper, we present
Reconstruction TRansformer (ReTR), a novel framework that leverages the
transformer architecture to redesign the rendering process, enabling complex
render interaction modeling. It introduces a learnable $\textit{meta-ray
token}$ and utilizes the cross-attention mechanism to simulate the interaction
of rendering process with sampled points and render the observed color.
Meanwhile, by operating within a high-dimensional feature space rather than the
color space, ReTR mitigates sensitivity to projected colors in source views.
Such improvements result in accurate surface assessment with high confidence.
We demonstrate the effectiveness of our approach on various datasets,
showcasing how our method outperforms the current state-of-the-art approaches
in terms of reconstruction quality and generalization ability. $\textit{Our
code is available at }$ https://github.com/YixunLiang/ReTR. |
This paper proposes Reconstruction Transformer (ReTR), a novel framework for generalizable neural surface reconstruction that leverages the transformer architecture to redesign the rendering process for improved surface modeling. |
Existing generalizable neural surface reconstruction techniques suffer from limitations of low confidence depth distribution and inaccurate surface reasoning due to the oversimplified volume rendering process. |
ReTR introduces a learnable meta-ray token and utilizes the cross-attention mechanism to simulate the interaction of the rendering process with sampled points. It operates within a high-dimensional feature space rather than the color space and introduces a unidirectional transformer and continuous positional encoding to simulate photon-medium interaction. |
ReTR outperforms state-of-the-art approaches in terms of reconstruction quality and generalization ability on various datasets including DTU, BlendedMVS, ETH3D, and Tanks & Temples.
The method achieves more accurate surface assessment with higher confidence, resulting in sharper depth distribution and reduced noise.
ReTR demonstrates robustness to different sampling strategies and can provide reliable depth estimations even with fewer samples. |
The method has limitations in terms of efficiency, requiring around 30 seconds to render a depth map and image with a resolution of 600x800.
While learning-based rendering enhances capabilities, it introduces additional training parameters compared to traditional volume rendering, increasing training time. |
neural surface reconstruction, transformer, volume rendering, computer vision, 3d reconstruction |
2305.18766
Report |
HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance |
Junzhe Zhu, Peiye Zhuang, Sanmi Koyejo |
The advancements in automatic text-to-3D generation have been remarkable.
Most existing methods use pre-trained text-to-image diffusion models to
optimize 3D representations like Neural Radiance Fields (NeRFs) via
latent-space denoising score matching. Yet, these methods often result in
artifacts and inconsistencies across different views due to their suboptimal
optimization approaches and limited understanding of 3D geometry. Moreover, the
inherent constraints of NeRFs in rendering crisp geometry and stable textures
usually lead to a two-stage optimization to attain high-resolution details.
This work proposes holistic sampling and smoothing approaches to achieve
high-quality text-to-3D generation, all in a single-stage optimization. We
compute denoising scores in the text-to-image diffusion model's latent and
image spaces. Instead of randomly sampling timesteps (also referred to as noise
levels in denoising score matching), we introduce a novel timestep annealing
approach that progressively reduces the sampled timestep throughout
optimization. To generate high-quality renderings in a single-stage
optimization, we propose regularization for the variance of z-coordinates along
NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel
smoothing technique that refines importance sampling weights coarse-to-fine,
ensuring accurate and thorough sampling in high-density regions. Extensive
experiments demonstrate the superiority of our method over previous approaches,
enabling the generation of highly detailed and view-consistent 3D assets
through a single-stage training process. |
This paper introduces a novel approach for generating high-quality, view-consistent 3D assets from text prompts using a single-stage optimization process, leveraging pre-trained text-to-image diffusion models. |
Existing text-to-3D generation methods often suffer from artifacts, inconsistencies across views, and require multi-stage optimization for high-resolution details. This work aims to overcome these limitations by improving both the optimization process and the 3D representation. |
The method combines score distillation from the latent and image spaces of a pre-trained Stable Diffusion model with a novel timestep annealing strategy for improved optimization. Additionally, it introduces a variance regularization loss for sharper geometry and a kernel smoothing technique for coarse-to-fine importance sampling to mitigate flickering artifacts in NeRFs. |
The proposed method generates 3D assets with superior photorealism, detailed textures, and more natural lighting compared to existing methods.
A novel timestep annealing approach effectively addresses divergence issues and enhances the guidance provided by the text-to-image diffusion prior, leading to improved generation quality.
The introduction of z-variance regularization and kernel smoothing techniques significantly enhances the quality of NeRF representations, ensuring both sharp geometry and view-consistent appearance. |
The current implementation relies on low-resolution guidance ($64\times64$) from the Deep Floyd IF model; future work will explore utilizing the full model for high-resolution guidance.
The method currently focuses on generating 3D assets from text prompts; future research will explore extending it to handle other input modalities, such as images or sketches. |
text-to-3d generation, neural radiance fields (nerfs), diffusion models, score distillation sampling (sds), timestep annealing |
2305.18729
Report |
Real-World Image Variation by Aligning Diffusion Inversion Chain |
Yuechen Zhang, Jinbo Xing, Eric Lo, Jiaya Jia |
Recent diffusion model advancements have enabled high-fidelity images to be
generated using text prompts. However, a domain gap exists between generated
images and real-world images, which poses a challenge in generating
high-quality variations of real-world images. Our investigation uncovers that
this domain gap originates from a latents' distribution gap in different
diffusion processes. To address this issue, we propose a novel inference
pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes
diffusion models to generate image variations from a single image exemplar. Our
pipeline enhances the generation quality of image variations by aligning the
image generation process to the source image's inversion chain. Specifically,
we demonstrate that step-wise latent distribution alignment is essential for
generating high-quality variations. To attain this, we design a cross-image
self-attention injection for feature interaction and a step-wise distribution
normalization to align the latent features. Incorporating these alignment
processes into a diffusion model allows RIVAL to generate high-quality image
variations without further parameter optimization. Our experimental results
demonstrate that our proposed approach outperforms existing methods concerning
semantic similarity and perceptual quality. This generalized inference pipeline
can be easily applied to other diffusion-based generation tasks, such as
image-conditioned text-to-image generation and stylization. |
This paper proposes RIVAL, a training-free inference pipeline for generating high-quality image variations from a single real-world image exemplar using diffusion models. |
Bridging the domain gap between generated and real-world images for high-quality image variation generation with diffusion models is crucial but challenging. |
RIVAL aligns the image generation process to the source image's inversion chain using two key components: (i) cross-image self-attention injection for feature interaction and (ii) step-wise latent normalization for latent distribution alignment. |
RIVAL generates high-quality image variations maintaining semantic and style consistency with the exemplar image.
RIVAL outperforms existing methods in terms of semantic similarity and perceptual quality.
RIVAL can be applied to other image generation tasks like text-driven generation with image conditions and inpainting. |
RIVAL relies on text prompts, potentially introducing semantic biases.
Generating complex scenes with RIVAL can be challenging due to limitations of the base diffusion model.
Future work could explore refining diffusion models and novel input modalities beyond text prompts. |
image variation generation, diffusion models, latent space alignment, cross-image attention, real-world image editing |
2305.18726
Report |
Diffusion-Stego: Training-free Diffusion Generative Steganography via Message Projection |
Daegyu Kim, Chaehun Shin, Jooyoung Choi, Dahuin Jung, Sungroh Yoon |
Generative steganography is the process of hiding secret messages in
generated images instead of cover images. Existing studies on generative
steganography use GAN or Flow models to obtain high hiding message capacity and
anti-detection ability over cover images. However, they create relatively
unrealistic stego images because of the inherent limitations of generative
models. We propose Diffusion-Stego, a generative steganography approach based
on diffusion models which outperform other generative models in image
generation. Diffusion-Stego projects secret messages into latent noise of
diffusion models and generates stego images with an iterative denoising
process. Since the naive hiding of secret messages into noise boosts visual
degradation and decreases extracted message accuracy, we introduce message
projection, which hides messages into noise space while addressing these
issues. We suggest three options for message projection to adjust the trade-off
between extracted message accuracy, anti-detection ability, and image quality.
Diffusion-Stego is a training-free approach, so we can apply it to pre-trained
diffusion models which generate high-quality images, or even large-scale
text-to-image models, such as Stable diffusion. Diffusion-Stego achieved a high
capacity of messages (3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp
with 90% accuracy) as well as high quality (with a FID score of 2.77 for 1.0
bpp on the FFHQ 64$\times$64 dataset) that makes it challenging to distinguish
from real images in the PNG format. |
This paper presents \our{}, a novel generative steganography approach based on diffusion models and deterministic samplers for hiding messages within generated images, achieving high message capacity and quality. |
Generative steganography enhances traditional steganography by hiding messages within generated images instead of cover images, making it more resistant to steganalysis. |
\our{} leverages the invertible property of deterministic samplers in diffusion models to embed secret messages into the noise of the generative process. The authors introduce message projection techniques to address challenges like image collapse and extraction errors. |
Achieves high message capacity, hiding up to 3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp with 90% accuracy.
Generates high-quality stego images, achieving a FID score of 2.77 for 1.0 bpp on the FFHQ 64×64 dataset, making it difficult to distinguish from real PNG images.
Demonstrates applicability to large-scale text-to-image models like Stable diffusion, allowing for message hiding based on text prompts. |
Trade-off exists between image quality, anti-detection ability, and extracted message accuracy, requiring further optimization.
Reliance on pre-trained diffusion models raises concerns about potential misuse for malicious purposes, necessitating research on safeguards and steganalysis techniques. |
generative steganography, diffusion models, deterministic sampling, message projection, image steganalysis |
2305.18676
Report |
LayerDiffusion: Layered Controlled Image Editing with Diffusion Models |
Pengzhi Li, QInxuan Huang, Yikang Ding, Zhiheng Li |
Text-guided image editing has recently experienced rapid development.
However, simultaneously performing multiple editing actions on a single image,
such as background replacement and specific subject attribute changes, while
maintaining consistency between the subject and the background remains
challenging. In this paper, we propose LayerDiffusion, a semantic-based layered
controlled image editing method. Our method enables non-rigid editing and
attribute modification of specific subjects while preserving their unique
characteristics and seamlessly integrating them into new backgrounds. We
leverage a large-scale text-to-image model and employ a layered controlled
optimization strategy combined with layered diffusion training. During the
diffusion process, an iterative guidance strategy is used to generate a final
image that aligns with the textual description. Experimental results
demonstrate the effectiveness of our method in generating highly coherent
images that closely align with the given textual description. The edited images
maintain a high similarity to the features of the input image and surpass the
performance of current leading image editing methods. LayerDiffusion opens up
new possibilities for controllable image editing. |
This paper introduces LayerDiffusion, a novel semantic-based layered controlled image editing method that enables simultaneous editing of both the background and specific subjects within an image using a single input image. |
Current text-guided image editing methods struggle to maintain consistency between edited subjects and backgrounds, especially when performing multiple editing actions simultaneously. This new method aims to address these limitations and enhance controllable image editing. |
The method leverages a large-scale text-to-image model and employs a layered controlled optimization strategy to refine text embeddings. It then uses a layered diffusion training strategy to fine-tune the model and an iterative guidance strategy during inference to generate images consistent with the textual description. |
LayerDiffusion enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds.
The method generates images with highly similar features to the input images, surpassing the performance of current leading image editing methods.
User studies confirm that LayerDiffusion's output aligns more closely with human perception compared to other methods. |
The method faces challenges in dealing with fine-grained tasks, such as preserving intricate textures or facial features, due to potential overfitting during model fine-tuning.
Significant disparities in camera angles between the input reference image and the desired edited image can lead to visually inconsistent scenes. |
image editing, text-guided synthesis, diffusion models, layered control, semantic editing |
2305.18670
Report |
SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-driven Video Editing |
Nazmul Karim, Umar Khalid, Mohsen Joneidi, Chen Chen, Nazanin Rahnavard |
Text-to-Image (T2I) diffusion models have achieved remarkable success in
synthesizing high-quality images conditioned on text prompts. Recent methods
have tried to replicate the success by either training text-to-video (T2V)
models on a very large number of text-video pairs or adapting T2I models on
text-video pairs independently. Although the latter is computationally less
expensive, it still takes a significant amount of time for per-video adaption.
To address this issue, we propose SAVE, a novel spectral-shift-aware adaptation
framework, in which we fine-tune the spectral shift of the parameter space
instead of the parameters themselves. Specifically, we take the spectral
decomposition of the pre-trained T2I weights and only update the singular
values while freezing the corresponding singular vectors. In addition, we
introduce a spectral shift regularizer aimed at placing tighter constraints on
larger singular values compared to smaller ones. This form of regularization
enables the model to grasp finer details within the video that align with the
provided textual descriptions. We also offer theoretical justification for our
proposed regularization technique. Since we are only dealing with spectral
shifts, the proposed method reduces the adaptation time significantly (approx.
10 times) and has fewer resource constraints for training. Such attributes
posit SAVE to be more suitable for real-world applications, e.g. editing
undesirable content during video streaming. We validate the effectiveness of
SAVE with an extensive experimental evaluation under different settings, e.g.
style transfer, object replacement, privacy preservation, etc. |
Proposes SAVE, a novel spectral-shift-aware adaptation framework for text-guided video editing that fine-tunes the spectral shift of a pre-trained T2I diffusion model instead of its parameters for efficient adaptation. |
Existing text-to-video generation methods are computationally expensive, lack temporal awareness, and require large-scale datasets, while this method leverages existing T2I models for efficiency and addresses temporal modeling for improved video editing. |
Leverages a pre-trained T2I model and fine-tunes its spectral shifts by updating singular values of weight matrices while freezing singular vectors, incorporating a spectral shift regularizer to prioritize learning finer video details, and exploring different spatiotemporal attention mechanisms for temporal coherence. |
Significantly reduces adaptation time (approximately 10x faster) compared to traditional fine-tuning methods.
Achieves state-of-the-art performance in text-guided video editing tasks, including style transfer, object replacement, and local attribute editing, as demonstrated through quantitative and qualitative evaluations.
Shows promising results in zero-shot text-to-video generation by incorporating pre-trained T2I adapters for motion modeling and frame attention for temporal consistency. |
Struggles with editing long video sequences with irregular actions, indicating potential for further exploration of temporal modeling techniques.
Relies on pre-trained T2I models, which might limit its capacity to learn novel concepts beyond the knowledge captured in the pre-trained models, suggesting investigation into incorporating external knowledge sources or few-shot learning strategies. |
video editing, diffusion models, text-to-video generation, spectral shift, temporal modeling |
2305.18439
Report |
Alteration-free and Model-agnostic Origin Attribution of Generated Images |
Zhenting Wang, Chen Chen, Yi Zeng, Lingjuan Lyu, Shiqing Ma |
Recently, there has been a growing attention in image generation models.
However, concerns have emerged regarding potential misuse and intellectual
property (IP) infringement associated with these models. Therefore, it is
necessary to analyze the origin of images by inferring if a specific image was
generated by a particular model, i.e., origin attribution. Existing methods are
limited in their applicability to specific types of generative models and
require additional steps during training or generation. This restricts their
use with pre-trained models that lack these specific operations and may
compromise the quality of image generation. To overcome this problem, we first
develop an alteration-free and model-agnostic origin attribution method via
input reverse-engineering on image generation models, i.e., inverting the input
of a particular model for a specific image. Given a particular model, we first
analyze the differences in the hardness of reverse-engineering tasks for the
generated images of the given model and other images. Based on our analysis, we
propose a method that utilizes the reconstruction loss of reverse-engineering
to infer the origin. Our proposed method effectively distinguishes between
generated images from a specific generative model and other images, including
those generated by different models and real images. |
This paper introduces a novel, alteration-free, and model-agnostic method for attributing the origin of AI-generated images, determining if a specific image was generated by a particular model. |
With the increasing concerns about misuse and intellectual property infringement related to AI-generated images, verifying the origin of these images is crucial for copyright protection, tracing malicious content, and ensuring fairness. |
The proposed method leverages the concept of input reverse-engineering on image generation models. It analyzes the reconstruction loss during reverse-engineering, comparing the loss for the examined image to the distribution of losses observed in images genuinely generated by the model in question. To mitigate the influence of inherent image complexities, the method calibrates the reconstruction loss using a reference model trained on a different dataset. |
The method effectively distinguishes between images generated by a specific model and real images, achieving an average accuracy of 94.2%.
It successfully differentiates between images generated by a particular model and those generated by other models, regardless of architectural differences or variations in training datasets, with an average accuracy exceeding 95%.
The method demonstrates robustness against adaptive attacks, such as image editing, maintaining an accuracy above 90% even when malicious modifications are applied. |
The computational cost of the method, primarily due to the reverse-engineering process, is acknowledged as a limitation compared to watermarking or classifier-based approaches. Future work aims to explore techniques for accelerating this process.
The current focus of the method is on image generation models. Expanding its applicability to other domains, such as video, language, and graph generation models, is identified as a direction for future research. |
origin attribution, ai-generated images, reverse-engineering, image generation models, intellectual property protection |
2305.18295
Report |
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths |
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, Ping Luo |
Text-to-image generation has recently witnessed remarkable achievements. We
introduce a text-conditional image diffusion model, termed RAPHAEL, to generate
highly artistic images, which accurately portray the text prompts, encompassing
multiple nouns, adjectives, and verbs. This is achieved by stacking tens of
mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling
billions of diffusion paths (routes) from the network input to the output. Each
path intuitively functions as a "painter" for depicting a particular textual
concept onto a specified image region at a diffusion timestep. Comprehensive
experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as
Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both
image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior
performance in switching images across diverse styles, such as Japanese comics,
realism, cyberpunk, and ink illustration. Secondly, a single model with three
billion parameters, trained on 1,000 A100 GPUs for two months, achieves a
state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore,
RAPHAEL significantly surpasses its counterparts in human evaluation on the
ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the
frontiers of image generation research in both academia and industry, paving
the way for future breakthroughs in this rapidly evolving field. More details
can be found on a webpage: https://raphael-painter.github.io/. |
RAPHAEL is a novel text-conditional image diffusion model that leverages a large-scale mixture of diffusion paths to generate highly artistic and text-aligned images. |
Existing text-to-image models often fail to accurately preserve all textual concepts within generated images due to limitations in the cross-attention mechanism for text-image integration. |
RAPHAEL employs a U-Net architecture with stacked space-MoE and time-MoE layers to enable billions of diffusion paths, each acting as a 'painter' for specific concepts and image regions. It also incorporates edge-supervised learning to enhance image quality and aesthetics. |
RAPHAEL exhibits superior performance in generating images across diverse artistic styles, surpassing models like Stable Diffusion and DALL-E 2.
It achieves state-of-the-art zero-shot FID-30k score of 6.61 on the COCO dataset, demonstrating high image quality and diversity.
RAPHAEL significantly outperforms competitors in human evaluations on the ViLG-300 benchmark for both image-text alignment and aesthetic quality. |
Potential misuse for creating misleading or false information, requiring prompt filtering and ethical considerations.
Computational complexity increases with the number of experts, necessitating a trade-off between image fidelity and inference speed. |
text-to-image generation, diffusion models, mixture-of-experts (moe), edge-supervised learning, artistic image synthesis |
2305.18292
Report |
Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models |
Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou |
Public large-scale text-to-image diffusion models, such as Stable Diffusion,
have gained significant attention from the community. These models can be
easily customized for new concepts using low-rank adaptations (LoRAs). However,
the utilization of multiple concept LoRAs to jointly support multiple
customized concepts presents a challenge. We refer to this scenario as
decentralized multi-concept customization, which involves single-client concept
tuning and center-node concept fusion. In this paper, we propose a new
framework called Mix-of-Show that addresses the challenges of decentralized
multi-concept customization, including concept conflicts resulting from
existing single-client LoRA tuning and identity loss during model fusion.
Mix-of-Show adopts an embedding-decomposed LoRA (ED-LoRA) for single-client
tuning and gradient fusion for the center node to preserve the in-domain
essence of single concepts and support theoretically limitless concept fusion.
Additionally, we introduce regionally controllable sampling, which extends
spatially controllable sampling (e.g., ControlNet and T2I-Adaptor) to address
attribute binding and missing object problems in multi-concept sampling.
Extensive experiments demonstrate that Mix-of-Show is capable of composing
multiple customized concepts with high fidelity, including characters, objects,
and scenes. |
Mix-of-Show, a novel framework for decentralized multi-concept customization in text-to-image diffusion models, enabling the merging of multiple independently fine-tuned concept models while preserving individual concept identity and fidelity. |
Existing methods struggle to combine multiple customized concepts effectively due to concept conflicts and identity loss during model fusion, limiting the potential of large-scale text-to-image models for complex compositions. |
Mix-of-Show utilizes ED-LoRA for single-client concept tuning, which enhances embedding expressiveness to preserve concept essence, and employs gradient fusion at the center node to align inference behavior of individual concepts, minimizing identity loss. |
ED-LoRA effectively captures concept identity while mitigating concept conflicts observed in vanilla LoRA.
Gradient fusion outperforms weight fusion in preserving individual concept fidelity after model merging.
Regionally controllable sampling, introduced for multi-concept generation, addresses attribute binding issues and enables complex compositions with accurate attribute assignment. |
Regionally controllable sampling may exhibit attribute leakage between regions.
Center-node fusion using gradient descent can be time-consuming, especially for large spatial features in Unet layers. |
text-to-image generation, diffusion models, concept customization, decentralized learning, low-rank adaptation |
2305.18286
Report |
Photoswap: Personalized Subject Swapping in Images |
Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang |
In an era where images and visual content dominate our digital landscape, the
ability to manipulate and personalize these images has become a necessity.
Envision seamlessly substituting a tabby cat lounging on a sunlit window sill
in a photograph with your own playful puppy, all while preserving the original
charm and composition of the image. We present Photoswap, a novel approach that
enables this immersive image editing experience through personalized subject
swapping in existing images. Photoswap first learns the visual concept of the
subject from reference images and then swaps it into the target image using
pre-trained diffusion models in a training-free manner. We establish that a
well-conceptualized visual subject can be seamlessly transferred to any image
with appropriate self-attention and cross-attention manipulation, maintaining
the pose of the swapped subject and the overall coherence of the image.
Comprehensive experiments underscore the efficacy and controllability of
Photoswap in personalized subject swapping. Furthermore, Photoswap
significantly outperforms baseline methods in human ratings across subject
swapping, background preservation, and overall quality, revealing its vast
application potential, from entertainment to professional editing. |
Presents *Photoswap*, a novel, training-free method for personalized subject swapping in images using pre-trained diffusion models. It allows users to replace subjects in an image with a user-specified subject, while maintaining the original pose and composition. |
Personalized subject swapping has broad applications in entertainment, advertising, and professional editing. Existing methods lack the capability to seamlessly integrate new subjects into existing images while preserving their pose and the image composition. |
Photoswap first learns the visual concept of the target subject from reference images using techniques like DreamBooth. Then, it leverages a training-free attention swapping mechanism that manipulates the self-attention and cross-attention maps and outputs during the target image generation process. |
Photoswap demonstrates impressive capabilities in swapping subjects in various images while preserving the original composition and subject pose.
It significantly outperforms baseline methods in human evaluations for subject identity preservation, background preservation, and overall quality.
The method provides control over the subject's appearance by adjusting the attention swapping steps. |
The model sometimes struggles with accurately reproducing hands and complex background information.
Future work aims to address limitations and enhance performance for intricate hand gestures or complex abstract information. |
image editing, subject swapping, diffusion models, attention mechanisms, personalized image manipulation |
2305.18264
Report |
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising |
Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, Hongsheng Li |
Leveraging large-scale image-text datasets and advancements in diffusion
models, text-driven generative models have made remarkable strides in the field
of image generation and editing. This study explores the potential of extending
the text-driven ability to the generation and editing of multi-text conditioned
long videos. Current methodologies for video generation and editing, while
innovative, are often confined to extremely short videos (typically less than
24 frames) and are limited to a single text condition. These constraints
significantly limit their applications given that real-world videos usually
consist of multiple segments, each bearing different semantic information. To
address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video,
capable of extending off-the-shelf short video diffusion models for generating
and editing videos comprising hundreds of frames with diverse semantic segments
without introducing additional training, all while preserving content
consistency. We have implemented three mainstream text-driven video generation
and editing methodologies and extended them to accommodate longer videos imbued
with a variety of semantic segments with our proposed paradigm. Our
experimental outcomes reveal that our approach significantly broadens the
generative and editing capabilities of video diffusion models, offering new
possibilities for future research and applications. The code is available at
https://github.com/G-U-N/Gen-L-Video. |
Introduces Gen-L-Video, a novel paradigm that extends off-the-shelf short video diffusion models to generate and edit long videos with multiple semantic segments, all without additional training. |
Addresses limitations of current text-driven video generation and editing methods that struggle with long durations (typically under 24 frames) and single text conditions, hindering their real-world applicability. |
Treats long videos as overlapping short clips, jointly denoising them with existing models while ensuring consistency and coherence via a weighted merging process. Integrates with pretrained, tuning-free, and one-shot-tuning paradigms, and incorporates techniques like bidirectional cross-frame attention. |
Significantly enhances frame consistency and textual alignment compared to isolated denoising.
Successfully generates and edits videos with hundreds of frames and diverse semantic segments, as demonstrated qualitatively and quantitatively.
Demonstrates versatility by integrating with personalized diffusion models, layout control mechanisms, and open-set detection/segmentation for arbitrary object editing. |
The framework’s potential to integrate different video diffusion models with varying lengths remains unexplored.
Further research can explore using diverse short video diffusion models concurrently. |
video generation, video editing, diffusion models, long video synthesis, multi-text conditioned generation |
2305.18203
Report |
Concept Decomposition for Visual Exploration and Inspiration |
Yael Vinker, Andrey Voynov, Daniel Cohen-Or, Ariel Shamir |
A creative idea is often born from transforming, combining, and modifying
ideas from existing visual examples capturing various concepts. However, one
cannot simply copy the concept as a whole, and inspiration is achieved by
examining certain aspects of the concept. Hence, it is often necessary to
separate a concept into different aspects to provide new perspectives. In this
paper, we propose a method to decompose a visual concept, represented as a set
of images, into different visual aspects encoded in a hierarchical tree
structure. We utilize large vision-language models and their rich latent space
for concept decomposition and generation. Each node in the tree represents a
sub-concept using a learned vector embedding injected into the latent space of
a pretrained text-to-image model. We use a set of regularizations to guide the
optimization of the embedding vectors encoded in the nodes to follow the
hierarchical structure of the tree. Our method allows to explore and discover
new concepts derived from the original one. The tree provides the possibility
of endless visual sampling at each node, allowing the user to explore the
hidden sub-concepts of the object of interest. The learned aspects in each node
can be combined within and across trees to create new visual ideas, and can be
used in natural language sentences to apply such aspects to new designs. |
This paper introduces a method for decomposing visual concepts into distinct aspects, creating a hierarchical tree structure for exploration and inspiration. |
This approach supports creative design by enabling the exploration of nuanced aspects within a concept and facilitating the generation of novel ideas through combination. |
Leveraging large vision-language models, the method constructs a binary tree where each node represents a learned vector embedding of a sub-concept. This learning process is guided by a binary reconstruction loss and a coherency constraint ensuring meaningful and distinct aspect representation. |
The method successfully decomposes complex visual concepts into coherent sub-concepts, as demonstrated through qualitative examples and a perceptual study.
The generated tree structure facilitates the exploration and combination of aspects, both within a concept (intra-tree) and across different concepts (inter-tree), to foster new design ideas.
The learned aspects can be effectively integrated into natural language sentences, enabling aspect-based image generation using pre-trained text-to-image models. |
The method may struggle with specific image conditions (e.g., background leakage, dominant sub-concepts) impacting decomposition quality.
Generating deeper trees with multiple levels remains challenging due to potential drift towards out-of-distribution embeddings, requiring further investigation. |
concept decomposition, visual exploration, design inspiration, text-to-image generation, vision-language models |
2305.18009
Report |
Multi-Modal Face Stylization with a Generative Prior |
Mengtian Li, Yi Dong, Minxuan Lin, Haibin Huang, Pengfei Wan, Chongyang Ma |
In this work, we introduce a new approach for face stylization. Despite
existing methods achieving impressive results in this task, there is still room
for improvement in generating high-quality artistic faces with diverse styles
and accurate facial reconstruction. Our proposed framework, MMFS, supports
multi-modal face stylization by leveraging the strengths of StyleGAN and
integrates it into an encoder-decoder architecture. Specifically, we use the
mid-resolution and high-resolution layers of StyleGAN as the decoder to
generate high-quality faces, while aligning its low-resolution layer with the
encoder to extract and preserve input facial details. We also introduce a
two-stage training strategy, where we train the encoder in the first stage to
align the feature maps with StyleGAN and enable a faithful reconstruction of
input faces. In the second stage, the entire network is fine-tuned with
artistic data for stylized face generation. To enable the fine-tuned model to
be applied in zero-shot and one-shot stylization tasks, we train an additional
mapping network from the large-scale Contrastive-Language-Image-Pre-training
(CLIP) space to a latent $w+$ space of fine-tuned StyleGAN. Qualitative and
quantitative experiments show that our framework achieves superior performance
in both one-shot and zero-shot face stylization tasks, outperforming
state-of-the-art methods by a large margin. |
This paper introduces MMFS, a novel framework for multi-modal face stylization that leverages StyleGAN2 within an encoder-decoder architecture for high-quality stylized face generation. |
Existing face stylization methods struggle to balance high-quality artistic generation with diverse style support, accurate facial reconstruction, and flexible control mechanisms (one-shot, zero-shot). |
The framework uses a two-stage training strategy. Stage I aligns a convolution-based encoder with StyleGAN2 for accurate reconstruction. Stage II fine-tunes the entire network on artistic data for stylization. A mapping network from CLIP feature space to StyleGAN2's latent space enables guided stylization. |
MMFS achieves state-of-the-art performance on random stylization, outperforming baselines in quality, diversity, and identity preservation.
The method demonstrates superior visual quality in both one-shot and zero-shot settings, effectively transferring styles from reference images or text prompts while preserving facial details.
Ablation studies validate the effectiveness of the two-stage training, projection loss, fine-tuning step, and CLIP feature integration. |
The current implementation has limitations in handling significant geometric deformations (e.g., caricatures).
The generated images are limited to the cropped region of FFHQ and struggle with out-of-distribution inputs with large pose variations or occlusions. |
face stylization, generative adversarial networks, stylegan2, clip, image-to-image translation |
2305.17929
Report |
Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects |
Yue Fan, Ivan Skorokhodov, Oleg Voynov, Savva Ignatyev, Evgeny Burnaev, Peter Wonka, Yiqun Wang |
We develop a method that recovers the surface, materials, and illumination of
a scene from its posed multi-view images. In contrast to prior work, it does
not require any additional data and can handle glossy objects or bright
lighting. It is a progressive inverse rendering approach, which consists of
three stages. First, we reconstruct the scene radiance and signed distance
function (SDF) with our novel regularization strategy for specular reflections.
Our approach considers both the diffuse and specular colors, which allows for
handling complex view-dependent lighting effects for surface reconstruction.
Second, we distill light visibility and indirect illumination from the learned
SDF and radiance field using learnable mapping functions. Third, we design a
method for estimating the ratio of incoming direct light represented via
Spherical Gaussians reflected in a specular manner and then reconstruct the
materials and direct illumination of the scene. Experimental results
demonstrate that the proposed method outperforms the current state-of-the-art
in recovering surfaces, materials, and lighting without relying on any
additional data. |
This paper presents Factored-NeuS, a novel method for reconstructing surfaces, materials, and illumination from posed multi-view images, even for scenes with glossy objects and complex lighting. |
Existing methods struggle to accurately reconstruct glossy surfaces and disentangle specular reflections from diffuse color, leading to inaccurate geometry and material estimations. This work addresses these limitations, particularly for real-world data. |
The method employs a three-stage progressive inverse rendering approach: (1) Joint reconstruction of surface SDF and radiance with diffuse and specular color decomposition. (2) Learning direct lighting visibility and indirect illumination from the SDF and radiance. (3) Recovering BRDF and direct illumination using a novel specular albedo network and continuous light visibility. |
Outperforms state-of-the-art methods in surface reconstruction quality, particularly for glossy objects, as demonstrated on DTU, SK3D, and Shiny datasets.
Achieves superior material and lighting decomposition compared to existing techniques, evidenced by improved PSNR metrics and visual fidelity on the IndiSG dataset.
Effectiveness of the proposed components, including specular albedo network and continuous light visibility, is validated through ablation studies, showing improvements in both quantitative metrics and qualitative results. |
Challenges remain in reconstructing fine geometric details and materials for objects with complex structures.
Future work includes extending the method to dynamic scenes and incorporating additional data modalities. |
inverse rendering, surface reconstruction, material reconstruction, illumination reconstruction, glossy surfaces |
2305.17916
Report |
Volume Feature Rendering for Fast Neural Radiance Field Reconstruction |
Kang Han, Wei Xiang, Lu Yu |
Neural radiance fields (NeRFs) are able to synthesize realistic novel views
from multi-view images captured from distinct positions and perspectives. In
NeRF's rendering pipeline, neural networks are used to represent a scene
independently or transform queried learnable feature vector of a point to the
expected color or density. With the aid of geometry guides either in occupancy
grids or proposal networks, the number of neural network evaluations can be
reduced from hundreds to dozens in the standard volume rendering framework.
Instead of rendering yielded color after neural network evaluation, we propose
to render the queried feature vectors of a ray first and then transform the
rendered feature vector to the final pixel color by a neural network. This
fundamental change to the standard volume rendering framework requires only one
single neural network evaluation to render a pixel, which substantially lowers
the high computational complexity of the rendering framework attributed to a
large number of neural network evaluations. Consequently, we can use a
comparably larger neural network to achieve a better rendering quality while
maintaining the same training and rendering time costs. Our model achieves the
state-of-the-art rendering quality on both synthetic and real-world datasets
while requiring a training time of several minutes. |
This paper proposes Volume Feature Rendering (VFR), a novel method that achieves state-of-the-art view synthesis quality with significantly reduced training time compared to standard volume rendering techniques. |
Existing neural rendering methods, while capable of high-fidelity view synthesis, suffer from high computational complexity due to numerous neural network evaluations per pixel. This limits the use of larger networks for better quality and increases training time. VFR addresses this limitation by enabling the use of larger networks without sacrificing training speed. |
Instead of rendering colors directly, VFR renders queried feature vectors of sample points along a ray. These vectors are then integrated and transformed into the final pixel color using a single neural network evaluation. This reduces computational complexity and allows for larger, more expressive networks. |
VFR achieves state-of-the-art rendering quality on both synthetic (NeRF synthetic dataset) and real-world (360 dataset) benchmarks.
The method significantly reduces training time compared to existing fast methods, achieving comparable quality in just minutes.
Ablation studies demonstrate the contribution of individual components like GELU activation and SH feature encoding to the improved performance. |
VFR currently requires per-scene optimization, limiting its applicability in real-time 3D video applications.
While offering high quality, VFR's rendering speed needs further improvement to compete with real-time rendering methods like BakedSDF and MobileNeRF, potentially by reducing the number of feature queries. |
neural rendering, view synthesis, neural radiance fields (nerf), volume rendering, feature integration |
2305.17624
Report |
SimpSON: Simplifying Photo Cleanup with Single-Click Distracting Object Segmentation Network |
Chuong Huynh, Yuqian Zhou, Zhe Lin, Connelly Barnes, Eli Shechtman, Sohrab Amirghodsi, Abhinav Shrivastava |
In photo editing, it is common practice to remove visual distractions to
improve the overall image quality and highlight the primary subject. However,
manually selecting and removing these small and dense distracting regions can
be a laborious and time-consuming task. In this paper, we propose an
interactive distractor selection method that is optimized to achieve the task
with just a single click. Our method surpasses the precision and recall
achieved by the traditional method of running panoptic segmentation and then
selecting the segments containing the clicks. We also showcase how a
transformer-based module can be used to identify more distracting regions
similar to the user's click position. Our experiments demonstrate that the
model can effectively and accurately segment unknown distracting objects
interactively and in groups. By significantly simplifying the photo cleaning
and retouching process, our proposed model provides inspiration for exploring
rare object segmentation and group selection with a single click. |
This paper introduces SimpSON, an interactive single-click distractor segmentation network for simplifying photo cleanup. |
Removing small and dense distracting objects from photos is a common yet time-consuming task in photo editing. SimpSON allows users to select and remove these distractions with a single click, potentially reducing editing time from hours to minutes. |
The method utilizes a three-stage pipeline: 1) a single-click Distractor Segmentation Network (1C-DSN) segments objects based on a single click, 2) a Click Proposal Network (CPN) identifies similar distractors and proposes their click positions, and 3) a Proposal Verification Module (PVM) verifies the similarity of proposed clicks to reduce false positives. This process can be run iteratively for more thorough selection. |
The 1C-DSN outperforms existing interactive segmentation methods in segmenting small and medium objects with a single click.
The CPN effectively identifies similar distractors within an image, enabling group selection.
The iterative selection process with PVM significantly improves group selection accuracy. |
The group selection pipeline relies on synthetic data due to the lack of labeled datasets with repeated distractors.
Further exploration of image harmonization techniques could improve the realism of synthetic data and potentially enhance performance. |
interactive segmentation, distractor removal, photo retouching, single-click segmentation, group selection |
2305.17431
Report |
Towards Consistent Video Editing with Text-to-Image Diffusion Models |
Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, Luoqi Liu |
Existing works have advanced Text-to-Image (TTI) diffusion models for video
editing in a one-shot learning manner. Despite their low requirements of data
and computation, these methods might produce results of unsatisfied consistency
with text prompt as well as temporal sequence, limiting their applications in
the real world. In this paper, we propose to address the above issues with a
novel EI$^2$ model towards \textbf{E}nhancing v\textbf{I}deo \textbf{E}diting
cons\textbf{I}stency of TTI-based frameworks. Specifically, we analyze and find
that the inconsistent problem is caused by newly added modules into TTI models
for learning temporal information. These modules lead to covariate shift in the
feature space, which harms the editing capability. Thus, we design EI$^2$ to
tackle the above drawbacks with two classical modules: Shift-restricted
Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM).
First, through theoretical analysis, we demonstrate that covariate shift is
highly related to Layer Normalization, thus STAM employs a \textit{Instance
Centering} layer replacing it to preserve the distribution of temporal
features. In addition, {STAM} employs an attention layer with normalized
mapping to transform temporal features while constraining the variance shift.
As the second part, we incorporate {STAM} with a novel {FFAM}, which
efficiently leverages fine-coarse spatial information of overall frames to
further enhance temporal consistency. Extensive experiments demonstrate the
superiority of the proposed EI$^2$ model for text-driven video editing. |
This paper presents EI$^2$, a novel approach that enhances the consistency of text-driven video editing using pre-trained Text-to-Image (TTI) diffusion models. |
Existing methods for adapting TTI models to video editing often suffer from temporal inconsistencies (e.g., flickering) and semantic disparity (inconsistency between edits and text prompts), limiting their real-world applicability. |
EI$^2$ introduces two novel modules: (1) **Shift-restricted Temporal Attention Module (STAM)**, theoretically grounded to address semantic disparity by mitigating covariate shift in feature space caused by temporal attention. (2) **Fine-coarse Frame Attention Module (FFAM)**, which enhances temporal consistency by efficiently incorporating global spatial-temporal information. |
EI$^2$ effectively addresses semantic disparity, leading to edits that better align with text prompts.
EI$^2$ enhances temporal consistency, producing smoother and more coherent video edits.
Extensive experiments demonstrate EI$^2$'s superiority over state-of-the-art methods in terms of visual quality, user preference, and resource consumption. |
The theoretical analysis relies on a Gaussian assumption for feature distributions, which may not hold perfectly in practice.
While demonstrating strong editing capabilities, EI$^2$ may still exhibit temporal inconsistencies in certain challenging scenarios (e.g., object replacement with dissimilar attributes). |
video editing, diffusion models, text-to-image synthesis, temporal consistency, semantic alignment |
2305.17423
Report |
Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference |
Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin Cui |
Due to the recent success of diffusion models, text-to-image generation is
becoming increasingly popular and achieves a wide range of applications. Among
them, text-to-image editing, or continuous text-to-image generation, attracts
lots of attention and can potentially improve the quality of generated images.
It's common to see that users may want to slightly edit the generated image by
making minor modifications to their input textual descriptions for several
rounds of diffusion inference. However, such an image editing process suffers
from the low inference efficiency of many existing diffusion models even using
GPU accelerators. To solve this problem, we introduce Fast Image Semantically
Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for
efficient text-to-image editing. The key intuition behind our approach is to
utilize the semantic mapping between the minor modifications on the input text
and the affected regions on the output image. For each text editing step,
FISEdit can automatically identify the affected image regions and utilize the
cached unchanged regions' feature map to accelerate the inference process.
Extensive empirical results show that FISEdit can be $3.4\times$ and
$4.4\times$ faster than existing methods on NVIDIA TITAN RTX and A100 GPUs
respectively, and even generates more satisfactory images. |
This paper introduces Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine designed for efficient text-to-image editing. |
The method addresses the inefficiency of existing text-to-image editing techniques that regenerate the entire image even when only minor modifications are desired. |
FISEdit leverages the semantic mapping between textual changes and affected image regions to enable sparse computation. It involves a mask generation algorithm to identify affected areas, sparse computation techniques for efficient feature map updates, and a cache-based editing pipeline for managing intermediate data. |
FISEdit achieves up to 4.9x reduction in computational cost compared to baselines.
It offers a speedup of up to 4.4x on NVIDIA TITAN RTX and 3.4x on NVIDIA A100 GPUs.
The method generates high-quality edited images comparable to existing approaches while being significantly faster. |
FISEdit's performance degrades when editing low-resolution images due to limited sparsity.
Future work includes extending the caching mechanism to support real-world text-to-image services for improved throughput. |
text-to-image editing, diffusion models, sparse computation, cache-enabled inference, semantic image editing |
2305.17235
Report |
COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models |
Jinqi Xiao, Miao Yin, Yu Gong, Xiao Zang, Jian Ren, Bo Yuan |
Attention-based vision models, such as Vision Transformer (ViT) and its
variants, have shown promising performance in various computer vision tasks.
However, these emerging architectures suffer from large model sizes and high
computational costs, calling for efficient model compression solutions. To
date, pruning ViTs has been well studied, while other compression strategies
that have been widely applied in CNN compression, e.g., model factorization, is
little explored in the context of ViT compression. This paper explores an
efficient method for compressing vision transformers to enrich the toolset for
obtaining compact attention-based vision models. Based on the new insight on
the multi-head attention layer, we develop a highly efficient ViT compression
solution, which outperforms the state-of-the-art pruning methods. For
compressing DeiT-small and DeiT-base models on ImageNet, our proposed approach
can achieve 0.45% and 0.76% higher top-1 accuracy even with fewer parameters.
Our finding can also be applied to improve the customization efficiency of
text-to-image diffusion models, with much faster training (up to $2.6\times$
speedup) and lower extra storage cost (up to $1927.5\times$ reduction) than the
existing works. |
This paper presents ComCAT, a novel model compression technique for attention-based vision models like Vision Transformer (ViT) by effectively exploring the inherent low-rankness within the multi-head attention mechanism. |
Large model sizes and high computational costs of attention-based models necessitate efficient compression techniques, and exploring low-rankness offers an alternative to existing pruning methods. |
The authors analyze singular value distributions in ViT layers and propose a head-level low-rank approximation strategy. They further introduce an automatic rank selection method, leveraging differentiable neural architecture search (NAS) to find optimal rank combinations for compression. |
ComCAT outperforms state-of-the-art pruning methods, achieving 0.45% and 0.76% higher top-1 accuracy with fewer parameters for DeiT-small and DeiT-base on ImageNet.
ComCAT demonstrates significant practical speedups on various hardware platforms, including GPUs, mobile processors, ASIC accelerators, and FPGAs.
Applied to text-to-image diffusion model customization, ComCAT improves efficiency with faster training (up to 2.6x speedup) and lower storage costs (up to 1927.5x reduction). |
Exploration of alternative low-rank decomposition methods beyond SVD for specific layers or tasks could be beneficial.
Further investigation into the trade-off between compression ratio, accuracy, and hardware efficiency is crucial for practical deployment. |
model compression, vision transformer, low-rank approximation, text-to-image diffusion, neural architecture search |
2305.17223
Report |
Do We Really Need a Large Number of Visual Prompts? |
Youngeun Kim, Yuhang Li, Abhishek Moitra, Ruokai Yin, Priyadarshini Panda |
Due to increasing interest in adapting models on resource-constrained edges,
parameter-efficient transfer learning has been widely explored. Among various
methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input
space, shows competitive fine-tuning performance compared to training of full
network parameters. However, VPT increases the number of input tokens,
resulting in additional computational overhead. In this paper, we analyze the
impact of the number of prompts on fine-tuning performance and self-attention
operation in a vision transformer architecture. Through theoretical and
empirical analysis we show that adding more prompts does not lead to linear
performance improvement. Further, we propose a Prompt Condensation (PC)
technique that aims to prevent performance degradation from using a small
number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show
that our approach reduces the number of prompts by ~70% while maintaining
accuracy. |
This paper analyzes the impact of the number of visual prompts on the performance of Visual Prompt Tuning (VPT) and proposes a Prompt Condensation (PC) technique to reduce the number of prompts while maintaining accuracy. |
VPT, while memory-efficient, can lead to increased computational cost due to the use of additional prompts. This paper investigates this trade-off to improve the efficiency of VPT. |
The paper provides empirical analysis on the correlation between the number of prompts and accuracy. It mathematically analyzes the impact of prompts on self-attention operation. It proposes Prompt Condensation (PC) which involves calculating the importance score of prompts and selecting the most important ones for fine-tuning. |
Reducing the number of prompts by 50% does not lead to a significant drop in accuracy.
The self-attention matrix remains low-rank even with the addition of prompts.
Proposed Prompt Condensation (PC) can reduce the number of prompts by ~70% while maintaining accuracy. |
The analysis primarily focuses on VPT-Deep and might not be directly applicable to other VPT variants.
Further investigation into more efficient and accurate prompt scoring methods is needed. |
visual prompt tuning, parameter-efficient transfer learning, prompt condensation, vision transformers, self-attention |
2305.16965
Report |
Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling |
Gongye Liu, Haoze Sun, Jiayi Li, Fei Yin, Yujiu Yang |
Diffusion models have recently demonstrated an impressive ability to address
inverse problems in an unsupervised manner. While existing methods primarily
focus on modifying the posterior sampling process, the potential of the forward
process remains largely unexplored. In this work, we propose Shortcut Sampling
for Diffusion(SSD), a novel approach for solving inverse problems in a
zero-shot manner. Instead of initiating from random noise, the core concept of
SSD is to find a specific transitional state that bridges the measurement image
y and the restored image x. By utilizing the shortcut path of "input -
transitional state - output", SSD can achieve precise restoration with fewer
steps. To derive the transitional state during the forward process, we
introduce Distortion Adaptive Inversion. Moreover, we apply back projection as
additional consistency constraints during the generation process.
Experimentally, we demonstrate SSD's effectiveness on multiple representative
IR tasks. Our method achieves competitive results with only 30 NFEs compared to
state-of-the-art zero-shot methods(100 NFEs) and outperforms them with 100 NFEs
in certain tasks. Code is available at https://github.com/GongyeLiu/SSD |
This paper proposes Shortcut Sampling for Diffusion (SSD), a novel approach for solving inverse problems in a zero-shot manner by finding a specific transitional state that bridges the input and restored images, enabling faster and more accurate restoration with fewer steps. |
Existing diffusion-based inverse problem solvers are slow due to their reliance on lengthy sampling processes starting from random noise, neglecting the potential of modifying the forward process. |
SSD uses Distortion Adaptive Inversion (DA Inversion) to find the transitional state by adding controllable random perturbations during the inversion process. It then applies back projection during the generation process to ensure faithfulness to the input image. |
SSD achieves competitive results with only 30 NFEs compared to state-of-the-art zero-shot methods using 100 NFEs.
SSD outperforms existing methods in certain IR tasks when using 100 NFEs.
SSD demonstrates strong performance on various inverse problems, including super-resolution, colorization, inpainting, and deblurring, on both CelebA and ImageNet datasets. |
The reliance on back projection may limit the performance when the degradation operator estimation is inaccurate.
The paper mainly focuses on simple degradation operators; handling more complex real-world degradation remains unexplored. |
diffusion models, inverse problems, zero-shot learning, image restoration, shortcut sampling |
2305.16936
Report |
CRoSS: Diffusion Model Makes Controllable, Robust and Secure Image Steganography |
Jiwen Yu, Xuanyu Zhang, Youmin Xu, Jian Zhang |
Current image steganography techniques are mainly focused on cover-based
methods, which commonly have the risk of leaking secret images and poor
robustness against degraded container images. Inspired by recent developments
in diffusion models, we discovered that two properties of diffusion models, the
ability to achieve translation between two images without training, and
robustness to noisy data, can be used to improve security and natural
robustness in image steganography tasks. For the choice of diffusion model, we
selected Stable Diffusion, a type of conditional diffusion model, and fully
utilized the latest tools from open-source communities, such as LoRAs and
ControlNets, to improve the controllability and diversity of container images.
In summary, we propose a novel image steganography framework, named
Controllable, Robust and Secure Image Steganography (CRoSS), which has
significant advantages in controllability, robustness, and security compared to
cover-based image steganography methods. These benefits are obtained without
additional training. To our knowledge, this is the first work to introduce
diffusion models to the field of image steganography. In the experimental
section, we conducted detailed experiments to demonstrate the advantages of our
proposed CRoSS framework in controllability, robustness, and security. |
Proposes CRoSS, a novel coverless image steganography framework leveraging diffusion models for enhanced security, controllability, and robustness. |
Addresses limitations of existing steganography methods, which often leak secret image information, lack robustness to degradation, and offer limited control over container image content. |
Utilizes DDIM Inversion with conditional diffusion models (Stable Diffusion) to enable invertible image translation between secret and container images. Different conditions (prompts, LoRAs, ControlNets) act as keys for hiding and revealing. |
CRoSS demonstrates higher security against steganalysis attacks and visual suspicion compared to existing methods.
Offers flexible control over container image content while maintaining high visual quality.
Exhibits strong robustness to various image degradations, including real-world scenarios like transmission via messaging apps and phone captures. |
While subjectively acceptable, the pixel-wise fidelity of revealed images is lower than cover-based methods.
Current implementation focuses on single-subject modifications within the secret image, limiting its ability to hide global image content. |
image steganography, diffusion models, ddim inversion, coverless steganography, stable diffusion |
2305.16835
Report |
OpenVIS: Open-vocabulary Video Instance Segmentation |
Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, Wenqiang Zhang |
Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously
detect, segment, and track arbitrary object categories in a video, without
being constrained to categories seen during training. In this work, we propose
an OpenVIS framework called InstFormer that achieves powerful open vocabulary
capability through lightweight fine-tuning on a limited-category labeled
dataset. Specifically, InstFormer comes in three steps a) Open-world Mask
Proposal: we utilize a query-based transformer, which is encouraged to propose
all potential object instances, to obtain class-agnostic instance masks; b)
Open-vocabulary Instance Representation and Classification: we propose
InstCLIP, adapted from pre-trained CLIP with Instance Guidance Attention.
InstCLIP generates the instance token capable of representing each
open-vocabulary instance. These instance tokens not only enable open-vocabulary
classification for multiple instances with a single CLIP forward pass but have
also been proven effective for subsequent open-vocabulary instance tracking. c)
Rollout Association: we introduce a class-agnostic rollout tracker to predict
rollout tokens from the tracking tokens of previous frames to enable
open-vocabulary instance association across frames in the video. The
experimental results demonstrate the proposed InstFormer achieve
state-of-the-art capabilities on a comprehensive OpenVIS evaluation benchmark,
while also achieves competitive performance in fully supervised VIS task. |
This paper presents InstFormer, a novel open-vocabulary video instance segmentation (OpenVIS) framework that segments, detects, and tracks arbitrary object categories in a video without being limited to categories seen during training. |
Current video instance segmentation models are limited to identifying objects from categories present in their training data, hindering their ability to understand target videos comprehensively. This necessitates retraining with new data for novel categories, which is time-consuming and resource-intensive. OpenVIS addresses this limitation by enabling the identification of objects from arbitrary categories, even unseen during training. |
InstFormer utilizes a three-step approach: 1) Open-world Mask Proposal using a query-based transformer to generate class-agnostic instance masks. 2) Open-vocabulary Instance Representation and Classification through InstCLIP, an enhanced version of CLIP with Instance Guidance Attention, to embed each instance with an instance token for classification and tracking. 3) Rollout Association leveraging a class-independent rollout tracker with temporal contrastive learning to associate instances across frames. |
InstFormer achieves state-of-the-art OpenVIS performance, surpassing existing open-vocabulary methods even with seeing fewer categories during training.
The proposed InstCLIP effectively maintains the zero-shot capability of the pre-trained CLIP model, demonstrating strong performance in zero-shot instance classification.
InstFormer demonstrates competitive performance in fully supervised VIS tasks, highlighting its ability to excel in both open-set and closed-set scenarios. |
The reliance on pre-trained VLMs like CLIP introduces a dependence on the capabilities and limitations of these models.
Future work could explore improving the rollout tracker by incorporating more advanced temporal modeling techniques for enhanced instance association. |
open-vocabulary, video instance segmentation, openvis, instance guidance clip, contrastive learning |
2305.16807
Report |
Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models |
Daiki Miyake, Akihiro Iohara, Yu Saito, Toshiyuki Tanaka |
In image editing employing diffusion models, it is crucial to preserve the
reconstruction quality of the original image while changing its style. Although
existing methods ensure reconstruction quality through optimization, a drawback
of these is the significant amount of time required for optimization. In this
paper, we propose negative-prompt inversion, a method capable of achieving
equivalent reconstruction solely through forward propagation without
optimization, thereby enabling much faster editing processes. We experimentally
demonstrate that the reconstruction quality of our method is comparable to that
of existing methods, allowing for inversion at a resolution of 512 pixels and
with 50 sampling steps within approximately 5 seconds, which is more than 30
times faster than null-text inversion. Reduction of the computation time by the
proposed method further allows us to use a larger number of sampling steps in
diffusion models to improve the reconstruction quality with a moderate increase
in computation time. |
The paper proposes "negative-prompt inversion," a novel method for fast reconstruction of real images with diffusion models without requiring optimization. |
Existing image editing methods using diffusion models rely on optimization for reconstruction quality, leading to high computational costs and slow processing. This new method significantly accelerates the editing process. |
The method leverages the observation that the optimal null-text embedding in null-text inversion can be approximated by the embedding of the input text prompt. It replaces the iterative optimization process of null-text inversion with a single forward pass using the prompt embedding. |
Negative-prompt inversion achieves reconstruction quality comparable to null-text inversion but is more than 30 times faster.
The method enables real-time image editing when combined with existing editing techniques like prompt-to-prompt.
Increasing the number of sampling steps in negative-prompt inversion further improves reconstruction quality while remaining computationally faster than optimization-based methods. |
The average reconstruction quality of negative-prompt inversion, while visually similar, does not fully match the quality of null-text inversion.
The method may exhibit failures, particularly in reconstructing human figures, potentially due to limitations in the employed AutoEncoder. |
diffusion models, image editing, image reconstruction, negative-prompt inversion, real-time editing |
2305.16804
Report |
Towards Open-World Segmentation of Parts |
Tai-Yu Pan, Qing Liu, Wei-Lun Chao, Brian Price |
Segmenting object parts such as cup handles and animal bodies is important in
many real-world applications but requires more annotation effort. The largest
dataset nowadays contains merely two hundred object categories, implying the
difficulty to scale up part segmentation to an unconstrained setting. To
address this, we propose to explore a seemingly simplified but empirically
useful and scalable task, class-agnostic part segmentation. In this problem, we
disregard the part class labels in training and instead treat all of them as a
single part class. We argue and demonstrate that models trained without part
classes can better localize parts and segment them on objects unseen in
training. We then present two further improvements. First, we propose to make
the model object-aware, leveraging the fact that parts are "compositions",
whose extents are bounded by the corresponding objects and whose appearances
are by nature not independent but bundled. Second, we introduce a novel
approach to improve part segmentation on unseen objects, inspired by an
interesting finding -- for unseen objects, the pixel-wise features extracted by
the model often reveal high-quality part segments. To this end, we propose a
novel self-supervised procedure that iterates between pixel clustering and
supervised contrastive learning that pulls pixels closer or pushes them away.
Via extensive experiments on PartImageNet and Pascal-Part, we show notable and
consistent gains by our approach, essentially a critical step towards
open-world part segmentation. |
This paper presents Open Part Segmenter (OPS), a novel approach for open-world part instance segmentation, enabling the segmentation of parts for objects unseen during training. |
Existing part segmentation methods struggle in open-world settings due to the limited coverage of part classes in training data. OPS aims to address this limitation by enabling part segmentation for unseen objects. |
OPS leverages class-agnostic training, object-aware learning (using object masks), and self-supervised fine-tuning with unlabeled data. This approach removes the reliance on specific part class labels and allows the model to learn more general part representations. |
Class-agnostic training proves effective for open-world part segmentation, outperforming class-aware training.
Object-aware learning significantly improves part segmentation, especially for unseen objects, by leveraging the object-part relationship.
Self-supervised fine-tuning with unlabeled data further enhances the model's generalizability to unseen parts and objects. |
The evaluation metric for unlabeled part discovery requires further exploration.
While multiple rounds of self-training show promise, further investigation is needed to optimize pseudo-label generation. |
part segmentation, open-world learning, class-agnostic training, object-aware learning, self-supervised learning |
2305.16759
Report |
StyleHumanCLIP: Text-guided Garment Manipulation for StyleGAN-Human |
Takato Yoshikawa, Yuki Endo, Yoshihiro Kanamori |
This paper tackles text-guided control of StyleGAN for editing garments in
full-body human images. Existing StyleGAN-based methods suffer from handling
the rich diversity of garments and body shapes and poses. We propose a
framework for text-guided full-body human image synthesis via an
attention-based latent code mapper, which enables more disentangled control of
StyleGAN than existing mappers. Our latent code mapper adopts an attention
mechanism that adaptively manipulates individual latent codes on different
StyleGAN layers under text guidance. In addition, we introduce feature-space
masking at inference time to avoid unwanted changes caused by text inputs. Our
quantitative and qualitative evaluations reveal that our method can control
generated images more faithfully to given texts than existing methods. |
This paper introduces a novel framework that leverages text guidance for manipulating garments in full-body human images generated by StyleGAN-Human. |
Existing methods for text-guided StyleGAN image editing struggle with the diversity of garments and body shapes/poses present in full-body human images, often neglecting garment details or altering the person's identity. |
The proposed framework utilizes an attention-based latent code mapper to effectively capture the correspondence between text descriptions and individual latent codes controlling different StyleGAN layers. It also employs feature-space masking at inference to prevent unwanted changes in image areas unrelated to the text input. |
The proposed method demonstrates superior performance in accurately reflecting text semantics in edited images compared to existing StyleGAN-based methods (StyleCLIP and HairCLIP) and diffusion model-based approaches (SD Inpainting and DiffEdit).
Quantitative evaluations reveal that the proposed method achieves higher CLIP accuracy and better preserves background regions (lower BG LPIPS) compared to existing methods.
Subjective user studies confirm the effectiveness of the proposed method, indicating higher scores for text alignment and competitive scores for image realism. |
Currently, separate mapper networks are trained for the upper and lower body, requiring users to manually select the appropriate network based on the target text.
The method faces limitations in handling full-body garments like dresses and is sensitive to the accuracy of the human parsing model used for mask generation. |
image editing, stylegan, text-guided image manipulation, virtual try-on, full-body human image synthesis |
2305.16681
Report |
CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot Learning |
Zhaoheng Zheng, Haidong Zhu, Ram Nevatia |
In this paper, we study the problem of Compositional Zero-Shot Learning
(CZSL), which is to recognize novel attribute-object combinations with
pre-existing concepts. Recent researchers focus on applying large-scale
Vision-Language Pre-trained (VLP) models like CLIP with strong generalization
ability. However, these methods treat the pre-trained model as a black box and
focus on pre- and post-CLIP operations, which do not inherently mine the
semantic concept between the layers inside CLIP. We propose to dive deep into
the architecture and insert adapters, a parameter-efficient technique proven to
be effective among large language models, into each CLIP encoder layer. We
further equip adapters with concept awareness so that concept-specific features
of "object", "attribute", and "composition" can be extracted. We assess our
method on four popular CZSL datasets, MIT-States, C-GQA, UT-Zappos, and
VAW-CZSL, which shows state-of-the-art performance compared to existing methods
on all of them. |
This paper proposes CAILA (Concept-Aware Intra-Layer Adapters), a novel method that enhances Compositional Zero-Shot Learning (CZSL) by integrating concept-aware adapters into each layer of pre-trained vision-language models like CLIP. |
Existing CZSL methods often treat large-scale VLP models as black boxes, failing to fully exploit the semantic knowledge embedded within their layers. This work aims to overcome this limitation by directly modifying the model architecture for better knowledge transfer and generalization. |
The proposed CAILA method inserts concept-specific adapters into each CLIP encoder layer to extract features related to attributes, objects, and compositions. It then employs a Mixture-of-Adapters (MoA) mechanism to fuse these features and enhance knowledge aggregation. Additionally, a Primitive Concept Shift strategy is introduced to generate augmented training data by combining primitive features. |
CAILA achieves state-of-the-art performance on four popular CZSL benchmarks: MIT-States, C-GQA, UT-Zappos, and VAW-CZSL, under both closed and open world settings.
The method exhibits significant improvements, especially on C-GQA, where it surpasses baselines by a large margin in the challenging open world scenario.
Ablation studies validate the effectiveness of individual components, including adapters, MoA, concept shift, and choice of mixture functions. |
CAILA's performance can degrade in open world settings when the number of possible compositions significantly increases, highlighting the need for more robust methods to handle large search spaces.
Future work could explore alternative adapter architectures and MoA strategies to further enhance knowledge transfer and generalization in CZSL. |
compositional zero-shot learning, vision-language pre-training, clip, adapters, concept-aware learning |
2305.16411
Report |
ZeroAvatar: Zero-shot 3D Avatar Generation from a Single Image |
Zhenzhen Weng, Zeyu Wang, Serena Yeung |
Recent advancements in text-to-image generation have enabled significant
progress in zero-shot 3D shape generation. This is achieved by score
distillation, a methodology that uses pre-trained text-to-image diffusion
models to optimize the parameters of a 3D neural presentation, e.g. Neural
Radiance Field (NeRF). While showing promising results, existing methods are
often not able to preserve the geometry of complex shapes, such as human
bodies. To address this challenge, we present ZeroAvatar, a method that
introduces the explicit 3D human body prior to the optimization process.
Specifically, we first estimate and refine the parameters of a parametric human
body from a single image. Then during optimization, we use the posed parametric
body as additional geometry constraint to regularize the diffusion model as
well as the underlying density field. Lastly, we propose a UV-guided texture
regularization term to further guide the completion of texture on invisible
body parts. We show that ZeroAvatar significantly enhances the robustness and
3D consistency of optimization-based image-to-3D avatar generation,
outperforming existing zero-shot image-to-3D methods. |
Proposes ZeroAvatar, a zero-shot 3D human avatar generation method from a single image using a pre-trained text-to-image diffusion model as a prior. It leverages a parametric human body model (SMPL) for initialization and depth-guided optimization, enhancing geometry preservation, and incorporates UV-guided texture completion for improved appearance, surpassing existing zero-shot methods. |
Extracting accurate 3D information from single images is crucial for content creation, AR/VR, robotics, and scene understanding, but existing methods struggle with preserving complex human geometry. |
1. **Initialization:** Estimate body pose and shape from the input image using SMPL, refining it against image features for accurate alignment. 2. **Depth-guided Optimization:** Optimize NeRF parameters using a depth-conditioned score distillation loss derived from a pre-trained text-to-image diffusion model, guided by SMPL depth. 3. **UV-guided Texture Completion:** Regularize the appearance of invisible body parts using a UV-guided texture prior, leveraging texture symmetry. |
ZeroAvatar significantly improves geometry and appearance fidelity of generated avatars, outperforming existing zero-shot 3D generation methods.
It effectively preserves human structure, achieving higher detection scores on novel views compared to baselines.
The method demonstrates strong generalization ability, handling both real-world humans and virtual avatars. |
Limitation: Relies on SMPL, limiting accuracy for body proportions deviating significantly from the average human shape. Future work: Enhance the generalizability of the human body prior.
Limitation: Extracted meshes from the density field can be coarse. Future work: Integrate techniques for geometry and texture refinement. |
3d avatar generation, zero-shot learning, diffusion models, score distillation sampling, human body prior |
2305.16310
Report |
Securing Deep Generative Models with Universal Adversarial Signature |
Yu Zeng, Mo Zhou, Yuan Xue, Vishal M. Patel |
Recent advances in deep generative models have led to the development of
methods capable of synthesizing high-quality, realistic images. These models
pose threats to society due to their potential misuse. Prior research attempted
to mitigate these threats by detecting generated images, but the varying traces
left by different generative models make it challenging to create a universal
detector capable of generalizing to new, unseen generative models. In this
paper, we propose to inject a universal adversarial signature into an arbitrary
pre-trained generative model, in order to make its generated contents more
detectable and traceable. First, the imperceptible optimal signature for each
image can be found by a signature injector through adversarial training.
Subsequently, the signature can be incorporated into an arbitrary generator by
fine-tuning it with the images processed by the signature injector. In this
way, the detector corresponding to the signature can be reused for any
fine-tuned generator for tracking the generator identity. The proposed method
is validated on the FFHQ and ImageNet datasets with various state-of-the-art
generative models, consistently showing a promising detection rate. Code will
be made publicly available at \url{https://github.com/zengxianyu/genwm}. |
This work proposes a method for securing deep generative models by embedding imperceptible signatures into generated images. These signatures are designed to be robust to various image manipulations, enabling the source of generated images to be tracked. |
The proliferation of high-quality image generation models raises concerns about potential misuse, including the spread of misinformation. This method aims to address this by providing a way to verify the origin of generated images. |
The method involves fine-tuning a pre-trained generative model with an additional signature injector network. The injector embeds the signature while minimizing the perceptual difference between the original and signed images. The presence of the signature is then verified by a separate classifier network. |
The embedded signatures are nearly imperceptible, with minimal impact on visual quality.
The signatures are robust to various image manipulations, including compression, resizing, and noise addition.
The method achieves high classification accuracy in distinguishing between signed and unsigned images. |
The current method requires fine-tuning a pre-trained generative model to embed the signature, limiting its practicality.
Future work could explore training-free frameworks for securing deep generative models. |
deep generative models, image security, watermarking, source tracking, misinformation mitigation |
2305.16295
Report |
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning |
Chia-Wen Kuo, Zsolt Kira |
A great deal of progress has been made in image captioning, driven by
research into how to encode the image using pre-trained models. This includes
visual encodings (e.g. image grid features or detected objects) and more
recently textual encodings (e.g. image tags or text descriptions of image
regions). As more advanced encodings are available and incorporated, it is
natural to ask: how to efficiently and effectively leverage the heterogeneous
set of encodings? In this paper, we propose to regard the encodings as
augmented views of the input image. The image captioning model encodes each
view independently with a shared encoder efficiently, and a contrastive loss is
incorporated across the encoded views in a novel way to improve their
representation quality and the model's data efficiency. Our proposed
hierarchical decoder then adaptively weighs the encoded views according to
their effectiveness for caption generation by first aggregating within each
view at the token level, and then across views at the view level. We
demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and
+12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous
analyses to demonstrate the importance of each part of our design. |
This paper proposes HAAV, a hierarchical aggregation method for augmented views in image captioning, enabling efficient and effective utilization of heterogeneous image encodings. |
Existing methods for leveraging heterogeneous image encodings in captioning are either computationally expensive (concatenation) or parameter inefficient (separate models per view). HAAV offers a solution that is both efficient and effective. |
HAAV treats heterogeneous views as image augmentations, encoding them independently with a shared transformer encoder. A contrastive loss improves representation learning. A hierarchical decoder then combines information within and across views, adaptively weighting their contributions for each generated word. |
HAAV achieves state-of-the-art performance on MS-COCO (+5.6% CIDEr) and Flickr30K (+12.9% CIDEr) without relying on large-scale pre-training.
The method demonstrates superior computation, parameter, and label efficiency compared to alternative approaches.
Analysis of attention weights confirms the hierarchical decoder's ability to adaptively leverage different views based on their relevance to the generated caption. |
The study primarily focuses on the trained-from-scratch setting, with potential benefits from large-scale pre-training yet to be explored.
Future work could investigate the impact of incorporating more diverse augmented views beyond the ones considered. |
image captioning, multi-view learning, data augmentation, contrastive learning, hierarchical attention |
2305.16289
Report |
Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation |
Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E. Gonzalez, Trevor Darrell |
Many fine-grained classification tasks, like rare animal identification, have
limited training data and consequently classifiers trained on these datasets
often fail to generalize to variations in the domain like changes in weather or
location. As such, we explore how natural language descriptions of the domains
seen in training data can be used with large vision models trained on diverse
pretraining datasets to generate useful variations of the training data. We
introduce ALIA (Automated Language-guided Image Augmentation), a method which
utilizes large vision and language models to automatically generate natural
language descriptions of a dataset's domains and augment the training data via
language-guided image editing. To maintain data integrity, a model trained on
the original dataset filters out minimal image edits and those which corrupt
class-relevant information. The resulting dataset is visually consistent with
the original training data and offers significantly enhanced diversity. We show
that ALIA is able to surpasses traditional data augmentation and text-to-image
generated data on fine-grained classification tasks, including cases of domain
generalization and contextual bias. Code is available at
https://github.com/lisadunlap/ALIA. |
This paper introduces ALIA (Automated Language-guided Image Augmentation), a method using vision and language models to automatically generate natural language descriptions of domains within image datasets and leverage them for language-guided image editing to augment the data. |
ALIA addresses the challenge of limited training data in fine-grained classification tasks, particularly domain shifts, by creating visually consistent, diverse augmentations grounded in the original data. |
ALIA generates domain descriptions from image captions summarized by an LLM. Then, it uses these descriptions for text-guided image editing, applying filtering techniques to ensure data quality and preserve task-relevant information. |
ALIA outperforms traditional data augmentation and text-to-image generation methods, even exceeding real data performance on the iWildCam dataset.
The study shows that ALIA-generated domain descriptions are more effective than user-provided prompts, highlighting the method's ability to capture key domain-specific features.
The choice of image editing method significantly impacts ALIA's performance, with Img2Img being more suitable for certain datasets like iWildCam and InstructPix2Pix for others. |
ALIA's performance depends on the quality of the captioning model, LLM, and image editing method, which can limit its effectiveness.
Determining the optimal amount of augmented data to include in training is an open question for future research. |
data augmentation, language-guided image editing, fine-grained classification, domain generalization, contextual bias |
2305.16233
Report |
Interactive Segment Anything NeRF with Feature Imitation |
Xiaokang Chen, Jiaxiang Tang, Diwen Wan, Jingbo Wang, Gang Zeng |
This paper investigates the potential of enhancing Neural Radiance Fields
(NeRF) with semantics to expand their applications. Although NeRF has been
proven useful in real-world applications like VR and digital creation, the lack
of semantics hinders interaction with objects in complex scenes. We propose to
imitate the backbone feature of off-the-shelf perception models to achieve
zero-shot semantic segmentation with NeRF. Our framework reformulates the
segmentation process by directly rendering semantic features and only applying
the decoder from perception models. This eliminates the need for expensive
backbones and benefits 3D consistency. Furthermore, we can project the learned
semantics onto extracted mesh surfaces for real-time interaction. With the
state-of-the-art Segment Anything Model (SAM), our framework accelerates
segmentation by 16 times with comparable mask quality. The experimental results
demonstrate the efficacy and computational advantages of our approach. Project
page: \url{https://me.kiui.moe/san/}. |
Presents a novel feature imitation method to enable real-time interactive 3D segmentation in Neural Radiance Fields (NeRF) by leveraging pre-trained 2D perception models. |
NeRF lacks explicit semantic information, limiting its interactive applications. This work aims to bridge this gap and enhance NeRF with semantic understanding for real-time user interaction in 3D scenes. |
Imitates the backbone features of off-the-shelf 2D perception models (e.g., SAM, X-Decoder) to directly render semantic features within the NeRF framework. Employs camera augmentation and caching mechanisms to improve training efficiency and feature imitation quality. |
Achieves real-time 3D click-based segmentation (24.39 FPS) with SAM, a 16x speedup compared to directly applying SAM on rendered images.
Demonstrates comparable segmentation quality to the original 2D models on various challenging scenes.
Enables mesh segmentation by projecting 2D masks onto 3D surfaces, facilitating downstream applications like texture editing and model composition. |
Performance relies on the capabilities of the underlying perception models, which can sometimes lead to imperfect segmentation masks.
Future work includes exploring more powerful perception models and extending the method to support more complex 3D interactions beyond segmentation. |
nerf, interactive segmentation, 3d semantic understanding, feature imitation, real-time |
2305.16133
Report |
OVO: Open-Vocabulary Occupancy |
Zhiyu Tan, Zichao Dong, Cheng Zhang, Weikun Zhang, Hang Ji, Hao Li |
Semantic occupancy prediction aims to infer dense geometry and semantics of
surroundings for an autonomous agent to operate safely in the 3D environment.
Existing occupancy prediction methods are almost entirely trained on
human-annotated volumetric data. Although of high quality, the generation of
such 3D annotations is laborious and costly, restricting them to a few specific
object categories in the training dataset. To address this limitation, this
paper proposes Open Vocabulary Occupancy (OVO), a novel approach that allows
semantic occupancy prediction of arbitrary classes but without the need for 3D
annotations during training. Keys to our approach are (1) knowledge
distillation from a pre-trained 2D open-vocabulary segmentation model to the 3D
occupancy network, and (2) pixel-voxel filtering for high-quality training data
generation. The resulting framework is simple, compact, and compatible with
most state-of-the-art semantic occupancy prediction models. On NYUv2 and
SemanticKITTI datasets, OVO achieves competitive performance compared to
supervised semantic occupancy prediction approaches. Furthermore, we conduct
extensive analyses and ablation studies to offer insights into the design of
the proposed framework. Our code is publicly available at
https://github.com/dzcgaara/OVO. |
This paper proposes Open Vocabulary Occupancy (OVO), a novel approach for semantic occupancy prediction that allows inference of arbitrary classes without requiring 3D annotations during training. |
Existing methods for semantic occupancy prediction rely heavily on laborious and costly 3D annotations, limiting their scalability and applicability to a restricted set of object categories. |
OVO leverages knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to a 3D occupancy network and employs pixel-voxel filtering for high-quality training data generation. |
OVO achieves competitive performance compared to supervised semantic occupancy prediction approaches on NYUv2 and SemanticKITTI datasets.
The effectiveness of the proposed feature alignment and voxel filtering techniques is demonstrated through ablation studies.
OVO introduces a minor computational overhead compared to the baseline occupancy network. |
OVO's reliance on voxel-wise prediction without instance-level optimization can lead to inconsistencies within a single object.
Future work will explore voxel grouping techniques to enhance prediction consistency at the instance level. |
semantic occupancy prediction, open vocabulary learning, knowledge distillation, 3d scene understanding, zero-shot learning |
2305.15779
Report |
Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models |
Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon |
Text-to-image diffusion models can generate diverse, high-fidelity images
based on user-provided text prompts. Recent research has extended these models
to support text-guided image editing. While text guidance is an intuitive
editing interface for users, it often fails to ensure the precise concept
conveyed by users. To address this issue, we propose Custom-Edit, in which we
(i) customize a diffusion model with a few reference images and then (ii)
perform text-guided editing. Our key discovery is that customizing only
language-relevant parameters with augmented prompts improves reference
similarity significantly while maintaining source similarity. Moreover, we
provide our recipe for each customization and editing process. We compare
popular customization methods and validate our findings on two editing methods
using various datasets. |
The paper introduces *Custom-Edit*, a two-step approach for precise text-guided image editing using customized diffusion models. |
Existing text-to-image models struggle to capture unique user concepts or appearances not encountered during training, making precise editing with textual prompts challenging. |
1. **Customization:** Fine-tune language-relevant parameters (cross-attention keys/values, rare token) of a pre-trained diffusion model on a few reference images with augmented prompts. 2. **Editing:** Utilize text-guided editing methods like Prompt-to-Prompt (P2P) or SDEdit on the customized model to edit images based on user prompts. |
Customizing language-relevant parameters with augmented prompts significantly improves reference similarity while maintaining source similarity.
Custom-Edit effectively transfers fine-grained appearance details from references to source images while preserving the overall structure.
The paper provides insights into the source-reference trade-off in diffusion-based editing, showing how adjusting strengths in P2P and SDEdit can control this balance. |
Custom-Edit sometimes struggles with editing complex backgrounds or may modify undesired regions due to limitations in attention map accuracy and text input controllability.
Future work could explore using larger text encoders, incorporating grounding inputs, or leveraging models with enhanced controllability to address these limitations. |
image editing, diffusion models, text-to-image, customization, prompt-to-prompt |
2305.15712
Report |
Knowledge Diffusion for Distillation |
Tao Huang, Yuan Zhang, Mingkai Zheng, Shan You, Fei Wang, Chen Qian, Chang Xu |
The representation gap between teacher and student is an emerging topic in
knowledge distillation (KD). To reduce the gap and improve the performance,
current methods often resort to complicated training schemes, loss functions,
and feature alignments, which are task-specific and feature-specific. In this
paper, we state that the essence of these methods is to discard the noisy
information and distill the valuable information in the feature, and propose a
novel KD method dubbed DiffKD, to explicitly denoise and match features using
diffusion models. Our approach is based on the observation that student
features typically contain more noises than teacher features due to the smaller
capacity of student model. To address this, we propose to denoise student
features using a diffusion model trained by teacher features. This allows us to
perform better distillation between the refined clean feature and teacher
feature. Additionally, we introduce a light-weight diffusion model with a
linear autoencoder to reduce the computation cost and an adaptive noise
matching module to improve the denoising performance. Extensive experiments
demonstrate that DiffKD is effective across various types of features and
achieves state-of-the-art performance consistently on image classification,
object detection, and semantic segmentation tasks. Code is available at
https://github.com/hunto/DiffKD. |
Presents DiffKD, a novel knowledge distillation (KD) method that utilizes diffusion models to explicitly denoise student features, thereby reducing the representation gap between teacher and student models. |
Addresses the challenge of representation gap in KD, particularly when distilling knowledge from stronger, more complex teacher models to smaller student models. |
Trains a diffusion model on teacher features to learn a denoising process. Employs this model to denoise student features, subsequently used for distillation. Introduces a lightweight diffusion model with a linear autoencoder for efficiency and an adaptive noise matching module for optimal denoising. |
DiffKD consistently outperforms state-of-the-art KD methods across various benchmarks including image classification, object detection, and semantic segmentation.
Demonstrates significant performance gains, particularly when distilling from stronger teacher models, highlighting its effectiveness in bridging the representation gap.
Shows the generic applicability of DiffKD across various tasks and feature types, including intermediate features and classification outputs. |
Current implementation relies on simple convolutional diffusion models and traditional loss functions, exploring more advanced diffusion techniques and loss functions could yield further improvements.
Computational cost, although comparable to other feature-based KD methods, is higher than simple logits distillation methods, presenting an area for future optimization |
knowledge-distillation, diffusion-models, representation-learning, model-compression, computer-vision |
2305.15542
Report |
TOAST: Transfer Learning via Attention Steering |
Baifeng Shi, Siyu Gai, Trevor Darrell, Xin Wang |
Transfer learning involves adapting a pre-trained model to novel downstream
tasks. However, we observe that current transfer learning methods often fail to
focus on task-relevant features. In this work, we explore refocusing model
attention for transfer learning. We introduce Top-Down Attention Steering
(TOAST), a novel transfer learning algorithm that keeps the pre-trained
backbone frozen, selects task-relevant features in the output, and feeds those
features back to the model to steer the attention to the task-specific
features. By refocusing the attention only, TOAST achieves state-of-the-art
results on a number of transfer learning benchmarks, while having a small
number of tunable parameters. Compared to fully fine-tuning, LoRA, and prompt
tuning, TOAST substantially improves performance across a range of fine-grained
visual classification datasets (e.g., 81.1% -> 86.2% on FGVC). TOAST also
outperforms the fully fine-tuned Alpaca and Vicuna models on
instruction-following language generation. Code is available at
https://github.com/bfshi/TOAST. |
This paper introduces TOAST (Top-Down Attention Steering), a transfer learning algorithm that enhances downstream task performance by refocusing the pre-trained model's attention onto task-relevant features. |
Existing transfer learning techniques often struggle to concentrate on task-specific features, limiting their effectiveness. |
TOAST freezes the pre-trained backbone and incorporates a top-down attention module. This module identifies task-relevant features in the output, feeds them back to guide attention during a second feedforward pass, effectively highlighting essential features. |
TOAST achieves state-of-the-art results on various benchmarks, including FGVC for fine-grained classification and VTAB-1k for broader image understanding.
It outperforms methods like fine-tuning, LoRA, and VPT, demonstrating the significance of attention refocusing.
TOAST also excels in instruction-following language generation, surpassing fine-tuned Alpaca and Vicuna models by providing more detailed and relevant responses. |
TOAST incurs higher computational cost due to the second feedforward pass.
While adaptable to diverse architectures and tasks, its performance on dense prediction tasks like semantic segmentation lags behind full fine-tuning. |
transfer learning, top-down attention, attention refocusing, parameter-efficient fine-tuning, computer vision, natural language processing |
2305.15399
Report |
Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape |
Rundi Wu, Ruoshi Liu, Carl Vondrick, Changxi Zheng |
Synthesizing novel 3D models that resemble the input example has long been
pursued by graphics artists and machine learning researchers. In this paper, we
present Sin3DM, a diffusion model that learns the internal patch distribution
from a single 3D textured shape and generates high-quality variations with fine
geometry and texture details. Training a diffusion model directly in 3D would
induce large memory and computational cost. Therefore, we first compress the
input into a lower-dimensional latent space and then train a diffusion model on
it. Specifically, we encode the input 3D textured shape into triplane feature
maps that represent the signed distance and texture fields of the input. The
denoising network of our diffusion model has a limited receptive field to avoid
overfitting, and uses triplane-aware 2D convolution blocks to improve the
result quality. Aside from randomly generating new samples, our model also
facilitates applications such as retargeting, outpainting and local editing.
Through extensive qualitative and quantitative evaluation, we show that our
method outperforms prior methods in generation quality of 3D shapes. |
Sin3DM: a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. |
Collecting large, diverse 3D datasets is challenging, limiting the applicability of data-driven 3D generation methods. Sin3DM addresses this by enabling high-quality 3D shape generation from a single example. |
The input 3D shape is compressed into a triplane feature representation using an autoencoder. A diffusion model with a limited receptive field and triplane-aware convolutions is trained on this latent space to learn local patch distributions. |
Sin3DM generates high-fidelity 3D shapes with diverse local variations while preserving global structure.
Quantitative evaluation shows Sin3DM outperforms prior single-instance 3D generation methods in terms of geometry and texture quality.
The method supports controlled generation, including retargeting, outpainting, and patch duplication. |
The generated variations primarily occur along the three axis directions due to the triplane representation.
Exploring the trade-off between generation quality and diversity is limited to adjusting the receptive field size. |
3d shape generation, diffusion models, single instance learning, triplane representation, controlled generation |
2305.15328
Report |
Visual Programming for Text-to-Image Generation and Evaluation |
Jaemin Cho, Abhay Zala, Mohit Bansal |
As large language models have demonstrated impressive performance in many
domains, recent works have adopted language models (LMs) as controllers of
visual modules for vision-and-language tasks. While existing work focuses on
equipping LMs with visual understanding, we propose two novel
interpretable/explainable visual programming frameworks for text-to-image (T2I)
generation and evaluation. First, we introduce VPGen, an interpretable
step-by-step T2I generation framework that decomposes T2I generation into three
steps: object/count generation, layout generation, and image generation. We
employ an LM to handle the first two steps (object/count generation and layout
generation), by finetuning it on text-layout pairs. Our step-by-step T2I
generation framework provides stronger spatial control than end-to-end models,
the dominant approach for this task. Furthermore, we leverage the world
knowledge of pretrained LMs, overcoming the limitation of previous
layout-guided T2I works that can only handle predefined object classes. We
demonstrate that our VPGen has improved control in counts/spatial
relations/scales of objects than state-of-the-art T2I generation models.
Second, we introduce VPEval, an interpretable and explainable evaluation
framework for T2I generation based on visual programming. Unlike previous T2I
evaluations with a single scoring model that is accurate in some skills but
unreliable in others, VPEval produces evaluation programs that invoke a set of
visual modules that are experts in different skills, and also provides
visual+textual explanations of the evaluation results. Our analysis shows that
VPEval provides a more human-correlated evaluation for skill-specific and
open-ended prompts than widely used single model-based evaluation. We hope that
our work encourages future progress on interpretable/explainable generation and
evaluation for T2I models. |
This paper introduces two novel visual programming frameworks for text-to-image (T2I) generation and evaluation: Text2Box and VPEval. |
Existing T2I generation lacks interpretable spatial control, and current evaluation methods rely on single models, lacking interpretability and struggling to accurately assess all skills. |
Text2Box decomposes T2I generation into interpretable steps (object/count generation, layout generation, image generation) leveraging a fine-tuned large language model (LLM) and layout-to-image models. VPEval employs evaluation programs invoking diverse visual modules specialized for different skills, providing visual and textual explanations. |
Text2Box demonstrates improved adherence to text prompts regarding object counts, spatial relationships, and object scales compared to baseline T2I models.
VPEval exhibits stronger alignment with human evaluation than existing single model-based T2I evaluation methods for both skill-specific and open-ended prompts.
Analysis reveals that while count, spatial, scale, and text rendering skills pose challenges for T2I models, Text2Box excels in the first three due to its strong layout control. |
The reliance on English-heavy datasets and natural image training data may limit the generalizability of the LLMs and generation/evaluation modules to other languages or image domains.
Generating evaluation programs with LLMs can be expensive; however, the authors plan to release pre-generated programs and a locally runnable LM for this purpose. |
text-to-image generation, visual programming, interpretable ai, explainable ai, image generation evaluation |
2305.15316
Report |
Training on Thin Air: Improve Image Classification with Generated Data |
Yongchao Zhou, Hshmat Sahak, Jimmy Ba |
Acquiring high-quality data for training discriminative models is a crucial
yet challenging aspect of building effective predictive systems. In this paper,
we present Diffusion Inversion, a simple yet effective method that leverages
the pre-trained generative model, Stable Diffusion, to generate diverse,
high-quality training data for image classification. Our approach captures the
original data distribution and ensures data coverage by inverting images to the
latent space of Stable Diffusion, and generates diverse novel training images
by conditioning the generative model on noisy versions of these vectors. We
identify three key components that allow our generated images to successfully
supplant the original dataset, leading to a 2-3x enhancement in sample
complexity and a 6.5x decrease in sampling time. Moreover, our approach
consistently outperforms generic prompt-based steering methods and KNN
retrieval baseline across a wide range of datasets. Additionally, we
demonstrate the compatibility of our approach with widely-used data
augmentation techniques, as well as the reliability of the generated data in
supporting various neural architectures and enhancing few-shot learning. |
This paper presents Diffusion Inversion, a novel method leveraging pre-trained generative models (specifically Stable Diffusion) to produce diverse, high-quality training data for image classification, thereby enhancing sample complexity and reducing sampling time. |
Acquiring high-quality training data is crucial for effective predictive systems but can be complex, costly, and time-consuming. This method addresses the limitations of traditional data collection and existing synthetic data generation approaches. |
The two-stage method first maps each training image to the latent space of Stable Diffusion, creating embedding vectors. Subsequently, it generates novel training images by conditioning the model on perturbed versions of these vectors. |
Diffusion Inversion achieves 2-3x improvement in sample complexity and a 6.5x reduction in sampling time compared to training on original data.
The method surpasses generic prompt-based steering methods and KNN retrieval baselines by effectively addressing data distribution shifts and ensuring data coverage.
The generated data is compatible with various neural architectures, improves few-shot learning performance, and complements traditional data augmentation techniques. |
Scaling the method to large datasets like ImageNet is challenging due to storage requirements and sampling efficiency of current diffusion models.
Potential for bias in generated data inherited from the generative model necessitates further research on bias mitigation strategies. |
data augmentation, synthetic data generation, diffusion models, image classification, stable diffusion |
2305.15194
Report |
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models |
Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn |
In this study, we aim to extend the capabilities of diffusion-based
text-to-image (T2I) generation models by incorporating diverse modalities
beyond textual description, such as sketch, box, color palette, and style
embedding, within a single model. We thus design a multimodal T2I diffusion
model, coined as DiffBlender, by separating the channels of conditions into
three types, i.e., image forms, spatial tokens, and non-spatial tokens. The
unique architecture of DiffBlender facilitates adding new input modalities,
pioneering a scalable framework for conditional image generation. Notably, we
achieve this without altering the parameters of the existing generative model,
Stable Diffusion, only with updating partial components. Our study establishes
new benchmarks in multimodal generation through quantitative and qualitative
comparisons with existing conditional generation methods. We demonstrate that
DiffBlender faithfully blends all the provided information and showcase its
various applications in the detailed image synthesis. |
\textsc{DiffBlender} is a novel multimodal text-to-image diffusion model that effectively incorporates diverse conditioning modalities, such as sketch, box, color palette, and style embedding, within a single model. |
Existing text-to-image generation models struggle to incorporate diverse modalities beyond textual descriptions. This limits the user's ability to provide fine-grained details and control over the generated image. |
\textsc{DiffBlender} categorizes input modalities into three types: image forms, spatial tokens, and non-spatial tokens. Each type is handled by a specific conditioning module attached to the Stable Diffusion backbone. This modular design allows for the seamless integration and extension of new modalities. |
\textsc{DiffBlender} achieves state-of-the-art performance in multi-conditional image generation, as evidenced by high scores in quantitative metrics (YOLO, SSIM, Depth) and qualitative comparisons.
The model allows for mode-specific guidance, providing fine-grained control over the influence of each modality on the generated image.
The modular design of \textsc{DiffBlender} enables easy extension to new modalities with minimal computational cost. |
The model may struggle to generate coherent images when provided with conflicting conditions.
As \textsc{DiffBlender} is built upon Stable Diffusion, it inherits its limitations, such as difficulty in representing intricate details like human hands. |
text-to-image generation, diffusion models, multimodal conditioning, stable diffusion, mode-specific guidance |
2305.15094
Report |
InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields |
Dongqing Wang, Tong Zhang, Alaa Abboud, Sabine Süsstrunk |
We propose InNeRF360, an automatic system that accurately removes
text-specified objects from 360-degree Neural Radiance Fields (NeRF). The
challenge is to effectively remove objects while inpainting perceptually
consistent content for the missing regions, which is particularly demanding for
existing NeRF models due to their implicit volumetric representation. Moreover,
unbounded scenes are more prone to floater artifacts in the inpainted region
than frontal-facing scenes, as the change of object appearance and background
across views is more sensitive to inaccurate segmentations and inconsistent
inpainting. With a trained NeRF and a text description, our method efficiently
removes specified objects and inpaints visually consistent content without
artifacts. We apply depth-space warping to enforce consistency across multiview
text-encoded segmentations, and then refine the inpainted NeRF model using
perceptual priors and 3D diffusion-based geometric priors to ensure visual
plausibility. Through extensive experiments in segmentation and inpainting on
360-degree and frontal-facing NeRFs, we show that our approach is effective and
enhances NeRF's editability. Project page: https://ivrl.github.io/InNeRF360. |
\MethodName{} is the first system for text-guided object removal and inpainting in 360\degree Neural Radiance Fields (NeRF), enabling object-level editing with perceptually consistent results. |
Existing methods struggle with 360\degree scenes due to limitations in multi-view consistency for both segmentation and inpainting, especially under object occlusion and geometry deformation across viewpoints. \MethodName{} addresses this by combining accurate segmentation with 3D-aware inpainting. |
1. **Multiview Consistent Segmentation:** Leverages Segment Anything Model (SAM) with depth-warped prompts for accurate object masks across views. 2. **Inpainting 360\degree NeRF:** Uses 2D inpainted images to initialize a new NeRF, then refines it with a 3D diffusion-based geometric prior to eliminate floaters and a perceptual prior for consistent texture. |
Achieves accurate and consistent object segmentation in 360\degree scenes, even for challenging cases like transparent objects.
Generates high-quality inpainted NeRFs without floaters, seamlessly blending the modifications into the original scene.
Quantitative evaluation shows superior performance over per-frame inpainting and baseline methods in terms of visual consistency and inpainting quality. |
Performance depends on the accuracy of the initial 2D object detection, which can be limited for ambiguous text instructions.
Future work includes exploring more sophisticated text-guided 3D editing and addressing limitations of current vision-language models. |
neural radiance fields, nerf inpainting, 3d scene editing, text-guided image editing, multiview segmentation |
2305.14849
Report |
DuDGAN: Improving Class-Conditional GANs via Dual-Diffusion |
Taesun Yeom, Minhyeok Lee |
Class-conditional image generation using generative adversarial networks
(GANs) has been investigated through various techniques; however, it continues
to face challenges such as mode collapse, training instability, and low-quality
output in cases of datasets with high intra-class variation. Furthermore, most
GANs often converge in larger iterations, resulting in poor iteration efficacy
in training procedures. While Diffusion-GAN has shown potential in generating
realistic samples, it has a critical limitation in generating class-conditional
samples. To overcome these limitations, we propose a novel approach for
class-conditional image generation using GANs called DuDGAN, which incorporates
a dual diffusion-based noise injection process. Our method consists of three
unique networks: a discriminator, a generator, and a classifier. During the
training process, Gaussian-mixture noises are injected into the two noise-aware
networks, the discriminator and the classifier, in distinct ways. This noisy
data helps to prevent overfitting by gradually introducing more challenging
tasks, leading to improved model performance. As a result, our method
outperforms state-of-the-art conditional GAN models for image generation in
terms of performance. We evaluated our method using the AFHQ, Food-101, and
CIFAR-10 datasets and observed superior results across metrics such as FID,
KID, Precision, and Recall score compared with comparison models, highlighting
the effectiveness of our approach. |
DuDGAN, a novel approach for class-conditional image generation using GANs, incorporates a dual diffusion-based noise injection process to improve quality and iteration efficiency. |
Conditional image generation with GANs often suffers from issues like mode collapse, training instability, and low-quality output, particularly with limited data and high intra-class variation. Existing methods often require extensive training iterations, making them inefficient. |
DuDGAN utilizes three networks: a generator, a discriminator, and a classifier. Gaussian-mixture noises are injected into the discriminator and classifier during training. The classifier, trained only on real images, provides high-dimensional class information and class logits, aiding the generator in producing diverse and high-fidelity images. |
DuDGAN outperforms state-of-the-art conditional GAN models in terms of FID and KID on AFHQ and CIFAR-10 datasets, indicating superior generation quality.
It achieves faster convergence within a smaller number of iterations compared to other models.
The generated images demonstrate high visual quality with fine details, accurate colors, and clear textures. |
The model's performance on the Food-101 dataset, while improved, suggests a need for further exploration in handling highly diverse datasets.
Future work could involve investigating the impact of varying noise schedules and exploring alternative augmentation techniques. |
generative adversarial networks, image generation, conditional image synthesis, diffusion models, noise injection |
2305.14840
Report |
Predicting Token Impact Towards Efficient Vision Transformer |
Hong Wang, Su Yang, Xiaoke Huang, Weishan Zhang |
Token filtering to reduce irrelevant tokens prior to self-attention is a
straightforward way to enable efficient vision Transformer. This is the first
work to view token filtering from a feature selection perspective, where we
weigh the importance of a token according to how much it can change the loss
once masked. If the loss changes greatly after masking a token of interest, it
means that such a token has a significant impact on the final decision and is
thus relevant. Otherwise, the token is less important for the final decision,
so it can be filtered out. After applying the token filtering module
generalized from the whole training data, the token number fed to the
self-attention module can be obviously reduced in the inference phase, leading
to much fewer computations in all the subsequent self-attention layers. The
token filter can be realized using a very simple network, where we utilize
multi-layer perceptron. Except for the uniqueness of performing token filtering
only once from the very beginning prior to self-attention, the other core
feature making our method different from the other token filters lies in the
predictability of token impact from a feature selection point of view. The
experiments show that the proposed method provides an efficient way to approach
a light weighted model after optimized with a backbone by means of fine tune,
which is easy to be deployed in comparison with the existing methods based on
training from scratch. |
This paper proposes DL-ViT, an efficient vision Transformer that predicts token impact from a feature selection perspective to filter irrelevant tokens before self-attention, resulting in a lighter model without significant accuracy loss. |
Vision Transformers, while powerful, suffer from heavy computational loads, hindering their application in edge computing. Existing token filtering methods are often heuristic-based, lack explainability, and require gradual token reduction throughout the model, making them less efficient. |
The method involves two phases: (1) It uses a novel metric called 'delta loss' (DL) to measure a token's impact on the classification loss when masked. Tokens with large DL values are labeled as important. This data is used to train an MLP-based binary classifier for token filtering. (2) The trained token filter is applied before the Transformer backbone, and the entire pipeline is fine-tuned end-to-end. |
DL-ViT achieves state-of-the-art performance in terms of both efficiency and accuracy compared to existing lightweight ViT models.
The method leads to a significant reduction (up to 46%) in FLOPs compared to the DeiT backbone while maintaining comparable accuracy.
The study demonstrates that incorporating global image features into the token selection module enhances performance. |
The method relies on a single hyperparameter (ρ) to control the significance of token importance during the labeling process, requiring careful tuning.
Future work will explore token relevance at middle layers to further enhance efficiency. |
vision transformer, token filtering, efficient deep learning, feature selection, delta loss |
2305.14831
Report |
OD-NeRF: Efficient Training of On-the-Fly Dynamic Neural Radiance Fields |
Zhiwen Yan, Chen Li, Gim Hee Lee |
Dynamic neural radiance fields (dynamic NeRFs) have demonstrated impressive
results in novel view synthesis on 3D dynamic scenes. However, they often
require complete video sequences for training followed by novel view synthesis,
which is similar to playing back the recording of a dynamic 3D scene. In
contrast, we propose OD-NeRF to efficiently train and render dynamic NeRFs
on-the-fly which instead is capable of streaming the dynamic scene. When
training on-the-fly, the training frames become available sequentially and the
model is trained and rendered frame-by-frame. The key challenge of efficient
on-the-fly training is how to utilize the radiance field estimated from the
previous frames effectively. To tackle this challenge, we propose: 1) a NeRF
model conditioned on the multi-view projected colors to implicitly track
correspondence between the current and previous frames, and 2) a transition and
update algorithm that leverages the occupancy grid from the last frame to
sample efficiently at the current frame. Our algorithm can achieve an
interactive speed of 6FPS training and rendering on synthetic dynamic scenes
on-the-fly, and a significant speed-up compared to the state-of-the-art on
real-world dynamic scenes. |
This paper introduces OD-NeRF, a new method for efficiently training and rendering dynamic neural radiance fields (NeRFs) on-the-fly, enabling real-time streaming of dynamic 3D scenes. |
Existing dynamic NeRFs typically require complete video sequences for training, limiting their use in real-time applications like streaming. On-the-fly training allows for the reconstruction and rendering of dynamic scenes as they happen. |
The authors propose two key techniques: 1) a NeRF model conditioned on multi-view projected colors to implicitly track point correspondence across frames and 2) a transition and update algorithm for the occupancy grid, leveraging information from previous frames for efficient sampling. |
OD-NeRF achieves an interactive speed of 6 FPS for on-the-fly training and rendering on synthetic dynamic scenes.
The method demonstrates significant speed-up compared to state-of-the-art techniques on real-world dynamic scenes.
OD-NeRF maintains comparable rendering quality to existing methods while achieving faster on-the-fly training. |
The implicit correspondence of the projected color-guided NeRF relies on the relative invariance of projected colors, which can be affected by specular surfaces and occlusions.
Future work could explore techniques to filter outlier projected colors or explicitly detect occlusions to improve the robustness of the method. |
neural radiance fields, dynamic scene reconstruction, on-the-fly training, novel view synthesis, 3d streaming |
2305.14777
Report |
Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport |
Jaemoo Choi, Jaewoong Choi, Myungjoo Kang |
Optimal Transport (OT) problem investigates a transport map that bridges two
distributions while minimizing a given cost function. In this regard, OT
between tractable prior distribution and data has been utilized for generative
modeling tasks. However, OT-based methods are susceptible to outliers and face
optimization challenges during training. In this paper, we propose a novel
generative model based on the semi-dual formulation of Unbalanced Optimal
Transport (UOT). Unlike OT, UOT relaxes the hard constraint on distribution
matching. This approach provides better robustness against outliers, stability
during training, and faster convergence. We validate these properties
empirically through experiments. Moreover, we study the theoretical upper-bound
of divergence between distributions in UOT. Our model outperforms existing
OT-based generative models, achieving FID scores of 2.97 on CIFAR-10 and 6.36
on CelebA-HQ-256. The code is available at
\url{https://github.com/Jae-Moo/UOTM}. |
This paper proposes UOTM, a novel generative model based on the semi-dual formulation of Unbalanced Optimal Transport (UOT) that relaxes the hard constraint on distribution matching in OT. |
OT-based generative models, while effective, suffer from sensitivity to outliers and optimization challenges. UOT offers a solution by enabling outlier robustness and stable training. |
The authors leverage the semi-dual formulation of UOT to derive a new objective function. They then parameterize the potential and transport map using neural networks and optimize them through an adversarial training procedure similar to GANs. |
UOTM exhibits strong robustness against outliers, outperforming OT-based methods on datasets with injected outliers.
Despite the relaxed constraints, UOTM achieves superior target distribution matching compared to OT-based counterparts.
UOTM demonstrates faster and more stable convergence, requiring significantly fewer training epochs to reach comparable performance. |
The hyperparameter tau, controlling the trade-off between cost and marginal matching, requires careful tuning for optimal performance.
Further investigation into the theoretical properties of UOTM, particularly regarding the role of the auxiliary variable and regularization, is necessary. |
generative models, optimal transport, unbalanced optimal transport, outlier robustness, stable training |
2305.14742
Report |
ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation |
Dongxu Yue, Qin Guo, Munan Ning, Jiaxi Cui, Yuesheng Zhu, Li Yuan |
Editing real facial images is a crucial task in computer vision with
significant demand in various real-world applications. While GAN-based methods
have showed potential in manipulating images especially when combined with
CLIP, these methods are limited in their ability to reconstruct real images due
to challenging GAN inversion capability. Despite the successful image
reconstruction achieved by diffusion-based methods, there are still challenges
in effectively manipulating fine-gained facial attributes with textual
instructions.To address these issues and facilitate convenient manipulation of
real facial images, we propose a novel approach that conduct text-driven image
editing in the semantic latent space of diffusion model. By aligning the
temporal feature of the diffusion model with the semantic condition at
generative process, we introduce a stable manipulation strategy, which perform
precise zero-shot manipulation effectively. Furthermore, we develop an
interactive system named ChatFace, which combines the zero-shot reasoning
ability of large language models to perform efficient manipulations in
diffusion semantic latent space. This system enables users to perform complex
multi-attribute manipulations through dialogue, opening up new possibilities
for interactive image editing. Extensive experiments confirmed that our
approach outperforms previous methods and enables precise editing of real
facial images, making it a promising candidate for real-world applications.
Project page: https://dongxuyue.github.io/chatface/ |
ChatFace, an interactive system for high-quality real facial image editing using text instructions in the semantic latent space of a diffusion model. |
Existing GAN-based methods struggle with real image reconstruction, while diffusion models face challenges in fine-grained facial attribute manipulation with text. |
An LLM parses user requests and controls editing attributes in the diffusion model's semantic latent space. A mapping network infers manipulation directions, and a Stable Manipulation Strategy (SMS) ensures precise zero-shot editing. |
Outperforms SOTA methods in quantitative metrics (directional CLIP similarity, segmentation-consistency, face identity similarity).
Human evaluation confirms superior performance in semantic relevance, visual realism, and identity consistency.
Enables fine-grained control over various facial attributes, including multi-attribute editing. |
Limited to the domain of the pre-trained diffusion autoencoder.
Generalization to visually diverse datasets requires further investigation. |
image editing, diffusion models, large language models, semantic manipulation, interactive system |
2305.14720
Report |
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing |
Dongxu Li, Junnan Li, Steven C. H. Hoi |
Subject-driven text-to-image generation models create novel renditions of an
input subject based on text prompts. Existing models suffer from lengthy
fine-tuning and difficulties preserving the subject fidelity. To overcome these
limitations, we introduce BLIP-Diffusion, a new subject-driven image generation
model that supports multimodal control which consumes inputs of subject images
and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion
introduces a new multimodal encoder which is pre-trained to provide subject
representation. We first pre-train the multimodal encoder following BLIP-2 to
produce visual representation aligned with the text. Then we design a subject
representation learning task which enables a diffusion model to leverage such
visual representation and generates new subject renditions. Compared with
previous methods such as DreamBooth, our model enables zero-shot subject-driven
generation, and efficient fine-tuning for customized subject with up to 20x
speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with
existing techniques such as ControlNet and prompt-to-prompt to enable novel
subject-driven generation and editing applications. Code and models will be
released at
https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project
page at https://dxli94.github.io/BLIP-Diffusion-website/. |
This paper introduces BLIP-Diffusion, a novel subject-driven text-to-image generation model that leverages pre-trained generic subject representation for efficient and high-fidelity image synthesis. |
Existing subject-driven generation models suffer from lengthy fine-tuning processes and difficulties in preserving subject fidelity. BLIP-Diffusion addresses these limitations by introducing pre-trained subject representation, enabling zero-shot generation or efficient fine-tuning with significant speedups. |
BLIP-Diffusion employs a two-stage pre-training strategy: (1) Multimodal representation learning with BLIP-2 to produce text-aligned visual features. (2) Subject representation learning using a novel prompted context generation task, where the model learns to generate subject renditions based on synthesized images with random backgrounds. |
BLIP-Diffusion achieves promising zero-shot subject-driven generation results.
It enables efficient fine-tuning for customized subjects with up to 20x speedup compared to previous methods like DreamBooth.
The model can be seamlessly integrated with existing techniques like ControlNet and prompt-to-prompt for enhanced control and editing capabilities. |
BLIP-Diffusion can still exhibit failures common to subject-driven generation models, such as inaccurate context synthesis and overfitting to the training set.
The model may inherit limitations from the underlying diffusion model, impacting its ability to fully comprehend complex text prompts and compositional relationships. |
text-to-image generation, subject-driven generation, diffusion models, multimodal learning, blip-2 |
2305.14677
Report |
Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion Models |
Zhongjie Duan, Chengyu Wang, Cen Chen, Jun Huang, Weining Qian |
In recent years, diffusion models have become the most popular and powerful
methods in the field of image synthesis, even rivaling human artists in
artistic creativity. However, the key issue currently limiting the application
of diffusion models is its extremely slow generation process. Although several
methods were proposed to speed up the generation process, there still exists a
trade-off between efficiency and quality. In this paper, we first provide a
detailed theoretical and empirical analysis of the generation process of the
diffusion models based on schedulers. We transform the designing problem of
schedulers into the determination of several parameters, and further transform
the accelerated generation process into an expansion process of the linear
subspace. Based on these analyses, we consequently propose a novel method
called Optimal Linear Subspace Search (OLSS), which accelerates the generation
process by searching for the optimal approximation process of the complete
generation process in the linear subspaces spanned by latent variables. OLSS is
able to generate high-quality images with a very small number of steps. To
demonstrate the effectiveness of our method, we conduct extensive comparative
experiments on open-source diffusion models. Experimental results show that
with a given number of steps, OLSS can significantly improve the quality of
generated images. Using an NVIDIA A100 GPU, we make it possible to generate a
high-quality image by Stable Diffusion within only one second without other
optimization techniques. |
This paper proposes OLSS (Optimal Linear Subspace Search), a novel diffusion scheduler that accelerates image generation by searching for the optimal approximation of the complete generation process within linear subspaces spanned by latent variables. |
Diffusion models, despite their prowess in image synthesis, suffer from slow generation speed. OLSS addresses this limitation by significantly reducing the number of inference steps while maintaining high image quality. |
The paper analyzes the diffusion model generation process, modeling it as a linear subspace expansion. OLSS replaces iterative formula coefficients with trainable parameters, solved using least squares methods, to control subspace expansion. A path optimization algorithm further enhances performance by tuning sampling steps. |
OLSS achieves superior image quality compared to state-of-the-art schedulers with the same number of steps.
The path optimization algorithm in OLSS further improves performance compared to uniform step selection.
OLSS demonstrates effectiveness in both open-domain and close-domain image synthesis tasks. |
The current path optimization algorithm in OLSS could be further improved for even better efficiency.
Exploration of improving generative quality based on modifications in the latent space is a potential future direction. |
diffusion models, image synthesis, computational efficiency, schedulers, path optimization |
2305.14674
Report |
T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities |
Kangfu Mei, Mo Zhou, Vishal M. Patel |
Diffusion Probabilistic Field (DPF) models the distribution of continuous
functions defined over metric spaces. While DPF shows great potential for
unifying data generation of various modalities including images, videos, and 3D
geometry, it does not scale to a higher data resolution. This can be attributed
to the ``scaling property'', where it is difficult for the model to capture
local structures through uniform sampling. To this end, we propose a new model
comprising of a view-wise sampling algorithm to focus on local structure
learning, and incorporating additional guidance, e.g., text description, to
complement the global geometry. The model can be scaled to generate
high-resolution data while unifying multiple modalities. Experimental results
on data generation in various modalities demonstrate the effectiveness of our
model, as well as its potential as a foundation framework for scalable
modality-unified visual content generation. |
This paper proposes T1, a new diffusion-based field model for scalable, modality-unified visual content generation. T1 leverages a novel view-wise sampling algorithm and incorporates text descriptions as inductive biases to preserve both local structure and global geometry of the data. |
Existing diffusion-based field models struggle to scale to high-resolution data due to limitations in capturing local structures through uniform sampling and lack of global geometry guidance. |
T1 uses a view-wise sampling algorithm that extracts local, high-resolution coordinate-signal pairs. It also incorporates text descriptions as inductive bias to guide the generation process and preserve global geometry. |
T1 outperforms previous domain-agnostic methods and achieves competitive results against domain-specific approaches on image, video, and 3D viewpoint generation tasks.
T1 is able to generate high-resolution videos under affordable computational resources.
Ablation studies validate the contribution of the proposed sampling algorithm and text conditioning. |
The scaling property is only resolved for spatial dimensions, and generating extremely long videos with complex dynamics remains challenging.
The method is only applicable to visual modalities interpretable by views. |
diffusion models, generative models, field models, text-to-video generation, novel view synthesis |
2305.14345
Report |
NCHO: Unsupervised Learning for Neural 3D Composition of Humans and Objects |
Taeksoo Kim, Shunsuke Saito, Hanbyul Joo |
Deep generative models have been recently extended to synthesizing 3D digital
humans. However, previous approaches treat clothed humans as a single chunk of
geometry without considering the compositionality of clothing and accessories.
As a result, individual items cannot be naturally composed into novel
identities, leading to limited expressiveness and controllability of generative
3D avatars. While several methods attempt to address this by leveraging
synthetic data, the interaction between humans and objects is not authentic due
to the domain gap, and manual asset creation is difficult to scale for a wide
variety of objects. In this work, we present a novel framework for learning a
compositional generative model of humans and objects (backpacks, coats,
scarves, and more) from real-world 3D scans. Our compositional model is
interaction-aware, meaning the spatial relationship between humans and objects,
and the mutual shape change by physical contact is fully incorporated. The key
challenge is that, since humans and objects are in contact, their 3D scans are
merged into a single piece. To decompose them without manual annotations, we
propose to leverage two sets of 3D scans of a single person with and without
objects. Our approach learns to decompose objects and naturally compose them
back into a generative human model in an unsupervised manner. Despite our
simple setup requiring only the capture of a single subject with objects, our
experiments demonstrate the strong generalization of our model by enabling the
natural composition of objects to diverse identities in various poses and the
composition of multiple objects, which is unseen in training data.
https://taeksuu.github.io/ncho/ |
This paper presents NCHO, a novel framework for learning a compositional generative model of humans and objects from real-world 3D scans, enabling separate control over human identity and attached objects like backpacks and coats. |
Existing 3D human generative models often treat clothing and accessories as entangled geometry, limiting controllability and expressiveness for tasks like virtual try-on or avatar creation. |
The method leverages paired 3D scans of a source person with and without objects to decompose object geometry. It trains separate human and object modules, combining them with a neural composition module for realistic interactions. |
NCHO demonstrates superior generation quality and disentanglement compared to baselines, as evidenced by FID scores and user studies.
The model generalizes to unseen identities and object instances, enabling diverse and controllable avatar generation.
It allows object removal from 3D scans and composition of multiple objects, showcasing capabilities beyond training data. |
Decomposing thin clothing layers remains challenging due to 3D scan limitations.
Future work includes extending the approach to handle RGB images as input. |
3d human modeling, generative models, compositional modeling, unsupervised learning, 3d object decomposition |
2305.14334
Report |
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence |
Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, Trevor Darrell |
Diffusion models have been shown to be capable of generating high-quality
images, suggesting that they could contain meaningful internal representations.
Unfortunately, the feature maps that encode a diffusion model's internal
information are spread not only over layers of the network, but also over
diffusion timesteps, making it challenging to extract useful descriptors. We
propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and
multi-timestep feature maps into per-pixel feature descriptors that can be used
for downstream tasks. These descriptors can be extracted for both synthetic and
real images using the generation and inversion processes. We evaluate the
utility of our Diffusion Hyperfeatures on the task of semantic keypoint
correspondence: our method achieves superior performance on the SPair-71k real
image benchmark. We also demonstrate that our method is flexible and
transferable: our feature aggregation network trained on the inversion features
of real image pairs can be used on the generation features of synthetic image
pairs with unseen objects and compositions. Our code is available at
https://diffusion-hyperfeatures.github.io. |
This paper introduces Diffusion Hyperfeatures, a method to extract per-pixel feature descriptors from diffusion models by consolidating multi-scale and multi-timestep feature maps. |
Diffusion models have shown potential for internal representations but extracting useful features is challenging due to their spread across layers and timesteps. This work offers a way to leverage these representations for downstream tasks. |
The method uses an aggregation network to combine intermediate feature maps from the diffusion process, learning mixing weights to identify the most meaningful features for a specific task (e.g., semantic correspondence). |
Diffusion Hyperfeatures outperform DINOv2 and CATS++ on semantic keypoint correspondence for real images (SPair-71k, CUB).
The aggregation network successfully transfers to unseen synthetic images, enabling the creation of datasets with pseudo-ground truth semantic correspondences.
Analysis of mixing weights shows that different model variants (SDv1-5 vs. SDv2-1) require different feature map prioritization for optimal performance. |
The current method is limited by memory constraints when aggregating many timesteps.
Future work could explore more efficient architectures or incorporate attention mechanisms to reduce memory footprint. |
diffusion models, feature representation, semantic correspondence, keypoint matching, feature aggregation |
2305.14330
Report |
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation |
Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim |
In the paradigm of AI-generated content (AIGC), there has been increasing
attention to transferring knowledge from pre-trained text-to-image (T2I) models
to text-to-video (T2V) generation. Despite their effectiveness, these
frameworks face challenges in maintaining consistent narratives and handling
shifts in scene composition or object placement from a single abstract user
prompt. Exploring the ability of large language models (LLMs) to generate
time-dependent, frame-by-frame prompts, this paper introduces a new framework,
dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors,
enabling the inclusion of time-varying content and facilitating consistent
video generation. To maintain temporal consistency and prevent mapping the
value to a different object, we equip a diffusion model with a novel value
mapping method and dual-softmax filtering, which do not require any additional
training. The experimental results validate the effectiveness of our framework
in producing visually coherent and storyful videos from abstract user prompts,
successfully addressing the challenges of zero-shot video generation. |
Introduces DirecT2V, a novel framework for zero-shot text-to-video generation using large language models (LLMs) as frame-level directors to enhance narrative consistency and handle time-varying content in videos. |
Existing zero-shot text-to-video generation methods struggle to maintain narrative consistency and handle complex actions or scene changes over time due to relying on a single user prompt for all frames. |
Leverages instruction-tuned LLMs (e.g., GPT-4) to generate frame-by-frame descriptions from a single abstract user prompt. Employs novel techniques like rotational value mapping and dual softmax filtering within the text-to-image diffusion model for improved temporal coherence and flexibility. |
DirecT2V successfully generates videos with consistent narratives and time-varying content, outperforming existing zero-shot methods.
Rotational value mapping in DirecT2V enables diverse context integration across frames while maintaining temporal consistency.
Dual softmax filtering effectively reduces inaccurate matching during value mapping, leading to more coherent video generation. |
The performance of DirecT2V is dependent on the capabilities and limitations of the chosen LLM for frame-level directing.
DirecT2V relies on pre-trained text-to-image diffusion models, which may inherit limitations in accurate object counting and positioning. |
text-to-video generation, large language models, zero-shot learning, diffusion models, temporal consistency |
2305.14312
Report |
Text-guided 3D Human Generation from 2D Collections |
Tsu-Jui Fu, Wenhan Xiong, Yixin Nie, Jingyu Liu, Barlas Oğuz, William Yang Wang |
3D human modeling has been widely used for engaging interaction in gaming,
film, and animation. The customization of these characters is crucial for
creativity and scalability, which highlights the importance of controllability.
In this work, we introduce Text-guided 3D Human Generation (\texttt{T3H}),
where a model is to generate a 3D human, guided by the fashion description.
There are two goals: 1) the 3D human should render articulately, and 2) its
outfit is controlled by the given text. To address this \texttt{T3H} task, we
propose Compositional Cross-modal Human (CCH). CCH adopts cross-modal attention
to fuse compositional human rendering with the extracted fashion semantics.
Each human body part perceives relevant textual guidance as its visual
patterns. We incorporate the human prior and semantic discrimination to enhance
3D geometry transformation and fine-grained consistency, enabling it to learn
from 2D collections for data efficiency. We conduct evaluations on DeepFashion
and SHHQ with diverse fashion attributes covering the shape, fabric, and color
of upper and lower clothing. Extensive experiments demonstrate that CCH
achieves superior results for \texttt{T3H} with high efficiency. |
This paper introduces Text-guided 3D Human Generation (T3H), aiming to generate controllable 3D human models with customized outfits from fashion descriptions. |
This work addresses the limitation of previous 3D human modeling approaches that rely on multi-view videos or lack language controllability. It enables efficient and customizable generation of 3D humans for various applications like gaming and animation. |
The authors propose Compositional Cross-modal Human (CCH), which leverages cross-modal attention to fuse compositional human rendering with extracted fashion semantics. It incorporates the human prior (SMPL) for robust geometry transformation and semantic discrimination for fine-grained consistency with descriptions. |
CCH achieves superior results for T3H with high efficiency compared to baselines like Latent-NeRF, TEXTure, and CLIP-O.
CCH exhibits comprehensive superiority across metrics like FID, Depth, PCK, CLIP-S, and FA, indicating its effectiveness in generating realistic and textually aligned 3D humans.
The ablation study shows the importance of textual guidance, cross-modal attention, and semantic discrimination for effective T3H. |
The reliance on SMPL parameters can cause quality degradation if the estimation is inaccurate.
Datasets used for training have limited viewing angles, leading to artifacts in 3D consistency. Future work can explore diverse datasets and improve handling of challenging poses. |
3d human generation, text-guided synthesis, cross-modal attention, neural rendering, compositional modeling |
2305.14207
Report |
SAD: Segment Any RGBD |
Jun Cen, Yizheng Wu, Kewei Wang, Xingyi Li, Jingkang Yang, Yixuan Pei, Lingdong Kong, Ziwei Liu, Qifeng Chen |
The Segment Anything Model (SAM) has demonstrated its effectiveness in
segmenting any part of 2D RGB images. However, SAM exhibits a stronger emphasis
on texture information while paying less attention to geometry information when
segmenting RGB images. To address this limitation, we propose the Segment Any
RGBD (SAD) model, which is specifically designed to extract geometry
information directly from images. Inspired by the natural ability of humans to
identify objects through the visualization of depth maps, SAD utilizes SAM to
segment the rendered depth map, thus providing cues with enhanced geometry
information and mitigating the issue of over-segmentation. We further include
the open-vocabulary semantic segmentation in our framework, so that the 3D
panoptic segmentation is fulfilled. The project is available on
https://github.com/Jun-CEN/SegmentAnyRGBD. |
This paper introduces Segment Any RGBD (SAD), a novel model that leverages Segment Anything Model (SAM) and Open-Vocabulary Semantic Segmentation (OVSeg) to perform semantic segmentation by incorporating geometric information from depth maps. |
This work addresses the limitations of SAM, which primarily relies on texture information and often leads to over-segmentation, by incorporating depth information to improve segmentation accuracy. |
SAD renders depth maps to RGB space and uses them as input for SAM, generating initial masks. These masks are then refined using coarse semantic masks from OVSeg. Finally, a clustering process groups adjacent segments of the same class. |
SAD effectively reduces over-segmentation compared to using RGB images directly with SAM.
The incorporation of depth information leads to more accurate segmentation results, particularly in distinguishing objects with similar textures.
SAD demonstrates its effectiveness in generating geometrically sound semantic segmentation results on both Sailvos3D and ScanNet datasets. |
The model may struggle to distinguish between objects in close proximity when they lack distinct geometric features in the depth map.
Future work could focus on improving the model's ability to handle challenging scenarios, such as scenes with significant occlusions or varying lighting conditions. |
semantic segmentation, depth maps, segment anything model (sam), open-vocabulary segmentation, 3d vision |
2305.14022
Report |
Realistic Noise Synthesis with Diffusion Models |
Qi Wu, Mingyan Han, Ting Jiang, Haoqiang Fan, Bing Zeng, Shuaicheng Liu |
Deep image denoising models often rely on large amount of training data for
the high quality performance. However, it is challenging to obtain sufficient
amount of data under real-world scenarios for the supervised training. As such,
synthesizing realistic noise becomes an important solution. However, existing
techniques have limitations in modeling complex noise distributions, resulting
in residual noise and edge artifacts in denoising methods relying on synthetic
data. To overcome these challenges, we propose a novel method that synthesizes
realistic noise using diffusion models, namely Realistic Noise Synthesize
Diffusor (RNSD). In particular, the proposed time-aware controlling module can
simulate various environmental conditions under given camera settings. RNSD can
incorporate guided multiscale content, such that more realistic noise with
spatial correlations can be generated at multiple frequencies. In addition, we
construct an inversion mechanism to predict the unknown camera setting, which
enables the extension of RNSD to datasets without setting information.
Extensive experiments demonstrate that our RNSD method significantly
outperforms the existing methods not only in the synthesized noise under
multiple realism metrics, but also in the single image denoising performances. |
This paper introduces RNSD, a novel diffusion model-based approach for synthesizing realistic noise in images, which significantly outperforms existing methods in terms of realism and improves the performance of denoising models. |
Collecting real-world noisy/clean image pairs for training denoising models is challenging. Existing synthetic noise generation techniques often fail to capture the complexity of real noise, leading to suboptimal denoising results. |
RNSD leverages a time-aware camera setting module (CamSampler) to simulate diverse noise distributions based on camera parameters. It also employs a multi-scale content guided UNet (MCG-UNet) to generate spatially correlated noise. Additionally, a camera setting prediction module (CamPredictor) enables noise synthesis on datasets without camera setting information. |
RNSD achieves state-of-the-art results on noise realism benchmarks, surpassing existing methods in metrics like PGAP and AKLD.
Denoising models trained with RNSD's synthetic noise demonstrate significant performance improvements, achieving up to 0.6dB PSNR gain.
Ablation studies confirm the efficacy of individual RNSD components, including CamSampler, MCG-UNet, and CamPredictor. |
The computational cost of diffusion models for noise synthesis is higher than some simpler methods.
Further exploration of incorporating more complex camera ISP pipelines could further enhance realism. |
image denoising, noise synthesis, diffusion models, camera settings, data augmentation |
2305.13921
Report |
Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models |
Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin |
Recent text-to-image (T2I) diffusion models show outstanding performance in
generating high-quality images conditioned on textual prompts. However, they
fail to semantically align the generated images with the prompts due to their
limited compositional capabilities, leading to attribute leakage, entity
leakage, and missing entities. In this paper, we propose a novel attention mask
control strategy based on predicted object boxes to address these issues. In
particular, we first train a BoxNet to predict a box for each entity that
possesses the attribute specified in the prompt. Then, depending on the
predicted boxes, a unique mask control is applied to the cross- and
self-attention maps. Our approach produces a more semantically accurate
synthesis by constraining the attention regions of each token in the prompt to
the image. In addition, the proposed method is straightforward and effective
and can be readily integrated into existing cross-attention-based T2I
generators. We compare our approach to competing methods and demonstrate that
it can faithfully convey the semantics of the original text to the generated
content and achieve high availability as a ready-to-use plugin. Please refer to
https://github.com/OPPOMente-Lab/attention-mask-control. |
This paper introduces a novel attention mask control strategy for text-to-image synthesis using diffusion models. The method leverages a BoxNet, trained to predict object boxes for entities within text prompts, to guide cross- and self-attention maps during image generation. This constraint ensures semantic accuracy by aligning textual elements with corresponding image regions. |
Existing text-to-image diffusion models struggle with accurately representing complex textual descriptions involving multiple entities and attributes, often leading to issues like attribute leakage, entity leakage, and missing entities. This work addresses these problems to improve the fidelity and faithfulness of generated images. |
The proposed approach employs a two-stage process. First, a BoxNet is trained on the COCO dataset to predict object boxes for entities at each timestep of the diffusion process. Second, during image generation, unique masks derived from these predicted boxes control the cross- and self-attention maps, ensuring entities and attributes are rendered within their designated image regions. |
The method significantly improves the semantic alignment between generated images and text prompts, effectively addressing attribute leakage, entity leakage, and missing entities.
Quantitative analysis using metrics like DINO similarity scores and subjective fidelity scores demonstrate the effectiveness of the approach in generating more accurate and faithful images.
The proposed strategy is flexible and can be easily integrated into existing diffusion-based image generators as a plugin to enhance their compositional generation capabilities. |
While the method demonstrates promising results, there is potential for a slight decrease in image quality in some cases.
Future work could focus on mitigating potential quality degradation and exploring more sophisticated text parsing techniques to further enhance the model’s understanding of complex prompts. |
text-to-image synthesis, diffusion models, compositional generation, attention mask control, object detection |
2305.13840
Report |
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models |
Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin |
Recent advancements in diffusion models have unlocked unprecedented abilities
in visual creation. However, current text-to-video generation models struggle
with the trade-off among movement range, action coherence and object
consistency. To mitigate this issue, we present a controllable text-to-video
(T2V) diffusion model, called Control-A-Video, capable of maintaining
consistency while customizable video synthesis. Based on a pre-trained
conditional text-to-image (T2I) diffusion model, our model aims to generate
videos conditioned on a sequence of control signals, such as edge or depth
maps. For the purpose of improving object consistency, Control-A-Video
integrates motion priors and content priors into video generation. We propose
two motion-adaptive noise initialization strategies, which are based on pixel
residual and optical flow, to introduce motion priors from input videos,
producing more coherent videos. Moreover, a first-frame conditioned controller
is proposed to generate videos from content priors of the first frame, which
facilitates the semantic alignment with text and allows longer video generation
in an auto-regressive manner. With the proposed architecture and strategies,
our model achieves resource-efficient convergence and generate consistent and
coherent videos with fine-grained control. Extensive experiments demonstrate
its success in various video generative tasks such as video editing and video
style transfer, outperforming previous methods in terms of consistency and
quality. |
This paper presents Control-A-Video, a controllable text-to-video (T2V) diffusion model that enhances object consistency and coherence in customizable video synthesis using control signals like edge or depth maps. |
Current T2V models struggle with maintaining consistency (object appearance across frames) and coherence (smooth action transitions) when generating videos with a large range of motion. Control-A-Video addresses this trade-off. |
The model integrates motion and content priors. It leverages motion-adaptive noise initialization (pixel residual and optical flow based) and a first-frame conditioned controller (generating videos based on the first frame's content). |
Control-A-Video generates consistent and coherent videos with fine-grained control from text prompts and control maps (depth, edge).
Motion-adaptive noise initialization improves consistency by preserving latent space similarity between frames, reducing flickering.
First-frame conditioning enhances text alignment and allows auto-regressive generation of longer videos. |
The model currently relies on a T2I model, inheriting its limitations.
Future work includes exploring the stability and controllability of video generation models. |
text-to-video generation, diffusion models, controllable video synthesis, motion priors, content priors |
2305.13777
Report |
VisorGPT: Learning Visual Prior via Generative Pre-Training |
Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou |
Various stuff and things in visual data possess specific traits, which can be
learned by deep neural networks and are implicitly represented as the visual
prior, e.g., object location and shape, in the model. Such prior potentially
impacts many vision tasks. For example, in conditional image synthesis, spatial
conditions failing to adhere to the prior can result in visually inaccurate
synthetic results. This work aims to explicitly learn the visual prior and
enable the customization of sampling. Inspired by advances in language
modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed
VisorGPT. By discretizing visual locations of objects, e.g., bounding boxes,
human pose, and instance masks, into sequences, VisorGPT can model visual prior
through likelihood maximization. Besides, prompt engineering is investigated to
unify various visual locations and enable customized sampling of sequential
outputs from the learned prior. Experimental results demonstrate that VisorGPT
can effectively model the visual prior, which can be employed for many vision
tasks, such as customizing accurate human pose for conditional image synthesis
models like ControlNet. Code will be released at
https://github.com/Sierkinhane/VisorGPT. |
This paper presents VisorGPT, a novel approach to explicitly learning the probabilistic prior of visual data, such as object location and shape, using generative pre-training. |
Explicitly learning visual prior is important for various vision tasks as it captures common sense knowledge about the visual world, leading to more realistic and accurate results in applications like image synthesis. |
VisorGPT discretizes visual annotations (e.g., bounding boxes, human poses) into sequences and leverages a GPT-style transformer to learn the probabilistic prior by maximizing the likelihood of training sequences. |
VisorGPT effectively models the visual prior, demonstrated by its ability to generate realistic and customized spatial conditions for image synthesis models like ControlNet and GLIGEN.
The learned prior aligns well with real-world data, evidenced by the close similarity in location, shape, and relation priors between generated sequences and real-world datasets.
VisorGPT enables customizable sampling of visual data by leveraging prompt engineering, allowing for control over factors like object size, number of instances, and categories. |
VisorGPT is currently limited to closed-set inference due to the limited number of labeled classes in training data.
The maximum token length in the model restricts the number of instances that can be included in each sequence, posing challenges for complex scenes. |
visual prior, generative pre-training, conditional image synthesis, prompt engineering, language modeling |
2305.13738
Report |
i-Code Studio: A Configurable and Composable Framework for Integrative AI |
Yuwei Fang, Mahmoud Khademi, Chenguang Zhu, Ziyi Yang, Reid Pryzant, Yichong Xu, Yao Qian, Takuya Yoshioka, Lu Yuan, Michael Zeng, Xuedong Huang |
Artificial General Intelligence (AGI) requires comprehensive understanding
and generation capabilities for a variety of tasks spanning different
modalities and functionalities. Integrative AI is one important direction to
approach AGI, through combining multiple models to tackle complex multimodal
tasks. However, there is a lack of a flexible and composable platform to
facilitate efficient and effective model composition and coordination. In this
paper, we propose the i-Code Studio, a configurable and composable framework
for Integrative AI. The i-Code Studio orchestrates multiple pre-trained models
in a finetuning-free fashion to conduct complex multimodal tasks. Instead of
simple model composition, the i-Code Studio provides an integrative, flexible,
and composable setting for developers to quickly and easily compose
cutting-edge services and technologies tailored to their specific requirements.
The i-Code Studio achieves impressive results on a variety of zero-shot
multimodal tasks, such as video-to-text retrieval, speech-to-speech
translation, and visual question answering. We also demonstrate how to quickly
build a multimodal agent based on the i-Code Studio that can communicate and
personalize for users. |
The paper proposes i-Code Studio, a configurable and composable framework for Integrative AI that orchestrates multiple pre-trained models to conduct complex multimodal tasks without finetuning. |
Integrative AI is an important direction towards AGI, and the proposed framework addresses the lack of flexible and composable platforms for efficient model composition and coordination. |
The i-Code Studio uses a directed acyclic graph (DAG) to configure the flow of input data through pre-trained models and services from Azure Cognitive Services and OpenAI, enabling complex multimodal tasks. |
i-Code Studio achieves state-of-the-art performance on zero-shot video-to-text retrieval.
It significantly outperforms baseline methods on visual question answering, even without access to supporting facts.
It demonstrates strong performance on speech-to-speech translation, surpassing previous state-of-the-art methods by a large margin. |
The framework currently relies on a limited number of pre-trained models and services.
Further research is needed to apply the framework to more complex multimodal tasks. |
integrative ai, multimodal learning, artificial general intelligence, composable framework, large pre-trained models |
2305.13655
Report |
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models |
Long Lian, Boyi Li, Adam Yala, Trevor Darrell |
Recent advancements in text-to-image diffusion models have yielded impressive
results in generating realistic and diverse images. However, these models still
struggle with complex prompts, such as those that involve numeracy and spatial
reasoning. This work proposes to enhance prompt understanding capabilities in
diffusion models. Our method leverages a pretrained large language model (LLM)
for grounded generation in a novel two-stage process. In the first stage, the
LLM generates a scene layout that comprises captioned bounding boxes from a
given prompt describing the desired image. In the second stage, a novel
controller guides an off-the-shelf diffusion model for layout-grounded image
generation. Both stages utilize existing pretrained models without additional
model parameter optimization. Our method significantly outperforms the base
diffusion model and several strong baselines in accurately generating images
according to prompts that require various capabilities, doubling the generation
accuracy across four tasks on average. Furthermore, our method enables
instruction-based multi-round scene specification and can handle prompts in
languages not supported by the underlying diffusion model. We anticipate that
our method will unleash users' creativity by accurately following more complex
prompts. Our code, demo, and benchmark are available at:
https://llm-grounded-diffusion.github.io |
This paper proposes LMD, a training-free, two-stage method to enhance the prompt understanding capabilities of text-to-image diffusion models. |
Existing diffusion models struggle to accurately follow complex prompts requiring numeracy, spatial reasoning, and attribute binding. |
LMD uses a pre-trained LLM to generate a scene layout with captioned bounding boxes from a text prompt. Then, a novel controller guides an off-the-shelf diffusion model to generate the image based on this layout. |
LMD significantly outperforms the base diffusion model and other baselines in accurately generating images from complex prompts, doubling generation accuracy across four tasks.
LMD enables instruction-based multi-round scene specification, allowing users to refine the generation through dialogue.
LMD supports prompts in languages not supported by the base diffusion model by using English layouts generated by the LLM. |
Ambiguity in LLM-generated layouts can sometimes lead to inaccuracies in image generation.
LMD inherits potential biases from the base diffusion model. |
text-to-image generation, diffusion models, large language models, layout generation, prompt understanding |
2305.13625
Report |
DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection |
Jiang Liu, Chun Pong Lau, Rama Chellappa |
The increasingly pervasive facial recognition (FR) systems raise serious
concerns about personal privacy, especially for billions of users who have
publicly shared their photos on social media. Several attempts have been made
to protect individuals from being identified by unauthorized FR systems
utilizing adversarial attacks to generate encrypted face images. However,
existing methods suffer from poor visual quality or low attack success rates,
which limit their utility. Recently, diffusion models have achieved tremendous
success in image generation. In this work, we ask: can diffusion models be used
to generate adversarial examples to improve both visual quality and attack
performance? We propose DiffProtect, which utilizes a diffusion autoencoder to
generate semantically meaningful perturbations on FR systems. Extensive
experiments demonstrate that DiffProtect produces more natural-looking
encrypted images than state-of-the-art methods while achieving significantly
higher attack success rates, e.g., 24.5% and 25.1% absolute improvements on the
CelebA-HQ and FFHQ datasets. |
This paper proposes DiffProtect, a novel diffusion model-based adversarial attack method for facial privacy protection. It generates natural and inconspicuous adversarial examples on face recognition systems by perturbing the semantic code of an input image and using a conditional DDIM decoding process to create a protected image. |
Existing methods for protecting against unauthorized facial recognition often produce low-quality images or have low attack success rates. This work aims to improve both visual quality and attack performance using diffusion models. |
DiffProtect uses a pre-trained diffusion autoencoder to encode an input face image into semantic and noise codes. It then optimizes the semantic code to create a protected image that fools the face recognition model while preserving visual quality. The method also includes a face semantics regularization module and an attack acceleration strategy. |
DiffProtect achieves significantly higher attack success rates (ASR) than previous state-of-the-art methods on CelebA-HQ and FFHQ datasets.
DiffProtect generates more natural-looking encrypted images with lower FID scores compared to baselines.
The accelerated version, DiffProtect-fast, maintains competitive attack performance while significantly reducing computation time. |
The attack generation process can be further optimized to improve efficiency.
Investigating the effectiveness of DiffProtect on other privacy-protection tasks beyond facial recognition. |
facial privacy protection, adversarial attack, diffusion models, face recognition, generative models |
2305.13620
Report |
A Dive into SAM Prior in Image Restoration |
Zeyu Xiao, Jiawang Bai, Zhihe Lu, Zhiwei Xiong |
The goal of image restoration (IR), a fundamental issue in computer vision,
is to restore a high-quality (HQ) image from its degraded low-quality (LQ)
observation. Multiple HQ solutions may correspond to an LQ input in this poorly
posed problem, creating an ambiguous solution space. This motivates the
investigation and incorporation of prior knowledge in order to effectively
constrain the solution space and enhance the quality of the restored images. In
spite of the pervasive use of hand-crafted and learned priors in IR, limited
attention has been paid to the incorporation of knowledge from large-scale
foundation models. In this paper, we for the first time leverage the prior
knowledge of the state-of-the-art segment anything model (SAM) to boost the
performance of existing IR networks in an parameter-efficient tuning manner. In
particular, the choice of SAM is based on its robustness to image degradations,
such that HQ semantic masks can be extracted from it. In order to leverage
semantic priors and enhance restoration quality, we propose a lightweight SAM
prior tuning (SPT) unit. This plug-and-play component allows us to effectively
integrate semantic priors into existing IR networks, resulting in significant
improvements in restoration quality. As the only trainable module in our
method, the SPT unit has the potential to improve both efficiency and
scalability. We demonstrate the effectiveness of the proposed method in
enhancing a variety of methods across multiple tasks, such as image
super-resolution and color image denoising. |
This paper proposes leveraging the semantic prior knowledge from Segment Anything Model (SAM) to enhance the performance of existing Image Restoration (IR) networks in a parameter-efficient tuning manner. |
IR is an ill-posed problem with an ambiguous solution space. This work explores the use of large-scale foundation models to provide richer priors and improve restoration quality. |
The method extracts semantic masks from SAM and incorporates them into a lightweight SAM Prior Tuning (SPT) unit. The SPT unit is integrated into existing IR networks, and only its parameters are tuned during training. |
The proposed method significantly improves the performance of various CNN-based and Transformer-based IR methods.
Experiments on image super-resolution and color image denoising demonstrate consistent performance gains over baseline methods.
Ablation studies validate the effectiveness of the SPT unit, the efficient tuning scheme, and the impact of SAM mask granularity. |
The use of SAM masks as semantic priors might introduce unrealistic fine-grained structures.
Future work could explore more effective methods for incorporating semantic priors to improve fidelity. |
image restoration, semantic prior, segment anything model, parameter-efficient tuning, foundation models |
2305.13501
Report |
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On |
Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara |
The rapidly evolving fields of e-commerce and metaverse continue to seek
innovative approaches to enhance the consumer experience. At the same time,
recent advancements in the development of diffusion models have enabled
generative networks to create remarkably realistic images. In this context,
image-based virtual try-on, which consists in generating a novel image of a
target model wearing a given in-shop garment, has yet to capitalize on the
potential of these powerful generative solutions. This work introduces
LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the
Virtual Try-ON task. The proposed architecture relies on a latent diffusion
model extended with a novel additional autoencoder module that exploits
learnable skip connections to enhance the generation process preserving the
model's characteristics. To effectively maintain the texture and details of the
in-shop garment, we propose a textual inversion component that can map the
visual features of the garment to the CLIP token embedding space and thus
generate a set of pseudo-word token embeddings capable of conditioning the
generation process. Experimental results on Dress Code and VITON-HD datasets
demonstrate that our approach outperforms the competitors by a consistent
margin, achieving a significant milestone for the task. Source code and trained
models are publicly available at: https://github.com/miccunifi/ladi-vton. |
This paper introduces LaDI-VTON, the first virtual try-on model that utilizes Latent Diffusion Models enhanced with textual inversion for improved garment transfer and detail preservation. |
This approach leverages the superior image generation capabilities of diffusion models to advance the realism and user experience in virtual try-on applications within e-commerce and the metaverse. |
LaDI-VTON extends Stable Diffusion with a novel autoencoder module (EMASC) for preserving details and a textual inversion component to accurately represent the input garment's visual features in the CLIP embedding space, conditioning the generation process. |
LaDI-VTON outperforms state-of-the-art methods on Dress Code and VITON-HD benchmarks, achieving significantly better FID and KID scores.
The introduced EMASC modules demonstrably reduce autoencoder compression loss, leading to better reconstruction of high-frequency human body details.
The textual inversion component effectively preserves the texture and details of the original in-shop garments during the virtual try-on process. |
LaDI-VTON, while excelling in realism, may not always perfectly synthesize textual details (logos, words) on garments due to its reliance on Stable Diffusion.
Future work could explore non-latent diffusion approaches for enhanced textual detail reproduction, acknowledging potential computational trade-offs. |
virtual try-on, latent diffusion models, textual inversion, generative architectures, e-commerce |
2305.13460
Report |
'Tax-free' 3DMM Conditional Face Generation |
Yiwen Huang, Zhiqiu Yu, Xinjie Yi, Yue Wang, James Tompkin |
3DMM conditioned face generation has gained traction due to its well-defined
controllability; however, the trade-off is lower sample quality: Previous works
such as DiscoFaceGAN and 3D-FM GAN show a significant FID gap compared to the
unconditional StyleGAN, suggesting that there is a quality tax to pay for
controllability. In this paper, we challenge the assumption that quality and
controllability cannot coexist. To pinpoint the previous issues, we
mathematically formalize the problem of 3DMM conditioned face generation. Then,
we devise simple solutions to the problem under our proposed framework. This
results in a new model that effectively removes the quality tax between 3DMM
conditioned face GANs and the unconditional StyleGAN. |
This paper introduces a novel 3DMM-conditioned GAN model for face generation that maintains high image quality comparable to unconditional StyleGAN while offering fine-grained control over facial attributes. |
Existing 3DMM-conditioned GANs suffer from a 'quality tax', exhibiting reduced image quality compared to unconditional models due to constraints imposed by the 3DMM conditioning. This work aims to remove this quality tax by addressing overconstraint issues. |
The authors propose a mathematical framework for 3DMM-conditioned face generation, optimizing for both consistency (generated image aligns with the input 3DMM parameters) and disentanglement (modifying one attribute doesn't affect others). They achieve this through a novel 3DMM representation, progressive blending for consistent training, and a structurally disentangled conditioning mechanism. |
The model generates high-quality images with FID scores close to unconditional StyleGAN2, outperforming existing 3DMM-conditioned GANs.
It demonstrates superior disentanglement capabilities compared to baselines, as evidenced by higher Disentanglement Scores (DS).
The model enables reference-based generation, allowing for transferring facial attributes like expression, illumination, and pose from real images to generated ones. |
The model inherits limitations from the pretrained face reconstruction (FR) and differentiable renderer (RDR), such as inaccurate skin tone prediction for darker skin tones.
It lacks explicit control over attributes not included in the 3DMM parameter space, like hair and eyeglasses. |
generative adversarial networks, 3d morphable models, face generation, disentanglement, conditional image synthesis |
2305.13311
Report |
VDT: General-purpose Video Diffusion Transformers via Mask Modeling |
Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding |
This work introduces Video Diffusion Transformer (VDT), which pioneers the
use of transformers in diffusion-based video generation. It features
transformer blocks with modularized temporal and spatial attention modules to
leverage the rich spatial-temporal representation inherited in transformers. We
also propose a unified spatial-temporal mask modeling mechanism, seamlessly
integrated with the model, to cater to diverse video generation scenarios. VDT
offers several appealing benefits. 1) It excels at capturing temporal
dependencies to produce temporally consistent video frames and even simulate
the physics and dynamics of 3D objects over time. 2) It facilitates flexible
conditioning information, \eg, simple concatenation in the token space,
effectively unifying different token lengths and modalities. 3) Pairing with
our proposed spatial-temporal mask modeling mechanism, it becomes a
general-purpose video diffuser for harnessing a range of tasks, including
unconditional generation, video prediction, interpolation, animation, and
completion, etc. Extensive experiments on these tasks spanning various
scenarios, including autonomous driving, natural weather, human action, and
physics-based simulation, demonstrate the effectiveness of VDT. Additionally,
we present comprehensive studies on how \model handles conditioning information
with the mask modeling mechanism, which we believe will benefit future research
and advance the field. Project page: https:VDT-2023.github.io |
The paper introduces Video Diffusion Transformer (VDT), a novel approach for video generation utilizing transformers in a diffusion-based framework. |
Existing video generation methods struggle with capturing temporal dependencies for consistent videos, handling diverse conditioning information, and unifying different video generation tasks. VDT leverages transformers' strengths to address these challenges. |
VDT employs transformer blocks with temporal and spatial attention modules to capture spatiotemporal dependencies. It utilizes a pre-trained VAE tokenizer for efficient processing and incorporates a unified spatial-temporal mask modeling mechanism for versatility. |
VDT excels in capturing temporal dependencies, generating high-quality, consistent videos, and simulating object dynamics.
It flexibly handles conditioning information via token concatenation, unifying tasks like unconditional generation, prediction, interpolation, and animation.
VDT demonstrates state-of-the-art performance on various datasets, including UCF101, Cityscapes, and Physion, outperforming previous GAN-based and diffusion-based methods. |
Limited pretraining due to computational constraints restricts potential.
Future work includes pretraining on larger datasets and incorporating other modalities like text. |
video generation, diffusion models, transformers, spatiotemporal attention, mask modeling |
2305.13310
Report |
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching |
Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, Chunhua Shen |
Powered by large-scale pre-training, vision foundation models exhibit
significant potential in open-world image understanding. However, unlike large
language models that excel at directly tackling various language tasks, vision
foundation models require a task-specific model structure followed by
fine-tuning on specific tasks. In this work, we present Matcher, a novel
perception paradigm that utilizes off-the-shelf vision foundation models to
address various perception tasks. Matcher can segment anything by using an
in-context example without training. Additionally, we design three effective
components within the Matcher framework to collaborate with these foundation
models and unleash their full potential in diverse perception tasks. Matcher
demonstrates impressive generalization performance across various segmentation
tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$
with one example, surpassing the state-of-the-art specialist model by 1.6%. In
addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot
semantic segmentation, outperforming the state-of-the-art generalist model by
14.4%. Our visualization results further showcase the open-world generality and
flexibility of Matcher when applied to images in the wild. Our code can be
found at https://github.com/aim-uofa/Matcher. |
This paper introduces Matcher, a novel training-free perception framework that leverages pre-trained vision foundation models (VFMs) to solve a variety of perception tasks using in-context learning with a single example. |
Existing VFMs often require task-specific fine-tuning and struggle to generalize across diverse perception tasks. Matcher addresses this limitation by enabling VFMs to perform well on a wide range of tasks without training. |
Matcher leverages an all-purpose feature extractor (DINOv2) and a class-agnostic segmentation model (SAM). It employs bidirectional matching for accurate semantic correspondence, a robust prompt sampler for generating diverse mask proposals, and instance-level matching for selecting high-quality masks. |
Matcher achieves state-of-the-art performance on one-shot semantic segmentation, outperforming specialized methods on COCO-20i and demonstrating strong generalization on FSS-1000 and LVIS-92i.
Matcher excels in one-shot object part segmentation, significantly surpassing previous methods on PASCAL-Part and PACO-Part benchmarks.
Matcher demonstrates competitive results in video object segmentation on DAVIS datasets, showcasing its ability to handle temporal information without training. |
Matcher's performance on instance segmentation is currently limited by the instance-level matching capabilities of the image encoder.
Future work will focus on improving instance-level segmentation and further exploring Matcher's potential for evaluating and advancing VFMs. |
vision foundation models, few-shot learning, semantic segmentation, object part segmentation, video object segmentation |
2305.13308
Report |
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection |
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata |
Despite their impressive capabilities, diffusion-based text-to-image (T2I)
models can lack faithfulness to the text prompt, where generated images may not
contain all the mentioned objects, attributes or relations. To alleviate these
issues, recent works proposed post-hoc methods to improve model faithfulness
without costly retraining, by modifying how the model utilizes the input
prompt. In this work, we take a step back and show that large T2I diffusion
models are more faithful than usually assumed, and can generate images faithful
to even complex prompts without the need to manipulate the generative process.
Based on that, we show how faithfulness can be simply treated as a candidate
selection problem instead, and introduce a straightforward pipeline that
generates candidate images for a text prompt and picks the best one according
to an automatic scoring system that can leverage already existing T2I
evaluation metrics. Quantitative comparisons alongside user studies on diverse
benchmarks show consistently improved faithfulness over post-hoc enhancement
methods, with comparable or lower computational cost. Code is available at
\url{https://github.com/ExplainableML/ImageSelect}. |
This paper proposes \texttt{ImageSelect}, a simple but effective pipeline that improves the faithfulness of text-to-image diffusion models by generating multiple candidate images and automatically selecting the most faithful one. |
Existing text-to-image (T2I) models, while impressive, often struggle to faithfully represent all details of a text prompt in the generated image. Recent methods trying to address this are computationally expensive and often tailored to specific prompt types. |
\texttt{ImageSelect} generates several candidate images for a given prompt using different random seeds. Then, it leverages existing T2I evaluation metrics like TIFA and ImageReward to automatically select the most faithful image from the candidates. |
\texttt{ImageSelect} consistently outperforms baseline methods, including model version upgrades (e.g., SD1.4 to SD2.1), in terms of faithfulness on diverse benchmarks.
Quantitative analysis shows substantial improvements across various faithfulness categories, such as counting and spatial relations, addressing known T2I model limitations.
Extensive human evaluation strongly favors \texttt{ImageSelect} outputs, demonstrating better alignment with human perception of faithfulness. |
Despite significant improvements, \texttt{ImageSelect} is still limited by the capabilities of the underlying T2I model, particularly for complex challenges like rendering text or handling very long prompts.
The reliance on pre-trained scoring models (TIFA, ImageReward) may introduce biases or limitations based on their training data. |
text-to-image generation, faithfulness, diffusion models, candidate selection, evaluation metrics |
2305.13307
Report |
NeRFuser: Large-Scale Scene Representation by NeRF Fusion |
Jiading Fang, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Adrien Gaidon, Gregory Shakhnarovich, Matthew R. Walter |
A practical benefit of implicit visual representations like Neural Radiance
Fields (NeRFs) is their memory efficiency: large scenes can be efficiently
stored and shared as small neural nets instead of collections of images.
However, operating on these implicit visual data structures requires extending
classical image-based vision techniques (e.g., registration, blending) from
image sets to neural fields. Towards this goal, we propose NeRFuser, a novel
architecture for NeRF registration and blending that assumes only access to
pre-generated NeRFs, and not the potentially large sets of images used to
generate them. We propose registration from re-rendering, a technique to infer
the transformation between NeRFs based on images synthesized from individual
NeRFs. For blending, we propose sample-based inverse distance weighting to
blend visual information at the ray-sample level. We evaluate NeRFuser on
public benchmarks and a self-collected object-centric indoor dataset, showing
the robustness of our method, including to views that are challenging to render
from the individual source NeRFs. |
This paper presents NeRFuser, a novel architecture for registering and blending pre-trained neural radiance fields (NeRFs) without access to the original training images. |
NeRFs offer a memory-efficient way to represent 3D scenes. This work introduces tools to operate directly on NeRFs as data, expanding their utility in 3D vision applications. |
The method involves two steps: 1) **Registration from re-rendering**: inferring the relative transformation between NeRFs by applying structure-from-motion to images rendered from novel viewpoints. 2) **Sample-based inverse distance weighting**: blending visual information at the ray-sample level during volumetric rendering. |
NeRFuser accurately registers NeRFs, achieving low rotation, translation, and scale errors.
The proposed sample-based blending method produces higher quality novel view synthesis than existing image- or pixel-based blending techniques.
The method is robust to errors in pose estimation during registration and exhibits superior performance compared to point-cloud registration baselines. |
The method inherits limitations of the input NeRFs, such as potential artifacts or inaccuracies in scene representation.
Future work includes exploring the integration of structured priors for improved robustness and handling dynamic scenes. |
neural radiance fields, nerf, 3d scene representation, registration, blending |
2305.13292
Report |
VideoLLM: Modeling Video Sequence with Large Language Models |
Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, Limin Wang |
With the exponential growth of video data, there is an urgent need for
automated technology to analyze and comprehend video content. However, existing
video understanding models are often task-specific and lack a comprehensive
capability of handling diverse tasks. The success of large language models
(LLMs) like GPT has demonstrated their impressive abilities in sequence causal
reasoning. Building upon this insight, we propose a novel framework called
VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs
from natural language processing (NLP) for video sequence understanding.
VideoLLM incorporates a carefully designed Modality Encoder and Semantic
Translator, which convert inputs from various modalities into a unified token
sequence. This token sequence is then fed into a decoder-only LLM.
Subsequently, with the aid of a simple task head, our VideoLLM yields an
effective unified framework for different kinds of video understanding tasks.
To evaluate the efficacy of VideoLLM, we conduct extensive experiments using
multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks
sourced from four different datasets. The experimental results demonstrate that
the understanding and reasoning capabilities of LLMs can be effectively
transferred to video understanding tasks. We release the code at
https://github.com/cg1177/VideoLLM. |
This paper proposes VideoLLM, a novel framework leveraging pre-trained LLMs for diverse video understanding tasks by converting video data into token sequences. |
Existing video understanding models are often task-specific and struggle with the increasing volume and complexity of video data. LLMs offer strong sequence reasoning abilities learned from large-scale text data. |
VideoLLM utilizes a Modality Encoder to process visual and textual data, a Semantic Translator to align visual and textual semantics, and a decoder-only LLM as a generalist video sequence reasoner. |
Different LLMs exhibit varying strengths on different video understanding tasks, with GPT-2 generally performing well.
The framework demonstrates strong performance on 8 diverse video understanding tasks across 4 datasets, achieving state-of-the-art or comparable results with fewer trainable parameters.
Scaling up LLM size generally improves performance up to a certain point, suggesting potential for further improvement with larger models and improved semantic translation. |
Performance of very large LLMs starts to decline, potentially due to overfitting in the semantic translator.
Future work will explore incorporating spatiotemporal information about video frames for more comprehensive understanding. |
video understanding, large language models, multimodal learning, sequence reasoning, parameter-efficient fine-tuning |
2305.13173
Report |
Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation |
Shuting He, Henghui Ding, Wei Jiang |
Zero-shot instance segmentation aims to detect and precisely segment objects
of unseen categories without any training samples. Since the model is trained
on seen categories, there is a strong bias that the model tends to classify all
the objects into seen categories. Besides, there is a natural confusion between
background and novel objects that have never shown up in training. These two
challenges make novel objects hard to be raised in the final instance
segmentation results. It is desired to rescue novel objects from background and
dominated seen categories. To this end, we propose D$^2$Zero with
Semantic-Promoted Debiasing and Background Disambiguation to enhance the
performance of Zero-shot instance segmentation. Semantic-promoted debiasing
utilizes inter-class semantic relationships to involve unseen categories in
visual feature training and learns an input-conditional classifier to conduct
dynamical classification based on the input image. Background disambiguation
produces image-adaptive background representation to avoid mistaking novel
objects for background. Extensive experiments show that we significantly
outperform previous state-of-the-art methods by a large margin, e.g., 16.86%
improvement on COCO. Project page: https://henghuiding.github.io/D2Zero/ |
This paper proposes D2Zero, a novel zero-shot instance segmentation approach, addressing the bias and background ambiguity challenges by employing semantic-promoted debiasing and image-adaptive background disambiguation. |
Existing instance segmentation models struggle to generalize to novel object categories unseen during training, hindering their applicability to real-world scenarios. D2Zero addresses this limitation by enabling the model to segment objects of unseen categories. |
D2Zero leverages semantic information (e.g., CLIP embeddings) to guide visual feature learning with an unseen-constrained training objective and employs an input-conditional classifier for dynamic classification. It further utilizes an image-adaptive background representation to enhance the distinction between novel objects and background. |
D2Zero achieves state-of-the-art performance on zero-shot instance segmentation benchmarks, significantly outperforming prior arts like ZSI.
The proposed input-conditional classifier effectively mitigates bias towards seen categories and reduces the domain gap between visual and semantic features.
The image-adaptive background prototype significantly improves the model's ability to distinguish novel objects from the background. |
The current approach focuses on single-modal visual features and could benefit from incorporating multi-modal features for richer representation.
Future work could explore joint optimization of both the instance segmentation and background disambiguation tasks within a unified framework. |
zero-shot learning, instance segmentation, debiasing, background disambiguation, computer vision |
2305.13093
Report |
Restore Anything Pipeline: Segment Anything Meets Image Restoration |
Jiaxi Jiang, Christian Holz |
Recent image restoration methods have produced significant advancements using
deep learning. However, existing methods tend to treat the whole image as a
single entity, failing to account for the distinct objects in the image that
exhibit individual texture properties. Existing methods also typically generate
a single result, which may not suit the preferences of different users. In this
paper, we introduce the Restore Anything Pipeline (RAP), a novel interactive
and per-object level image restoration approach that incorporates a
controllable model to generate different results that users may choose from.
RAP incorporates image segmentation through the recent Segment Anything Model
(SAM) into a controllable image restoration model to create a user-friendly
pipeline for several image restoration tasks. We demonstrate the versatility of
RAP by applying it to three common image restoration tasks: image deblurring,
image denoising, and JPEG artifact removal. Our experiments show that RAP
produces superior visual results compared to state-of-the-art methods. RAP
represents a promising direction for image restoration, providing users with
greater control, and enabling image restoration at an object level. |
Introduces Restore Anything Pipeline (RAP), an interactive and object-level image restoration approach using Segment Anything Model (SAM) for segmentation and a controllable model for user-specific results. |
Addresses limitations of existing methods that treat images as single entities and produce single, potentially suboptimal results, failing to account for diverse object textures and user preferences. |
Integrates SAM for object segmentation, a flexible blind restoration framework (adapted FBCNN) for automatic/controlled restoration based on predicted/user-adjusted degradation parameters, and optional object-level enhancement. |
RAP produces superior visual results compared to state-of-the-art methods in deblurring, denoising, and JPEG artifact removal.
Object-level processing allows different restoration levels for objects with varying textures and degradation.
Users can control the restoration level by adjusting predicted degradation parameters (e.g., noise level, blur kernel, JPEG quality factor). |
Limited to specific degradation types considered during training.
Relies on the accuracy of SAM for segmentation, which can be affected by image quality. |
image restoration, interactive image editing, segment anything model, object-level processing, controllable image restoration |
2305.13077
Report |
ControlVideo: Training-free Controllable Text-to-Video Generation |
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, Qi Tian |
Text-driven diffusion models have unlocked unprecedented abilities in image
generation, whereas their video counterpart still lags behind due to the
excessive training cost of temporal modeling. Besides the training burden, the
generated videos also suffer from appearance inconsistency and structural
flickers, especially in long video synthesis. To address these challenges, we
design a \emph{training-free} framework called \textbf{ControlVideo} to enable
natural and efficient text-to-video generation. ControlVideo, adapted from
ControlNet, leverages coarsely structural consistency from input motion
sequences, and introduces three modules to improve video generation. Firstly,
to ensure appearance coherence between frames, ControlVideo adds fully
cross-frame interaction in self-attention modules. Secondly, to mitigate the
flicker effect, it introduces an interleaved-frame smoother that employs frame
interpolation on alternated frames. Finally, to produce long videos
efficiently, it utilizes a hierarchical sampler that separately synthesizes
each short clip with holistic coherency. Empowered with these modules,
ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs
quantitatively and qualitatively. Notably, thanks to the efficient designs, it
generates both short and long videos within several minutes using one NVIDIA
2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo. |
This paper proposes ControlVideo, a training-free framework for controllable text-to-video generation, enabling high-quality video synthesis with temporal consistency. |
Training text-to-video diffusion models is computationally expensive and resource-intensive. ControlVideo leverages pre-trained text-to-image models and motion sequences, offering a more efficient alternative. |
ControlVideo adapts ControlNet with three key modules: 1) Fully cross-frame interaction for appearance coherence, 2) Interleaved-frame smoother for reducing structural flickers, 3) Hierarchical sampler for efficient long-video generation. |
Outperforms state-of-the-art methods in qualitative and quantitative comparisons on motion-prompt pairs.
Demonstrates superior appearance consistency and video quality, effectively mitigating flickers and artifacts.
Enables efficient long-video generation (100+ frames) within minutes on a single NVIDIA 2080Ti. |
ControlVideo struggles to generate videos beyond the provided motion sequences.
Future work will focus on adapting motion sequences based on text prompts for more diverse video generation. |
text-to-video generation, diffusion models, controlnet, temporal consistency, hierarchical sampling |
2305.13035
Report |
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design |
Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer |
Scaling laws have been recently employed to derive compute-optimal model size
(number of parameters) for a given compute duration. We advance and refine such
methods to infer compute-optimal model shapes, such as width and depth, and
successfully implement this in vision transformers. Our shape-optimized vision
transformer, SoViT, achieves results competitive with models that exceed twice
its size, despite being pre-trained with an equivalent amount of compute. For
example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012,
surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical
settings, with also less than half the inference cost. We conduct a thorough
evaluation across multiple tasks, such as image classification, captioning, VQA
and zero-shot transfer, demonstrating the effectiveness of our model across a
broad range of domains and identifying limitations. Overall, our findings
challenge the prevailing approach of blindly scaling up vision models and pave
a path for a more informed scaling. |
This paper presents SoViT, a shape-optimized vision transformer that achieves comparable performance to much larger models when pre-trained with the same amount of compute. |
Scaling model size alone is not the most compute-efficient approach. Optimizing model shape (e.g., width, depth) for a given compute budget can lead to smaller, faster, and equally performant models. |
The authors introduce a novel scaling strategy that leverages a joint functional form for model size and compute to estimate optimal scaling exponents for different shape dimensions. They use a "star sweep" and a "grid sweep" to efficiently explore the design space and identify compute-optimal model shapes. |
MLP dimension should be scaled faster than depth, and depth faster than width in vision transformers.
Compute-optimal ViTs are smaller than previously used, with parameter count scaling more slowly than the allocated compute.
SoViT-400m/14, a model optimized for the compute-equivalent of ViT-g/14, achieves 90.3% fine-tuning accuracy on ILSRCV2012, matching ViT-g/14's performance while being significantly smaller. |
The proposed optimal model shape might not be ideal for all vision tasks, as indicated by the panoptic segmentation results.
The study primarily focuses on optimizing three shape dimensions (width, depth, MLP size) and fixing the patch size. Further investigation is needed to explore the impact of including patch size in the optimization process. |
vision transformers, scaling laws, model shape optimization, compute efficiency, image classification |
2305.12998
Report |
MFT: Long-Term Tracking of Every Pixel |
Michal Neoral, Jonáš Šerých, Jiří Matas |
We propose MFT -- Multi-Flow dense Tracker -- a novel method for dense,
pixel-level, long-term tracking. The approach exploits optical flows estimated
not only between consecutive frames, but also for pairs of frames at
logarithmically spaced intervals. It selects the most reliable sequence of
flows on the basis of estimates of its geometric accuracy and the probability
of occlusion, both provided by a pre-trained CNN. We show that MFT achieves
competitive performance on the TAP-Vid benchmark, outperforming baselines by a
significant margin, and tracking densely orders of magnitude faster than the
state-of-the-art point-tracking methods. The method is insensitive to
medium-length occlusions and it is robustified by estimating flow with respect
to the reference frame, which reduces drift. |
Presents MFT, a novel method for dense, pixel-level, long-term tracking in videos, by combining optical flows from multiple time intervals and leveraging occlusion and uncertainty estimations. |
Addresses the limitations of existing dense tracking approaches like error accumulation and occlusion sensitivity, aiming for robust and accurate long-term tracking essential for applications like video editing and augmented reality. |
Utilizes pre-computed optical flows at logarithmically spaced time intervals, and employs two small CNNs to estimate occlusion and uncertainty maps from optical flow cost volumes. A selection mechanism then identifies the most reliable flow chain for each pixel based on these maps. |
Achieves competitive performance on the TAP-Vid benchmark, outperforming most baselines and demonstrating a good balance between speed and accuracy for dense tracking.
Significantly outperforms other methods in terms of speed when tracking densely, achieving 2.4 FPS compared to 0.04 FPS for state-of-the-art point trackers, and reaching over 100 FPS with pre-computed flows.
Shows robustness against moderate occlusions and adapts to appearance changes by dynamically switching between different flow intervals. |
Exhibits occasional spurious re-detections when out-of-view template regions are incorrectly matched to visually similar areas in the current frame.
Future work could explore techniques to mitigate these spurious re-detections, potentially by incorporating temporal consistency constraints or object-level reasoning. |
dense tracking, long-term tracking, optical flow, occlusion handling, uncertainty estimation |
2305.12972
Report |
VanillaNet: the Power of Minimalism in Deep Learning |
Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao |
At the heart of foundation models is the philosophy of "more is different",
exemplified by the astonishing success in computer vision and natural language
processing. However, the challenges of optimization and inherent complexity of
transformer models call for a paradigm shift towards simplicity. In this study,
we introduce VanillaNet, a neural network architecture that embraces elegance
in design. By avoiding high depth, shortcuts, and intricate operations like
self-attention, VanillaNet is refreshingly concise yet remarkably powerful.
Each layer is carefully crafted to be compact and straightforward, with
nonlinear activation functions pruned after training to restore the original
architecture. VanillaNet overcomes the challenges of inherent complexity,
making it ideal for resource-constrained environments. Its easy-to-understand
and highly simplified architecture opens new possibilities for efficient
deployment. Extensive experimentation demonstrates that VanillaNet delivers
performance on par with renowned deep neural networks and vision transformers,
showcasing the power of minimalism in deep learning. This visionary journey of
VanillaNet has significant potential to redefine the landscape and challenge
the status quo of foundation model, setting a new path for elegant and
effective model design. Pre-trained models and codes are available at
https://github.com/huawei-noah/VanillaNet and
https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet. |
This paper introduces VanillaNet, a simple neural network architecture for computer vision that avoids complex components like shortcuts, excessive depth, and self-attention, while still achieving competitive performance. |
Existing deep learning models, while powerful, are becoming increasingly complex, posing challenges for deployment, especially in resource-constrained environments. VanillaNet addresses this by offering a simpler alternative without sacrificing performance. |
VanillaNet employs a streamlined architecture built upon convolutional layers and introduces a "deep training" strategy for improved performance. This involves starting with additional non-linear activation functions and progressively pruning them during training to maintain inference speed. It also incorporates a novel series-based activation function for enhanced non-linearity. |
VanillaNet achieves image classification accuracy comparable to well-known deep networks and vision transformers, even surpassing them in inference speed on GPUs.
Ablation studies demonstrate the effectiveness of the proposed deep training strategy and series activation function in boosting the performance of simple architectures.
Visualization of attention maps provides insights into VanillaNet's learning process, suggesting its strength in thoroughly extracting information from images. |
Future work includes exploring better parameter allocation strategies for VanillaNet to further improve its efficiency.
Further investigation into the trade-off between non-linearity and depth in extremely simple architectures is also warranted. |
neural network architecture, deep learning, computer vision, model efficiency, convolutional neural networks |
2305.12966
Report |
Hierarchical Integration Diffusion Model for Realistic Image Deblurring |
Zheng Chen, Yulun Zhang, Ding Liu, Bin Xia, Jinjin Gu, Linghe Kong, Xin Yuan |
Diffusion models (DMs) have recently been introduced in image deblurring and
exhibited promising performance, particularly in terms of details
reconstruction. However, the diffusion model requires a large number of
inference iterations to recover the clean image from pure Gaussian noise, which
consumes massive computational resources. Moreover, the distribution
synthesized by the diffusion model is often misaligned with the target results,
leading to restrictions in distortion-based metrics. To address the above
issues, we propose the Hierarchical Integration Diffusion Model (HI-Diff), for
realistic image deblurring. Specifically, we perform the DM in a highly
compacted latent space to generate the prior feature for the deblurring
process. The deblurring process is implemented by a regression-based method to
obtain better distortion accuracy. Meanwhile, the highly compact latent space
ensures the efficiency of the DM. Furthermore, we design the hierarchical
integration module to fuse the prior into the regression-based model from
multiple scales, enabling better generalization in complex blurry scenarios.
Comprehensive experiments on synthetic and real-world blur datasets demonstrate
that our HI-Diff outperforms state-of-the-art methods. Code and trained models
are available at https://github.com/zhengchen1999/HI-Diff. |
This paper proposes HI-Diff, a novel Hierarchical Integration Diffusion Model, for realistic image deblurring. |
Diffusion models (DMs) show promise in image deblurring for detailed reconstruction but suffer from high computational cost and potential distortion due to distribution misalignment. |
HI-Diff uses a two-stage training approach: 1) compressing ground truth images into a compact latent representation as prior feature and integrating it into a Transformer through a hierarchical integration module (HIM), and 2) training a latent diffusion model to generate the prior feature, which further guides the Transformer during deblurring. |
HI-Diff outperforms state-of-the-art methods on benchmark datasets, including GoPro, HIDE, RealBlur, and RWBI.
The hierarchical integration with multi-scale prior features shows superior performance than using single-scale features.
Jointly training the diffusion model and Transformer in stage two significantly improves the deblurring performance. |
The study mainly focuses on image deblurring, and its applicability to other image restoration tasks needs further investigation.
Exploring more advanced diffusion model architectures and training strategies could potentially further enhance the deblurring performance. |
image deblurring, diffusion models, hierarchical integration, latent space, prior feature |
2305.12799
Report |
Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration |
Qifan Yu, Juncheng Li, Wentao Ye, Siliang Tang, Yueting Zhuang |
Recent text-to-image generation models have shown promising results in
generating high-fidelity photo-realistic images. In parallel, the problem of
data scarcity has brought a growing interest in employing AIGC technology for
high-quality data expansion. However, this paradigm requires well-designed
prompt engineering that cost-less data expansion and labeling remain
under-explored. Inspired by LLM's powerful capability in task guidance, we
propose a new paradigm of annotated data expansion named as ChatGenImage. The
core idea behind it is to leverage the complementary strengths of diverse
models to establish a highly effective and user-friendly pipeline for
interactive data augmentation. In this work, we extensively study how LLMs
communicate with AIGC model to achieve more controllable image generation and
make the first attempt to collaborate them for automatic data augmentation for
a variety of downstream tasks. Finally, we present fascinating results obtained
from our ChatGenImage framework and demonstrate the powerful potential of our
synthetic data for systematic vision adaptation. Our codes are available at
https://github.com/Yuqifan1117/Labal-Anything-Pipeline. |
Presents ChatGenImage, a novel framework for interactive data augmentation that leverages the collaborative capabilities of LLMs, AIGC models, and label foundation toolkits to generate high-quality synthetic images with fine-grained annotations. |
Addresses the limitations of existing data augmentation methods by enabling more controllable and diverse image generation with detailed annotations, which is crucial for improving the generalization and robustness of vision models, especially in data-scarce scenarios. |
Utilizes a two-stage process: 1) Language Enhancement Image Initialization: LLMs generate descriptive prompts to guide AIGC models in creating initial images. 2) Iteratively Local Refinement and Labeling: LLMs analyze annotations from label foundation toolkits and provide local editing prompts to AIGC models for iterative refinement, resulting in images that align with complex annotations. |
ChatGenImage effectively generates controllable and diverse images, even for unfamiliar or rare concepts, by leveraging the knowledge and reasoning capabilities of LLMs.
The framework excels in creating images depicting intricate scenes with multiple objects and backgrounds through iterative local refinement guided by LLMs.
Image filtering rules based on pixel and semantic checking ensure the generation of high-quality synthetic data suitable for downstream tasks. |
Current experiments primarily focus on qualitative analysis, with quantitative evaluations for downstream task performance left for future work.
The framework's computational cost, particularly for iterative refinement, presents a challenge for large-scale data generation. |
data augmentation, synthetic data generation, large language models (llms), text-to-image synthesis, vision adaptation |
2305.12716
Report |
The CLIP Model is Secretly an Image-to-Prompt Converter |
Yuxuan Ding, Chunna Tian, Haoxuan Ding, Lingqiao Liu |
The Stable Diffusion model is a prominent text-to-image generation model that
relies on a text prompt as its input, which is encoded using the Contrastive
Language-Image Pre-Training (CLIP). However, text prompts have limitations when
it comes to incorporating implicit information from reference images. Existing
methods have attempted to address this limitation by employing expensive
training procedures involving millions of training samples for image-to-image
generation. In contrast, this paper demonstrates that the CLIP model, as
utilized in Stable Diffusion, inherently possesses the ability to
instantaneously convert images into text prompts. Such an image-to-prompt
conversion can be achieved by utilizing a linear projection matrix that is
calculated in a closed form. Moreover, the paper showcases that this capability
can be further enhanced by either utilizing a small amount of similar-domain
training data (approximately 100 images) or incorporating several online
training steps (around 30 iterations) on the reference images. By leveraging
these approaches, the proposed method offers a simple and flexible solution to
bridge the gap between images and text prompts. This methodology can be applied
to various tasks such as image variation and image editing, facilitating more
effective and seamless interaction between images and textual prompts. |
This paper introduces Stable Diffusion Image-to-Prompt Conversion (SD-IPC), a method for converting images into text prompts for use with Stable Diffusion, eliminating the need for expensive retraining procedures. |
Text prompts in text-to-image generation models struggle to capture implicit information from reference images, limiting their effectiveness for tasks like image variation. |
SD-IPC leverages the inherent relationship between CLIP's visual and textual embeddings by deriving a closed-form projection matrix. This matrix converts visual embeddings into textual prompts, enabling image-guided generation with Stable Diffusion. Additionally, the paper proposes fine-tuning approaches to improve content preservation and customize generation. |
SD-IPC effectively captures semantic information from reference images, enabling image variation without extensive training.
Fine-tuning SD-IPC on specific datasets enhances content preservation and editing capabilities, outperforming existing methods like SD-R.
SD-IPC facilitates fast adaptation for customized generation, requiring significantly fewer updates than methods like Custom Diffusion. |
Editing text needs to be contextually appropriate to avoid generating nonsensical images.
The current method only supports single image input. |
image variation, text-to-image generation, stable diffusion, clip, image-to-prompt conversion |
2305.12659
Report |
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model |
Zhenghao Zhang, Zhichao Wei, Shengfan Zhang, Zuozhuo Dai, Siyu Zhu |
Unsupervised video object segmentation has made significant progress in
recent years, but the manual annotation of video mask datasets is expensive and
limits the diversity of available datasets. The Segment Anything Model (SAM)
has introduced a new prompt-driven paradigm for image segmentation, unlocking a
range of previously unexplored capabilities. In this paper, we propose a novel
paradigm called UVOSAM, which leverages SAM for unsupervised video object
segmentation without requiring video mask labels. To address SAM's limitations
in instance discovery and identity association, we introduce a video salient
object tracking network that automatically generates trajectories for prominent
foreground objects. These trajectories then serve as prompts for SAM to produce
video masks on a frame-by-frame basis. Our experimental results demonstrate
that UVOSAM significantly outperforms current mask-supervised methods. These
findings suggest that UVOSAM has the potential to improve unsupervised video
object segmentation and reduce the cost of manual annotation. |
This paper introduces UVOSAM, a novel paradigm for unsupervised video object segmentation using the Segment Anything Model (SAM) without requiring video mask labels. |
Manually annotating video mask datasets is expensive and limits diversity. UVOSAM aims to address this by leveraging SAM for mask-free unsupervised video object segmentation. |
UVOSAM consists of two stages: 1) a video salient object tracking (VSOT) network detects prominent objects and generates trajectories, and 2) SAM utilizes these trajectories as prompts to produce video masks frame-by-frame. |
UVOSAM significantly outperforms current mask-supervised methods on DAVIS2017-unsupervised and Youtube-VIS 2019 datasets.
Providing accurate bounding box prompts to SAM leads to near-perfect segmentation results, highlighting its potential.
Ablation studies demonstrate the importance of the tracking framework, prompt types, and combining box and point prompts for optimal performance. |
UVOSAM struggles with detecting slender objects and experiences trajectory drift in cases of occlusion or significant scale changes.
Future work will focus on improving VSOT's robustness to address these limitations. |
unsupervised video object segmentation, segment anything model (sam), mask-free training, video salient object tracking, prompt-driven segmentation |
2305.12529
Report |
DreamWaltz: Make a Scene with Complex 3D Animatable Avatars |
Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, Lei Zhang |
We present DreamWaltz, a novel framework for generating and animating complex
3D avatars given text guidance and parametric human body prior. While recent
methods have shown encouraging results for text-to-3D generation of common
objects, creating high-quality and animatable 3D avatars remains challenging.
To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent
occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural
representations with canonical poses. It provides view-aligned supervision via
3D-aware skeleton conditioning which enables complex avatar generation without
artifacts and multiple faces. For animation, our method learns an animatable 3D
avatar representation from abundant image priors of diffusion model conditioned
on various poses, which could animate complex non-rigged avatars given
arbitrary poses without retraining. Extensive evaluations demonstrate that
DreamWaltz is an effective and robust approach for creating 3D avatars that can
take on complex shapes and appearances as well as novel poses for animation.
The proposed framework further enables the creation of complex scenes with
diverse compositions, including avatar-avatar, avatar-object and avatar-scene
interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and
animation results. |
DreamWaltz is a novel framework for generating and animating complex 3D avatars from text descriptions, leveraging human body priors. |
Creating high-quality, animatable 3D avatars from text is challenging due to the complexity of avatar appearances, articulated structures, and pose-dependent shape changes. |
DreamWaltz utilizes a trainable NeRF for 3D representation, a pre-trained text-and-skeleton-conditional diffusion model for supervision, and SMPL models for 3D-aware skeletons. It introduces 3D-consistent occlusion-aware Score Distillation Sampling for high-quality avatar creation and learns an animatable NeRF representation from diffusion and pose priors. |
DreamWaltz generates high-quality 3D avatars with complex shapes and appearances from text prompts.
The learned animatable NeRF enables realistic avatar animation with arbitrary motion sequences without retraining.
The framework allows for scene composition with diverse avatar-avatar, avatar-object, and avatar-scene interactions. |
The visual quality can be further improved with higher resolution training and dedicated optimization for details like face and hand.
The model may inherit societal biases present in the training data of the underlying diffusion model. |
text-to-3d, avatar generation, 3d animation, nerf, diffusion models |
2305.12476
Report |
Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models |
Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, Long Chen |
Pretrained vision-language models, such as CLIP, have demonstrated strong
generalization capabilities, making them promising tools in the realm of
zero-shot visual recognition. Visual relation detection (VRD) is a typical task
that identifies relationship (or interaction) types between object pairs within
an image. However, naively utilizing CLIP with prevalent class-based prompts
for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish
between different fine-grained relation types and it neglects essential spatial
information of two objects. To this end, we propose a novel method for
zero-shot VRD: RECODE, which solves RElation detection via COmposite
DEscription prompts. Specifically, RECODE first decomposes each predicate
category into subject, object, and spatial components. Then, it leverages large
language models (LLMs) to generate description-based prompts (or visual cues)
for each component. Different visual cues enhance the discriminability of
similar relation categories from different perspectives, which significantly
boosts performance in VRD. To dynamically fuse different cues, we further
introduce a chain-of-thought method that prompts LLMs to generate reasonable
weights for different visual cues. Extensive experiments on four VRD benchmarks
have demonstrated the effectiveness and interpretability of RECODE. |
This paper introduces RECODE, the first training-free framework for zero-shot visual relation detection using large language models (LLMs). |
Existing VRD methods require extensive training data and struggle with unseen relations. RECODE addresses these limitations by leveraging the knowledge and reasoning capabilities of LLMs. |
RECODE utilizes LLMs to generate descriptions of visual cues for various relation categories. These descriptions are then used to compute similarities between image regions and relation categories, enabling zero-shot relation detection. |
RECODE achieves competitive results on the Visual Genome (VG) dataset without any training, demonstrating its potential for zero-shot VRD.
Ablation studies on class-based prompts highlight RECODE's effectiveness compared to a baseline using only class-based prompts.
Qualitative analysis showcases the interpretability of RECODE's predictions, revealing its ability to identify relevant visual cues for accurate relation classification. |
RECODE's current evaluation doesn't explicitly cover spatial or ownership relation categories.
The framework assumes access to perfect bounding boxes and object categories, which might not be realistic in real-world applications. |
visual relation detection, zero-shot learning, large language models, computer vision, artificial intelligence |
2305.12452
Report |
Advancing Referring Expression Segmentation Beyond Single Image |
Yixuan Wu, Zhao Zhang, Xie Chi, Feng Zhu, Rui Zhao |
Referring Expression Segmentation (RES) is a widely explored multi-modal
task, which endeavors to segment the pre-existing object within a single image
with a given linguistic expression. However, in broader real-world scenarios,
it is not always possible to determine if the described object exists in a
specific image. Typically, we have a collection of images, some of which may
contain the described objects. The current RES setting curbs its practicality
in such situations. To overcome this limitation, we propose a more realistic
and general setting, named Group-wise Referring Expression Segmentation (GRES),
which expands RES to a collection of related images, allowing the described
objects to be present in a subset of input images. To support this new setting,
we introduce an elaborately compiled dataset named Grouped Referring Dataset
(GRD), containing complete group-wise annotations of target objects described
by given expressions. We also present a baseline method named Grouped Referring
Segmenter (GRSer), which explicitly captures the language-vision and
intra-group vision-vision interactions to achieve state-of-the-art results on
the proposed GRES and related tasks, such as Co-Salient Object Detection and
RES. Our dataset and codes will be publicly released in
https://github.com/yixuan730/group-res. |
This paper proposes Group-wise Referring Expression Segmentation (GRES), a new setting that extends Referring Expression Segmentation (RES) to a collection of related images, allowing for more realistic scenarios where the target object might not be present in all images. |
Current RES methods are limited to single images with confirmed target presence, hindering their practicality in real-world applications such as image retrieval or multi-monitor event discovery. GRES addresses this limitation by considering groups of related images, some of which may not contain the described object. |
The authors introduce GRSer, a baseline method for GRES, which leverages language and intra-group visual cues. GRSer utilizes a Triphasic Query Module (TQM) to generate heatmaps based on both linguistic and visual features, and a Heatmap Hierarchizer to rank these heatmaps for improved object localization. Additionally, a new dataset called GRD is presented, featuring complete group-wise annotations of target objects, including negative samples. |
GRSer significantly outperforms existing RES methods on both GRES and conventional RES benchmarks.
The proposed TQM and Heatmap Hierarchizer are shown to effectively capture language-vision and vision-vision interactions, contributing to improved object localization.
The GRD dataset, with its complete group-wise annotations and negative samples, provides a more realistic and challenging benchmark for evaluating GRES methods. |
The study primarily focuses on a fixed group size, leaving exploration of dynamic group sizes for future work.
The impact of different language and vision encoders on GRES performance could be investigated further. |
referring expression segmentation, group-wise segmentation, multi-modal learning, computer vision, natural language processing |
2305.12328
Report |
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions |
Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang |
We present an end-to-end diffusion-based method for editing videos with human
language instructions, namely $\textbf{InstructVid2Vid}$. Our approach enables
the editing of input videos based on natural language instructions without any
per-example fine-tuning or inversion. The proposed InstructVid2Vid model
combines a pretrained image generation model, Stable Diffusion, with a
conditional 3D U-Net architecture to generate time-dependent sequence of video
frames. To obtain the training data, we incorporate the knowledge and expertise
of different models, including ChatGPT, BLIP, and Tune-a-Video, to synthesize
video-instruction triplets, which is a more cost-efficient alternative to
collecting data in real-world scenarios. To improve the consistency between
adjacent frames of generated videos, we propose the Frame Difference Loss,
which is incorporated during the training process. During inference, we extend
the classifier-free guidance to text-video input to guide the generated
results, making them more related to both the input video and instruction.
Experiments demonstrate that InstructVid2Vid is able to generate high-quality,
temporally coherent videos and perform diverse edits, including attribute
editing, change of background, and style transfer. These results highlight the
versatility and effectiveness of our proposed method. Code is released in
$\href{https://github.com/BrightQin/InstructVid2Vid}{InstructVid2Vid}$. |
Introduces InstructVid2Vid, an end-to-end diffusion-based video editing method using human language instructions without per-example fine-tuning. |
Addresses limitations of existing video editing methods that require computationally expensive fine-tuning for each input video. |
Combines a pretrained Stable Diffusion model with a 3D U-Net, trained on a synthetic dataset generated using ChatGPT, BLIP, and Tune-a-Video. Introduces Frame Difference Loss to enhance temporal consistency in generated videos. |
Achieves attribute modification, background change, and style transfer in videos while maintaining temporal consistency.
Demonstrates superior performance in quantitative metrics like frame differencing, optical flow, and FID compared to models without Frame Difference Loss.
Showcases the potential of model composition for generating training data and enabling advanced video editing capabilities. |
Current model primarily excels at Level 1 video editing tasks and faces limitations in comprehending complex instructions or achieving high-level semantic editing.
Future work focuses on enhancing InstructVid2Vid's capabilities for higher-level video editing tasks like motion manipulation and story-driven editing. |
video editing, diffusion models, generative ai, text-to-video, multimodal learning |
2305.12252
Report |
Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model |
Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, Ruimao Zhang |
This paper investigates the problem of the current HOI detection methods and
introduces DiffHOI, a novel HOI detection scheme grounded on a pre-trained
text-image diffusion model, which enhances the detector's performance via
improved data diversity and HOI representation. We demonstrate that the
internal representation space of a frozen text-to-image diffusion model is
highly relevant to verb concepts and their corresponding context. Accordingly,
we propose an adapter-style tuning method to extract the various semantic
associated representation from a frozen diffusion model and CLIP model to
enhance the human and object representations from the pre-trained detector,
further reducing the ambiguity in interaction prediction. Moreover, to fill in
the gaps of HOI datasets, we propose SynHOI, a class-balance, large-scale, and
high-diversity synthetic dataset containing over 140K HOI images with fully
triplet annotations. It is built using an automatic and scalable pipeline
designed to scale up the generation of diverse and high-precision HOI-annotated
data. SynHOI could effectively relieve the long-tail issue in existing datasets
and facilitate learning interaction representations. Extensive experiments
demonstrate that DiffHOI significantly outperforms the state-of-the-art in
regular detection (i.e., 41.50 mAP) and zero-shot detection. Furthermore,
SynHOI can improve the performance of model-agnostic and backbone-agnostic HOI
detection, particularly exhibiting an outstanding 11.55% mAP improvement in
rare classes. |
This paper introduces DiffHOI, a novel HOI detection scheme that leverages the generative and representative capabilities of pre-trained text-to-image diffusion models to enhance HOI detection performance. |
Current HOI detection methods suffer from limitations such as class imbalance, small data size, limited diversity in existing datasets, and difficulties in extracting nuanced verb-associated contextual information for effective interaction prediction. |
DiffHOI utilizes an adapter-style tuning approach to extract global and local semantic representations from a frozen diffusion model and CLIP model. It also introduces SynHOI, a class-balanced, large-scale synthetic HOI dataset generated using an automatic pipeline. |
DiffHOI significantly outperforms state-of-the-art methods in regular HOI detection, achieving 41.50 mAP on HICO-DET.
SynHOI effectively addresses the long-tail issue in existing datasets and improves the performance of HOI detection, particularly in rare classes with an 11.55% mAP improvement.
DiffHOI demonstrates superior performance in zero-shot HOI detection. |
The computational cost of incorporating large-scale diffusion models in HOI detection pipelines.
Further exploration of more effective prompt design strategies for generating higher-quality synthetic HOI data. |
human-object interaction detection, diffusion models, synthetic data generation, zero-shot learning, computer vision |
2305.11588
Report |
Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields |
Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao |
Text-driven 3D scene generation is widely applicable to video gaming, film
industry, and metaverse applications that have a large demand for 3D scenes.
However, existing text-to-3D generation methods are limited to producing 3D
objects with simple geometries and dreamlike styles that lack realism. In this
work, we present Text2NeRF, which is able to generate a wide range of 3D scenes
with complicated geometric structures and high-fidelity textures purely from a
text prompt. To this end, we adopt NeRF as the 3D representation and leverage a
pre-trained text-to-image diffusion model to constrain the 3D reconstruction of
the NeRF to reflect the scene description. Specifically, we employ the
diffusion model to infer the text-related image as the content prior and use a
monocular depth estimation method to offer the geometric prior. Both content
and geometric priors are utilized to update the NeRF model. To guarantee
textured and geometric consistency between different views, we introduce a
progressive scene inpainting and updating strategy for novel view synthesis of
the scene. Our method requires no additional training data but only a natural
language description of the scene as the input. Extensive experiments
demonstrate that our Text2NeRF outperforms existing methods in producing
photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of
natural language prompts. Our code is available at
https://github.com/eckertzhang/Text2NeRF. |
Text2NeRF generates realistic 3D scenes from text prompts by combining a pre-trained text-to-image diffusion model with NeRF. |
Existing text-to-3D methods struggle to generate high-fidelity and diverse scenes with complex geometry, often resulting in simplistic or dreamlike outputs. Text2NeRF addresses this by leveraging the strengths of NeRF and diffusion models for realistic scene generation. |
The method infers an initial image and depth map from the text prompt using a diffusion model and a monocular depth estimation method. This information initializes a NeRF model. A progressive inpainting and updating (PIU) strategy expands the scene view-by-view, using the diffusion model to fill missing regions while maintaining consistency. Support sets with multi-view constraints and a two-stage depth alignment strategy are employed to enhance realism and accuracy. |
Generates photorealistic 3D scenes with complex geometry and textures from text prompts.
Outperforms existing text-to-3D methods both qualitatively and quantitatively in terms of scene quality, realism, and semantic relevance.
Supports generation of diverse scenes, including indoor, outdoor, and artistic styles, and allows for 360-degree scene generation. |
Struggles with scenes containing large occlusions due to limitations in depth estimation accuracy.
Requires longer optimization time compared to mesh or point cloud-based generation methods. |
text-to-3d, nerf, 3d scene generation, diffusion models, novel view synthesis |
2305.11577
Report |
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model |
Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, Yanwei Fu |
This paper introduces LeftRefill, an innovative approach to efficiently
harness large Text-to-Image (T2I) diffusion models for reference-guided image
synthesis. As the name implies, LeftRefill horizontally stitches reference and
target views together as a whole input. The reference image occupies the left
side, while the target canvas is positioned on the right. Then, LeftRefill
paints the right-side target canvas based on the left-side reference and
specific task instructions. Such a task formulation shares some similarities
with contextual inpainting, akin to the actions of a human painter. This novel
formulation efficiently learns both structural and textured correspondence
between reference and target without other image encoders or adapters. We
inject task and view information through cross-attention modules in T2I models,
and further exhibit multi-view reference ability via the re-arranged
self-attention modules. These enable LeftRefill to perform consistent
generation as a generalized model without requiring test-time fine-tuning or
model modifications. Thus, LeftRefill can be seen as a simple yet unified
framework to address reference-guided synthesis. As an exemplar, we leverage
LeftRefill to address two different challenges: reference-guided inpainting and
novel view synthesis, based on the pre-trained StableDiffusion. Codes and
models are released at https://github.com/ewrfcas/LeftRefill. |
This paper introduces LeftRefill, a novel method for reference-guided image synthesis using large text-to-image diffusion models by stitching reference and target images into a single canvas and leveraging contextual inpainting. |
Existing methods for reference-guided synthesis rely on computationally expensive fine-tuning of large models or visual encoders that prioritize semantics over spatial details, hindering performance in tasks like novel view synthesis and reference-guided inpainting. |
LeftRefill stitches reference and target images horizontally and trains using the inpainting capability of Stable Diffusion, guided by task and view-specific prompt tuning and a novel block causal masking technique for consistent autoregressive generation. |
LeftRefill achieves state-of-the-art performance in both reference-guided inpainting and novel view synthesis with fewer trainable parameters.
The method effectively leverages multi-view references to enhance inpainting quality and generate consistent novel views.
The proposed block causal masking technique enables autoregressive generation in diffusion-based models, leading to improved geometric consistency in novel view synthesis. |
LeftRefill's autoregressive generation suffers from error accumulation, requiring additional reference views for correction.
Future work includes extending LeftRefill to higher resolutions and improving efficiency for larger, more powerful text-to-image models. |
reference-guided image synthesis, text-to-image (t2i) diffusion models, novel view synthesis, image inpainting, prompt tuning |
2305.11520
Report |
Late-Constraint Diffusion Guidance for Controllable Image Synthesis |
Chang Liu, Dong Liu |
Diffusion models, either with or without text condition, have demonstrated
impressive capability in synthesizing photorealistic images given a few or even
no words. These models may not fully satisfy user need, as normal users or
artists intend to control the synthesized images with specific guidance, like
overall layout, color, structure, object shape, and so on. To adapt diffusion
models for controllable image synthesis, several methods have been proposed to
incorporate the required conditions as regularization upon the intermediate
features of the diffusion denoising network. These methods, known as
early-constraint ones in this paper, have difficulties in handling multiple
conditions with a single solution. They intend to train separate models for
each specific condition, which require much training cost and result in
non-generalizable solutions. To address these difficulties, we propose a new
approach namely late-constraint: we leave the diffusion networks unchanged, but
constrain its output to be aligned with the required conditions. Specifically,
we train a lightweight condition adapter to establish the correlation between
external conditions and internal representations of diffusion models. During
the iterative denoising process, the conditional guidance is sent into
corresponding condition adapter to manipulate the sampling process with the
established correlation. We further equip the introduced late-constraint
strategy with a timestep resampling method and an early stopping technique,
which boost the quality of synthesized image meanwhile complying with the
guidance. Our method outperforms the existing early-constraint methods and
generalizes better to unseen condition. Our code would be available. |
This paper proposes Late-Constraint Diffusion Guidance (LCDG), a novel approach for controllable image synthesis that aligns the output of diffusion models with external guidance without altering the original network. |
Existing methods for controlling diffusion models, known as early-constraint methods, have limitations in handling multiple conditions and generalizing to unseen ones. LCDG addresses these limitations by externally guiding the sampling process. |
LCDG trains a lightweight Condition Adapter (CA) to learn the correlation between internal representations of diffusion models and external conditions. During sampling, it uses the CA to adjust the estimated score based on the difference between the desired and reconstructed conditions. |
LCDG achieves superior FID scores compared to existing early-constraint methods on COCO dataset, demonstrating better sample quality.
The method exhibits strong generalization ability, effectively handling various conditions like edge, sketch, color stroke, palette, and mask with a single model.
LCDG offers flexible controllability through adjustable parameters like controlling scale and truncation threshold. |
LCDG, similar to other gradient-based methods, can increase sampling time due to additional forwarding processes.
Further research is needed to explore more efficient acceleration strategies to mitigate the increased sampling time. |
image synthesis, diffusion models, controllable generation, condition adapter, structure-aware sampling |
2305.11487
Report |
PointGPT: Auto-regressively Generative Pre-training from Point Clouds |
Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, Yufeng Yue |
Large language models (LLMs) based on the generative pre-training transformer
(GPT) have demonstrated remarkable effectiveness across a diverse range of
downstream tasks. Inspired by the advancements of the GPT, we present PointGPT,
a novel approach that extends the concept of GPT to point clouds, addressing
the challenges associated with disorder properties, low information density,
and task gaps. Specifically, a point cloud auto-regressive generation task is
proposed to pre-train transformer models. Our method partitions the input point
cloud into multiple point patches and arranges them in an ordered sequence
based on their spatial proximity. Then, an extractor-generator based
transformer decoder, with a dual masking strategy, learns latent
representations conditioned on the preceding point patches, aiming to predict
the next one in an auto-regressive manner. Our scalable approach allows for
learning high-capacity models that generalize well, achieving state-of-the-art
performance on various downstream tasks. In particular, our approach achieves
classification accuracies of 94.9% on the ModelNet40 dataset and 93.4% on the
ScanObjectNN dataset, outperforming all other transformer models. Furthermore,
our method also attains new state-of-the-art accuracies on all four few-shot
learning benchmarks. |
This paper introduces PointGPT, a novel self-supervised learning framework for point clouds inspired by the GPT architecture in NLP. It addresses challenges like point cloud disorder, low information density, and task-specific gaps to learn effective representations. |
Existing point cloud learning methods often depend on fully-supervised training with costly annotations. Self-supervised learning, particularly GPT-like architectures, have shown great promise in NLP and image analysis for learning without explicit labels. This paper explores adapting this success to the point cloud domain. |
PointGPT partitions point clouds into ordered sequences of point patches using Morton-order curve. It then utilizes an extractor-generator transformer decoder with a dual masking strategy. The extractor learns latent representations by predicting masked patches, while the generator reconstructs the point patches. Post-pre-training with a labeled hybrid dataset further enhances representation learning. |
PointGPT outperforms other single-modal SSL methods on object classification (ScanObjectNN, ModelNet40) and part segmentation (ShapeNetPart) tasks.
Scaled PointGPT models achieve state-of-the-art results on these tasks, exceeding even methods using cross-modal information or teacher models.
The method demonstrates strong generalization ability, achieving superior performance in few-shot learning scenarios. |
The data and model scales used are still smaller compared to NLP and image processing domains, limiting further exploration of PointGPT's potential.
Future work can investigate scaling PointGPT to even larger datasets and model sizes to further bridge the gap with NLP and image processing. |
point cloud, self-supervised learning, generative pre-training, transformer, representation learning |
2305.11321
Report |
JoIN: Joint GANs Inversion for Intrinsic Image Decomposition |
Viraj Shah, Svetlana Lazebnik, Julien Philip |
In this work, we propose to solve ill-posed inverse imaging problems using a
bank of Generative Adversarial Networks (GAN) as a prior and apply our method
to the case of Intrinsic Image Decomposition for faces and materials. Our
method builds on the demonstrated success of GANs to capture complex image
distributions. At the core of our approach is the idea that the latent space of
a GAN is a well-suited optimization domain to solve inverse problems. Given an
input image, we propose to jointly inverse the latent codes of a set of GANs
and combine their outputs to reproduce the input. Contrary to most GAN
inversion methods which are limited to inverting only a single GAN, we
demonstrate that it is possible to maintain distribution priors while inverting
several GANs jointly. We show that our approach is modular, allowing various
forward imaging models, and that it can successfully decompose both synthetic
and real images. |
This paper introduces JoIN, a novel method that leverages a bank of Generative Adversarial Networks (GANs) as priors to solve ill-posed inverse imaging problems, particularly focusing on Intrinsic Image Decomposition (IID) for faces and materials. |
IID is crucial for realistic image editing by decomposing images into independent components like albedo, shading, and specular reflections. Existing methods are often limited by restrictive priors or cross-contamination between components. This work addresses these limitations using the powerful image distribution modeling capabilities of GANs. |
The proposed method involves training separate GANs for each image component (albedo, shading, specular). Given an input image, the latent codes of these GANs are jointly optimized to minimize the reconstruction loss between the generated output and the target image. A novel kNN-based loss regularization is introduced to constrain the optimization and prevent cross-contamination between components. The approach is further enhanced by incorporating encoder-based initialization and generator fine-tuning techniques. |
JoIN successfully decomposes both synthetic and real images into their intrinsic components, outperforming existing methods quantitatively and qualitatively.
Independently trained GANs prove advantageous for modularity, flexibility, and preventing cross-contamination between components.
The novel kNN-based loss regularization is shown to be effective in maintaining GAN priors during optimization, leading to improved decomposition results. |
The method is limited by the training quality of the GANs and currently more adapted to distributions easily modeled by GANs, such as faces.
The optimization-based inversion process, while producing high-quality results, is computationally more expensive than feed-forward methods. |
generative adversarial networks, intrinsic image decomposition, inverse problems, gan inversion, image editing |
2305.11173
Report |
Going Denser with Open-Vocabulary Part Segmentation |
Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan |
Object detection has been expanded from a limited number of categories to
open vocabulary. Moving forward, a complete intelligent vision system requires
understanding more fine-grained object descriptions, object parts. In this
paper, we propose a detector with the ability to predict both open-vocabulary
objects and their part segmentation. This ability comes from two designs.
First, we train the detector on the joint of part-level, object-level and
image-level data to build the multi-granularity alignment between language and
image. Second, we parse the novel object into its parts by its dense semantic
correspondence with the base object. These two designs enable the detector to
largely benefit from various data sources and foundation models. In
open-vocabulary part segmentation experiments, our method outperforms the
baseline by 3.3$\sim$7.3 mAP in cross-dataset generalization on PartImageNet,
and improves the baseline by 7.3 novel AP$_{50}$ in cross-category
generalization on Pascal Part. Finally, we train a detector that generalizes to
a wide range of part segmentation datasets while achieving better performance
than dataset-specific training. |
This paper presents a novel detector capable of performing both open-vocabulary object detection and part segmentation, enabling fine-grained understanding of objects and their components. |
Expanding object detection to include part segmentation in an open-vocabulary setting is crucial for intelligent vision systems, enabling deeper object understanding and supporting applications like robotics and image editing. |
The proposed method leverages a vision-language model trained on part, object, and image-level data to achieve multi-granularity alignment between images and text. It further parses novel objects into their parts by establishing dense semantic correspondence with base objects using a pre-trained DINO model. |
The method outperforms baselines by 3.3-7.3 mAP in cross-dataset generalization on PartImageNet.
It achieves a 7.3 AP improvement in cross-category generalization on Pascal Part.
The model trained on joint data outperforms dataset-specific training on various part segmentation datasets. |
Joint training for object detection and part segmentation may not always benefit both tasks equally.
Further research is needed to explore better text prompt engineering for part segmentation. |
open-vocabulary learning, part segmentation, object detection, vision-language model, semantic correspondence |
2305.11147
Report |
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild |
Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, Ran Xu |
Achieving machine autonomy and human control often represent divergent
objectives in the design of interactive AI systems. Visual generative
foundation models such as Stable Diffusion show promise in navigating these
goals, especially when prompted with arbitrary languages. However, they often
fall short in generating images with spatial, structural, or geometric
controls. The integration of such controls, which can accommodate various
visual conditions in a single unified model, remains an unaddressed challenge.
In response, we introduce UniControl, a new generative foundation model that
consolidates a wide array of controllable condition-to-image (C2I) tasks within
a singular framework, while still allowing for arbitrary language prompts.
UniControl enables pixel-level-precise image generation, where visual
conditions primarily influence the generated structures and language prompts
guide the style and context. To equip UniControl with the capacity to handle
diverse visual conditions, we augment pretrained text-to-image diffusion models
and introduce a task-aware HyperNet to modulate the diffusion models, enabling
the adaptation to different C2I tasks simultaneously. Trained on nine unique
C2I tasks, UniControl demonstrates impressive zero-shot generation abilities
with unseen visual conditions. Experimental results show that UniControl often
surpasses the performance of single-task-controlled methods of comparable model
sizes. This control versatility positions UniControl as a significant
advancement in the realm of controllable visual generation. |
This paper introduces UniControl, a novel unified diffusion model for controllable visual generation. UniControl consolidates various condition-to-image tasks within a single framework, allowing for pixel-level image generation by leveraging both visual conditions and language prompts. |
Existing text-to-image generative models lack pixel-level precision for spatial control, while models like ControlNet, capable of incorporating visual conditions, require separate training for each condition. UniControl addresses this by handling diverse visual conditions in a single unified model, making it more efficient and versatile. |
UniControl leverages a mixture of experts (MOE)-style adapter and a task-aware HyperNet. The MOE adapter captures unique information from different visual conditions, while the task-aware HyperNet enables the model to adapt to different C2I tasks using task instructions. |
UniControl demonstrates impressive zero-shot generation abilities on unseen visual conditions and hybrid tasks.
Experimental results show that UniControl often surpasses the performance of single-task controlled methods of comparable model sizes.
User studies confirm UniControl's superiority over single-task models and official ControlNet implementations on various tasks. |
The model's performance is limited by the potential biases present in the training dataset (Laion-Aesthetics).
While UniControl excels in various tasks, it may face challenges when high-quality human output is desired. |
generative models, diffusion models, controllable image generation, multi-task learning, zero-shot learning |
2305.11116
Report |
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation |
Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang |
Existing automatic evaluation on text-to-image synthesis can only provide an
image-text matching score, without considering the object-level
compositionality, which results in poor correlation with human judgments. In
this work, we propose LLMScore, a new framework that offers evaluation scores
with multi-granularity compositionality. LLMScore leverages the large language
models (LLMs) to evaluate text-to-image models. Initially, it transforms the
image into image-level and object-level visual descriptions. Then an evaluation
instruction is fed into the LLMs to measure the alignment between the
synthesized image and the text, ultimately generating a score accompanied by a
rationale. Our substantial analysis reveals the highest correlation of LLMScore
with human judgments on a wide range of datasets (Attribute Binding Contrast,
Concept Conjunction, MSCOCO, DrawBench, PaintSkills). Notably, our LLMScore
achieves Kendall's tau correlation with human evaluations that is 58.8% and
31.2% higher than the commonly-used text-image matching metrics CLIP and BLIP,
respectively. |
This paper presents LLMScore, a novel framework leveraging Large Language Models (LLMs) for evaluating text-to-image synthesis with a focus on multi-granularity compositionality. |
Existing automatic evaluation metrics for text-to-image synthesis often fail to capture object-level alignment between text prompts and generated images, resulting in poor correlation with human judgments. |
LLMScore first decomposes the image into image-level and object-level descriptions using vision and language models. These descriptions, along with the text prompt, are fed into an LLM (e.g., GPT-4) with specific evaluation instructions to generate a score and a rationale. |
LLMScore achieves significantly higher correlation with human judgments compared to existing metrics like CLIP and BLIP across various datasets.
The framework demonstrates accurate capture of object-level alignment through detailed rationales that highlight similarities and discrepancies between images and text prompts.
LLMScore is adaptable to different evaluation objectives (e.g., overall quality, error counting) by simply modifying the evaluation instructions provided to the LLM. |
The reliance on GPT, a non-free LLM, may limit accessibility and scalability.
Potential biases inherited from the pre-trained LLM could affect evaluation fairness and require careful consideration for specific domains. |
text-to-image synthesis, image evaluation, compositionality, large language models, multi-granularity understanding |
2305.11031
Report |
ConsistentNeRF: Enhancing Neural Radiance Fields with 3D Consistency for Sparse View Synthesis |
Shoukang Hu, Kaichen Zhou, Kaiyu Li, Longhui Yu, Lanqing Hong, Tianyang Hu, Zhenguo Li, Gim Hee Lee, Ziwei Liu |
Neural Radiance Fields (NeRF) has demonstrated remarkable 3D reconstruction
capabilities with dense view images. However, its performance significantly
deteriorates under sparse view settings. We observe that learning the 3D
consistency of pixels among different views is crucial for improving
reconstruction quality in such cases. In this paper, we propose ConsistentNeRF,
a method that leverages depth information to regularize both multi-view and
single-view 3D consistency among pixels. Specifically, ConsistentNeRF employs
depth-derived geometry information and a depth-invariant loss to concentrate on
pixels that exhibit 3D correspondence and maintain consistent depth
relationships. Extensive experiments on recent representative works reveal that
our approach can considerably enhance model performance in sparse view
conditions, achieving improvements of up to 94% in PSNR, 76% in SSIM, and 31%
in LPIPS compared to the vanilla baselines across various benchmarks, including
DTU, NeRF Synthetic, and LLFF. |
ConsistentNeRF enhances Neural Radiance Fields by integrating multi-view and single-view 3D consistency to improve performance in sparse view scenarios. |
NeRF struggles in sparse view settings due to the lack of 3D consistency. This work addresses this limitation by enforcing 3D consistency, leading to improved performance. |
ConsistentNeRF utilizes: (1) a depth-derived mask to focus on pixels with multi-view 3D correspondence, and (2) a depth-invariant loss to enforce single-view 3D consistency using monocular depth priors. |
Achieves state-of-the-art results on DTU, NeRF Synthetic, and LLFF datasets under sparse view conditions.
Significantly outperforms vanilla NeRF and other depth-based methods.
Demonstrates the importance of 3D consistency for high-quality novel view synthesis. |
Reliance on pre-trained MVSNeRF for mask derivation limits real-world applicability.
Performance degrades when the target view is far from source views due to limitations in exploiting 3D correspondence. |
neural radiance fields, nerf, sparse view synthesis, 3d consistency, depth estimation |
2305.10973
Report |
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold |
Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, Christian Theobalt |
Synthesizing visual content that meets users' needs often requires flexible
and precise controllability of the pose, shape, expression, and layout of the
generated objects. Existing approaches gain controllability of generative
adversarial networks (GANs) via manually annotated training data or a prior 3D
model, which often lack flexibility, precision, and generality. In this work,
we study a powerful yet much less explored way of controlling GANs, that is, to
"drag" any points of the image to precisely reach target points in a
user-interactive manner, as shown in Fig.1. To achieve this, we propose
DragGAN, which consists of two main components: 1) a feature-based motion
supervision that drives the handle point to move towards the target position,
and 2) a new point tracking approach that leverages the discriminative
generator features to keep localizing the position of the handle points.
Through DragGAN, anyone can deform an image with precise control over where
pixels go, thus manipulating the pose, shape, expression, and layout of diverse
categories such as animals, cars, humans, landscapes, etc. As these
manipulations are performed on the learned generative image manifold of a GAN,
they tend to produce realistic outputs even for challenging scenarios such as
hallucinating occluded content and deforming shapes that consistently follow
the object's rigidity. Both qualitative and quantitative comparisons
demonstrate the advantage of DragGAN over prior approaches in the tasks of
image manipulation and point tracking. We also showcase the manipulation of
real images through GAN inversion. |
This paper proposes DragGAN, an interactive image editing method that allows users to manipulate the pose, shape, expression, and layout of objects in GAN-generated images by dragging handle points to target points. |
Controllable image synthesis is crucial for real-world applications, and existing methods often lack flexibility, precision, or generality. DragGAN offers an intuitive and versatile solution by enabling precise control over pixel movement within the learned image manifold of a GAN. |
DragGAN employs a two-step optimization process: 1) motion supervision, where a shifted feature patch loss guides the handle points towards targets, and 2) point tracking, using nearest neighbor search in the GAN's feature space to accurately track handle point locations during manipulation. |
DragGAN achieves precise control over handle point movement, enabling manipulation of various spatial attributes across diverse object categories.
The method effectively hallucinates occluded content and preserves object rigidity during deformation, indicating manipulation within the learned image manifold.
DragGAN outperforms existing GAN manipulation and point tracking methods qualitatively and quantitatively, while maintaining interactive performance. |
Editing quality is limited by the diversity of the training data and the presence of texture-less regions.
Potential misuse for creating fake images raises ethical concerns regarding personality rights and privacy. |
generative adversarial networks (gans), interactive image manipulation, point tracking, image editing, generative models |
2305.10924
Report |
Structural Pruning for Diffusion Models |
Gongfan Fang, Xinyin Ma, Xinchao Wang |
Generative modeling has recently undergone remarkable advancements, primarily
propelled by the transformative implications of Diffusion Probabilistic Models
(DPMs). The impressive capability of these models, however, often entails
significant computational overhead during both training and inference. To
tackle this challenge, we present Diff-Pruning, an efficient compression method
tailored for learning lightweight diffusion models from pre-existing ones,
without the need for extensive re-training. The essence of Diff-Pruning is
encapsulated in a Taylor expansion over pruned timesteps, a process that
disregards non-contributory diffusion steps and ensembles informative gradients
to identify important weights. Our empirical assessment, undertaken across
several datasets highlights two primary benefits of our proposed method: 1)
Efficiency: it enables approximately a 50\% reduction in FLOPs at a mere 10\%
to 20\% of the original training expenditure; 2) Consistency: the pruned
diffusion models inherently preserve generative behavior congruent with their
pre-trained models. Code is available at
\url{https://github.com/VainF/Diff-Pruning}. |
Presents Diff-Pruning, an efficient compression method for learning lightweight diffusion models from pre-existing ones, without extensive retraining. |
Addresses the challenge of significant computational overhead during training and inference in diffusion probabilistic models (DPMs). |
Employs Taylor expansion over pruned timesteps, discarding non-contributory diffusion steps and ensembling informative gradients to identify and remove unimportant weights. |
Achieves approximately 50% reduction in FLOPs with only 10% to 20% of the original training expenditure.
Pruned models maintain generative behavior consistent with their pre-trained counterparts.
Outperforms baseline pruning methods and scratch training in terms of FID and SSIM scores. |
Diffusion models exhibit sensitivity to model size reduction, requiring careful consideration of pruning ratios.
Further research can explore enhancing generation quality and consistency of pruned models. |
diffusion models, model compression, network pruning, generative models, taylor expansion |
2305.10884
Report |
Meta-Auxiliary Network for 3D GAN Inversion |
Bangrui Jiang, Zhenhua Guo, Yujiu Yang |
Real-world image manipulation has achieved fantastic progress in recent
years. GAN inversion, which aims to map the real image to the latent code
faithfully, is the first step in this pipeline. However, existing GAN inversion
methods fail to achieve high reconstruction quality and fast inference at the
same time. In addition, existing methods are built on 2D GANs and lack
explicitly mechanisms to enforce multi-view consistency.In this work, we
present a novel meta-auxiliary framework, while leveraging the newly developed
3D GANs as generator. The proposed method adopts a two-stage strategy. In the
first stage, we invert the input image to an editable latent code using
off-the-shelf inversion techniques. The auxiliary network is proposed to refine
the generator parameters with the given image as input, which both predicts
offsets for weights of convolutional layers and sampling positions of volume
rendering. In the second stage, we perform meta-learning to fast adapt the
auxiliary network to the input image, then the final reconstructed image is
synthesized via the meta-learned auxiliary network. Extensive experiments show
that our method achieves better performances on both inversion and editing
tasks. |
This paper presents a novel meta-auxiliary framework for 3D GAN inversion, enabling high-quality reconstruction and fast inference for multi-view consistent image editing. |
Existing GAN inversion methods struggle to balance high reconstruction quality and fast inference. Additionally, they often lack explicit mechanisms for multi-view consistency, which is crucial for realistic 3D image editing. |
The method employs a two-stage strategy: 1) inverting an input image into an editable latent code using existing techniques and an auxiliary network to refine generator parameters; 2) using meta-learning to adapt the auxiliary network to the input image for fast, high-quality reconstruction. |
Achieves state-of-the-art reconstruction quality comparable to optimization-based methods but with significantly faster inference speed.
Enables multi-view consistent image editing by leveraging a 3D-aware GAN generator.
Demonstrates the effectiveness of meta-learning for fast adaptation of the auxiliary network to unseen images. |
The method primarily focuses on facial image editing due to the use of a face-specific 3D GAN generator.
The performance on profile images can be further improved. |
gan inversion, 3d gan, meta-learning, image editing, multi-view consistency |
2305.10855
Report |
TextDiffuser: Diffusion Models as Text Painters |
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei |
Diffusion models have gained increasing attention for their impressive
generation abilities but currently struggle with rendering accurate and
coherent text. To address this issue, we introduce TextDiffuser, focusing on
generating images with visually appealing text that is coherent with
backgrounds. TextDiffuser consists of two stages: first, a Transformer model
generates the layout of keywords extracted from text prompts, and then
diffusion models generate images conditioned on the text prompt and the
generated layout. Additionally, we contribute the first large-scale text images
dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs
with text recognition, detection, and character-level segmentation annotations.
We further collect the MARIO-Eval benchmark to serve as a comprehensive tool
for evaluating text rendering quality. Through experiments and user studies, we
show that TextDiffuser is flexible and controllable to create high-quality text
images using text prompts alone or together with text template images, and
conduct text inpainting to reconstruct incomplete images with text. The code,
model, and dataset will be available at \url{https://aka.ms/textdiffuser}. |
This paper introduces TextDiffuser, a novel two-stage diffusion model designed for generating images with visually appealing and coherent text. |
Existing diffusion models struggle to render accurate and coherent text, which is essential given the widespread use of text images in various applications. |
TextDiffuser employs a Layout Transformer to generate the layout of keywords from text prompts, then uses diffusion models to generate images conditioned on the text prompt and generated layout. This approach provides flexibility and control over the generation process, allowing for text inpainting and template-based generation. |
TextDiffuser outperforms existing methods in terms of text rendering quality, as demonstrated by quantitative metrics (FID, CLIPScore, OCR evaluation) and user studies on the MARIO-Eval benchmark.
The authors introduce MARIO-10M, a large-scale text image dataset with 10 million image-text pairs and OCR annotations, to address the lack of specialized datasets for text rendering.
TextDiffuser demonstrates controllability in text color through language descriptions, allowing for personalized text image generation. |
TextDiffuser exhibits limitations in generating images with small characters due to the use of VAE for image encoding.
Generating images from long text with many keywords can lead to disordered and overlapped text layouts, potentially due to noisy training data with numerous keywords. |
text rendering, diffusion models, image generation, text inpainting, ocr |
2305.10853
Report |
LDM3D: Latent Diffusion Model for 3D |
Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal |
This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that
generates both image and depth map data from a given text prompt, allowing
users to generate RGBD images from text prompts. The LDM3D model is fine-tuned
on a dataset of tuples containing an RGB image, depth map and caption, and
validated through extensive experiments. We also develop an application called
DepthFusion, which uses the generated RGB images and depth maps to create
immersive and interactive 360-degree-view experiences using TouchDesigner. This
technology has the potential to transform a wide range of industries, from
entertainment and gaming to architecture and design. Overall, this paper
presents a significant contribution to the field of generative AI and computer
vision, and showcases the potential of LDM3D and DepthFusion to revolutionize
content creation and digital experiences. A short video summarizing the
approach can be found at https://t.ly/tdi2. |
This paper introduces LDM3D, a novel Latent Diffusion Model that generates RGB images and their corresponding depth maps from text prompts, enabling the creation of immersive 360° experiences. |
LDM3D advances generative AI by enabling the creation of more immersive and interactive content, pushing the boundaries of digital experience beyond traditional 2D representations. |
The authors fine-tuned Stable Diffusion v1.4 on a dataset of RGB images, depth maps (generated by DPT-Large), and captions, modifying the model architecture to process and generate both data types. They also developed DepthFusion, an application using TouchDesigner to create 360° views from LDM3D outputs. |
LDM3D achieves comparable image quality to Stable Diffusion v1.4 based on FID and CLIP similarity metrics.
LDM3D generates depth maps with accuracy comparable to DPT-Large.
DepthFusion successfully leverages LDM3D outputs to create immersive 360° experiences, demonstrating the potential for novel content creation. |
Fine-tuning the KL-autoencoder for RGBD data slightly decreased reconstruction quality compared to the pre-trained model on RGB data only.
Future work includes exploring alternative autoencoder architectures to further improve reconstruction quality for better content generation. |
generative ai, diffusion models, depth estimation, 360° view synthesis, immersive experiences |
2305.10764
Report |
OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding |
Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, Hao Su |
We introduce OpenShape, a method for learning multi-modal joint
representations of text, image, and point clouds. We adopt the commonly used
multi-modal contrastive learning framework for representation alignment, but
with a specific focus on scaling up 3D representations to enable open-world 3D
shape understanding. To achieve this, we scale up training data by ensembling
multiple 3D datasets and propose several strategies to automatically filter and
enrich noisy text descriptions. We also explore and compare strategies for
scaling 3D backbone networks and introduce a novel hard negative mining module
for more efficient training. We evaluate OpenShape on zero-shot 3D
classification benchmarks and demonstrate its superior capabilities for
open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy
of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than
10% for existing methods. OpenShape also achieves an accuracy of 85.3% on
ModelNet40, outperforming previous zero-shot baseline methods by 20% and
performing on par with some fully-supervised methods. Furthermore, we show that
our learned embeddings encode a wide range of visual and semantic concepts
(e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D
and image-3D interactions. Due to their alignment with CLIP embeddings, our
learned shape representations can also be integrated with off-the-shelf
CLIP-based models for various applications, such as point cloud captioning and
point cloud-conditioned image generation. |
This paper introduces OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds for open-world 3D shape understanding. |
Existing 3D shape understanding methods are limited by the scale of training data and struggle with unseen categories, hindering real-world applications. |
OpenShape scales up training data by ensembling multiple 3D datasets, employs strategies for filtering and enriching noisy text descriptions, explores scaling 3D backbone networks, and introduces a hard negative mining module. |
OpenShape achieves superior zero-shot 3D classification accuracy on ModelNet40 (85.3%) and Objaverse-LVIS (46.8%).
Learned embeddings encode rich visual and semantic concepts, enabling fine-grained text-3D and image-3D retrieval.
OpenShape embeddings can be integrated with off-the-shelf CLIP-based models for point cloud captioning and image generation. |
Training data size is still limited compared to 2D counterparts.
Current shape representations mainly focus on global features, lacking part-level information. |
3d shape understanding, multi-modal representation learning, zero-shot learning, contrastive learning, open-world recognition |
2305.10683
Report |
Paxion: Patching Action Knowledge in Video-Language Foundation Models |
Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji |
Action knowledge involves the understanding of textual, visual, and temporal
aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench)
containing two carefully designed probing tasks: Action Antonym and Video
Reversal, which targets multimodal alignment capabilities and temporal
understanding skills of the model, respectively. Despite recent video-language
models' (VidLM) impressive performance on various benchmark tasks, our
diagnostic tasks reveal their surprising deficiency (near-random performance)
in action knowledge, suggesting that current models rely on object recognition
abilities as a shortcut for action understanding. To remedy this, we propose a
novel framework, Paxion, along with a new Discriminative Video Dynamics
Modeling (DVDM) objective. The Paxion framework utilizes a Knowledge Patcher
network to encode new action knowledge and a Knowledge Fuser component to
integrate the Patcher into frozen VidLMs without compromising their existing
capabilities. Due to limitations of the widely-used Video-Text Contrastive
(VTC) loss for learning action knowledge, we introduce the DVDM objective to
train the Knowledge Patcher. DVDM forces the model to encode the correlation
between the action text and the correct ordering of video frames. Our extensive
analyses show that Paxion and DVDM together effectively fill the gap in action
knowledge understanding (~50% to 80%), while maintaining or improving
performance on a wide spectrum of both object- and action-centric downstream
tasks. The code and data will be made publicly available for research purposes
at https://github.com/MikeWangWZHL/Paxion.git. |
This paper introduces Phy-Bench, a benchmark to evaluate action knowledge in video-language models, and proposes Paxion, a novel framework to enhance action knowledge in pretrained video-language models without hurting their general capabilities. |
Existing video-language models show a surprising lack of action knowledge, relying on object recognition as a shortcut, which limits their understanding of dynamic events. |
Paxion uses a Knowledge Patcher (KP) based on Perceivers to encode action knowledge and a Knowledge Fuser (KF) to integrate KP into frozen video-language model backbones. It introduces Dynamic Video Dynamics Modeling (DVDM) objectives, including Video-Action Contrastive (VAC) and Action-Temporal Matching (ATM) losses, to train KP. |
Video-language models perform near-randomly on Phy-Bench, highlighting their deficiency in action knowledge.
Paxion with DVDM significantly improves performance on Phy-Bench, effectively patching the action knowledge gap.
Paxion maintains or surpasses the original model's performance on object-centric and action-centric downstream tasks, demonstrating its ability to improve both object and action understanding. |
The paper focuses on patching only one type of knowledge (action knowledge).
Future work includes exploring other types of physical knowledge (e.g., object affordances, mental simulation) and fusion with multiple learned Knowledge Patchers. |
action knowledge, video-language models, benchmarking, parameter-efficient fine-tuning, dynamics modeling |
2305.10675
Report |
Tuned Contrastive Learning |
Chaitanya Animesh, Manmohan Chandraker |
In recent times, contrastive learning based loss functions have become
increasingly popular for visual self-supervised representation learning owing
to their state-of-the-art (SOTA) performance. Most of the modern contrastive
learning methods generalize only to one positive and multiple negatives per
anchor. A recent state-of-the-art, supervised contrastive (SupCon) loss,
extends self-supervised contrastive learning to supervised setting by
generalizing to multiple positives and negatives in a batch and improves upon
the cross-entropy loss. In this paper, we propose a novel contrastive loss
function -- Tuned Contrastive Learning (TCL) loss, that generalizes to multiple
positives and negatives in a batch and offers parameters to tune and improve
the gradient responses from hard positives and hard negatives. We provide
theoretical analysis of our loss function's gradient response and show
mathematically how it is better than that of SupCon loss. We empirically
compare our loss function with SupCon loss and cross-entropy loss in supervised
setting on multiple classification-task datasets to show its effectiveness. We
also show the stability of our loss function to a range of hyper-parameter
settings. Unlike SupCon loss which is only applied to supervised setting, we
show how to extend TCL to self-supervised setting and empirically compare it
with various SOTA self-supervised learning methods. Hence, we show that TCL
loss achieves performance on par with SOTA methods in both supervised and
self-supervised settings. |
This paper proposes Tuned Contrastive Learning (TCL) loss, a novel contrastive loss function generalizing to multiple positives and negatives in a batch for both supervised and self-supervised settings. |
TCL loss aims to overcome limitations of the SupCon loss by improving gradient responses from hard positives and hard negatives, leading to performance gains in representation learning. |
TCL loss introduces tunable parameters (k1, k2) to regulate gradient contributions from positives and negatives. It's evaluated against SupCon and cross-entropy losses in supervised settings and compared to SOTA self-supervised learning methods. |
TCL loss consistently outperforms SupCon loss and cross-entropy loss in supervised image classification tasks.
TCL loss demonstrates stability across various hyperparameter settings, including encoder architectures, batch sizes, and augmentation strategies.
In self-supervised settings, TCL loss, using positive triplets, outperforms SimCLR and achieves comparable performance to SOTA methods. |
Choosing TCL's parameters k1 and k2 relies on heuristics.
Future work could explore loss objectives that inherently provide TCL's benefits without introducing additional parameters. |
contrastive learning, supervised learning, self-supervised learning, representation learning, loss function |
2305.10579
Report |
MultiPlaneNeRF: Neural Radiance Field with Non-Trainable Representation |
Dominik Zimny, Artur Kasymov, Adam Kania, Jacek Tabor, Maciej Zięba, Przemysław Spurek |
NeRF is a popular model that efficiently represents 3D objects from 2D
images. However, vanilla NeRF has some important limitations. NeRF must be
trained on each object separately. The training time is long since we encode
the object's shape and color in neural network weights. Moreover, NeRF does not
generalize well to unseen data. In this paper, we present MultiPlaneNeRF -- a
model that simultaneously solves the above problems. Our model works directly
on 2D images. We project 3D points on 2D images to produce non-trainable
representations. The projection step is not parametrized and a very shallow
decoder can efficiently process the representation. Furthermore, we can train
MultiPlaneNeRF on a large data set and force our implicit decoder to generalize
across many objects. Consequently, we can only replace the 2D images (without
additional training) to produce a NeRF representation of the new object. In the
experimental section, we demonstrate that MultiPlaneNeRF achieves results
comparable to state-of-the-art models for synthesizing new views and has
generalization properties. Additionally, MultiPlane decoder can be used as a
component in large generative models like GANs. |
This paper proposes MultiPlaneNeRF, a novel NeRF model that utilizes non-trainable representations of 3D objects using pre-existing 2D images, enabling efficient training of a small implicit decoder for view synthesis. |
Existing NeRF models suffer from long training times, lack of generalization to unseen data, and limitations in scalability. MultiPlaneNeRF aims to address these issues by leveraging fixed 2D images as planar representations. |
MultiPlaneNeRF projects 3D points onto a fixed set of 2D images to create non-trainable representations. A shallow decoder then aggregates color and position information from projected points to predict RGB colors and volume density, trained using a vanilla NeRF loss function. |
MultiPlaneNeRF achieves rendering results comparable to state-of-the-art models like NeRF and NSFV while using fewer trainable parameters.
The model demonstrates generalization capabilities by synthesizing novel views of unseen objects from different classes by simply replacing the input image representation.
The MultiPlane decoder can be integrated into larger generative architectures like GANs, yielding comparable results to models like EG3D with the benefit of interpretable representations. |
The trade-off between rendering quality and generalization properties requires further investigation.
Future work can explore extending MultiPlaneNeRF to handle dynamic scenes. |
neural radiance fields, view synthesis, 3d object representation, generalization, multiplane representation |
2305.10513
Report |
Learning Pose Image Manifolds Using Geometry-Preserving GANs and Elasticae |
Shenyuan Liang, Pavan Turaga, Anuj Srivastava |
This paper investigates the challenge of learning image manifolds,
specifically pose manifolds, of 3D objects using limited training data. It
proposes a DNN approach to manifold learning and for predicting images of
objects for novel, continuous 3D rotations. The approach uses two distinct
concepts: (1) Geometric Style-GAN (Geom-SGAN), which maps images to
low-dimensional latent representations and maintains the (first-order) manifold
geometry. That is, it seeks to preserve the pairwise distances between base
points and their tangent spaces, and (2) uses Euler's elastica to smoothly
interpolate between directed points (points + tangent directions) in the
low-dimensional latent space. When mapped back to the larger image space, the
resulting interpolations resemble videos of rotating objects. Extensive
experiments establish the superiority of this framework in learning paths on
rotation manifolds, both visually and quantitatively, relative to
state-of-the-art GANs and VAEs. |
This paper proposes a novel deep neural network (DNN) framework for learning image manifolds of 3D objects, specifically focusing on pose manifolds, with limited training data. |
Learning pose manifolds is crucial for predicting object appearance under novel viewpoints and facilitates various applications in computer vision. |
The framework utilizes a two-step approach: (1) Geometric Style-GAN (Geom-SGAN) maps images to a low-dimensional latent space while preserving pairwise distances and tangent space geometry. (2) Euler's elastica interpolates between directed points in the latent space, generating smooth rotation paths when mapped back to the image space. |
The proposed method outperforms state-of-the-art GANs and VAEs in generating realistic and accurate rotation paths for various 3D objects.
Quantitative evaluations using squared errors demonstrate the superior performance of the approach in preserving both image and tangent space geometry.
The learnt pose manifolds can be utilized for applications such as image denoising by finding the nearest neighbor on the manifold. |
The current implementation primarily focuses on rotation manifolds and assumes other imaging conditions are fixed.
Future work includes extending the framework to handle more complex transformations and exploring its application in other computer vision tasks. |
image manifolds, pose manifolds, generative models, deep learning, "eulers elastica" |
2305.10474
Report |
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models |
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji |
Despite tremendous progress in generating high-quality images using diffusion
models, synthesizing a sequence of animated frames that are both photorealistic
and temporally coherent is still in its infancy. While off-the-shelf
billion-scale datasets for image generation are available, collecting similar
video data of the same scale is still challenging. Also, training a video
diffusion model is computationally much more expensive than its image
counterpart. In this work, we explore finetuning a pretrained image diffusion
model with video data as a practical solution for the video synthesis task. We
find that naively extending the image noise prior to video noise prior in video
diffusion leads to sub-optimal performance. Our carefully designed video noise
prior leads to substantially better performance. Extensive experimental
validation shows that our model, Preserve Your Own Correlation (PYoCo), attains
SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It
also achieves SOTA video generation quality on the small-scale UCF-101
benchmark with a $10\times$ smaller model using significantly less computation
than the prior art. |
This paper proposes a new video diffusion noise prior, called Preserve Your Own COrrelation (PYoCo), tailored for fine-tuning text-to-image diffusion models for text-to-video generation. |
Training large-scale text-to-video diffusion models from scratch is computationally expensive and data-intensive. Leveraging pre-trained image diffusion models through fine-tuning is a practical alternative, but naively extending the image noise prior to video diffusion leads to sub-optimal performance. |
The authors analyze the correlation of noise maps in video frames and design a progressive noise model that captures temporal correlations. They fine-tune a pre-trained text-to-image diffusion model (EDIFFI) with the proposed noise prior and incorporate techniques like temporal attention, joint image-video fine-tuning, cascaded generation, and ensemble denoisers. |
PYoCo achieves state-of-the-art zero-shot text-to-video results on UCF-101 and MSR-VTT benchmarks.
The method outperforms previous approaches on unconditional video generation on UCF-101 while using a significantly smaller model size and less computation.
Ablation studies confirm that the proposed correlated noise model is superior to training from scratch or fine-tuning with an independent noise model. |
The paper primarily focuses on generating videos with short durations.
The impact of different video datasets on model performance requires further investigation. |
video generation, diffusion models, text-to-video, noise prior, fine-tuning |
2305.10456
Report |
LPMM: Intuitive Pose Control for Neural Talking-Head Model via Landmark-Parameter Morphable Model |
Kwangho Lee, Patrick Kwon, Myung Ki Lee, Namhyuk Ahn, Junsoo Lee |
While current talking head models are capable of generating photorealistic
talking head videos, they provide limited pose controllability. Most methods
require specific video sequences that should exactly contain the head pose
desired, being far from user-friendly pose control. Three-dimensional morphable
models (3DMM) offer semantic pose control, but they fail to capture certain
expressions. We present a novel method that utilizes parametric control of head
orientation and facial expression over a pre-trained neural-talking head model.
To enable this, we introduce a landmark-parameter morphable model (LPMM), which
offers control over the facial landmark domain through a set of semantic
parameters. Using LPMM, it is possible to adjust specific head pose factors,
without distorting other facial attributes. The results show our approach
provides intuitive rig-like control over neural talking head models, allowing
both parameter and image-based inputs. |
This paper introduces LPMM (Landmark-Parameter Morphable Model), a method for intuitive pose control of neural talking-head models using semantic parameters. |
Current talking-head models lack user-friendly pose control, often requiring specific video sequences or offering limited semantic control. |
The method involves training an LP-regressor to extract LPMM parameters from facial images and an LP-adaptor to convert these parameters into latent codes for a pre-trained talking-head generator. |
LPMM enables independent control of facial expressions and head orientation using semantic parameters.
The method allows both parameter-based and image-based pose control, providing flexibility to users.
Evaluation shows superior performance compared to existing methods like StyleRig, especially for in-plane rotations and complex expressions. |
Discovering parameter combinations for intuitive control of complex expressions is an open challenge.
The method's reliance on a pre-trained talking-head model limits its generalizability to unseen identities. |
talking-head synthesis, pose control, facial reenactment, landmark-parameter morphable model, semantic control |
2305.10438
Report |
IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images |
Varuna Krishna, S Suryavardan, Shreyash Mishra, Sathyanarayanan Ramamoorthy, Parth Patwa, Megha Chakraborty, Aman Chadha, Amitava Das, Amit Sheth |
Word embeddings, i.e., semantically meaningful vector representation of
words, are largely influenced by the distributional hypothesis "You shall know
a word by the company it keeps" (Harris, 1954), whereas modern prediction-based
neural network embeddings rely on design choices and hyperparameter
optimization. Word embeddings like Word2Vec, GloVe etc. well capture the
contextuality and real-world analogies but contemporary convolution-based image
embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge.
The popular king-queen analogy does not hold true for most commonly used vision
embeddings.
In this paper, we introduce a pre-trained joint embedding (JE), named
IMAGINATOR, trained on 21K distinct image objects level from 1M image+text
pairs. JE is a way to encode multimodal data into a vector space where the text
modality serves as the ground-ing key, which the complementary modality (in
this case, the image) is anchored with. IMAGINATOR encapsulates three
individual representations: (i) object-object co-location, (ii) word-object
co-location, and (iii) word-object correlation. These three ways capture
complementary aspects of the two modalities which are further combined to
obtain the final JEs.
Generated JEs are intrinsically evaluated to assess how well they capture the
contextuality and real-world analogies. We also evaluate pre-trained IMAGINATOR
JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and
(iii) text-based image retrieval. IMAGINATOR establishes a new standard on the
aforementioned down-stream tasks by outperforming the current SoTA on all the
selected tasks. IMAGINATOR will be made publicly available. The codes are
available at https://github.com/varunakk/IMAGINATOR |
This paper introduces IMAGINATOR, a pre-trained joint embedding model for vision-language tasks that captures contextual relationships between words and objects. |
Current image embeddings struggle to capture contextuality and real-world analogies, hindering performance in tasks like image captioning and retrieval. IMAGINATOR aims to address this by incorporating distributional semantics from NLP. |
IMAGINATOR utilizes an object detection model (Detic) and leverages three co-location matrices (object-object, word-object, word-word) to learn joint embeddings. It uses PPMI, context distribution smoothing, SVD, and eigenvalue weighting to enhance representation quality. |
IMAGINATOR outperforms state-of-the-art models on intrinsic evaluations of word contextuality and image similarity.
It achieves superior performance on downstream tasks including image captioning, Image2Tweet, and text-based image retrieval.
The proposed $BERT_{IMAGINATOR}$ architecture effectively leverages joint embeddings for compositional understanding of image-text pairs. |
The performance of IMAGINATOR is limited by the capabilities of existing object detection techniques, which only identify a limited set of objects.
Future work includes exploring contrastive learning to improve object representations and investigating vision transformers with positional encoding for finer-grained cross-modal connections. |
joint embeddings, multimodal learning, image captioning, image retrieval, distributional semantics |
2305.10431
Report |
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention |
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han |
Diffusion models excel at text-to-image generation, especially in
subject-driven generation for personalized images. However, existing methods
are inefficient due to the subject-specific fine-tuning, which is
computationally intensive and hampers efficient deployment. Moreover, existing
methods struggle with multi-subject generation as they often blend features
among subjects. We present FastComposer which enables efficient, personalized,
multi-subject text-to-image generation without fine-tuning. FastComposer uses
subject embeddings extracted by an image encoder to augment the generic text
conditioning in diffusion models, enabling personalized image generation based
on subject images and textual instructions with only forward passes. To address
the identity blending problem in the multi-subject generation, FastComposer
proposes cross-attention localization supervision during training, enforcing
the attention of reference subjects localized to the correct regions in the
target images. Naively conditioning on subject embeddings results in subject
overfitting. FastComposer proposes delayed subject conditioning in the
denoising step to maintain both identity and editability in subject-driven
image generation. FastComposer generates images of multiple unseen individuals
with different styles, actions, and contexts. It achieves
300$\times$-2500$\times$ speedup compared to fine-tuning-based methods and
requires zero extra storage for new subjects. FastComposer paves the way for
efficient, personalized, and high-quality multi-subject image creation. Code,
model, and dataset are available at
https://github.com/mit-han-lab/fastcomposer. |
\method is a tuning-free method for personalized, multi-subject text-to-image generation that uses a pre-trained vision encoder to achieve efficiency and accessibility. |
Existing methods for subject-driven text-to-image generation are inefficient due to subject-specific fine-tuning and struggle with multi-subject generation due to identity blending. |
The paper proposes to augment text prompts with visual features from reference subject images using a pre-trained vision encoder and a multi-layer perceptron (MLP). They introduce cross-attention localization during training to prevent identity blending and delayed subject conditioning to balance identity preservation with text guidance. |
\method achieves state-of-the-art performance on both single-subject and multi-subject image generation benchmarks.
It significantly outperforms optimization-based methods (DreamBooth, Textual Inversion, Custom Diffusion) in identity preservation while maintaining competitive prompt consistency.
\method is significantly faster and more memory-efficient than fine-tuning-based approaches, achieving 300x-2500x speedup and 2.8x-6.7x memory saving. |
The current training dataset (FFHQ) is limited in size and diversity, mainly containing headshots of human faces.
The model is primarily human-centric due to the scarcity of large-scale, multi-subject datasets featuring other subjects like animals. |
text-to-image generation, diffusion models, personalization, multi-subject generation, tuning-free |
2305.10293
Report |
Infinite Class Mixup |
Thomas Mensink, Pascal Mettes |
Mixup is a widely adopted strategy for training deep networks, where
additional samples are augmented by interpolating inputs and labels of training
pairs. Mixup has shown to improve classification performance, network
calibration, and out-of-distribution generalisation. While effective, a
cornerstone of Mixup, namely that networks learn linear behaviour patterns
between classes, is only indirectly enforced since the output interpolation is
performed at the probability level. This paper seeks to address this limitation
by mixing the classifiers directly instead of mixing the labels for each mixed
pair. We propose to define the target of each augmented sample as a uniquely
new classifier, whose parameters are a linear interpolation of the classifier
vectors of the input pair. The space of all possible classifiers is continuous
and spans all interpolations between classifier pairs. To make optimisation
tractable, we propose a dual-contrastive Infinite Class Mixup loss, where we
contrast the classifier of a mixed pair to both the classifiers and the
predicted outputs of other mixed pairs in a batch. Infinite Class Mixup is
generic in nature and applies to many variants of Mixup. Empirically, we show
that it outperforms standard Mixup and variants such as RegMixup and Remix on
balanced, long-tailed, and data-constrained benchmarks, highlighting its broad
applicability. |
This paper proposes Infinite Class Mixup, a novel training strategy that improves upon traditional Mixup by directly interpolating image classifiers instead of just label probabilities. |
Traditional Mixup indirectly enforces linear behavior between classes by interpolating labels at the probability level. This paper argues that directly interpolating classifiers provides a stronger and more direct enforcement, leading to better generalization. |
The method defines a unique classifier for each mixed image pair by linearly interpolating classifier vectors of original classes. To handle the infinite possibilities of interpolated classifiers, a dual-contrastive loss function is introduced, contrasting each mixed pair against other classifiers and mixed images within the same batch. |
Infinite Class Mixup consistently outperforms standard Mixup and its variants like RegMixup and Remix across various benchmarks, especially in data-constrained and imbalanced settings.
The dual-contrastive loss function, employing both class-axis and pair-axis contrastive learning, is shown to be crucial for the effectiveness of Infinite Class Mixup.
Analyses reveal that Infinite Class Mixup leads to lower confidence for ambiguous interpolations and better differentiation between interpolated classes compared to standard Mixup. |
The paper primarily focuses on image classification tasks. Exploring its generalization to other data modalities like point clouds or graphs remains a potential future direction.
Further investigation into the impact of different contrastive learning strategies and their potential benefits for specific tasks could be valuable. |
mixup, deep learning, image classification, contrastive learning, data augmentation |
2305.10223
Report |
NAI$_2$: Learning Noise-Aware Illumination-Interpolator for Unsupervised Low-Light Image Enhancement |
Xiaofeng Liu, Jiaxin Gao, Xin Fan, Risheng Liu |
Contemporary Low-Light Image Enhancement (LLIE) techniques have made notable
advancements in preserving image details and enhancing contrast, achieving
commendable results on specific datasets. Nevertheless, these approaches
encounter persistent challenges in efficiently mitigating dynamic noise and
accommodating diverse low-light scenarios. Insufficient constraints on complex
pixel-wise mapping learning lead to overfitting to specific types of noise and
artifacts associated with low-light conditions, reducing effectiveness in
variable lighting scenarios. To this end, we first propose a method for
estimating the noise level in low light images in a quick and accurate way.
This facilitates precise denoising, prevents over-smoothing, and adapts to
dynamic noise patterns. Subsequently, we devise a Learnable Illumination
Interpolator (LII), which employs learnlable interpolation operations between
the input and unit vector to satisfy general constraints between illumination
and input. Finally, we introduce a self-regularization loss that incorporates
intrinsic image properties and essential visual attributes to guide the output
towards meeting human visual expectations. Comprehensive experiments validate
the competitiveness of our proposed algorithm in both qualitative and
quantitative assessments. Notably, our noise estimation method, with linear
time complexity and suitable for various denoisers, significantly improves both
denoising and enhancement performance. Benefiting from this, our approach
achieves a 0.675dB PSNR improvement on the LOL dataset and 0.818dB on the MIT
dataset on LLIE task, even compared to supervised methods. |
This paper proposes NAI$_2$, a novel unsupervised Low-Light Image Enhancement (LLIE) method employing a denoising-first and enhancing-later pipeline. |
Existing LLIE techniques struggle to effectively mitigate dynamic noise and adapt to diverse low-light scenarios, often overfitting to specific datasets or noise types. |
NAI$_2$ leverages a novel noise estimation method based on high-order image gradients for precise denoising. It then uses a Learnable Illumination Interpolator (LII) with a self-regularization loss based on natural image statistics to ensure natural color and illumination. |
NAI$_2$ achieves state-of-the-art performance on benchmark datasets like MIT and LOL, surpassing supervised methods in some cases.
The proposed noise estimation method significantly improves denoising efficacy and efficiency compared to traditional methods.
LII, with its inherent structure constraint, ensures smooth yet structure-aware illumination maps, leading to visually pleasing enhancements. |
The noise estimation method currently focuses on Gaussian noise and requires further exploration for other noise types.
Future work will investigate incorporating data inline distribution for enhanced performance. |
low-light image enhancement, noise estimation, illumination interpolation, unsupervised learning, image restoration |
2305.10210
Report |
Object Re-Identification from Point Clouds |
Benjamin Thérien, Chengjie Huang, Adrian Chow, Krzysztof Czarnecki |
Object re-identification (ReID) from images plays a critical role in
application domains of image retrieval (surveillance, retail analytics, etc.)
and multi-object tracking (autonomous driving, robotics, etc.). However,
systems that additionally or exclusively perceive the world from depth sensors
are becoming more commonplace without any corresponding methods for object
ReID. In this work, we fill the gap by providing the first large-scale study of
object ReID from point clouds and establishing its performance relative to
image ReID. To enable such a study, we create two large-scale ReID datasets
with paired image and LiDAR observations and propose a lightweight matching
head that can be concatenated to any set or sequence processing backbone (e.g.,
PointNet or ViT), creating a family of comparable object ReID networks for both
modalities. Run in Siamese style, our proposed point cloud ReID networks can
make thousands of pairwise comparisons in real-time ($10$ Hz). Our findings
demonstrate that their performance increases with higher sensor resolution and
approaches that of image ReID when observations are sufficiently dense. Our
strongest network trained at the largest scale achieves ReID accuracy exceeding
$90\%$ for rigid objects and $85\%$ for deformable objects (without any
explicit skeleton normalization). To our knowledge, we are the first to study
object re-identification from real point cloud observations. |
This paper presents the first large-scale study of object re-identification (ReID) from point clouds, comparing its performance to image-based ReID. |
Object ReID is crucial for applications like multi-object tracking in autonomous driving and robotics, and using point clouds from LiDAR sensors can offer advantages over traditional image-based methods, especially as depth sensor resolution increases. |
The authors create two large-scale ReID datasets with paired image and LiDAR data from nuScenes and Waymo Open Dataset. They propose a lightweight, real-time matching head (RTMM) that can be used with various point cloud processing backbones (PointNet, DGCNN, Point-Transformer) for pairwise object comparisons. |
Point cloud ReID performance approaches image ReID with sufficiently dense point clouds.
Performance improves significantly with higher LiDAR sensor resolution, suggesting a promising future for point cloud ReID.
ReID accuracy exceeding 90% for rigid objects and 85% for deformable objects is achieved with their best model. |
The study is limited by computational resources, preventing training on all possible data samples.
Future work can explore fusing LiDAR and camera data, and incorporating geometric priors for improved performance. |
object re-identification, point cloud, lidar, autonomous driving, multi-object tracking |
2305.10028
Report |
Pyramid Diffusion Models For Low-light Image Enhancement |
Dewei Zhou, Zongxin Yang, Yi Yang |
Recovering noise-covered details from low-light images is challenging, and
the results given by previous methods leave room for improvement. Recent
diffusion models show realistic and detailed image generation through a
sequence of denoising refinements and motivate us to introduce them to
low-light image enhancement for recovering realistic details. However, we found
two problems when doing this, i.e., 1) diffusion models keep constant
resolution in one reverse process, which limits the speed; 2) diffusion models
sometimes result in global degradation (e.g., RGB shift). To address the above
problems, this paper proposes a Pyramid Diffusion model (PyDiff) for low-light
image enhancement. PyDiff uses a novel pyramid diffusion method to perform
sampling in a pyramid resolution style (i.e., progressively increasing
resolution in one reverse process). Pyramid diffusion makes PyDiff much faster
than vanilla diffusion models and introduces no performance degradation.
Furthermore, PyDiff uses a global corrector to alleviate the global degradation
that may occur in the reverse process, significantly improving the performance
and making the training of diffusion models easier with little additional
computational consumption. Extensive experiments on popular benchmarks show
that PyDiff achieves superior performance and efficiency. Moreover, PyDiff can
generalize well to unseen noise and illumination distributions. |
This paper proposes PyDiff, a novel pyramid diffusion model for low-light image enhancement that achieves state-of-the-art performance and efficiency. |
Existing methods for low-light image enhancement struggle to recover fine details often resulting in blurred outputs. Diffusion models excel at generating realistic details through iterative refinement, making them suitable for this task. |
PyDiff utilizes a pyramid diffusion method that performs sampling at progressively increasing resolutions within a single reverse process, leading to significant speed improvements. It also introduces a global corrector to alleviate global degradations like RGB shifts often occurring in diffusion models. |
PyDiff achieves state-of-the-art performance on popular benchmarks like LOL and LOLv2, outperforming previous methods in both quantitative metrics and visual quality.
The pyramid diffusion method significantly accelerates inference, making PyDiff nearly twice as fast as the previous state-of-the-art method LLFlow.
PyDiff exhibits strong generalization capabilities, effectively handling unseen noise and illumination distributions. |
The global corrector, while effective, introduces an additional hyperparameter (correction threshold) that requires tuning.
The current implementation of PyDiff focuses on single-image enhancement. Exploring its potential for video enhancement could be a promising direction. |
low-light image enhancement, diffusion models, pyramid diffusion, global corrector, image restoration |
2305.09967
Report |
Variable Length Embeddings |
Johnathan Chiu, Andi Gu, Matt Zhou |
In this work, we introduce a novel deep learning architecture, Variable
Length Embeddings (VLEs), an autoregressive model that can produce a latent
representation composed of an arbitrary number of tokens. As a proof of
concept, we demonstrate the capabilities of VLEs on tasks that involve
reconstruction and image decomposition. We evaluate our experiments on a mix of
the iNaturalist and ImageNet datasets and find that VLEs achieve comparable
reconstruction results to a state of the art VAE, using less than a tenth of
the parameters. |
This paper introduces Variable Length Embeddings (VLEs), an autoregressive model that generates a latent representation with a variable number of tokens, allowing for flexible and efficient image representation. |
VLEs offer a more efficient and interpretable way to represent images compared to traditional fixed-dimensional autoencoders, potentially benefiting downstream tasks like classification, captioning, and generative modeling. |
The authors develop two VLE variants: vanilla VLE, which focuses on pixel-level reconstruction, and masked VLE, which introduces a masking mechanism to encourage semantically meaningful token representation. Both variants are trained in a self-supervised manner. |
VLEs achieve comparable reconstruction performance to state-of-the-art VAEs with significantly fewer parameters (less than one-tenth).
Vanilla VLE demonstrates a strong dependence on pixel distribution complexity, while masked VLE shows potential for capturing semantically distinct objects.
Masked VLE exhibits promising results in decomposing images into human-interpretable masks, highlighting its potential for downstream tasks. |
The current masking mechanism in masked VLE, while promising, could be further improved by incorporating image segmentation or saliency priors.
Future work includes exploring the integration of other modalities, such as image captioning, to enhance the model's understanding of contextual relationships. |
autoencoders, variable length embeddings, image representation learning, unsupervised learning, generative modeling |
2305.09900
Report |
Efficient Equivariant Transfer Learning from Pretrained Models |
Sourya Basu, Pulkit Katdare, Prasanna Sattigeri, Vijil Chenthamarakshan, Katherine Driggs-Campbell, Payel Das, Lav R. Varshney |
Efficient transfer learning algorithms are key to the success of foundation
models on diverse downstream tasks even with limited data. Recent works of Basu
et al. (2023) and Kaba et al. (2022) propose group averaging (equitune) and
optimization-based methods, respectively, over features from group-transformed
inputs to obtain equivariant outputs from non-equivariant neural networks.
While Kaba et al. (2022) are only concerned with training from scratch, we find
that equitune performs poorly on equivariant zero-shot tasks despite good
finetuning results. We hypothesize that this is because pretrained models
provide better quality features for certain transformations than others and
simply averaging them is deleterious. Hence, we propose {\lambda}-equitune that
averages the features using importance weights, {\lambda}s. These weights are
learned directly from the data using a small neural network, leading to
excellent zero-shot and finetuned results that outperform equitune. Further, we
prove that {\lambda}-equitune is equivariant and a universal approximator of
equivariant functions. Additionally, we show that the method of Kaba et al.
(2022) used with appropriate loss functions, which we call equizero, also gives
excellent zero-shot and finetuned performance. Both equitune and equizero are
special cases of {\lambda}-equitune. To show the simplicity and generality of
our method, we validate on a wide range of diverse applications and models such
as 1) image classification using CLIP, 2) deep Q-learning, 3) fairness in
natural language generation (NLG), 4) compositional generalization in
languages, and 5) image classification using pretrained CNNs such as Resnet and
Alexnet. |
This paper introduces lambda-equitune, a method for improving the zero-shot and fine-tuning performance of pretrained models on equivariant tasks by learning importance weights for features extracted from group-transformed inputs. |
Efficient transfer learning algorithms are crucial for leveraging pretrained models in diverse downstream tasks with limited data, especially those exhibiting group equivariance. |
Lambda-equitune extends the concept of equitune by incorporating learnable importance weights, lambda, assigned to features obtained from group-transformed inputs. These weights are learned directly from the data using a small neural network and are used for weighted group averaging. |
Lambda-equitune outperforms equitune and is competitive with equizero, another proposed method based on optimizing a proxy loss function over group transformations, in zero-shot learning.
For fine-tuning, lambda-equitune often surpasses both equitune and equizero.
Both lambda-equitune and equizero demonstrate superior performance compared to non-equivariant pretrained models and equitune across various tasks, including image classification, deep Q-learning, fairness in natural language generation, compositional generalization in languages, and image classification using pretrained CNNs. |
The current work focuses on finite groups; extending it to continuous groups requires further research.
Future work can explore optimizing the design of equality and neutral sets used in fairness tasks for more general demographic groups. |
equivariant deep learning, transfer learning, zero-shot learning, fine-tuning, group equivariance |
2305.09847
Report |
Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important? |
Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He |
This study examines the impact of optimizing the Stable Diffusion (SD) guided
inference pipeline. We propose optimizing certain denoising steps by limiting
the noise computation to conditional noise and eliminating unconditional noise
computation, thereby reducing the complexity of the target iterations by 50%.
Additionally, we demonstrate that later iterations of the SD are less sensitive
to optimization, making them ideal candidates for applying the suggested
optimization. Our experiments show that optimizing the last 20% of the
denoising loop iterations results in an 8.2% reduction in inference time with
almost no perceivable changes to the human eye. Furthermore, we found that by
extending the optimization to 50% of the last iterations, we can reduce
inference time by approximately 20.3%, while still generating visually pleasing
images. |
This paper proposes an optimization method for the Stable Diffusion (SD) guided inference pipeline by selectively eliminating the computation of unconditional noise in later denoising steps. |
This optimization aims to reduce the inference time of SD without significantly impacting the perceived quality of the generated images. |
The authors analyze the sensitivity of different denoising iterations and find that later iterations are less sensitive to optimization. They then propose limiting the noise computation to only the conditional noise in a certain percentage of the last iterations, effectively reducing the computational complexity. |
Optimizing the last 20% of denoising iterations results in an 8.2% reduction in inference time with almost no noticeable difference in image quality.
Extending the optimization to 50% of the last iterations achieves a 20.3% speedup while still maintaining visually pleasing images.
Further tuning of the guidance scale can compensate for detail loss when applying aggressive optimization. |
The paper mainly focuses on a single diffusion model (SD) and its performance on a limited set of prompts.
More extensive user studies are needed to thoroughly evaluate the impact of the optimization on perceived image quality. |
stable diffusion, guided diffusion, optimization, inference time, image generation |
2305.09828
Report |
Mimetic Initialization of Self-Attention Layers |
Asher Trockman, J. Zico Kolter |
It is notoriously difficult to train Transformers on small datasets;
typically, large pre-trained models are instead used as the starting point. We
explore the weights of such pre-trained Transformers (particularly for vision)
to attempt to find reasons for this discrepancy. Surprisingly, we find that
simply initializing the weights of self-attention layers so that they "look"
more like their pre-trained counterparts allows us to train vanilla
Transformers faster and to higher final accuracies, particularly on vision
tasks such as CIFAR-10 and ImageNet classification, where we see gains in
accuracy of over 5% and 4%, respectively. Our initialization scheme is closed
form, learning-free, and very simple: we set the product of the query and key
weights to be approximately the identity, and the product of the value and
projection weights to approximately the negative identity. As this mimics the
patterns we saw in pre-trained Transformers, we call the technique "mimetic
initialization". |
This paper introduces "mimetic initialization," a learning-free initialization technique for Transformers that mimics patterns observed in pretrained models, leading to improved training and performance, especially on vision tasks. |
Transformers often struggle to train on small datasets compared to CNNs. This work aims to improve the training of vanilla Transformers on such datasets without relying on extensive pretraining or architectural modifications. |
The authors observed that pretrained Vision Transformers exhibit specific weight patterns: the product of query and key weights approximates the identity, while the product of value and projection weights approximates the negative identity. They propose a closed-form initialization scheme that replicates these patterns using the singular value decomposition. |
Mimetic initialization significantly improves CIFAR-10 classification accuracy for various ViT architectures, with gains up to 7.77%.
It also benefits ImageNet training, particularly in a ResNet-style pipeline, showing up to 4.1% accuracy improvement for ViT-Tiny.
The method shows modest gains on language modeling tasks like WikiText-103, suggesting potential for broader application. |
The study primarily focuses on vision tasks, with limited exploration on language models. Further investigation is needed to understand its full potential for language applications.
The hyperparameter tuning of alpha and beta, which control the diagonal prominence in the initialization, is not extensively discussed. A more detailed analysis of their impact could be beneficial. |
transformer, initialization, vision transformer, image classification, language modeling |
2305.08995
Report |
Denoising Diffusion Models for Plug-and-Play Image Restoration |
Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, Luc Van Gool |
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible
and interpretable method for solving various inverse problems by utilizing any
off-the-shelf denoiser as the implicit image prior. However, most existing
methods focus on discriminative Gaussian denoisers. Although diffusion models
have shown impressive performance for high-quality image synthesis, their
potential to serve as a generative denoiser prior to the plug-and-play IR
methods remains to be further explored. While several other attempts have been
made to adopt diffusion models for image restoration, they either fail to
achieve satisfactory results or typically require an unacceptable number of
Neural Function Evaluations (NFEs) during inference. This paper proposes
DiffPIR, which integrates the traditional plug-and-play method into the
diffusion sampling framework. Compared to plug-and-play IR methods that rely on
discriminative Gaussian denoisers, DiffPIR is expected to inherit the
generative ability of diffusion models. Experimental results on three
representative IR tasks, including super-resolution, image deblurring, and
inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on
both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and
perceptual quality with no more than 100 NFEs. The source code is available at
{\url{https://github.com/yuanzhi-zhu/DiffPIR}} |
This paper proposes DiffPIR, a plug-and-play image restoration method that leverages the generative capabilities of pre-trained diffusion models within a diffusion sampling framework. |
Existing plug-and-play methods primarily rely on discriminative Gaussian denoisers, limiting their performance. Diffusion models, as generative denoisers, offer improved potential for modeling complex data distributions and handling ill-posed inverse problems. |
The method decouples the data and prior terms of the image restoration optimization problem using the Half-Quadratic-Splitting (HQS) algorithm. An off-the-shelf diffusion model acts as a plug-and-play denoiser prior, while the data term is solved analytically or approximated for various degradation models. |
DiffPIR achieves state-of-the-art performance on super-resolution, image deblurring, and inpainting tasks.
The method demonstrates superior reconstruction faithfulness and perceptual quality compared to existing techniques.
DiffPIR maintains efficiency, requiring no more than 100 Neural Function Evaluations (NFEs) for inference. |
The paper primarily focuses on bicubic super-resolution, potentially limiting the generalizability to other degradation models.
Future work could explore adapting the sampling process to further reduce NFEs and enhance efficiency. |
image restoration, diffusion models, plug-and-play priors, generative denoising, half-quadratic-splitting |
2305.08891
Report |
Common Diffusion Noise Schedules and Sample Steps are Flawed |
Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang |
We discover that common diffusion noise schedules do not enforce the last
timestep to have zero signal-to-noise ratio (SNR), and some implementations of
diffusion samplers do not start from the last timestep. Such designs are flawed
and do not reflect the fact that the model is given pure Gaussian noise at
inference, creating a discrepancy between training and inference. We show that
the flawed design causes real problems in existing implementations. In Stable
Diffusion, it severely limits the model to only generate images with medium
brightness and prevents it from generating very bright and dark samples. We
propose a few simple fixes: (1) rescale the noise schedule to enforce zero
terminal SNR; (2) train the model with v prediction; (3) change the sampler to
always start from the last timestep; (4) rescale classifier-free guidance to
prevent over-exposure. These simple changes ensure the diffusion process is
congruent between training and inference and allow the model to generate
samples more faithful to the original data distribution. |
This paper identifies and corrects flaws in common diffusion noise schedules and sampling implementations that cause discrepancies between training and inference. |
These flaws limit the generated images' brightness range and hinder the model's ability to accurately respond to prompts related to brightness. |
The authors propose: (1) rescaling noise schedules to ensure zero terminal SNR, (2) training with v prediction and loss, (3) enforcing samplers to start from the last timestep, and (4) rescaling classifier-free guidance to prevent over-exposure. |
Rescaling the noise schedule and enforcing sampling from the last timestep allows the model to generate images with a wider range of brightness.
Training with v prediction and loss maintains visual quality comparable to using epsilon loss.
The proposed classifier-free guidance rescaling technique effectively mitigates over-exposure issues encountered when terminal SNR approaches zero. |
The paper primarily focuses on Stable Diffusion, and further investigation is needed to assess the impact on other diffusion models.
The proposed rescaling method for classifier-free guidance relies on an empirically determined hyperparameter, and further exploration of optimal values is warranted. |
diffusion models, noise schedules, sampling techniques, classifier-free guidance, stable diffusion |
2305.08810
Report |
AutoRecon: Automated 3D Object Discovery and Reconstruction |
Yuang Wang, Xingyi He, Sida Peng, Haotong Lin, Hujun Bao, Xiaowei Zhou |
A fully automated object reconstruction pipeline is crucial for digital
content creation. While the area of 3D reconstruction has witnessed profound
developments, the removal of background to obtain a clean object model still
relies on different forms of manual labor, such as bounding box labeling, mask
annotations, and mesh manipulations. In this paper, we propose a novel
framework named AutoRecon for the automated discovery and reconstruction of an
object from multi-view images. We demonstrate that foreground objects can be
robustly located and segmented from SfM point clouds by leveraging
self-supervised 2D vision transformer features. Then, we reconstruct decomposed
neural scene representations with dense supervision provided by the decomposed
point clouds, resulting in accurate object reconstruction and segmentation.
Experiments on the DTU, BlendedMVS and CO3D-V2 datasets demonstrate the
effectiveness and robustness of AutoRecon. |
Proposes AutoRecon, a fully automated framework for discovering and reconstructing 3D objects from multi-view images without annotations. |
Enables scalable 3D content creation and the potential for large-scale generation of free 2D and 3D object annotations for supervised learning. |
A two-stage coarse-to-fine pipeline: 1) Coarse decomposition segments the foreground object from SfM point clouds using self-supervised 2D vision transformer features and a 3D segmentation Transformer. 2) Fine decomposition reconstructs a decomposed neural scene representation within the estimated object bounding box, guided by the coarse decomposition. |
Achieves superior 3D salient object detection compared to baselines, especially on challenging datasets like CO3D.
Reconstructs background-free object models with quality comparable to or exceeding NeuS, without manual annotation or post-processing.
Produces high-quality and multi-view consistent 2D segmentation masks, outperforming existing single-view and multi-view baselines. |
Remains sensitive to issues like shadows and thin structures.
Storing multi-view ViT features is memory-intensive. |
3d object reconstruction, unsupervised object discovery, scene decomposition, neural scene representation, point cloud segmentation |
2305.08776
Report |
Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models |
Zhimin Chen, Longlong Jing, Yingwei Li, Bing Li |
Foundation models have achieved remarkable results in 2D and language tasks
like image segmentation, object detection, and visual-language understanding.
However, their potential to enrich 3D scene representation learning is largely
untapped due to the existence of the domain gap. In this work, we propose an
innovative methodology called Bridge3D to address this gap by pre-training 3D
models using features, semantic masks, and captions sourced from foundation
models. Specifically, our method employs semantic masks from foundation models
to guide the masking and reconstruction process for the masked autoencoder,
enabling more focused attention on foreground representations. Moreover, we
bridge the 3D-text gap at the scene level using image captioning foundation
models, thereby facilitating scene-level knowledge distillation. We further
extend this bridging effort by introducing an innovative object-level knowledge
distillation method that harnesses highly accurate object-level masks and
semantic text data from foundation models. Our methodology significantly
surpasses the performance of existing state-of-the-art methods in 3D object
detection and semantic segmentation tasks. For instance, on the ScanNet
dataset, Bridge3D improves the baseline by a notable margin of 6.3%. Code will
be available at: https://github.com/Zhimin-C/Bridge3D |
This paper presents Bridge3D, a novel method that leverages multiple foundation models for self-supervised 3D scene understanding. It uses semantic masks from foundation models to guide the masking and reconstruction process for the masked autoencoder, focusing on foreground representations and bridges the 3D-text gap at both scene and object levels for knowledge distillation. |
Bridging the gap between the success of foundation models in 2D and language tasks with the need for enriched 3D scene representation learning is crucial for advancing 3D scene understanding. |
The method uses a three-pronged approach: 1) Semantic-guided masked autoencoder with foreground-aware masking and patch dropping. 2) Multi-modal scene-level knowledge distillation using image captioning and 3D features. 3) Multi-modal object-level knowledge distillation leveraging accurate object-level masks and semantic text from foundation models. |
Bridge3D outperforms state-of-the-art self-supervised learning methods in 3D object detection and semantic segmentation tasks.
It significantly improves performance on ScanNet and SUN RGB-D datasets for object detection and S3DIS dataset for semantic segmentation.
Ablation studies confirm the effectiveness of each component and modality used in Bridge3D. |
The current work focuses primarily on indoor 3D scenes, limiting its generalizability.
Future work will focus on extending Bridge3D to outdoor scenes and open-vocabulary 3D tasks. |
3d scene understanding, self-supervised learning, foundation models, knowledge distillation, masked autoencoder |
2305.08694
Report |
A Reproducible Extraction of Training Images from Diffusion Models |
Ryan Webster |
Recently, Carlini et al. demonstrated the widely used model Stable Diffusion
can regurgitate real training samples, which is troublesome from a copyright
perspective. In this work, we provide an efficient extraction attack on par
with the recent attack, with several order of magnitudes less network
evaluations. In the process, we expose a new phenomena, which we dub template
verbatims, wherein a diffusion model will regurgitate a training sample largely
in tact. Template verbatims are harder to detect as they require retrieval and
masking to correctly label. Furthermore, they are still generated by newer
systems, even those which de-duplicate their training set, and we give insight
into why they still appear during generation. We extract training images from
several state of the art systems, including Stable Diffusion 2.0, Deep Image
Floyd, and finally Midjourney v4. We release code to verify our extraction
attack, perform the attack, as well as all extracted prompts at
\url{https://github.com/ryanwebster90/onestep-extraction}. |
This paper presents an efficient extraction attack on diffusion models, revealing a new phenomenon called "template verbatims" where models regurgitate training samples with non-semantic variations. |
The research highlights copyright concerns and potential misuse of generative models, especially for artists whose work might be exploited without attribution. |
The authors propose whitebox and blackbox attacks leveraging one-step synthesis properties and edge consistency, evaluating them against various diffusion models. |
The attack achieves comparable performance to existing methods but with significantly fewer network evaluations.
Template verbatims, harder to detect due to variations, are found even in models trained on deduplicated datasets.
The attack successfully extracts training images from various models, including Stable Diffusion 2.0, Deep Image Floyd, and Midjourney v4. |
The current ground truth construction struggles with images containing rearranged patches.
Future work could explore more robust copy detection methods invariant to patch permutations. |
diffusion models, extraction attack, copyright infringement, template verbatims, generative models |
2305.08685
Report |
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding |
Linhui Xiao, Xiaoshan Yang, Fang Peng, Ming Yan, Yaowei Wang, Changsheng Xu |
Visual Grounding (VG) is a crucial topic in the field of vision and language,
which involves locating a specific region described by expressions within an
image. To reduce the reliance on manually labeled data, unsupervised visual
grounding have been developed to locate regions using pseudo-labels. However,
the performance of existing unsupervised methods is highly dependent on the
quality of pseudo-labels and these methods always encounter issues with limited
diversity. In order to utilize vision and language pre-trained models to
address the grounding problem, and reasonably take advantage of pseudo-labels,
we propose CLIP-VG, a novel method that can conduct self-paced curriculum
adapting of CLIP with pseudo-language labels. We propose a simple yet efficient
end-to-end network architecture to realize the transfer of CLIP to the visual
grounding. Based on the CLIP-based architecture, we further propose
single-source and multi-source curriculum adapting algorithms, which can
progressively find more reliable pseudo-labels to learn an optimal model,
thereby achieving a balance between reliability and diversity for the
pseudo-language labels. Our method outperforms the current state-of-the-art
unsupervised method by a significant margin on RefCOCO/+/g datasets in both
single-source and multi-source scenarios, with improvements ranging from
6.78$\%$ to 10.67$\%$ and 11.39$\%$ to 14.87$\%$, respectively. The results
even outperform existing weakly supervised visual grounding methods.
Furthermore, our method is also competitive in fully supervised setting. The
code and models are available at https://github.com/linhuixiao/CLIP-VG. |
This paper proposes CLIP-VG, a novel method that conducts self-paced curriculum adapting of CLIP with pseudo-language labels for visual grounding, aiming to address the limitations of existing unsupervised methods that heavily rely on the quality of pseudo-labels and often encounter issues with limited diversity. |
Visual grounding is a crucial task in vision and language, and this work reduces the reliance on manually labeled data by efficiently utilizing VLP models like CLIP and pseudo-labels in a self-paced curriculum learning paradigm. |
The methodology involves a simple yet efficient end-to-end pure-Transformer encoder-only network architecture based on CLIP. It introduces a reliability measurement scheme to evaluate instance-level quality and proposes single-source and multi-source self-paced curriculum adapting algorithms (SSA and MSA) to progressively find more reliable pseudo-labels for training. |
CLIP-VG significantly outperforms the current state-of-the-art unsupervised method (Pseudo-Q) on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78% to 10.67% and 11.39% to 14.87%, respectively.
The proposed method also surpasses existing weakly supervised methods and achieves competitive results compared to fully supervised models.
CLIP-VG demonstrates significant speedups in both training and inference compared to other SOTA models while maintaining high performance. |
The quality of the currently utilized three types of pseudo-labels remains low, which could be further improved.
The greedy sample selection strategy in SSA and MSA, while balancing efficiency and performance, represents a trade-off that could be further explored. |
visual grounding, curriculum learning, pseudo-language label, vision-language models, clip |
2305.08408
Report |
SB-VQA: A Stack-Based Video Quality Assessment Framework for Video Enhancement |
Ding-Jiun Huang, Yu-Ting Kao, Tieh-Hung Chuang, Ya-Chun Tsai, Jing-Kai Lou, Shuen-Huei Guan |
In recent years, several video quality assessment (VQA) methods have been
developed, achieving high performance. However, these methods were not
specifically trained for enhanced videos, which limits their ability to predict
video quality accurately based on human subjective perception. To address this
issue, we propose a stack-based framework for VQA that outperforms existing
state-of-the-art methods on VDPVE, a dataset consisting of enhanced videos. In
addition to proposing the VQA framework for enhanced videos, we also
investigate its application on professionally generated content (PGC). To
address copyright issues with premium content, we create the PGCVQ dataset,
which consists of videos from YouTube. We evaluate our proposed approach and
state-of-the-art methods on PGCVQ, and provide new insights on the results. Our
experiments demonstrate that existing VQA algorithms can be applied to PGC
videos, and we find that VQA performance for PGC videos can be improved by
considering the plot of a play, which highlights the importance of video
semantic understanding. |
The paper proposes SB-VQA, a stack-based video quality assessment framework for enhanced videos, and investigates its application on professionally generated content (PGC). |
Accurate VQA for enhanced videos is crucial as traditional metrics like PSNR and SSIM fail to reflect human perception. Moreover, VQA for PGC content, while important for applications like old film restoration, remains underexplored. |
SB-VQA utilizes a stack-based approach with multiple feature extractors (FANet) and patch-weighted convolution blocks to mitigate bias from diverse video enhancements. The authors create PGCVQ, a PGC dataset, by transcoding movie trailers at various bitrates and analyze the relationship between predicted quality, encoding bitrate, and video content appeal (using YouTube heatmaps). |
SB-VQA outperforms state-of-the-art VQA methods on the VDPVE dataset (enhanced videos).
SB-VQA's predicted quality scores on PGCVQ align with the expectation that higher bitrates yield better perceived quality.
A correlation is observed between predicted quality scores and content appeal derived from YouTube heatmaps, suggesting that VQA can reflect video content richness. |
The paper acknowledges potential overfitting of SB-VQA's regression block on the training dataset.
Future work could explore multi-modal models incorporating semantic understanding to enhance VQA accuracy. |
video quality assessment, video enhancement, professionally generated content, deep learning, content appeal |
2305.07895
Report |
On the Hidden Mystery of OCR in Large Multimodal Models |
Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, Xiang Bai |
Large models have recently played a dominant role in natural language
processing and multimodal vision-language learning. However, their
effectiveness in text-related visual tasks remains relatively unexplored. In
this paper, we conducted a comprehensive evaluation of Large Multimodal Models,
such as GPT4V and Gemini, in various text-related visual tasks including Text
Recognition, Scene Text-Centric Visual Question Answering (VQA),
Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten
Mathematical Expression Recognition (HMER). To facilitate the assessment of
Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we
propose OCRBench, a comprehensive evaluation benchmark.Our study encompasses 29
datasets, making it the most comprehensive OCR evaluation benchmark available.
Furthermore, our study reveals both the strengths and weaknesses of these
models, particularly in handling multilingual text, handwritten text,
non-semantic text, and mathematical expression recognition. Most importantly,
the baseline results showcased in this study could provide a foundational
framework for the conception and assessment of innovative strategies targeted
at enhancing zero-shot multimodal techniques. The evaluation pipeline and
benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR. |
This paper presents a comprehensive evaluation of 14 Large Multimodal Models (LMMs) on various text-related visual tasks and proposes OCRBench, a new benchmark for assessing OCR capabilities in LMMs. |
Understanding LMMs' effectiveness in text-related visual tasks is crucial due to their potential to revolutionize how we interact with and analyze information from both text and images. |
The authors evaluate LMMs on 29 datasets across five key tasks: text recognition, scene text-centric VQA, document-oriented VQA, key information extraction, and handwritten mathematical expression recognition. They introduce OCRBench, a refined benchmark with 1000 manually verified question-answer pairs. |
LMMs demonstrate promising results in text recognition, sometimes matching state-of-the-art supervised methods, but still lag behind in complex tasks like handwritten mathematical expression recognition.
LMMs exhibit weaknesses in handling blurry images, handwritten text, multilingual text, and non-semantic text.
OCRBench provides a standardized and accurate tool for evaluating LMM performance on OCR tasks, revealing strengths and areas for improvement. |
The study primarily focuses on accuracy and doesn't delve into computational efficiency or resource consumption of LMMs compared to specialized models.
Future work should explore the impact of training data size and fine-tuning strategies on LMM performance in specific OCR tasks. |
large multimodal models, optical character recognition, benchmarking, text recognition, visual question answering |
2305.07710
Report |
Zero-shot racially balanced dataset generation using an existing biased StyleGAN2 |
Anubhav Jain, Nasir Memon, Julian Togelius |
Facial recognition systems have made significant strides thanks to data-heavy
deep learning models, but these models rely on large privacy-sensitive
datasets. Further, many of these datasets lack diversity in terms of ethnicity
and demographics, which can lead to biased models that can have serious
societal and security implications. To address these issues, we propose a
methodology that leverages the biased generative model StyleGAN2 to create
demographically diverse images of synthetic individuals. The synthetic dataset
is created using a novel evolutionary search algorithm that targets specific
demographic groups. By training face recognition models with the resulting
balanced dataset containing 50,000 identities per race (13.5 million images in
total), we can improve their performance and minimize biases that might have
been present in a model trained on a real dataset. |
This paper introduces a novel search-based algorithm to generate balanced synthetic facial image datasets with diverse demographics from a pre-trained, biased StyleGAN2 model, aiming to improve fairness and accuracy in facial recognition. |
Existing facial recognition models trained on real-world datasets often inherit biases due to imbalanced representation of ethnicities, leading to unfair performance across different demographic groups. This work addresses the need for balanced and privacy-aware facial image datasets to mitigate these biases. |
The authors propose an evolutionary search algorithm that operates on the latent space of a pre-trained StyleGAN2 model. By leveraging an auxiliary demographic classifier, the algorithm explores the latent space to find and generate a large number of synthetic identities belonging to specific racial groups. |
The proposed approach generates over 50,000 unique synthetic identities per racial group, totaling 13.5 million images, showcasing its ability to create large-scale, balanced datasets.
Pre-training facial recognition models (ArcFace, AdaFace, ElasticFace) on the generated dataset leads to improved accuracy on standard benchmarks like RFW, LFW, CFP-FP, and AgeDB compared to models trained solely on real data.
The generated balanced dataset aids in mitigating bias, demonstrated by a reduction in accuracy disparity across different racial groups for the trained facial recognition models. |
The reliance on an external ethnicity classifier for supervision during synthetic data generation can introduce noise due to potential misclassifications by the classifier.
The study primarily focuses on mitigating racial bias and could be extended to address other demographic attributes like gender and age. |
facial recognition, bias mitigation, synthetic data, stylegan2, evolutionary search |
2305.07625
Report |
Meta Omnium: A Benchmark for General-Purpose Learning-to-Learn |
Ondrej Bohdal, Yinbing Tian, Yongshuo Zong, Ruchika Chavhan, Da Li, Henry Gouk, Li Guo, Timothy Hospedales |
Meta-learning and other approaches to few-shot learning are widely studied
for image recognition, and are increasingly applied to other vision tasks such
as pose estimation and dense prediction. This naturally raises the question of
whether there is any few-shot meta-learning algorithm capable of generalizing
across these diverse task types? To support the community in answering this
question, we introduce Meta Omnium, a dataset-of-datasets spanning multiple
vision tasks including recognition, keypoint localization, semantic
segmentation and regression. We experiment with popular few-shot meta-learning
baselines and analyze their ability to generalize across tasks and to transfer
knowledge between them. Meta Omnium enables meta-learning researchers to
evaluate model generalization to a much wider array of tasks than previously
possible, and provides a single framework for evaluating meta-learners across a
wide suite of vision applications in a consistent manner. |
This paper introduces Meta-Omnium, a benchmark dataset for evaluating few-shot meta-learning algorithms across multiple vision tasks. |
Existing few-shot learning benchmarks focus on single tasks, limiting the development of general-purpose algorithms capable of knowledge transfer across tasks. |
The benchmark comprises datasets from four vision tasks: image classification, semantic segmentation, keypoint localization, and regression. It includes seen/unseen dataset splits for in-distribution and out-of-distribution generalization evaluation. The authors adapt several popular few-shot learning algorithms (ProtoNet, MAML, DDRR, etc.) to the multi-task setting and provide baseline results. |
Prototypical Networks show the best overall performance and out-of-distribution generalization ability.
Single-task meta-learning generally outperforms multi-task meta-learning, indicating the challenge of learning from heterogeneous task distributions.
Meta-learning significantly outperforms simple transfer learning and training-from-scratch approaches. |
Limited number of datasets within each task family.
Exploration of more sophisticated meta-learning algorithms is needed. |
meta-learning, few-shot learning, multi-task learning, benchmarking, computer vision |
2305.07304
Report |
CLIP-Count: Towards Text-Guided Zero-Shot Object Counting |
Ruixiang Jiang, Lingbo Liu, Changwen Chen |
Recent advances in visual-language models have shown remarkable zero-shot
text-image matching ability that is transferable to downstream tasks such as
object detection and segmentation. Adapting these models for object counting,
however, remains a formidable challenge. In this study, we first investigate
transferring vision-language models (VLMs) for class-agnostic object counting.
Specifically, we propose CLIP-Count, the first end-to-end pipeline that
estimates density maps for open-vocabulary objects with text guidance in a
zero-shot manner. To align the text embedding with dense visual features, we
introduce a patch-text contrastive loss that guides the model to learn
informative patch-level visual representations for dense prediction. Moreover,
we design a hierarchical patch-text interaction module to propagate semantic
information across different resolution levels of visual features. Benefiting
from the full exploitation of the rich image-text alignment knowledge of
pretrained VLMs, our method effectively generates high-quality density maps for
objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech
crowd counting datasets demonstrate state-of-the-art accuracy and
generalizability of the proposed method. Code is available:
https://github.com/songrise/CLIP-Count. |
CLIP-Count, the first end-to-end text-guided zero-shot object counting model, estimates density maps for open-vocabulary objects using text prompts. |
Existing class-agnostic counting methods are limited by reliance on manual patch exemplars or lack of object specificity. CLIP-Count addresses these limitations with a more user-friendly and flexible text-guided approach. |
The method adapts CLIP by introducing: 1) a patch-text contrastive loss to align text and patch embeddings, 2) a hierarchical patch-text interaction module to propagate semantic information across resolutions, and 3) a CNN decoder to generate density maps. |
Outperforms state-of-the-art zero-shot counting methods on FSC-147.
Demonstrates superior cross-dataset generalizability on CARPK and ShanghaiTech.
Effectively localizes and counts diverse objects with high fidelity, as shown qualitatively. |
Performance can be limited by ambiguity in text guidance for object counting.
Future work includes collecting datasets with more fine-grained text annotations. |
class-agnostic object counting, zero-shot learning, text-guided vision, density estimation, clip |
2305.07223
Report |
Transavs: End-To-End Audio-Visual Segmentation With Transformer |
Yuhang Ling, Yuxi Li, Zhenye Gan, Jiangning Zhang, Mingmin Chi, Yabiao Wang |
Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment
sounding objects in video frames by exploring audio signals. Generally AVS
faces two key challenges: (1) Audio signals inherently exhibit a high degree of
information density, as sounds produced by multiple objects are entangled
within the same audio stream; (2) Objects of the same category tend to produce
similar audio signals, making it difficult to distinguish between them and thus
leading to unclear segmentation results. Toward this end, we propose TransAVS,
the first Transformer-based end-to-end framework for AVS task. Specifically,
TransAVS disentangles the audio stream as audio queries, which will interact
with images and decode into segmentation masks with full transformer
architectures. This scheme not only promotes comprehensive audio-image
communication but also explicitly excavates instance cues encapsulated in the
scene. Meanwhile, to encourage these audio queries to capture distinctive
sounding objects instead of degrading to be homogeneous, we devise two
self-supervised loss functions at both query and mask levels, allowing the
model to capture distinctive features within similar audio data and achieve
more precise segmentation. Our experiments demonstrate that TransAVS achieves
state-of-the-art results on the AVSBench dataset, highlighting its
effectiveness in bridging the gap between audio and visual modalities. |
This paper proposes TransAVS, the first Transformer-based end-to-end framework for Audio-Visual Segmentation (AVS) that leverages audio cues to segment sounding objects in video frames. |
AVS is challenging because audio signals are information-dense (mixing sounds from multiple sources) and objects of the same category often produce similar sounds making segmentation difficult. |
TransAVS disentangles the audio stream into queries that interact with image features in a Transformer architecture to generate segmentation masks. Two self-supervised losses, Audio Query Distance Loss (AQDL) and Audio Query Mask Loss (AQML), encourage the model to learn distinctive features for more precise segmentation. |
TransAVS achieves state-of-the-art results on the AVSBench dataset, outperforming previous methods in both single-source and multi-source sound segmentation.
The use of audio queries for instance-level awareness and discrimination significantly improves segmentation accuracy.
Self-supervised losses, AQDL and AQML, effectively address the challenge of sound homogeneity among objects of the same category. |
The number of audio queries needs to be carefully tuned for optimal performance.
Future work could explore incorporating temporal information for better handling of object movements and occlusions. |
audio-visual segmentation, multi-modal learning, transformer, self-supervised learning, sound source separation |
2305.07024
Report |
SparseGNV: Generating Novel Views of Indoor Scenes with Sparse Input Views |
Weihao Cheng, Yan-Pei Cao, Ying Shan |
We study to generate novel views of indoor scenes given sparse input views.
The challenge is to achieve both photorealism and view consistency. We present
SparseGNV: a learning framework that incorporates 3D structures and image
generative models to generate novel views with three modules. The first module
builds a neural point cloud as underlying geometry, providing contextual
information and guidance for the target novel view. The second module utilizes
a transformer-based network to map the scene context and the guidance into a
shared latent space and autoregressively decodes the target view in the form of
discrete image tokens. The third module reconstructs the tokens into the image
of the target view. SparseGNV is trained across a large indoor scene dataset to
learn generalizable priors. Once trained, it can efficiently generate novel
views of an unseen indoor scene in a feed-forward manner. We evaluate SparseGNV
on both real-world and synthetic indoor scenes and demonstrate that it
outperforms state-of-the-art methods based on either neural radiance fields or
conditional image generation. |
Proposes SparseGNV, a learning framework combining 3D structures and image generative models to synthesize novel views of indoor scenes from sparse input views. |
Addresses the challenge of generating photorealistic and consistent novel views of indoor scenes, which are often spatially complex and require dense, expensive scans. |
SparseGNV uses three modules: 1) Neural geometry module to build a 3D point cloud from sparse input views; 2) View generator module to encode scene context and target viewpoint into latent space, and autoregressively decode a novel view as discrete tokens; 3) Image converter module to reconstruct the tokens into a final image. |
Outperforms state-of-the-art methods like NeRFs and conditional image generation on real-world and synthetic datasets.
Generates high-fidelity novel views with consistent structure faithful to the observations.
Demonstrates strong generalization ability by effectively leveraging sparse input information. |
Output can be less stable compared to volume rendering methods, with potential alterations in object details and lighting.
Requires camera poses and depths, which can be unavailable in extremely sparse settings. Future work could explore incorporating depth estimation into the framework. |
novel view synthesis, indoor scenes, sparse input, 3d structure, image generation |
2305.07021
Report |
Simple Token-Level Confidence Improves Caption Correctness |
Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach |
The ability to judge whether a caption correctly describes an image is a
critical part of vision-language understanding. However, state-of-the-art
models often misinterpret the correctness of fine-grained details, leading to
errors in outputs such as hallucinating objects in generated captions or poor
compositional reasoning. In this work, we explore Token-Level Confidence, or
TLC, as a simple yet surprisingly effective method to assess caption
correctness. Specifically, we fine-tune a vision-language model on image
captioning, input an image and proposed caption to the model, and aggregate
either algebraic or learned token confidences over words or sequences to
estimate image-caption consistency. Compared to sequence-level scores from
pretrained models, TLC with algebraic confidence measures achieves a relative
improvement in accuracy by 10% on verb understanding in SVO-Probes and
outperforms prior state-of-the-art in image and group scores for compositional
reasoning in Winoground by a relative 37% and 9%, respectively. When training
data are available, a learned confidence estimator provides further improved
performance, reducing object hallucination rates in MS COCO Captions by a
relative 30% over the original model and setting a new state-of-the-art. |
This paper proposes Token-Level Confidence (TLC), a simple yet effective method to assess image-caption correctness by leveraging token-level confidences from a fine-tuned image captioning model. |
Current vision-language models often struggle with fine-grained details in image-caption correctness, impacting tasks like hallucination detection and compositional reasoning. |
TLC uses a fine-tuned captioning model. TLC-A, uses algebraic confidence measures (e.g., softmax) on token predictions. TLC-L, learns a confidence estimator trained on predicting token correctness based on references. |
TLC-A surpasses prior state-of-the-art on Winoground, showing substantial improvement in image and group scores for compositional reasoning.
TLC-A outperforms sequence-level image-text matching scores on verb understanding evaluated with SVO-Probes.
TLC-L significantly reduces object hallucination rates in generated captions on MS COCO Captions, setting a new state-of-the-art. |
TLC-L requires in-domain training data for the confidence estimator.
The study uses uncalibrated output distributions for confidence estimation, potentially limiting reliability. |
image captioning, caption correctness, hallucination reduction, compositional reasoning, vision-language models |
2305.07017
Report |
An Inverse Scaling Law for CLIP Training |
Xianhang Li, Zeyu Wang, Cihang Xie |
CLIP, one of the pioneering foundation models that connect images and text,
has enabled many recent breakthroughs in computer vision. However, its
associated training cost is prohibitively high, imposing a significant barrier
to its widespread exploration. In this paper, we present a surprising finding
that there exists an inverse scaling law for CLIP training, whereby the larger
the image/text encoders used, the shorter the sequence length of image/text
tokens that can be applied in training. Moreover, we showcase that the strategy
for reducing image/text token length plays a crucial role in determining the
quality of this scaling law.
As a result of this finding, we are able to successfully train CLIP even with
limited computational resources. For example, using 8 A100 GPUs, our CLIP
models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days,
67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling
up -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot
accuracy, and meanwhile accelerate the training by ~33x compared to its
OpenCLIP counterpart. By reducing the computation barrier associated with CLIP,
we hope to inspire more research in this field, particularly from academics.
Our code is available at https://github.com/UCSC-VLAA/CLIPA. |
This paper discovers an inverse scaling law for CLIP training, showing that larger image/text encoders can be trained with shorter image/text token sequences while maintaining performance. |
This is important because it lowers the computational barrier of CLIP training, making it more accessible to researchers with limited resources. |
The authors experimented with various token reduction strategies (resizing, masking, etc.) and model sizes on ImageNet-1k, COCO, and robustness benchmarks. |
Larger CLIP models exhibit smaller performance drops when trained with reduced token lengths.
Image resizing and syntax masking are the most effective strategies for reducing image and text tokens, respectively.
The proposed CLIPA framework, utilizing this inverse scaling law, achieves competitive results with significantly reduced training cost compared to OpenCLIP. |
Current CLIP models, including CLIPA, struggle with capturing complex relationships, attributes, and order information.
Future work includes investigating the inverse scaling law with even larger models and datasets, as well as addressing the limitations in relational understanding. |
clip, inverse scaling law, efficient training, foundation models, computer vision |
2305.07015
Report |
Exploiting Diffusion Prior for Real-World Image Super-Resolution |
Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C. K. Chan, Chen Change Loy |
We present a novel approach to leverage prior knowledge encapsulated in
pre-trained text-to-image diffusion models for blind super-resolution (SR).
Specifically, by employing our time-aware encoder, we can achieve promising
restoration results without altering the pre-trained synthesis model, thereby
preserving the generative prior and minimizing training cost. To remedy the
loss of fidelity caused by the inherent stochasticity of diffusion models, we
employ a controllable feature wrapping module that allows users to balance
quality and fidelity by simply adjusting a scalar value during the inference
process. Moreover, we develop a progressive aggregation sampling strategy to
overcome the fixed-size constraints of pre-trained diffusion models, enabling
adaptation to resolutions of any size. A comprehensive evaluation of our method
using both synthetic and real-world benchmarks demonstrates its superiority
over current state-of-the-art approaches. Code and models are available at
https://github.com/IceClear/StableSR. |
This paper introduces StableSR, a novel blind super-resolution method that leverages the generative prior of pre-trained text-to-image diffusion models for high-quality image restoration. |
The approach addresses the limitations of existing super-resolution techniques, which often require extensive training from scratch or rely on explicit degradation assumptions, limiting their generalizability and computational efficiency. |
StableSR employs a time-aware encoder to condition a frozen pre-trained diffusion model on the input low-resolution image, preserving the generative prior and enabling efficient training. It further incorporates a controllable feature wrapping module for balancing realism and fidelity, and a progressive aggregation sampling strategy for handling arbitrary image resolutions. |
StableSR achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing existing methods in perceptual quality metrics.
The method demonstrates superior artifact removal and detail generation capabilities, producing visually compelling results.
StableSR offers flexible control over the trade-off between fidelity and realism, catering to user preferences and diverse image degradation scenarios. |
The inference speed of StableSR, being a diffusion-based approach, is slower compared to GAN-based methods, demanding further exploration into fast sampling strategies.
The pre-cleaning stage, while effective for severely degraded images, introduces an additional dependency on external models, necessitating further research into enhancing the robustness of StableSR. |
super-resolution, image restoration, diffusion models, generative prior, blind super-resolution |
2305.07011
Report |
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers |
Dahun Kim, Anelia Angelova, Weicheng Kuo |
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a
contrastive image-text pretraining recipe to bridge the gap between image-level
pretraining and open-vocabulary object detection. At the pretraining phase, we
propose to randomly crop and resize regions of positional embeddings instead of
using the whole image positional embeddings. This better matches the use of
positional embeddings at region-level in the detection finetuning phase. In
addition, we replace the common softmax cross entropy loss in contrastive
learning with focal loss to better learn the informative yet difficult
examples. Finally, we leverage recent advances in novel object proposals to
improve open-vocabulary detection finetuning. We evaluate our full model on the
LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer.
RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best
existing approach by +7.8 points in addition to competitive zero-shot transfer
detection. Surprisingly, RO-ViT improves the image-level representation as well
and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr
image-text retrieval benchmarks, outperforming competitive approaches with
larger models. |
This paper presents Region-aware Open-vocabulary Vision Transformers (RO-ViT), a contrastive image-text pretraining approach for open-vocabulary object detection, utilizing cropped positional embeddings (CPE) and focal loss during pretraining. |
Existing image-text pretraining methods are designed for image-level tasks and lack region-level understanding, hindering their performance in open-vocabulary object detection. |
RO-ViT introduces two key innovations: (1) Cropped Positional Embeddings (CPE) that randomly crop and resize regions of positional embeddings during pretraining to better match region-level use in detection, and (2) replacement of softmax cross-entropy loss with focal loss in contrastive learning to emphasize informative examples. |
RO-ViT achieves state-of-the-art performance (34.1 AP_r) on the LVIS open-vocabulary detection benchmark, surpassing the best existing approach by +7.8 AP_r.
Despite not being optimized for retrieval, RO-ViT achieves state-of-the-art performance on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks.
Ablation studies confirm the benefits of both CPE and focal loss, demonstrating improved region-level representation for open-vocabulary detection. |
The model's performance relies on the quality and potential biases present in the pretrained VLMs.
Future work can explore the application of RO-ViT to video-based open-vocabulary detection, leveraging its strong performance on the ego-centric dataset. |
open-vocabulary object detection, vision transformers, contrastive image-text pretraining, cropped positional embeddings, focal loss |
2305.06973
Report |
FreePoint: Unsupervised Point Cloud Instance Segmentation |
Zhikai Zhang, Jian Ding, Li Jiang, Dengxin Dai, Gui-Song Xia |
Instance segmentation of point clouds is a crucial task in 3D field with
numerous applications that involve localizing and segmenting objects in a
scene. However, achieving satisfactory results requires a large number of
manual annotations, which is a time-consuming and expensive process. To
alleviate dependency on annotations, we propose a method, called FreePoint, for
underexplored unsupervised class-agnostic instance segmentation on point
clouds. In detail, we represent the point features by combining coordinates,
colors, normals, and self-supervised deep features. Based on the point
features, we perform a multicut algorithm to segment point clouds into coarse
instance masks as pseudo labels, which are used to train a point cloud instance
segmentation model. To alleviate the inaccuracy of coarse masks during
training, we propose a weakly-supervised training strategy and corresponding
loss. Our work can also serve as an unsupervised pre-training pretext for
supervised semantic instance segmentation with limited annotations. For
class-agnostic instance segmentation on point clouds, FreePoint largely fills
the gap with its fully-supervised counterpart based on the state-of-the-art
instance segmentation model Mask3D and even surpasses some previous
fully-supervised methods. When serving as a pretext task and fine-tuning on
S3DIS, FreePoint outperforms training from scratch by 5.8% AP with only 10%
mask annotations. |
Proposes FreePoint, an unsupervised approach for class-agnostic instance segmentation on point clouds, using a combination of traditional features and self-supervised deep-learning embeddings. |
Addresses the labor-intensive and expensive nature of obtaining manual annotations for point cloud instance segmentation. |
1. Filters out background points using plane segmentation. 2. Extracts features by combining coordinates, colors, normals, and self-supervised deep features. 3. Generates pseudo masks via multicut algorithm on a graph constructed using point feature affinities. 4. Trains an instance segmentation model using a two-step training strategy and a weakly-supervised loss based on the pseudo masks. |
Achieves over 50% accuracy of its fully-supervised counterpart (Mask3D) on class-agnostic instance segmentation on ScanNet.
Outperforms existing fully-supervised methods on class-agnostic instance segmentation.
Demonstrates strong performance as a pre-training task, improving semantic instance segmentation results with limited annotations on S3DIS. |
Performance gap exists compared to fully-supervised methods.
Reliance on a user-defined parameter (σ) for generating coarse masks, though robust and easily set with visualization. |
unsupervised learning, point cloud segmentation, instance segmentation, 3d vision, weakly-supervised learning |
2305.06710
Report |
Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator |
Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Huang, Wenjing Yang |
Classifier-free guidance is an effective sampling technique in diffusion
models that has been widely adopted. The main idea is to extrapolate the model
in the direction of text guidance and away from null-text guidance. In this
paper, we demonstrate that null-text guidance in diffusion models is secretly a
cartoon-style creator, i.e., the generated images can be efficiently
transformed into cartoons by simply perturbing the null-text guidance.
Specifically, we proposed two disturbance methods, i.e., Rollback disturbance
(Back-D) and Image disturbance (Image-D), to construct misalignment between the
noisy images used for predicting null-text guidance and text guidance
(subsequently referred to as \textbf{null-text noisy image} and \textbf{text
noisy image} respectively) in the sampling process. Back-D achieves
cartoonization by altering the noise level of null-text noisy image via
replacing $x_t$ with $x_{t+\Delta t}$. Image-D, alternatively, produces
high-fidelity, diverse cartoons by defining $x_t$ as a clean input image, which
further improves the incorporation of finer image details. Through
comprehensive experiments, we delved into the principle of noise disturbing for
null-text and uncovered that the efficacy of disturbance depends on the
correlation between the null-text noisy image and the source image. Moreover,
our proposed techniques, which can generate cartoon images and cartoonize
specific ones, are training-free and easily integrated as a plug-and-play
component in any classifier-free guided diffusion model. Project page is
available at \url{https://nulltextforcartoon.github.io/}. |
This paper discovers that null-text guidance in diffusion models can be manipulated to create cartoon-style images. |
This is the first work to achieve cartoonization using diffusion models without requiring additional training, offering a novel and efficient approach. |
The authors introduce two noise disturbance methods: Rollback disturbance (Back-D), which replaces the null-text noisy image with a noisier version, and Image disturbance (Image-D), which uses a clean input image. Both methods create a misalignment between null-text and text guidance, driving the generation towards a cartoon style. |
Both Back-D and Image-D can generate free-form cartoon images from text prompts and cartoonize specific input images.
Image-D generally produces higher-fidelity cartoons with richer details compared to Back-D.
The degree of cartoonization is influenced by the correlation between the null-text noisy image and the input image, with higher correlation leading to better results. |
The effectiveness of the cartoonization relies heavily on the performance of the underlying text-to-image diffusion model.
The paper mainly explores visual cartoonization, leaving exploration of other artistic styles for future work. |
diffusion models, cartoonization, classifier-free guidance, null-text guidance, image generation |
2305.06525
Report |
Pyramid Texture Filtering |
Qing Zhang, Hao Jiang, Yongwei Nie, Wei-Shi Zheng |
We present a simple but effective technique to smooth out textures while
preserving the prominent structures. Our method is built upon a key observation
-- the coarsest level in a Gaussian pyramid often naturally eliminates textures
and summarizes the main image structures. This inspires our central idea for
texture filtering, which is to progressively upsample the very low-resolution
coarsest Gaussian pyramid level to a full-resolution texture smoothing result
with well-preserved structures, under the guidance of each fine-scale Gaussian
pyramid level and its associated Laplacian pyramid level. We show that our
approach is effective to separate structure from texture of different scales,
local contrasts, and forms, without degrading structures or introducing visual
artifacts. We also demonstrate the applicability of our method on various
applications including detail enhancement, image abstraction, HDR tone mapping,
inverse halftoning, and LDR image enhancement. |
This paper introduces a novel texture smoothing technique that leverages image pyramids, effectively removing textures while preserving prominent structures. |
Texture smoothing is crucial in computational photography and image analysis for tasks like image abstraction, detail enhancement, and HDR tone mapping. This method offers a simple yet effective way to achieve this, addressing limitations of previous approaches. |
The method iteratively upsamples the coarsest level of a Gaussian pyramid to the original image resolution. This upsampling is guided by finer-scale levels of both Gaussian and Laplacian pyramids, ensuring structure preservation while eliminating textures. |
The coarsest level of a Gaussian pyramid naturally eliminates textures while preserving main image structures.
Pyramid-guided structure-aware upsampling effectively removes textures of varying scales, contrasts, and forms without degrading structures or introducing artifacts.
The method proves applicable and beneficial in various applications such as detail enhancement, image abstraction, HDR tone mapping, inverse halftoning, and LDR image enhancement. |
The method may struggle to preserve small-scale structures not present in the coarsest Gaussian pyramid level.
Similar to bilateral filtering, over-smoothing can lead to gradient reversal artifacts. |
image smoothing, structure extraction, image decomposition, image pyramid, upsampling |
2305.06422
Report |
An Empirical Study on the Robustness of the Segment Anything Model (SAM) |
Yuqing Wang, Yun Zhao, Linda Petzold |
The Segment Anything Model (SAM) is a foundation model for general image
segmentation. Although it exhibits impressive performance predominantly on
natural images, understanding its robustness against various image
perturbations and domains is critical for real-world applications where such
challenges frequently arise. In this study we conduct a comprehensive
robustness investigation of SAM under diverse real-world conditions. Our
experiments encompass a wide range of image perturbations. Our experimental
results demonstrate that SAM's performance generally declines under perturbed
images, with varying degrees of vulnerability across different perturbations.
By customizing prompting techniques and leveraging domain knowledge based on
the unique characteristics of each dataset, the model's resilience to these
perturbations can be enhanced, addressing dataset-specific challenges. This
work sheds light on the limitations and strengths of SAM in real-world
applications, promoting the development of more robust and versatile image
segmentation solutions. |
This paper presents the first comprehensive robustness analysis of the Segment Anything Model (SAM) under various image perturbations and across different domains. |
Evaluating SAM's robustness is crucial for real-world applications where image perturbations are common, ensuring its reliability in challenging conditions. |
The study evaluates SAM on nine diverse datasets spanning various domains, using three prompting methods (point, box, combination) and fifteen image perturbation types with different severity levels. |
SAM's performance generally declines under perturbed images, with varying degrees of vulnerability across different perturbations.
SAM exhibits particular vulnerability to chromatic aberration, motion blur, and Gaussian noise, while showing robustness against brightness and saturation changes.
The combination of point and box prompting consistently yields superior results and improved robustness compared to single prompting methods. |
The study primarily focuses on zero-shot learning and doesn't explore fine-tuning SAM for specific domains or perturbations.
Future work includes exploring more adaptive prompting strategies, incorporating human-in-the-loop interactions, and developing dataset-specific data augmentation techniques. |
image segmentation, segment anything model, robustness evaluation, prompting methods, domain-specific analysis |
2305.06402
Report |
Analyzing Bias in Diffusion-based Face Generation Models |
Malsha V. Perera, Vishal M. Patel |
Diffusion models are becoming increasingly popular in synthetic data
generation and image editing applications. However, these models can amplify
existing biases and propagate them to downstream applications. Therefore, it is
crucial to understand the sources of bias in their outputs. In this paper, we
investigate the presence of bias in diffusion-based face generation models with
respect to attributes such as gender, race, and age. Moreover, we examine how
dataset size affects the attribute composition and perceptual quality of both
diffusion and Generative Adversarial Network (GAN) based face generation models
across various attribute classes. Our findings suggest that diffusion models
tend to worsen distribution bias in the training data for various attributes,
which is heavily influenced by the size of the dataset. Conversely, GAN models
trained on balanced datasets with a larger number of samples show less bias
across different attributes. |
This paper investigates bias in diffusion-based face generation models with respect to gender, race, and age, focusing on the impact of training dataset size on attribute composition and perceptual quality in comparison to GAN-based models. |
Understanding bias in diffusion models is crucial for promoting fairness and mitigating negative societal consequences when these models are used in real-world applications. |
The study uses the FFHQ and FairFace datasets to train diffusion and GAN models, analyzing the attribute distribution of generated images with varying training subset sizes and employing attribute classifiers to assess the results. |
Diffusion models tend to amplify existing biases in training data, particularly for gender, race, and age, and the bias is influenced by dataset size.
GAN models, when trained on balanced datasets with larger sample sizes, demonstrate better preservation of attribute composition compared to diffusion models.
Data replication is more common in diffusion models trained with smaller datasets, which can contribute to bias in attribute distribution. |
The study primarily focuses on unconditional face generation using specific diffusion and GAN architectures; exploring other architectures could provide further insights.
While automated classifiers were used, potential bias in these classifiers might require further investigation using alternative methods for determining attribute classes. |
bias in ai, diffusion models, generative adversarial networks, face generation, dataset bias |
2305.06386
Report |
Text-To-Concept (and Back) via Cross-Model Alignment |
Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, Soheil Feizi |
We observe that the mapping between an image's representation in one model to
its representation in another can be learned surprisingly well with just a
linear layer, even across diverse models. Building on this observation, we
propose $\textit{text-to-concept}$, where features from a fixed pretrained
model are aligned linearly to the CLIP space, so that text embeddings from
CLIP's text encoder become directly comparable to the aligned features. With
text-to-concept, we convert fixed off-the-shelf vision encoders to surprisingly
strong zero-shot classifiers for free, with accuracy at times even surpassing
that of CLIP, despite being much smaller models and trained on a small fraction
of the data compared to CLIP. We show other immediate use-cases of
text-to-concept, like building concept bottleneck models with no concept
supervision, diagnosing distribution shifts in terms of human concepts, and
retrieving images satisfying a set of text-based constraints. Lastly, we
demonstrate the feasibility of $\textit{concept-to-text}$, where vectors in a
model's feature space are decoded by first aligning to the CLIP before being
fed to a GPT-based generative model. Our work suggests existing deep models,
with presumably diverse architectures and training, represent input samples
relatively similarly, and a two-way communication across model representation
spaces and to humans (through language) is viable. |
This paper introduces "text-to-concept", a technique that leverages linear alignment of pretrained vision models to CLIP space to enable direct comparison of text embeddings (representing human concepts) with image features from these models. |
This method makes existing vision models significantly more interpretable and functional by allowing us to understand and utilize the semantic knowledge encoded in their feature spaces through the lens of human language. |
The core methodology involves training a linear layer to map image representations from a given vision model to the representation space of a CLIP model. This allows text embeddings from CLIP's text encoder, which inherently represent concepts, to be directly compared to aligned features from the other model, enabling text-to-concept mapping. |
Linear alignment effectively maps representations across diverse models, indicating they encode information similarly despite different architectures and training.
Text-to-concept allows off-the-shelf models to perform zero-shot classification competitively with CLIP, even surpassing it in some cases (e.g., color recognition).
The method enables novel applications like building Concept Bottleneck Models without concept supervision, analyzing dataset distributions in terms of human concepts, and performing concept-based image retrieval. |
Concept vector quality depends on prompt engineering and data used, requiring refinement for optimal performance.
The method's success relies on the quality of the underlying models (CLIP, vision encoders, and language models), implying limitations inherited from these components. |
interpretability, text-to-concept, cross-model alignment, zero-shot learning, concept bottleneck models |
2305.06356
Report |
HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion |
Mustafa Işık, Martin Rünz, Markos Georgopoulos, Taras Khakhulin, Jonathan Starck, Lourdes Agapito, Matthias Nießner |
Representing human performance at high-fidelity is an essential building
block in diverse applications, such as film production, computer games or
videoconferencing. To close the gap to production-level quality, we introduce
HumanRF, a 4D dynamic neural scene representation that captures full-body
appearance in motion from multi-view video input, and enables playback from
novel, unseen viewpoints. Our novel representation acts as a dynamic video
encoding that captures fine details at high compression rates by factorizing
space-time into a temporal matrix-vector decomposition. This allows us to
obtain temporally coherent reconstructions of human actors for long sequences,
while representing high-resolution details even in the context of challenging
motion. While most research focuses on synthesizing at resolutions of 4MP or
lower, we address the challenge of operating at 12MP. To this end, we introduce
ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160
cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We
demonstrate challenges that emerge from using such high-resolution data and
show that our newly introduced HumanRF effectively leverages this data, making
a significant step towards production-level quality novel view synthesis. |
Introduces HumanRF, a 4D dynamic neural scene representation that captures full-body human appearance in motion from multi-view video input for high-fidelity novel view synthesis. |
To close the gap to production-level quality human performance capture, addressing limitations of existing dynamic NeRF methods in handling long sequences with complex motion and high-resolution data. |
Leverages a novel 4D scene representation with adaptive temporal partitioning, using a low-rank space-time tensor decomposition of feature grids and shared MLPs for efficient and temporally consistent reconstruction. |
HumanRF consistently outperforms state-of-the-art methods in novel view synthesis quality for long sequences with complex motion.
The method effectively utilizes high-resolution data (12MP), capturing fine details beyond the capabilities of previous datasets and techniques.
HumanRF demonstrates strong performance on a dynamic furry animal dataset, indicating its potential beyond human subjects. |
Current implementation requires separate optimization for each sequence, limiting generalization ability.
Lacks explicit control over articulation outside training poses. |
neural rendering, novel view synthesis, human performance capture, 4d dynamic nerf, high-resolution data |
2305.06351
Report |
Reconstructing Animatable Categories from Videos |
Gengshan Yang, Chaoyang Wang, N Dinesh Reddy, Deva Ramanan |
Building animatable 3D models is challenging due to the need for 3D scans,
laborious registration, and manual rigging, which are difficult to scale to
arbitrary categories. Recently, differentiable rendering provides a pathway to
obtain high-quality 3D models from monocular videos, but these are limited to
rigid categories or single instances. We present RAC that builds category 3D
models from monocular videos while disentangling variations over instances and
motion over time. Three key ideas are introduced to solve this problem: (1)
specializing a skeleton to instances via optimization, (2) a method for latent
space regularization that encourages shared structure across a category while
maintaining instance details, and (3) using 3D background models to disentangle
objects from the background. We show that 3D models of humans, cats, and dogs
can be learned from 50-100 internet videos. |
This supplementary material provides additional details, results, and comparisons for the paper 'Reconstructing Animatable 3D Categories from Videos'. It focuses on shape regularization techniques, handling categories outside the DensePose framework, evaluating performance on the 'Pablo' sequence, and highlighting differences from prior work. |
This material supplements the main paper with: 1) technical details on shape regularization and ensuring smoothness in pose, deformation, and appearance; 2) extending the method to handle categories without pre-defined DensePose features (e.g., vehicles); 3) quantitative evaluation on the 'Pablo' sequence, comparing with other methods; and 4) a clear comparison table highlighting how the approach differs from related work. |
The authors elaborate on the use of eikonal regularization for shape and time-dependent positional embeddings for smooth temporal variations. For categories outside DensePose, they explain a two-stage camera pose initialization process using manual annotation and a viewpoint network. Performance evaluation on the 'Pablo' sequence involves calculating the average point-to-surface distances in the clothing region and comparing it to baselines. Finally, a table summarizes key differences from previous works regarding shape, motion, background modeling, and reliance on 3D data. |
Eikonal regularization improves surface reconstruction quality.
The method successfully reconstructs a car category model from videos, demonstrating generalization beyond human shapes.
The proposed method outperforms some single-view human shape predictors on the 'Pablo' sequence but shows limitations compared to methods using parametric models or personalized templates. |
The method depends on an initial viewpoint estimation step for categories outside DensePose, which might require manual annotation.
While the approach performs well without 3D supervision, incorporating shape priors could further improve accuracy, especially in challenging cases. |
3d reconstruction, category-level modeling, video-based animation, nerf, differentiable rendering |
2305.05947
Report |
iEdit: Localised Text-guided Image Editing with Weak Supervision |
Rumeysa Bodur, Erhan Gundogdu, Binod Bhattarai, Tae-Kyun Kim, Michael Donoser, Loris Bazzani |
Diffusion models (DMs) can generate realistic images with text guidance using
large-scale datasets. However, they demonstrate limited controllability in the
output space of the generated images. We propose a novel learning method for
text-guided image editing, namely \texttt{iEdit}, that generates images
conditioned on a source image and a textual edit prompt. As a fully-annotated
dataset with target images does not exist, previous approaches perform
subject-specific fine-tuning at test time or adopt contrastive learning without
a target image, leading to issues on preserving the fidelity of the source
image. We propose to automatically construct a dataset derived from LAION-5B,
containing pseudo-target images with their descriptive edit prompts given input
image-caption pairs. This dataset gives us the flexibility of introducing a
weakly-supervised loss function to generate the pseudo-target image from the
latent noise of the source image conditioned on the edit prompt. To encourage
localised editing and preserve or modify spatial structures in the image, we
propose a loss function that uses segmentation masks to guide the editing
during training and optionally at inference. Our model is trained on the
constructed dataset with 200K samples and constrained GPU resources. It shows
favourable results against its counterparts in terms of image fidelity, CLIP
alignment score and qualitatively for editing both generated and real images. |
\texttt{iEdit} is a novel learning method for localized text-guided image editing based on Latent Diffusion Models, which takes a source image and a textual edit prompt as input to generate an edited image. |
Existing text-to-image generation models show limited controllability and struggle to balance preserving fidelity of unmodified regions while implementing localized edits according to the prompt. |
The method constructs a dataset of image pairs with automatically generated edit prompts by leveraging CLIP embeddings and manipulating image captions. This dataset is used to fine-tune a pre-trained LDM with a novel loss function incorporating segmentation masks for localized editing. |
\texttt{iEdit} outperforms state-of-the-art methods in terms of CLIP alignment score, demonstrating improved fidelity to the edit prompt.
It achieves a good balance between editing and preserving image fidelity, as evidenced by SSIM scores on edited and unmodified regions.
The method is computationally efficient, requiring only fine-tuning of a pre-trained LDM on a relatively small dataset with limited GPU resources. |
The automatic dataset generation may produce suboptimal image pairs, impacting training effectiveness.
Evaluation of image editing methods remains challenging due to the lack of standardized metrics and datasets. |
image editing, text-guided synthesis, diffusion models, weakly supervised learning, semantic segmentation |
2305.05901
Report |
Text-guided High-definition Consistency Texture Model |
Zhibin Tang, Tiantong He |
With the advent of depth-to-image diffusion models, text-guided generation,
editing, and transfer of realistic textures are no longer difficult. However,
due to the limitations of pre-trained diffusion models, they can only create
low-resolution, inconsistent textures. To address this issue, we present the
High-definition Consistency Texture Model (HCTM), a novel method that can
generate high-definition and consistent textures for 3D meshes according to the
text prompts. We achieve this by leveraging a pre-trained depth-to-image
diffusion model to generate single viewpoint results based on the text prompt
and a depth map. We fine-tune the diffusion model with Parameter-Efficient
Fine-Tuning to quickly learn the style of the generated result, and apply the
multi-diffusion strategy to produce high-resolution and consistent results from
different viewpoints. Furthermore, we propose a strategy that prevents the
appearance of noise on the textures caused by backpropagation. Our proposed
approach has demonstrated promising results in generating high-definition and
consistent textures for 3D meshes, as demonstrated through a series of
experiments. |
Presents HCTM, a novel method generating high-definition, consistent textures for 3D meshes from text prompts. |
Existing text-guided 3D texture generation methods produce low-resolution, inconsistent results. |
Leverages a pre-trained depth-to-image diffusion model fine-tuned with Parameter-Efficient Fine-Tuning and a multi-diffusion strategy to generate high-resolution, consistent textures from different viewpoints. Also employs textual inversion for better prompt-image alignment and a noise reduction strategy during texture projection. |
Generates textures with higher consistency than Latent-NeRF and TEXTure.
Produces clearer, more detailed textures, as demonstrated with the 'oak wood dining table' prompt.
Exhibits greater stability than existing methods, even with challenging prompts like 'gold dining table'.
Outperforms baselines in a user study for overall quality, prompt relevance, and texture consistency, especially on complex meshes. |
Discontinuity, severe flare, and shadows still impact the visual quality.
Multi-diffusion strategy doesn't work well in 3D due to UV mapping altering white noise distribution. |
3d texture generation, diffusion models, text-guided synthesis, parameter-efficient fine-tuning, multi-diffusion strategy |
2305.05803
Report |
Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation |
Tianle Chen, Zheda Mai, Ruiwen Li, Wei-lun Chao |
Weakly supervised semantic segmentation (WSSS) aims to bypass the need for
laborious pixel-level annotation by using only image-level annotation. Most
existing methods rely on Class Activation Maps (CAM) to derive pixel-level
pseudo-labels and use them to train a fully supervised semantic segmentation
model. Although these pseudo-labels are class-aware, indicating the coarse
regions for particular classes, they are not object-aware and fail to delineate
accurate object boundaries. To address this, we introduce a simple yet
effective method harnessing the Segment Anything Model (SAM), a class-agnostic
foundation model capable of producing fine-grained instance masks of objects,
parts, and subparts. We use CAM pseudo-labels as cues to select and combine SAM
masks, resulting in high-quality pseudo-labels that are both class-aware and
object-aware. Our approach is highly versatile and can be easily integrated
into existing WSSS methods without any modification. Despite its simplicity,
our approach shows consistent gain over the state-of-the-art WSSS methods on
both PASCAL VOC and MS-COCO datasets. |
This paper proposes SEPL, a novel method that leverages the Segment Anything Model (SAM) to enhance pseudo-labels generated by Class Activation Maps (CAM) in weakly supervised semantic segmentation (WSSS). |
CAM-derived pseudo-labels, while class-aware, often lack object awareness, leading to inaccurate object boundaries. This work addresses this limitation by integrating SAM's ability to produce fine-grained instance masks. |
SEPL uses CAM pseudo-labels as cues to select and combine relevant SAM masks. It assigns each SAM mask to the class with the largest intersection area and then selects masks based on their overlap with pseudo-labels to mitigate false and partial activations. |
SEPL consistently improves the quality of pseudo-labels generated by various WSSS methods on PASCAL VOC and MS COCO datasets.
Using SEPL-enhanced pseudo-labels for training supervised segmentation models leads to significant performance improvements.
SEPL can be directly applied to initial CAMs, potentially replacing time-consuming post-processing steps and accelerating WSSS pipelines. |
SEPL's effectiveness is contingent on the quality of initial pseudo-labels and SAM masks.
Future work includes exploring SAM's hierarchical mask structure for more sophisticated mask selection. |
weakly supervised semantic segmentation, class activation maps, segment anything model (sam), pseudo-label enhancement, object boundary detection |
2305.05768
Report |
DifFIQA: Face Image Quality Assessment Using Denoising Diffusion Probabilistic Models |
Žiga Babnik, Peter Peer, Vitomir Štruc |
Modern face recognition (FR) models excel in constrained scenarios, but often
suffer from decreased performance when deployed in unconstrained (real-world)
environments due to uncertainties surrounding the quality of the captured
facial data. Face image quality assessment (FIQA) techniques aim to mitigate
these performance degradations by providing FR models with sample-quality
predictions that can be used to reject low-quality samples and reduce false
match errors. However, despite steady improvements, ensuring reliable quality
estimates across facial images with diverse characteristics remains
challenging. In this paper, we present a powerful new FIQA approach, named
DifFIQA, which relies on denoising diffusion probabilistic models (DDPM) and
ensures highly competitive results. The main idea behind the approach is to
utilize the forward and backward processes of DDPMs to perturb facial images
and quantify the impact of these perturbations on the corresponding image
embeddings for quality prediction. Because the diffusion-based perturbations
are computationally expensive, we also distill the knowledge encoded in DifFIQA
into a regression-based quality predictor, called DifFIQA(R), that balances
performance and execution time. We evaluate both models in comprehensive
experiments on 7 datasets, with 4 target FR models and against 10
state-of-the-art FIQA techniques with highly encouraging results. The source
code will be made publicly available. |
DifFIQA, a novel Face Image Quality Assessment (FIQA) technique, leverages Denoising Diffusion Probabilistic Models (DDPMs) to assess the quality of face images by analyzing their embedding stability under perturbations introduced by the forward and backward diffusion processes. |
FIQA is crucial for improving the reliability and performance of Face Recognition (FR) models in real-world scenarios, where input image quality can vary significantly. DifFIQA addresses the need for accurate and robust quality assessment across diverse facial characteristics and FR models. |
DifFIQA utilizes a custom DDPM, trained with time-dependent degradations, to generate noisy and reconstructed versions of input face images. By analyzing the disparities between the embeddings of the original, noisy, and reconstructed images in the target FR model's embedding space, DifFIQA infers the quality of the input image. To enhance efficiency, a distilled regression-based model, DifFIQA(R), is also introduced. |
DifFIQA and DifFIQA(R) demonstrate highly competitive performance, consistently outperforming state-of-the-art FIQA methods on challenging datasets like IJB-C and XQLFW.
The distillation process significantly reduces runtime complexity by three orders of magnitude, making DifFIQA(R) comparable to faster FIQA models without substantial performance degradation.
Ablation studies highlight the importance of incorporating image flipping, forward diffusion pass, and appropriate noise levels for optimal DifFIQA performance. |
The computational complexity of the original DifFIQA model poses a challenge for real-time applications, despite being addressed through distillation.
The reliance on CNN-based UNet for denoising in DifFIQA may limit its ability to capture global image properties, suggesting potential improvements with transformer-based models. |
face image quality assessment, face recognition, denoising diffusion probabilistic models, deep learning, computer vision |
2305.05594
Report |
PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces |
Yiqun Wang, Ivan Skorokhodov, Peter Wonka |
A signed distance function (SDF) parametrized by an MLP is a common
ingredient of neural surface reconstruction. We build on the successful recent
method NeuS to extend it by three new components. The first component is to
borrow the tri-plane representation from EG3D and represent signed distance
fields as a mixture of tri-planes and MLPs instead of representing it with MLPs
only. Using tri-planes leads to a more expressive data structure but will also
introduce noise in the reconstructed surface. The second component is to use a
new type of positional encoding with learnable weights to combat noise in the
reconstruction process. We divide the features in the tri-plane into multiple
frequency scales and modulate them with sin and cos functions of different
frequencies. The third component is to use learnable convolution operations on
the tri-plane features using self-attention convolution to produce features
with different frequency bands. The experiments show that PET-NeuS achieves
high-fidelity surface reconstruction on standard datasets. Following previous
work and using the Chamfer metric as the most important way to measure surface
reconstruction quality, we are able to improve upon the NeuS baseline by 57% on
Nerf-synthetic (0.84 compared to 1.97) and by 15.5% on DTU (0.71 compared to
0.84). The qualitative evaluation reveals how our method can better control the
interference of high-frequency noise. Code available at
\url{https://github.com/yiqun-wang/PET-NeuS}. |
Presents PET-NeuS, a novel neural surface reconstruction method utilizing a tri-plane representation modulated by positional encoding and enhanced by self-attention convolution. |
Aims to enhance the expressiveness of neural surface reconstruction methods for preserving fine-grained local features while mitigating noise interference. |
Integrates tri-planes into NeuS framework, introduces a novel positional encoding strategy for tri-plane features, and employs self-attention convolution to generate multi-frequency tri-plane features. |
Achieves state-of-the-art surface reconstruction quality on DTU and NeRF-synthetic datasets, outperforming baselines like NeuS, VolSDF, and HF-NeuS.
Demonstrates superior ability to reconstruct fine-grained details, such as bumps, holes, and complex structures, as evidenced by qualitative results.
Exhibits faster training time compared to competing methods while maintaining high fidelity in surface reconstruction. |
Computation time, although faster than some baselines, remains a limitation.
Balancing fine detail reconstruction with potential overfitting and noise in flat surface areas requires further investigation. |
neural surface reconstruction, tri-plane representation, positional encoding, self-attention convolution, multi-view reconstruction |
2305.05464
Report |
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer |
Nisha Huang, Yuxin Zhang, Weiming Dong |
Large-scale text-to-video diffusion models have demonstrated an exceptional
ability to synthesize diverse videos. However, due to the lack of extensive
text-to-video datasets and the necessary computational resources for training,
directly applying these models for video stylization remains difficult. Also,
given that the noise addition process on the input content is random and
destructive, fulfilling the style transfer task's content preservation criteria
is challenging. This paper proposes a zero-shot video stylization method named
Style-A-Video, which utilizes a generative pre-trained transformer with an
image latent diffusion model to achieve a concise text-controlled video
stylization. We improve the guidance condition in the denoising process,
establishing a balance between artistic expression and structure preservation.
Furthermore, to decrease inter-frame flicker and avoid the formation of
additional artifacts, we employ a sampling optimization and a temporal
consistency module. Extensive experiments show that we can attain superior
content preservation and stylistic performance while incurring less consumption
than previous solutions. Code will be available at
https://github.com/haha-lisa/Style-A-Video. |
This paper introduces Style-A-Video, a novel zero-shot video stylization method leveraging a generative pre-trained transformer and an image latent diffusion model for text-driven video stylization. |
Existing text-to-video diffusion models are limited by data scarcity and computational resources, making direct application to video stylization challenging. Also, existing methods struggle to balance stylistic changes with preserving the input video's content. |
Style-A-Video utilizes a combination of text prompts for style, video frames for content, and attention maps for detailed guidance in the denoising process. It uses a custom guidance method with classifier-free guidance and employs sampling optimization and a temporal consistency module to reduce flicker and artifacts. |
Style-A-Video achieves superior content preservation compared to existing text-driven video editing approaches.
The method demonstrates strong stylistic representation capabilities, effectively transferring styles from text prompts to videos.
Evaluations show that Style-A-Video excels in temporal consistency, ensuring smooth transitions between stylized frames. |
The reliance on pre-trained models may limit the method's flexibility in handling complex or unseen styles.
Future work could explore incorporating additional conditioning elements, such as depth maps or motion cues, to enhance stylization control.
Investigating the effect of other parameters on video stability is another promising avenue for future research. |
video stylization, diffusion models, text-driven editing, content preservation, temporal consistency |
2305.05445
Report |
StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator |
Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, Jingdong Wang |
Despite recent advances in syncing lip movements with any audio waves,
current methods still struggle to balance generation quality and the model's
generalization ability. Previous studies either require long-term data for
training or produce a similar movement pattern on all subjects with low
quality. In this paper, we propose StyleSync, an effective framework that
enables high-fidelity lip synchronization. We identify that a style-based
generator would sufficiently enable such a charming property on both one-shot
and few-shot scenarios. Specifically, we design a mask-guided spatial
information encoding module that preserves the details of the given face. The
mouth shapes are accurately modified by audio through modulated convolutions.
Moreover, our design also enables personalized lip-sync by introducing style
space and generator refinement on only limited frames. Thus the identity and
talking style of a target person could be accurately preserved. Extensive
experiments demonstrate the effectiveness of our method in producing
high-fidelity results on a variety of scenes. Resources can be found at
https://hangz-nju-cuhk.github.io/projects/StyleSync. |
This paper introduces StyleSync, a novel framework that leverages a modified style-based generator for high-fidelity lip synchronization in both generalized and personalized scenarios. |
Current lip-syncing methods struggle to achieve high fidelity while maintaining generalization ability and often require extensive training data or produce repetitive lip movements. |
StyleSync utilizes a mask-guided spatial information encoding module to preserve facial details while modulating mouth shapes according to the input audio. For personalization, it employs style space and generator refinement with limited target data. |
StyleSync outperforms state-of-the-art methods in one-shot lip-syncing, producing high-fidelity results.
The personalized optimization procedure enhances fidelity by capturing individual speaking styles.
Extensive evaluations on LRW and VoxCeleb2 datasets demonstrate the effectiveness of StyleSync in terms of both quantitative metrics and subjective user experience. |
The current method relies on a fixed mask, limiting its ability to handle dynamic head poses or expressions.
Extreme jaw positions in target videos might exceed the masked area, posing challenges for seamless blending. Future work will focus on addressing these limitations. |
lip sync, stylegan, personalized lip-sync, audio-driven facial animation, generative adversarial networks |
2305.05208
Report |
Boosting Visual-Language Models by Exploiting Hard Samples |
Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu, Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, Kenji Kawaguchi |
Contrastive Language-Image Pre-training (CLIP) has become the standard for
learning cross-modal representations between images and text. Efforts to
improve its capabilities typically demand the collection of additional data and
retraining with new loss functions. While effective, the added requirements
limit their practical use due to the increased resource and time investments
needed. In this work, we present HELIP, a cost-effective strategy tailored to
enhance the performance of existing CLIP models without the need for training a
model from scratch or collecting additional data. Our method allows for
effortless integration with existing models' training pipelines, providing an
instant boost by training them with selected challenging text-image pairs from
their original training datasets. HELIP treats each text-image pair as a single
point in the joint vision-language space, identifying those in close proximity
as hard pairs. By incorporating the challenging data, pre-trained CLIP models
are refined using both the traditional contrastive loss and the newly
introduced hard negative margin loss, ensuring the challenging data is fully
utilized. On comprehensive benchmarks, HELIP consistently boosts existing
models to achieve leading performance. In particular, it improves the zero-shot
classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M
and YFCC15M datasets. The improvements are 3.05%, 4.47%, and 10.1%
respectively, achieved within two epochs of training. In addition, across
fine-grained classification datasets, HELIP improves the zero-shot performance
of pre-trained CLIP and SLIP by an average of 8.4% and 18.6%, and their linear
probe performance by an average of 9.5% and 3.0%. |
This paper introduces HELIP, a cost-effective method for improving pre-trained CLIP models by fine-tuning them with challenging data selected from their original training datasets. |
Existing methods to improve CLIP models often require retraining from scratch or collecting additional data, limiting their practical use. |
HELIP identifies 'hard pairs' – pairs of images and text in close proximity in the joint vision-language space – using a novel Hard Pair Mining (HPM) strategy. It then fine-tunes the model using both the traditional contrastive loss and a new Hard Negative Margin Loss (HNML) that leverages the hard pairs. |
HELIP consistently boosts the zero-shot classification accuracy of existing CLIP models across various datasets, including ImageNet, CIFAR-10, and CIFAR-100.
It significantly improves zero-shot and linear probe performance on fine-grained image classification datasets.
HELIP enhances zero-shot image-text retrieval performance on MS-COCO and Flickr30K datasets. |
The paper acknowledges that the combination of HELIP with image self-supervision and larger training batch sizes could further improve linear probe performance.
Future work will explore composition-aware fine-tuning, parameter-efficient tuning, and extending the approach to other contrastive learning domains. |
contrastive language-image pretraining (clip), hard negative mining, zero-shot learning, image classification, image-text retrieval |
2305.05189
Report |
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models |
Shanshan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, Liang Lin |
Diffusion models, which have emerged to become popular text-to-image
generation models, can produce high-quality and content-rich images guided by
textual prompts. However, there are limitations to semantic understanding and
commonsense reasoning in existing models when the input prompts are concise
narrative, resulting in low-quality image generation. To improve the capacities
for narrative prompts, we propose a simple-yet-effective parameter-efficient
fine-tuning approach called the Semantic Understanding and Reasoning adapter
(SUR-adapter) for pre-trained diffusion models. To reach this goal, we first
collect and annotate a new dataset SURD which consists of more than 57,000
semantically corrected multi-modal samples. Each sample contains a simple
narrative prompt, a complex keyword-based prompt, and a high-quality image.
Then, we align the semantic representation of narrative prompts to the complex
prompts and transfer knowledge of large language models (LLMs) to our
SUR-adapter via knowledge distillation so that it can acquire the powerful
semantic understanding and reasoning capabilities to build a high-quality
textual semantic representation for text-to-image generation. We conduct
experiments by integrating multiple LLMs and popular pre-trained diffusion
models to show the effectiveness of our approach in enabling diffusion models
to understand and reason concise natural language without image quality
degradation. Our approach can make text-to-image diffusion models easier to use
with better user experience, which demonstrates our approach has the potential
for further advancing the development of user-friendly text-to-image generation
models by bridging the semantic gap between simple narrative prompts and
complex keyword-based prompts. The code is released at
https://github.com/Qrange-group/SUR-adapter. |
This paper introduces SUR-adapter, a novel fine-tuning approach for enhancing pre-trained text-to-image diffusion models with improved semantic understanding and reasoning (SUR) capabilities from LLMs. |
Existing diffusion models often struggle to generate high-quality images from concise narrative prompts due to limitations in the semantic understanding and commonsense reasoning of their text encoders. |
The authors collect a new dataset (SURD) with simple narrative prompts, complex keyword-based prompts, and corresponding images. They then use SURD to fine-tune diffusion models with SUR-adapter, which leverages knowledge distillation from LLMs and aligns representations of simple and complex prompts. |
SUR-adapter significantly improves the semantic accuracy of generated images from simple prompts across different diffusion models and control methods.
The fine-tuned models maintain image generation quality comparable to the original pre-trained models.
Deeper layers of LLMs contribute more effectively to semantic distillation. |
While improved, SUR-adapter doesn't completely solve the semantic understanding issue, suggesting a need for larger multi-modal datasets and more advanced distillation techniques.
The performance difference between LLMs of varying sizes is insignificant, indicating potential limitations in the adapter's capacity to transfer knowledge. |
diffusion model, large language model, multimodal image generation, adapter, knowledge distillation |
2305.04966
Report |
NerfAcc: Efficient Sampling Accelerates NeRFs |
Ruilong Li, Hang Gao, Matthew Tancik, Angjoo Kanazawa |
Optimizing and rendering Neural Radiance Fields is computationally expensive
due to the vast number of samples required by volume rendering. Recent works
have included alternative sampling approaches to help accelerate their methods,
however, they are often not the focus of the work. In this paper, we
investigate and compare multiple sampling approaches and demonstrate that
improved sampling is generally applicable across NeRF variants under an unified
concept of transmittance estimator. To facilitate future experiments, we
develop NerfAcc, a Python toolbox that provides flexible APIs for incorporating
advanced sampling methods into NeRF related methods. We demonstrate its
flexibility by showing that it can reduce the training time of several recent
NeRF methods by 1.5x to 20x with minimal modifications to the existing
codebase. Additionally, highly customized NeRFs, such as Instant-NGP, can be
implemented in native PyTorch using NerfAcc. |
This paper introduces NerfAcc, a Python toolbox designed to accelerate the training of Neural Radiance Fields (NeRFs) through efficient sampling techniques. |
Optimizing and rendering NeRFs is computationally expensive due to the numerous samples required for volume rendering. Existing efficient sampling methods are often tightly coupled with specific NeRF implementations, hindering wider adoption. |
The paper presents a unified view of various sampling methods as constructing a 'transmittance estimator' for importance sampling. NerfAcc decouples the sampling procedure, offering a plug-and-play solution compatible with different NeRF variants. |
NerfAcc significantly reduces training time (1.5x to 20x) for various NeRF methods across different datasets, often with slightly improved performance.
The toolbox allows for the training of an Instant-NGP model with pure Python code, achieving comparable speed and slightly better performance than the original CUDA implementation.
The unified framework enables combining different sampling approaches, leading to improved results in certain scenarios (e.g., combining occupancy grid and proposal network). |
The current implementation primarily focuses on density-based NeRFs for per-scene optimization, with limited support for SDF-based methods.
Exploring alternative update functions for the transmittance estimator, beyond EMA and SGD, could be a potential direction for future work. |
neural radiance fields, nerf, volumetric rendering, importance sampling, transmittance estimation |
2305.04790
Report |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans |
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen |
We present a vision and language model named MultiModal-GPT to conduct
multi-round dialogue with humans. MultiModal-GPT can follow various
instructions from humans, such as generating a detailed caption, counting the
number of interested objects, and answering general questions from users.
MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with
Low-rank Adapter (LoRA) added both in the cross-attention part and the
self-attention part of the language model. We first construct instruction
templates with vision and language data for multi-modality instruction tuning
to make the model understand and follow human instructions. We find the quality
of training data is vital for the dialogue performance, where few data
containing short answers can lead the model to respond shortly to any
instructions. To further enhance the ability to chat with humans of the
MultiModal-GPT, we utilize language-only instruction-following data to train
the MultiModal-GPT jointly. The joint training of language-only and
visual-language instructions with the \emph{same} instruction template
effectively improves dialogue performance. Various demos show the ability of
continuous dialogue of MultiModal-GPT with humans. Code, dataset, and demo are
at https://github.com/open-mmlab/Multimodal-GPT |
Introduces MultiModal-GPT, a vision and language model fine-tuned from OpenFlamingo for multi-round dialogues with humans, capable of tasks like detailed captioning, object counting, and general query answering. |
Aims to bridge the gap in existing models' ability to engage in accurate, human-like multimodal dialogues. |
Fine-tunes OpenFlamingo with Low-rank Adapter (LoRA) using a unified instruction template for both language and visual-language instruction data, enabling synergistic training and improved performance. |
Joint training with language and visual-language data significantly improves dialogue performance.
Highlights the importance of high-quality training data, showing that datasets with limited response types (e.g., yes/no) can degrade the model's conversational abilities.
Demonstrates through various demos MultiModal-GPT's proficiency in maintaining continuous dialogues, handling tasks like recipe generation, object counting, OCR, and general knowledge questions. |
Current work does not include certain potentially beneficial vision and language instruction datasets like MultiInstruct.
Further exploration of advanced techniques to improve the model's ability to handle more complex dialogue scenarios. |
multimodal dialogue, vision and language model, instruction tuning, openflamingo, multimodal-gpt |
2305.04517
Report |
DiffBFR: Bootstrapping Diffusion Model Towards Blind Face Restoration |
Xinmin Qiu, Congying Han, Zicheng Zhang, Bonan Li, Tiande Guo, Xuecheng Nie |
Blind face restoration (BFR) is important while challenging. Prior works
prefer to exploit GAN-based frameworks to tackle this task due to the balance
of quality and efficiency. However, these methods suffer from poor stability
and adaptability to long-tail distribution, failing to simultaneously retain
source identity and restore detail. We propose DiffBFR to introduce Diffusion
Probabilistic Model (DPM) for BFR to tackle the above problem, given its
superiority over GAN in aspects of avoiding training collapse and generating
long-tail distribution. DiffBFR utilizes a two-step design, that first restores
identity information from low-quality images and then enhances texture details
according to the distribution of real faces. This design is implemented with
two key components: 1) Identity Restoration Module (IRM) for preserving the
face details in results. Instead of denoising from pure Gaussian random
distribution with LQ images as the condition during the reverse process, we
propose a novel truncated sampling method which starts from LQ images with part
noise added. We theoretically prove that this change shrinks the evidence lower
bound of DPM and then restores more original details. With theoretical proof,
two cascade conditional DPMs with different input sizes are introduced to
strengthen this sampling effect and reduce training difficulty in the
high-resolution image generated directly. 2) Texture Enhancement Module (TEM)
for polishing the texture of the image. Here an unconditional DPM, a LQ-free
model, is introduced to further force the restorations to appear realistic. We
theoretically proved that this unconditional DPM trained on pure HQ images
contributes to justifying the correct distribution of inference images output
from IRM in pixel-level space. Truncated sampling with fractional time step is
utilized to polish pixel-level textures while preserving identity information. |
This paper introduces DiffBFR, the first approach applying pure diffusion models to Blind Face Restoration (BFR). It leverages the advantages of diffusion models over GANs in handling long-tail distributions and avoiding training collapse. |
BFR, crucial for various applications, remains challenging due to the complex degradation in real-world images. Existing GAN-based methods struggle to restore fine-grained details and realistic textures, especially in cases of long-tail distribution. |
DiffBFR utilizes a two-step design: 1) **IRM (Identity Restoration Module):** Restores identity information from low-quality images using a novel truncated sampling method with cascaded conditional DPMs. 2) **TEM (Texture Enhancement Module):** Enhances texture details based on the distribution of real faces learned by an unconditional DPM trained on HQ images. |
DiffBFR achieves superior quantitative results compared to state-of-the-art methods, as demonstrated by FID, NIQE, and LPIPS metrics.
Qualitative results highlight DiffBFR's ability to restore high-fidelity facial details, maintain person identities, and produce realistic textures.
Theoretical analysis and ablation studies validate the effectiveness of the proposed IRM and TEM modules. |
Inference time, though reduced by truncated sampling, remains longer than GAN-based methods, requiring further optimization.
The cascaded multi-stage structure results in a larger parameter scale compared to single-stage diffusion models. |
blind face restoration, diffusion probabilistic models, long-tail distribution, identity restoration, texture enhancement |
2305.04470
Report |
Video Object Segmentation in Panoptic Wild Scenes |
Yuanyou Xu, Zongxin Yang, Yi Yang |
In this paper, we introduce semi-supervised video object segmentation (VOS)
to panoptic wild scenes and present a large-scale benchmark as well as a
baseline method for it. Previous benchmarks for VOS with sparse annotations are
not sufficient to train or evaluate a model that needs to process all possible
objects in real-world scenarios. Our new benchmark (VIPOSeg) contains
exhaustive object annotations and covers various real-world object categories
which are carefully divided into subsets of thing/stuff and seen/unseen classes
for comprehensive evaluation. Considering the challenges in panoptic VOS, we
propose a strong baseline method named panoptic object association with
transformers (PAOT), which uses panoptic identification to associate objects
with a pyramid architecture on multiple scales. Experimental results show that
VIPOSeg can not only boost the performance of VOS models by panoptic training
but also evaluate them comprehensively in panoptic scenes. Previous methods for
classic VOS still need to improve in performance and efficiency when dealing
with panoptic scenes, while our PAOT achieves SOTA performance with good
efficiency on VIPOSeg and previous VOS benchmarks. PAOT also ranks 1st in the
VOT2022 challenge. Our dataset is available at
https://github.com/yoxu515/VIPOSeg-Benchmark. |
This paper introduces the concept of panoptic video object segmentation (VOS) and presents VIPOSeg, a new large-scale benchmark dataset with exhaustive object annotations encompassing seen/unseen and thing/stuff classes, along with a baseline method PAOT for this task. |
Existing VOS benchmarks are limited by sparse annotations and lack of diverse object categories, hindering the development and evaluation of models equipped for real-world scenarios with numerous objects and stuff classes. |
The authors leverage the VIPSeg dataset for video panoptic segmentation to create VIPOSeg by re-splitting the data, converting annotations to VOS format, and meticulously cleaning the annotations. They also introduce PAOT, which employs decoupled identity banks for thing/stuff objects, a pyramid architecture for multi-scale matching, and efficient long-short term transformers to address panoptic scene challenges. |
VIPOSeg proves to be significantly more challenging than previous VOS benchmarks due to its dense object annotations and diverse object scales.
Training on VIPOSeg significantly improves the performance of VOS methods, including on classic VOS benchmarks.
PAOT, with its pyramid architecture and panoptic ID strategy, achieves state-of-the-art performance on VIPOSeg and other benchmarks, demonstrating its effectiveness for panoptic VOS. |
The efficiency of current VOS models on VIPOSeg requires further improvement, as most models demand over 11 GB of memory.
Future work includes exploring better memory strategies and larger ID capacities to enhance model efficiency for panoptic VOS. |
video object segmentation, panoptic segmentation, benchmark dataset, deep learning, computer vision |
2305.04461
Report |
Locally Attentional SDF Diffusion for Controllable 3D Shape Generation |
Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, Heung-Yeung Shum |
Although the recent rapid evolution of 3D generative neural networks greatly
improves 3D shape generation, it is still not convenient for ordinary users to
create 3D shapes and control the local geometry of generated shapes. To address
these challenges, we propose a diffusion-based 3D generation framework --
locally attentional SDF diffusion, to model plausible 3D shapes, via 2D sketch
image input. Our method is built on a two-stage diffusion model. The first
stage, named occupancy-diffusion, aims to generate a low-resolution occupancy
field to approximate the shape shell. The second stage, named SDF-diffusion,
synthesizes a high-resolution signed distance field within the occupied voxels
determined by the first stage to extract fine geometry. Our model is empowered
by a novel view-aware local attention mechanism for image-conditioned shape
generation, which takes advantage of 2D image patch features to guide 3D voxel
feature learning, greatly improving local controllability and model
generalizability. Through extensive experiments in sketch-conditioned and
category-conditioned 3D shape generation tasks, we validate and demonstrate the
ability of our method to provide plausible and diverse 3D shapes, as well as
its superior controllability and generalizability over existing work. Our code
and trained models are available at
https://zhengxinyang.github.io/projects/LAS-Diffusion.html |
This paper proposes LAS-Diffusion, a novel two-stage diffusion-based 3D shape generation framework that takes 2D sketch images as input, aiming at achieving plausible 3D shape generation with local controllability. |
Existing 3D shape generation methods struggle with quality and lack intuitive control, especially for ordinary users who wish to embed creative ideas into the generation process. |
The framework employs two stages: occupancy-diffusion, generating a low-resolution occupancy field to approximate the shape shell; and SDF-diffusion, synthesizing a high-resolution signed distance field within the occupied voxels. It utilizes a view-aware local attention mechanism to leverage 2D image patch features for guiding 3D voxel feature learning and local control. |
LAS-Diffusion outperforms existing methods in terms of local controllability and generalizability for sketch-conditioned generation tasks.
The method exhibits superior shape quality and diversity for category-conditioned generation tasks.
It demonstrates robustness to different sketch styles, including synthetic sketches, freehand sketches, and professional sketches. |
The model's sketch style is currently limited to the rendering pipeline used during training, making it less adaptable to highly distorted or inconsistent sketches.
The current work focuses on shape geometry, with future work aiming to incorporate shape appearance generation using sketches and language descriptions. |
3d shape generation, diffusion model, sketch-conditioned, local attention, sdf |
2305.04451
Report |
FashionTex: Controllable Virtual Try-on with Text and Texture |
Anran Lin, Nanxuan Zhao, Shuliang Ning, Yuda Qiu, Baoyuan Wang, Xiaoguang Han |
Virtual try-on attracts increasing research attention as a promising way for
enhancing the user experience for online cloth shopping. Though existing
methods can generate impressive results, users need to provide a well-designed
reference image containing the target fashion clothes that often do not exist.
To support user-friendly fashion customization in full-body portraits, we
propose a multi-modal interactive setting by combining the advantages of both
text and texture for multi-level fashion manipulation. With the carefully
designed fashion editing module and loss functions, FashionTex framework can
semantically control cloth types and local texture patterns without annotated
pairwise training data. We further introduce an ID recovery module to maintain
the identity of input portrait. Extensive experiments have demonstrated the
effectiveness of our proposed pipeline. |
Presents FashionTex, a novel pipeline for interactive and controllable full-body virtual try-on using text prompts to modify clothing types and texture patches to adjust local patterns. |
Addresses limitations of existing virtual try-on methods that require reference images with specific clothing items, enabling user-friendly fashion customization. |
Leverages StyleGAN's latent space for editing, designing a fashion editing module with separate mappers for text and texture inputs. Introduces a novel CLIP-based type loss for accurate cloth type manipulation and an ID recovery module to maintain portrait identity. |
Achieves precise control over cloth types and textures, enabling customization based on textual descriptions and reference patches.
Outperforms existing text-driven image manipulation methods (TediGAN, StyleCLIP) in fashion type editing based on FID and accuracy metrics.
Exhibits superior performance in texture transfer compared to TextureGAN, DiOr, and Texture Reformer, as evidenced by FID and LPIPS scores. |
Reliance on human parsing for region-specific editing may introduce errors.
Limited diversity in generated clothing styles due to the training dataset. |
virtual try-on, fashion editing, multi-modal learning, text-to-image synthesis, stylegan |
2305.04441
Report |
Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models |
Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han |
Recently large-scale language-image models (e.g., text-guided diffusion
models) have considerably improved the image generation capabilities to
generate photorealistic images in various domains. Based on this success,
current image editing methods use texts to achieve intuitive and versatile
modification of images. To edit a real image using diffusion models, one must
first invert the image to a noisy latent from which an edited image is sampled
with a target text prompt. However, most methods lack one of the following:
user-friendliness (e.g., additional masks or precise descriptions of the input
image are required), generalization to larger domains, or high fidelity to the
input image. In this paper, we design an accurate and quick inversion
technique, Prompt Tuning Inversion, for text-driven image editing.
Specifically, our proposed editing method consists of a reconstruction stage
and an editing stage. In the first stage, we encode the information of the
input image into a learnable conditional embedding via Prompt Tuning Inversion.
In the second stage, we apply classifier-free guidance to sample the edited
image, where the conditional embedding is calculated by linearly interpolating
between the target embedding and the optimized one obtained in the first stage.
This technique ensures a superior trade-off between editability and high
fidelity to the input image of our method. For example, we can change the color
of a specific object while preserving its original shape and background under
the guidance of only a target text prompt. Extensive experiments on ImageNet
demonstrate the superior editing performance of our method compared to the
state-of-the-art baselines. |
This paper proposes Prompt Tuning Inversion, a novel text-driven image editing method based on diffusion models that allows accurate and efficient editing of real images using only target text prompts. |
Existing text-driven image editing methods lack in user-friendliness (requiring masks or image descriptions), generalization to larger domains, or fidelity to the input image. This method aims to address these shortcomings. |
The method has two stages. First, Prompt Tuning Inversion encodes the input image information into a learnable conditional embedding. Second, it linearly interpolates this embedding with the target text embedding to guide the diffusion model in generating an edited image. |
The method outperforms state-of-the-art baselines like DiffEdit in terms of the trade-off between editability and fidelity to the input image.
Prompt Tuning Inversion demonstrates faster convergence and superior reconstruction quality compared to Null-Text Inversion.
Ablation studies highlight the influence of interpolation ratio and learning rate on the balance between editability and fidelity. |
The method may not successfully edit all instances of an object in an image with multiple objects.
Future work could explore techniques like precise attention map manipulation or multi-modal conditional control to overcome this limitation. |
image editing, diffusion models, text-guided synthesis, prompt tuning, image inversion |
2305.04440
Report |
Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting |
Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu |
Class-agnostic counting (CAC) aims to count objects of interest from a query
image given few exemplars. This task is typically addressed by extracting the
features of query image and exemplars respectively and then matching their
feature similarity, leading to an extract-then-match paradigm. In this work, we
show that CAC can be simplified in an extract-and-match manner, particularly
using a vision transformer (ViT) where feature extraction and similarity
matching are executed simultaneously within the self-attention. We reveal the
rationale of such simplification from a decoupled view of the self-attention.
The resulting model, termed CACViT, simplifies the CAC pipeline into a single
pretrained plain ViT. Further, to compensate the loss of the scale and the
order-of-magnitude information due to resizing and normalization in plain ViT,
we present two effective strategies for scale and magnitude embedding.
Extensive experiments on the FSC147 and the CARPK datasets show that CACViT
significantly outperforms state-of-the art CAC approaches in both effectiveness
(23.60% error reduction) and generalization, which suggests CACViT provides a
concise and strong baseline for CAC. Code will be available. |
This paper presents CACViT, a simple yet effective ViT-based model for class-agnostic counting (CAC) that simplifies CAC into an extract-and-match paradigm within a single pretrained plain ViT, outperforming state-of-the-art approaches. |
CAC is important due to its potential to generalize to unseen scenes and reduced reliance on class-specific training data. Existing methods suffer from redundancy and task-specific designs by following an extract-then-match paradigm. |
The authors leverage the self-attention mechanism of ViT to simultaneously extract features and perform matching for query images and exemplars. Two effective strategies, aspect-ratio-aware scale embedding and order-of-magnitude embedding, are introduced to compensate for scale information loss due to resizing and normalization in ViT. |
CACViT significantly outperforms state-of-the-art CAC approaches on FSC147, achieving relative error reductions of 19.04% and 23.60% on the validation and test sets, respectively.
The method demonstrates strong cross-dataset generalization ability on the CARPK car counting dataset.
Ablation studies validate the effectiveness of the proposed scale and magnitude embedding strategies. |
The performance on the FSC147 test set for the 1-shot setting is unexpectedly better than the 3-shot setting, potentially due to annotation quality issues for dense environments.
The improvement on MAE by magnitude embedding alone is marginal, possibly due to inaccurate object size priors from resized exemplars. |
class-agnostic counting, vision transformer, self-attention, scale embedding, magnitude embedding |
2305.04296
Report |
HashCC: Lightweight Method to Improve the Quality of the Camera-less NeRF Scene Generation |
Jan Olszewski |
Neural Radiance Fields has become a prominent method of scene generation via
view synthesis. A critical requirement for the original algorithm to learn
meaningful scene representation is camera pose information for each image in a
data set. Current approaches try to circumnavigate this assumption with
moderate success, by learning approximate camera positions alongside learning
neural representations of a scene. This requires complicated camera models,
causing a long and complicated training process, or results in a lack of
texture and sharp details in rendered scenes. In this work we introduce Hash
Color Correction (HashCC) -- a lightweight method for improving Neural Radiance
Fields rendered image quality, applicable also in situations where camera
positions for a given set of images are unknown. |
Introduces HashCC, a lightweight color correction method for NeRF to enhance rendered image quality in camera-less scenarios. |
Addresses limitations of existing camera-less NeRF methods that produce blurry results or require complex models and training, hindering wider application. |
Extends NeRF-/- by incorporating a Color Correction Network with Hash Encoding and a shallow MLP, adding a color correction term to the main network output. Uses Spherical Harmonics encoding for viewing direction. |
Improves rendered image quality in 6 out of 8 scenes from the LLFF dataset based on PSNR, SSIM, and LPIPS metrics.
Enhances camera pose estimation compared to NeRF-/- in most scenes.
Shows sharper details and textures compared to baseline, as demonstrated in qualitative comparisons. |
Limited improvement in camera pose estimation due to reliance on a simple pinhole camera model.
Future work can explore sophisticated camera models and exploit camera trajectory information for improved pose estimation, especially in non-forward-facing scenarios. |
neural radiance fields, camera-less nerf, hash encoding, color correction, view synthesis |
2305.04268
Report |
Multi-Space Neural Radiance Fields |
Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, Bo Ren |
Existing Neural Radiance Fields (NeRF) methods suffer from the existence of
reflective objects, often resulting in blurry or distorted rendering. Instead
of calculating a single radiance field, we propose a multi-space neural
radiance field (MS-NeRF) that represents the scene using a group of feature
fields in parallel sub-spaces, which leads to a better understanding of the
neural network toward the existence of reflective and refractive objects. Our
multi-space scheme works as an enhancement to existing NeRF methods, with only
small computational overheads needed for training and inferring the extra-space
outputs. We demonstrate the superiority and compatibility of our approach using
three representative NeRF-based models, i.e., NeRF, Mip-NeRF, and Mip-NeRF 360.
Comparisons are performed on a novelly constructed dataset consisting of 25
synthetic scenes and 7 real captured scenes with complex reflection and
refraction, all having 360-degree viewpoints. Extensive experiments show that
our approach significantly outperforms the existing single-space NeRF methods
for rendering high-quality scenes concerned with complex light paths through
mirror-like objects. Our code and dataset will be publicly available at
https://zx-yin.github.io/msnerf. |
This paper introduces Multi-Space Neural Radiance Fields (MS-NeRF), a novel method addressing challenges in rendering reflective objects in 360-degree scenes. |
Existing NeRF methods struggle with reflections, often producing blurry results due to violated multi-view consistency. |
The MS-NeRF represents scenes as multiple virtual sub-spaces, each adhering to multi-view consistency. This is achieved using a lightweight multi-space module replacing the original output layer of NeRF, generating densities and features for each sub-space, subsequently decoded and weighted for final rendering. |
MS-NeRF significantly outperforms single-space NeRF methods in rendering scenes with complex reflections.
The approach demonstrates compatibility with various NeRF architectures like NeRF, Mip-NeRF, and Mip-NeRF 360.
A new dataset featuring complex reflections and refractions is introduced, including synthetic and real-world captured scenes. |
The number of sub-spaces, while not needing to precisely match the virtual images, requires careful selection.
Future work could explore automatically determining the optimal number of sub-spaces. |
neural radiance fields, nerf, reflections, 360-degree rendering, novel view synthesis |
2305.04232
Report |
CatFLW: Cat Facial Landmarks in the Wild Dataset |
George Martvel, Nareed Farhat, Ilan Shimshoni, Anna Zamansky |
Animal affective computing is a quickly growing field of research, where only
recently first efforts to go beyond animal tracking into recognizing their
internal states, such as pain and emotions, have emerged. In most mammals,
facial expressions are an important channel for communicating information about
these states. However, unlike the human domain, there is an acute lack of
datasets that make automation of facial analysis of animals feasible.
This paper aims to fill this gap by presenting a dataset called Cat Facial
Landmarks in the Wild (CatFLW) which contains 2016 images of cat faces in
different environments and conditions, annotated with 48 facial landmarks
specifically chosen for their relationship with underlying musculature, and
relevance to cat-specific facial Action Units (CatFACS). To the best of our
knowledge, this dataset has the largest amount of cat facial landmarks
available.
In addition, we describe a semi-supervised (human-in-the-loop) method of
annotating images with landmarks, used for creating this dataset, which
significantly reduces the annotation time and could be used for creating
similar datasets for other animals.
The dataset is available on request. |
This paper introduces CatFLW, a dataset of cat faces annotated with 48 facial landmarks, designed for advancing animal affective computing, especially for cats. |
Publicly available animal facial landmark datasets are scarce, hindering research in animal affective computing. This is particularly crucial for cats, where facial analysis can help with pain assessment. |
The authors selected images of single, fully visible cat faces from an existing dataset. They then used a semi-supervised 'human-in-the-loop' method to annotate facial landmarks, leveraging a gradually trained EfficientNet model to expedite the process. |
CatFLW contains \cat images with 48 landmarks, focusing on features relevant to cat facial expressions and musculature.
The 'human-in-the-loop' annotation significantly reduced annotation time per image compared to purely manual methods.
The dataset exhibits diverse cat breeds, environments, and head poses, similar to the AnimalWeb dataset, making it suitable for training robust computer vision models. |
The dataset size, while the largest of its kind, is still limited compared to human facial landmark datasets.
Future work can focus on expanding the dataset, developing automated landmark detection models, and exploring applications in pain and emotion recognition. |
facial landmark detection, animal affective computing, cat facial expressions, pain assessment, dataset |
2305.04075
Report |
PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos |
Zhiqiang Shen, Xiaoxiao Sheng, Longguang Wang, Yulan Guo, Qiong Liu, Xi Zhou |
Self-supervised learning can extract representations of good quality from
solely unlabeled data, which is appealing for point cloud videos due to their
high labelling cost. In this paper, we propose a contrastive mask prediction
(PointCMP) framework for self-supervised learning on point cloud videos.
Specifically, our PointCMP employs a two-branch structure to achieve
simultaneous learning of both local and global spatio-temporal information. On
top of this two-branch structure, a mutual similarity based augmentation module
is developed to synthesize hard samples at the feature level. By masking
dominant tokens and erasing principal channels, we generate hard samples to
facilitate learning representations with better discrimination and
generalization performance. Extensive experiments show that our PointCMP
achieves the state-of-the-art performance on benchmark datasets and outperforms
existing full-supervised counterparts. Transfer learning results demonstrate
the superiority of the learned representations across different datasets and
tasks. |
This paper proposes PointCMP, a novel contrastive mask prediction framework for self-supervised learning on point cloud videos. |
Self-supervised learning on point cloud videos is crucial for reducing the high cost of annotation, and current paradigms struggle to effectively capture both local and global spatio-temporal information essential for this task. |
PointCMP leverages a two-branch structure to learn local and global information, coupled with a mutual similarity based augmentation module to generate hard samples at the feature level by masking dominant tokens and erasing principal channels. |
PointCMP achieves state-of-the-art performance on benchmark datasets for 3D action and gesture recognition, outperforming fully supervised counterparts.
The method demonstrates superior performance in linear probing and semi-supervised settings, highlighting the quality of the learned representations.
Transfer learning experiments show strong generalization capabilities across datasets and tasks, exceeding previous self-supervised methods. |
The current design of PointCMP primarily focuses on single-view point cloud videos, limiting its applicability to multi-view scenarios.
Exploring more advanced masking and augmentation strategies at the feature level could further enhance the performance of PointCMP. |
self-supervised learning, point cloud videos, contrastive learning, mask prediction, action recognition |
2305.03989
Report |
LEO: Generative Latent Image Animator for Human Video Synthesis |
Yaohui Wang, Xin Ma, Xinyuan Chen, Antitza Dantcheva, Bo Dai, Yu Qiao |
Spatio-temporal coherency is a major challenge in synthesizing high quality
videos, particularly in synthesizing human videos that contain rich global and
local deformations. To resolve this challenge, previous approaches have
resorted to different features in the generation process aimed at representing
appearance and motion. However, in the absence of strict mechanisms to
guarantee such disentanglement, a separation of motion from appearance has
remained challenging, resulting in spatial distortions and temporal jittering
that break the spatio-temporal coherency. Motivated by this, we here propose
LEO, a novel framework for human video synthesis, placing emphasis on
spatio-temporal coherency. Our key idea is to represent motion as a sequence of
flow maps in the generation process, which inherently isolate motion from
appearance. We implement this idea via a flow-based image animator and a Latent
Motion Diffusion Model (LMDM). The former bridges a space of motion codes with
the space of flow maps, and synthesizes video frames in a warp-and-inpaint
manner. LMDM learns to capture motion prior in the training data by
synthesizing sequences of motion codes. Extensive quantitative and qualitative
analysis suggests that LEO significantly improves coherent synthesis of human
videos over previous methods on the datasets TaichiHD, FaceForensics and
CelebV-HQ. In addition, the effective disentanglement of appearance and motion
in LEO allows for two additional tasks, namely infinite-length human video
synthesis, as well as content-preserving video editing. |
This paper proposes LEO, a novel framework for human video synthesis that prioritizes spatio-temporal coherency by representing motion as a sequence of flow maps, effectively disentangling it from appearance. |
Synthesizing high-quality human videos with strong spatio-temporal coherency is challenging due to the difficulty in disentangling motion from appearance, leading to spatial distortions and temporal jittering. |
LEO utilizes a two-phase training approach. It first trains a flow-based image animator to learn latent motion codes and their mapping to flow maps. Then, a Latent Motion Diffusion Model (LMDM) learns motion prior from these codes, enabling the synthesis of coherent videos via a warp-and-inpaint mechanism. |
LEO demonstrates superior spatio-temporal coherency compared to existing methods, even in long videos (512 frames).
The disentanglement of appearance and motion enables infinite-length video synthesis and content-preserving video editing.
Quantitative evaluations on TaichiHD, FaceForensics, and CelebV-HQ datasets, using metrics like FVD, KVD, and ACD, confirm LEO’s superiority. |
The quality of unconditionally generated videos depends on the starting frame generated by a separate model, suggesting a need for improvement in that area.
The diversity of motion patterns in infinite-length generation is limited by the training data, requiring further research. |
video synthesis, human video generation, motion disentanglement, diffusion models, spatio-temporal coherency |
2305.03713
Report |
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos |
Ekta Prashnani, Koki Nagano, Shalini De Mello, David Luebke, Orazio Gallo |
Modern generators render talking-head videos with impressive photorealism,
ushering in new user experiences such as videoconferencing under constrained
bandwidth budgets. Their safe adoption, however, requires a mechanism to verify
if the rendered video is trustworthy. For instance, for videoconferencing we
must identify cases in which a synthetic video portrait uses the appearance of
an individual without their consent. We term this task avatar fingerprinting.
Specifically, we learn an embedding in which the motion signatures of one
identity are grouped together, and pushed away from those of the other
identities. This allows us to link the synthetic video to the identity driving
the expressions in the video, regardless of the facial appearance shown. Avatar
fingerprinting algorithms will be critical as talking head generators become
more ubiquitous, and yet no large scale datasets exist for this new task.
Therefore, we contribute a large dataset of people delivering scripted and
improvised short monologues, accompanied by synthetic videos in which we render
videos of one person using the facial appearance of another. Project page:
https://research.nvidia.com/labs/nxp/avatar-fingerprinting/. |
The paper introduces "avatar fingerprinting", a method to verify the identity of the person driving the expressions in a synthetic talking-head video, regardless of the facial appearance. |
This is crucial for safe adoption of talking-head generators, which are becoming increasingly realistic, to prevent unauthorized use of someone's likeness. |
The method extracts temporal facial landmark distances from videos and uses a novel contrastive loss to learn a "dynamic identity embedding". In this embedding space, videos driven by the same identity cluster together, regardless of the target appearance. |
The method achieves an AUC of 0.886, outperforming baselines adapted from deepfake detection.
The learned embeddings capture motion dynamics rather than appearance, evidenced by similar distances between different target identities driven by the same person.
The method demonstrates robustness to unseen talking-head generators. |
The algorithm struggles to differentiate subjects with consistently neutral expressions.
Accuracy is affected if the generator fails to capture expressions crucial for distinguishing identities. |
avatar fingerprinting, talking-head video, synthetic media, identity verification, deepfake detection |
2305.03689
Report |
COLA: A Benchmark for Compositional Text-to-image Retrieval |
Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko |
Compositional reasoning is a hallmark of human visual intelligence. Yet,
despite the size of large vision-language models, they struggle to represent
simple compositions by combining objects with their attributes. To measure this
lack of compositional capability, we design Cola, a text-to-image retrieval
benchmark to Compose Objects Localized with Attributes. To solve Cola, a model
must retrieve images with the correct configuration of attributes and objects
and avoid choosing a distractor image with the same objects and attributes but
in the wrong configuration. Cola contains about 1.2k composed queries of 168
objects and 197 attributes on around 30K images. Our human evaluation finds
that Cola is 83.33% accurate, similar to contemporary compositionality
benchmarks. Using Cola as a testbed, we explore empirical modeling designs to
adapt pre-trained vision-language models to reason compositionally. We explore
6 adaptation strategies on 2 seminal vision-language models, using
compositionality-centric test benchmarks - Cola and CREPE. We find the optimal
adaptation strategy is to train a multi-modal attention layer that jointly
attends over the frozen pre-trained image and language features. Surprisingly,
training multimodal layers on CLIP performs better than tuning a larger FLAVA
model with already pre-trained multimodal layers. Furthermore, our adaptation
strategy improves CLIP and FLAVA to comparable levels, suggesting that training
multimodal layers using contrastive attribute-object data is key, as opposed to
using them pre-trained. Lastly, we show that Cola is harder than a closely
related contemporary benchmark, CREPE, since simpler fine-tuning strategies
without multimodal layers suffice on CREPE but not on Cola. However, we still
see a significant gap between our best adaptation and human accuracy,
suggesting considerable room for further research. |
The paper introduces COLA, a text-to-image retrieval benchmark designed to test the ability of vision-language models to compose objects localized with attributes. |
Compositional reasoning, particularly binding attributes to the correct objects, is crucial for vision-language models to understand complex scenes and execute instructions accurately. |
The authors create COLA with single and multi-object queries containing multiple attributes. They then explore and compare various fine-tuning strategies on pre-trained models (CLIP and FLAVA) using the benchmark. |
A lightweight multimodal adaptation strategy using a transformer encoder-decoder to jointly attend over image and language features outperforms common tuning methods like prompt-tuning and fine-tuning.
Training multimodal layers on attribute-object data during adaptation is crucial for performance, even surpassing the use of pre-trained multimodal layers in larger models.
COLA proves to be a more challenging benchmark than existing ones, highlighting the difficulty of text-to-image retrieval with fine-grained compositional differences. |
While focusing on attribute-object compositionality, other compositional structures like relationships and scene graphs require further exploration.
The fine-tuning strategy, while effective for compositionality, might impact performance on other generic vision-language tasks. |
compositional reasoning, vision-language models, text-to-image retrieval, attribute-object binding, benchmarking |
2305.03382
Report |
Guided Image Synthesis via Initial Image Editing in Diffusion Model |
Jiafeng Mao, Xueting Wang, Kiyoharu Aizawa |
Diffusion models have the ability to generate high quality images by
denoising pure Gaussian noise images. While previous research has primarily
focused on improving the control of image generation through adjusting the
denoising process, we propose a novel direction of manipulating the initial
noise to control the generated image. Through experiments on stable diffusion,
we show that blocks of pixels in the initial latent images have a preference
for generating specific content, and that modifying these blocks can
significantly influence the generated image. In particular, we show that
modifying a part of the initial image affects the corresponding region of the
generated image while leaving other regions unaffected, which is useful for
repainting tasks. Furthermore, we find that the generation preferences of pixel
blocks are primarily determined by their values, rather than their position. By
moving pixel blocks with a tendency to generate user-desired content to
user-specified regions, our approach achieves state-of-the-art performance in
layout-to-image generation. Our results highlight the flexibility and power of
initial image manipulation in controlling the generated image. |
This paper investigates the impact of the initial noise image in diffusion models, revealing its inherent preference for generating specific content and leveraging this insight to control image generation. |
This is important because it offers a novel approach to fine-grained control in image generation, addressing limitations of prompt-based methods and enabling tasks like repainting and layout-to-image synthesis. |
The authors conduct experiments on Stable Diffusion, manipulating the initial noise image by either partially re-randomizing regions or swapping pixel blocks based on attention maps to guide content generation. |
The initial noise image exhibits distinct preferences for generating specific content, impacting the final output.
Modifying regions in the initial image leads to corresponding changes in generated images, enabling repainting tasks.
Moving pixel blocks based on their generation tendency achieves state-of-the-art performance in layout-to-image synthesis. |
The method's effectiveness is limited when guidance bounding boxes are small, particularly affecting small object generation.
Future work can explore optimizing the initial image or accelerating denoising based on optimized starting points. |
text-to-image, diffusion model, fine-grained control, layout-to-image, initial image editing |
2305.03374
Report |
DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation |
Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, Wenwu Zhu |
Subject-driven text-to-image generation aims to generate customized images of
the given subject based on the text descriptions, which has drawn increasing
attention. Existing methods mainly resort to finetuning a pretrained generative
model, where the identity-relevant information (e.g., the boy) and the
identity-irrelevant information (e.g., the background or the pose of the boy)
are entangled in the latent embedding space. However, the highly entangled
latent embedding may lead to the failure of subject-driven text-to-image
generation as follows: (i) the identity-irrelevant information hidden in the
entangled embedding may dominate the generation process, resulting in the
generated images heavily dependent on the irrelevant information while ignoring
the given text descriptions; (ii) the identity-relevant information carried in
the entangled embedding can not be appropriately preserved, resulting in
identity change of the subject in the generated images. To tackle the problems,
we propose DisenBooth, an identity-preserving disentangled tuning framework for
subject-driven text-to-image generation. Specifically, DisenBooth finetunes the
pretrained diffusion model in the denoising process. Different from previous
works that utilize an entangled embedding to denoise each image, DisenBooth
instead utilizes disentangled embeddings to respectively preserve the subject
identity and capture the identity-irrelevant information. We further design the
novel weak denoising and contrastive embedding auxiliary tuning objectives to
achieve the disentanglement. Extensive experiments show that our proposed
DisenBooth framework outperforms baseline models for subject-driven
text-to-image generation with the identity-preserved embedding. Additionally,
by combining the identity-preserved embedding and identity-irrelevant
embedding, DisenBooth demonstrates more generation flexibility and
controllability |
This paper proposes DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. |
Existing methods suffer from entanglement of subject identity and irrelevant information in the latent space, leading to inaccurate subject generation and overfitting to background or pose. |
DisenBooth disentangles identity-relevant and irrelevant information by using separate textual and visual embeddings during finetuning of a pretrained diffusion model. It employs a weak denoising objective and a contrastive embedding objective to enforce disentanglement. |
DisenBooth outperforms baseline models in subject-driven text-to-image generation, demonstrating superior identity preservation and text prompt fidelity.
The disentangled embeddings enable more flexible and controllable generation by allowing users to combine identity-preserved embeddings with identity-irrelevant embeddings from reference images.
DisenBooth effectively disentangles identity-relevant and irrelevant information, as evidenced by ablation studies and visualizations. |
DisenBooth inherits limitations of the pretrained Stable Diffusion model.
Future work could explore more fine-grained disentanglement within identity-irrelevant information. |
text-to-image generation, subject-driven generation, disentangled representation learning, diffusion models, fine-tuning |
2305.03302
Report |
High-Fidelity 3D Face Generation from Natural Language Descriptions |
Menghua Wu, Hao Zhu, Linjia Huang, Yiyu Zhuang, Yuanxun Lu, Xun Cao |
Synthesizing high-quality 3D face models from natural language descriptions
is very valuable for many applications, including avatar creation, virtual
reality, and telepresence. However, little research ever tapped into this task.
We argue the major obstacle lies in 1) the lack of high-quality 3D face data
with descriptive text annotation, and 2) the complex mapping relationship
between descriptive language space and shape/appearance space. To solve these
problems, we build Describe3D dataset, the first large-scale dataset with
fine-grained text descriptions for text-to-3D face generation task. Then we
propose a two-stage framework to first generate a 3D face that matches the
concrete descriptions, then optimize the parameters in the 3D shape and texture
space with abstract description to refine the 3D face model. Extensive
experimental results show that our method can produce a faithful 3D face that
conforms to the input descriptions with higher accuracy and quality than
previous methods. The code and Describe3D dataset are released at
https://github.com/zhuhao-nju/describe3d . |
This paper presents a novel method for generating high-fidelity 3D faces from natural language descriptions. |
Creating 3D faces is crucial for applications like avatars and VR, but current methods struggle to translate textual descriptions into accurate 3D models. |
The authors created Describe3D, a dataset of 3D faces paired with fine-grained text descriptions. They propose a two-stage pipeline: 1) concrete synthesis maps text to 3D shape and texture, and 2) abstract synthesis refines the model based on abstract descriptions using CLIP. |
The method generates 3D faces that accurately reflect both concrete and abstract descriptions.
Quantitative evaluation shows superior performance over cascading text-to-image and image-to-shape methods, as well as Latent3D.
Ablation studies demonstrate the effectiveness of the descriptive code, 3DMM representation, region-specific losses, and abstract synthesis. |
The method relies on distinguishing concrete and abstract descriptions, and struggles with complex sentences.
The dataset has limited racial diversity, impacting performance for certain ethnicities. |
3d face generation, text-to-3d, natural language processing, computer vision, deep learning |
2305.03051
Report |
Controllable Visual-Tactile Synthesis |
Ruihan Gao, Wenzhen Yuan, Jun-Yan Zhu |
Deep generative models have various content creation applications such as
graphic design, e-commerce, and virtual Try-on. However, current works mainly
focus on synthesizing realistic visual outputs, often ignoring other sensory
modalities, such as touch, which limits physical interaction with users. In
this work, we leverage deep generative models to create a multi-sensory
experience where users can touch and see the synthesized object when sliding
their fingers on a haptic surface. The main challenges lie in the significant
scale discrepancy between vision and touch sensing and the lack of explicit
mapping from touch sensing data to a haptic rendering device. To bridge this
gap, we collect high-resolution tactile data with a GelSight sensor and create
a new visuotactile clothing dataset. We then develop a conditional generative
model that synthesizes both visual and tactile outputs from a single sketch. We
evaluate our method regarding image quality and tactile rendering accuracy.
Finally, we introduce a pipeline to render high-quality visual and tactile
outputs on an electroadhesion-based haptic device for an immersive experience,
allowing for challenging materials and editable sketch inputs. |
This paper presents a novel method for synthesizing both visual and tactile outputs of garments from user sketches, aiming to create a multi-sensory experience. |
Existing generative models primarily focus on visual outputs, neglecting other sensory modalities like touch. This work addresses the gap by enabling users to both see and feel synthesized objects, enhancing user experience in various applications like online shopping and virtual reality. |
The authors collect a new dataset of garments with spatially aligned visual and high-resolution tactile data. They propose a conditional GAN model that learns from dense visual supervision and sparse local tactile supervision to synthesize both outputs from a single sketch. The synthesized outputs are then rendered on a haptic device for an immersive experience. |
The proposed method outperforms baseline conditional GANs in terms of image quality and perceptual realism, as evidenced by quantitative metrics (LPIPS, SIFID) and human preference studies.
The model generalizes to unseen sketches, allowing for user-driven design edits and customization.
The system allows for text-conditioned synthesis, enabling users to modify garment designs using text prompts. |
The model struggles to generalize to user sketches with intricate patterns.
The current haptic rendering is limited to surface textures and primarily excels with relatively flat objects like garments, posing challenges for rendering 3D objects with significant surface normal changes. |
generative models, multi-sensory synthesis, visual-tactile generation, haptic rendering, conditional gans |
2305.03049
Report |
NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds |
Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang |
This paper proposes NeuralEditor that enables neural radiance fields (NeRFs)
natively editable for general shape editing tasks. Despite their impressive
results on novel-view synthesis, it remains a fundamental challenge for NeRFs
to edit the shape of the scene. Our key insight is to exploit the explicit
point cloud representation as the underlying structure to construct NeRFs,
inspired by the intuitive interpretation of NeRF rendering as a process that
projects or "plots" the associated 3D point cloud to a 2D image plane. To this
end, NeuralEditor introduces a novel rendering scheme based on deterministic
integration within K-D tree-guided density-adaptive voxels, which produces both
high-quality rendering results and precise point clouds through optimization.
NeuralEditor then performs shape editing via mapping associated points between
point clouds. Extensive evaluation shows that NeuralEditor achieves
state-of-the-art performance in both shape deformation and scene morphing
tasks. Notably, NeuralEditor supports both zero-shot inference and further
fine-tuning over the edited scene. Our code, benchmark, and demo video are
available at https://immortalco.github.io/NeuralEditor. |
hemodel, a point cloud-guided NeRF model enabling general shape editing by manipulating underlying point clouds. |
NeRF excels in novel-view synthesis but struggles with shape editing. hemodel leverages point clouds for their ease of manipulation, combining the strengths of both representations. |
Uses K-D trees for density-adaptive voxels and deterministic spline integration for rendering. Employs Phong reflection for color modeling with normal vectors from the point cloud. Optimizes the point cloud via pruning, growing, and normal vector guidance. |
hemodel generates more precise point clouds than PointNeRF.
Significantly outperforms baselines in shape deformation, both in zero-shot and fine-tuned settings.
Achieves smooth scene morphing between multiple scenes, a challenging task for prior work. |
Point cloud-guided NeRF models, including hemodel, struggle with surfaces having complex visual effects (e.g., translucent mirrors).
Shape deformation doesn't consider the surrounding environment, limiting its ability to realistically adjust scene colors based on lighting changes. |
neural radiance fields, shape editing, point clouds, scene morphing, 3d vision |
2305.03048
Report |
Personalize Segment Anything Model with One Shot |
Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Xianzheng Ma, Hao Dong, Peng Gao, Hongsheng Li |
Driven by large-data pre-training, Segment Anything Model (SAM) has been
demonstrated as a powerful and promptable framework, revolutionizing the
segmentation models. Despite the generality, customizing SAM for specific
visual concepts without man-powered prompting is under explored, e.g.,
automatically segmenting your pet dog in different images. In this paper, we
propose a training-free Personalization approach for SAM, termed as PerSAM.
Given only a single image with a reference mask, PerSAM first localizes the
target concept by a location prior, and segments it within other images or
videos via three techniques: target-guided attention, target-semantic
prompting, and cascaded post-refinement. In this way, we effectively adapt SAM
for private use without any training. To further alleviate the mask ambiguity,
we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the
entire SAM, we introduce two learnable weights for multi-scale masks, only
training 2 parameters within 10 seconds for improved performance. To
demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for
personalized evaluation, and test our methods on video object segmentation with
competitive performance. Besides, our approach can also enhance DreamBooth to
personalize Stable Diffusion for text-to-image generation, which discards the
background disturbance for better target appearance learning. Code is released
at https://github.com/ZrrSkywalker/Personalize-SAM |
This paper presents PerSAM, a training-free personalization approach for the Segment Anything Model (SAM) that enables it to segment user-designated visual concepts using only one-shot data (a reference image and a mask). |
While SAM demonstrates impressive general-purpose segmentation abilities, it lacks the capacity to automatically segment specific user-defined objects in new images or videos. PerSAM addresses this limitation, making SAM more practical for personalized use cases. |
PerSAM leverages a location confidence map derived from feature similarities between the reference and test images to guide SAM's attention. It introduces target-guided attention and target-semantic prompting techniques to enhance SAM's focus on the target object. A fine-tuning variant, PerSAM-F, further improves performance by addressing the ambiguity of segmentation scales through a lightweight, scale-aware fine-tuning process. |
PerSAM achieves superior performance compared to other in-context learning methods on personalized object segmentation benchmarks, including the authors' newly introduced PerSeg dataset.
PerSAM-F, with minimal fine-tuning, further enhances accuracy, surpassing even fully trained video object segmentation models on DAVIS 2017.
The authors demonstrate PerSAM's applicability for improving DreamBooth's personalized text-to-image synthesis by mitigating background disturbance. |
The reliance on an accurate one-shot mask as input can be a limitation, although the authors provide a relaxation by allowing a bounding box as input with a slight performance trade-off.
Further exploration of personalization techniques for broader applications of SAM is suggested as future work. |
personalized object segmentation, segment anything model, one-shot learning, parameter-efficient fine-tuning, dreambooth |
2305.03045
Report |
OctFormer: Octree-based Transformers for 3D Point Clouds |
Peng-Shuai Wang |
We propose octree-based transformers, named OctFormer, for 3D point cloud
learning. OctFormer can not only serve as a general and effective backbone for
3D point cloud segmentation and object detection but also have linear
complexity and is scalable for large-scale point clouds. The key challenge in
applying transformers to point clouds is reducing the quadratic, thus
overwhelming, computation complexity of attentions. To combat this issue,
several works divide point clouds into non-overlapping windows and constrain
attentions in each local window. However, the point number in each window
varies greatly, impeding the efficient execution on GPU. Observing that
attentions are robust to the shapes of local windows, we propose a novel octree
attention, which leverages sorted shuffled keys of octrees to partition point
clouds into local windows containing a fixed number of points while permitting
shapes of windows to change freely. And we also introduce dilated octree
attention to expand the receptive field further. Our octree attention can be
implemented in 10 lines of code with open-sourced libraries and runs 17 times
faster than other point cloud attentions when the point number exceeds 200k.
Built upon the octree attention, OctFormer can be easily scaled up and achieves
state-of-the-art performances on a series of 3D segmentation and detection
benchmarks, surpassing previous sparse-voxel-based CNNs and point cloud
transformers in terms of both efficiency and effectiveness. Notably, on the
challenging ScanNet200 dataset, OctFormer outperforms sparse-voxel-based CNNs
by 7.3 in mIoU. Our code and trained models are available at
https://wang-ps.github.io/octformer. |
This paper introduces OctFormer, an efficient and scalable transformer architecture for 3D point cloud understanding, based on a novel octree attention mechanism. |
Existing point cloud transformers suffer from low efficiency, hindering their applicability to large-scale point clouds. |
OctFormer leverages octree structures to divide point clouds into groups with equal point numbers for efficient window attention, enabling easy parallelization and scalability. |
OctFormer achieves state-of-the-art performance on ScanNet segmentation, outperforming previous methods including point cloud transformers and sparse-voxel-based CNNs.
OctFormer demonstrates superior efficiency, running over 17 times faster than other point cloud transformers on large-scale inputs.
OctFormer with only 18M parameters surpasses previous sparse-voxel-based CNNs with 38M parameters on ScanNet segmentation. |
OctFormer might overfit on small-scale datasets, requiring exploration of unsupervised pretraining techniques.
The current positional encoding limits the flexibility of OctFormer, demanding investigation into alternative positional encoding methods. |
3d deep learning, point cloud processing, transformers, octree, attention mechanism |
2305.03043
Report |
Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization |
Connor Z. Lin, Koki Nagano, Jan Kautz, Eric R. Chan, Umar Iqbal, Leonidas Guibas, Gordon Wetzstein, Sameh Khamis |
There is a growing demand for the accessible creation of high-quality 3D
avatars that are animatable and customizable. Although 3D morphable models
provide intuitive control for editing and animation, and robustness for
single-view face reconstruction, they cannot easily capture geometric and
appearance details. Methods based on neural implicit representations, such as
signed distance functions (SDF) or neural radiance fields, approach
photo-realism, but are difficult to animate and do not generalize well to
unseen data. To tackle this problem, we propose a novel method for constructing
implicit 3D morphable face models that are both generalizable and intuitive for
editing. Trained from a collection of high-quality 3D scans, our face model is
parameterized by geometry, expression, and texture latent codes with a learned
SDF and explicit UV texture parameterization. Once trained, we can reconstruct
an avatar from a single in-the-wild image by leveraging the learned prior to
project the image into the latent space of our model. Our implicit morphable
face models can be used to render an avatar from novel views, animate facial
expressions by modifying expression codes, and edit textures by directly
painting on the learned UV-texture maps. We demonstrate quantitatively and
qualitatively that our method improves upon photo-realism, geometry, and
expression accuracy compared to state-of-the-art methods. |
This paper proposes a novel method for constructing implicit 3D morphable face models that are both generalizable and intuitive for editing by combining the advantages of template-based 3DMMs with the quality and topological flexibility of implicit 3D representations. |
There is a growing demand for the accessible creation of high-quality 3D avatars that are animatable and customizable. |
The proposed method disentangles each facial avatar into identity and expression. It leverages an implicit geometry branch with a signed distance function (SDF) and a UV texture parameterization branch to represent the face. The model is trained on a large dataset of 3D face scans with various expressions. It also utilizes a single-shot inversion framework to map a single in-the-wild RGB image to the implicit 3D morphable model representation. |
The method achieves state-of-the-art reconstruction accuracy for photo-realistic rendering, geometry, and expression accuracy in the single-view reconstruction setting.
The learned texture map is intuitive to edit and propagates naturally during animation.
The proposed model demonstrates superior performance in expression and pose transfer between in-the-wild source and target images. |
The optimization process during inversion is relatively slow, limiting its use in real-time applications.
The reliance on a de-lighting module may result in subjects appearing paler than expected and the model does not capture hair or accessories due to the limitations of the training dataset. |
neural avatars, implicit representations, texture maps, animation, inversion |
2305.03040
Report |
TUVF: Learning Generalizable Texture UV Radiance Fields |
An-Chieh Cheng, Xueting Li, Sifei Liu, Xiaolong Wang |
Textures are a vital aspect of creating visually appealing and realistic 3D
models. In this paper, we study the problem of generating high-fidelity texture
given shapes of 3D assets, which has been relatively less explored compared
with generic 3D shape modeling. Our goal is to facilitate a controllable
texture generation process, such that one texture code can correspond to a
particular appearance style independent of any input shapes from a category. We
introduce Texture UV Radiance Fields (TUVF) that generate textures in a
learnable UV sphere space rather than directly on the 3D shape. This allows the
texture to be disentangled from the underlying shape and transferable to other
shapes that share the same UV space, i.e., from the same category. We integrate
the UV sphere space with the radiance field, which provides a more efficient
and accurate representation of textures than traditional texture maps. We
perform our experiments on synthetic and real-world object datasets where we
achieve not only realistic synthesis but also substantial improvements over
state-of-the-arts on texture controlling and editing. Project Page:
https://www.anjiecheng.me/TUVF |
This paper proposes Texture UV Radiance Fields (TUVF), a novel method for generating high-quality and disentangled textures on 3D objects, enabling controllable texture synthesis and editing. |
Texture plays a vital role in creating realistic 3D models, but generating high-fidelity, controllable textures remains a challenge. Existing methods often entangle texture with geometry, limiting controllability and transferability. |
TUVF generates textures in a learnable UV sphere space, disentangling texture from the underlying 3D shape. It utilizes a Canonical Surface Auto-encoder to learn dense correspondence between a canonical UV sphere and object instances, enabling texture transfer across different shapes. A texture generator creates textures on the UV sphere, and a radiance field renders the final textured object. Adversarial learning is employed for training. |
TUVF achieves state-of-the-art results on CompCars, Photoshape, and DiffusionCats datasets, demonstrating superior texture quality and disentanglement.
It enables texture transfer across different shapes with consistent style and local details.
TUVF supports texture editing by modifying rendered images and fine-tuning the corresponding texture features. |
The current one-to-one dense mapping assumption in correspondence learning might not hold in all real-world scenarios with shape variations.
Future work could explore incorporating data-driven priors (e.g., diffusion models) and advanced neural rendering architectures (e.g., ray transformers) for further improvement. |
texture synthesis, 3d deep learning, neural rendering, generative adversarial networks, disentanglement |
2305.02981
Report |
Adversarially-Guided Portrait Matting |
Sergej Chicherin, Karen Efremyan |
We present a method for generating alpha mattes using a limited data source.
We pretrain a novel transformerbased model (StyleMatte) on portrait datasets.
We utilize this model to provide image-mask pairs for the StyleGAN3-based
network (StyleMatteGAN). This network is trained unsupervisedly and generates
previously unseen imagemask training pairs that are fed back to StyleMatte. We
demonstrate that the performance of the matte pulling network improves during
this cycle and obtains top results on the human portraits and state-of-the-art
metrics on animals dataset. Furthermore, StyleMatteGAN provides
high-resolution, privacy-preserving portraits with alpha mattes, making it
suitable for various image composition tasks. Our code is available at
https://github.com/chroneus/stylematte |
Presents StyleMatteGAN, a novel approach for generating synthetic portraits with high-quality alpha mattes using a StyleGAN3-based architecture trained in an unsupervised manner. |
Addresses the scarcity of large, high-quality datasets for portrait matting, a critical challenge in computer vision. |
Leverages a pretrained StyleGAN3 network modified to generate RGBA images and employs a cyclical training process where a transformer-based matting network (StyleMatte) is iteratively refined using synthetic data from StyleMatteGAN. |
StyleMatte achieves state-of-the-art results on benchmark datasets like P3M-10k and AM-2k.
StyleMatteGAN generates high-resolution, realistic portraits with consistent alpha mattes, as evidenced by FID scores.
Cyclical training with synthetic data improves the performance of the StyleMatte matting network. |
Generated portraits primarily exhibit a frontal head pose due to limitations in the training data.
Future work could explore 3D-aware GANs and diffusion models to enhance pose variety and image quality. |
image matting, generative adversarial networks, stylegan3, synthetic data generation, portrait matting |
2305.02677
Report |
Caption Anything: Interactive Image Description with Diverse Multimodal Controls |
Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao |
Controllable image captioning is an emerging multimodal topic that aims to
describe the image with natural language following human purpose,
$\textit{e.g.}$, looking at the specified regions or telling in a particular
text style. State-of-the-art methods are trained on annotated pairs of input
controls and output captions. However, the scarcity of such well-annotated
multimodal data largely limits their usability and scalability for interactive
AI systems. Leveraging unimodal instruction-following foundation models is a
promising alternative that benefits from broader sources of data. In this
paper, we present Caption AnyThing (CAT), a foundation model augmented image
captioning framework supporting a wide range of multimodel controls: 1) visual
controls, including points, boxes, and trajectories; 2) language controls, such
as sentiment, length, language, and factuality. Powered by Segment Anything
Model (SAM) and ChatGPT, we unify the visual and language prompts into a
modularized framework, enabling the flexible combination between different
controls. Extensive case studies demonstrate the user intention alignment
capabilities of our framework, shedding light on effective user interaction
modeling in vision-language applications. Our code is publicly available at
https://github.com/ttengwang/Caption-Anything. |
Presents Caption Anything (CAT), a training-free controllable image captioning framework augmented by pre-trained foundation models like Segment Anything Model (SAM) and ChatGPT, supporting diverse visual and language controls. |
Addresses the limitations of existing controllable image captioning methods that rely on limited annotated data and support only pre-defined control signals, aiming for enhanced user interactivity and controllability. |
Integrates pre-trained image captioners with SAM and an instruction-tuned LLM: SAM processes visual controls (points, boxes, trajectory) into masks; the captioner generates descriptions based on the masked image; and the LLM refines the caption based on language controls (sentiment, length, language, factuality). |
CAT accurately identifies and describes objects based on various visual controls (points, boxes, trajectory).
CAT generates captions with diverse language styles based on user-defined language controls.
CAT can be extended to object-centric chatting and image paragraph captioning by incorporating additional tools like VQA and OCR. |
The reliance on multiple foundation models might lead to increased computational cost.
Further quantitative analysis is needed to evaluate the performance of CAT compared to existing methods. |
controllable image captioning, foundation models, segment anything model, chatgpt, user interaction |
2305.02541
Report |
Catch Missing Details: Image Reconstruction with Frequency Augmented Variational Autoencoder |
Xinmiao Lin, Yikang Li, Jenhao Hsiao, Chiuman Ho, Yu Kong |
The popular VQ-VAE models reconstruct images through learning a discrete
codebook but suffer from a significant issue in the rapid quality degradation
of image reconstruction as the compression rate rises. One major reason is that
a higher compression rate induces more loss of visual signals on the higher
frequency spectrum which reflect the details on pixel space. In this paper, a
Frequency Complement Module (FCM) architecture is proposed to capture the
missing frequency information for enhancing reconstruction quality. The FCM can
be easily incorporated into the VQ-VAE structure, and we refer to the new model
as Frequency Augmented VAE (FA-VAE). In addition, a Dynamic Spectrum Loss (DSL)
is introduced to guide the FCMs to balance between various frequencies
dynamically for optimal reconstruction. FA-VAE is further extended to the
text-to-image synthesis task, and a Cross-attention Autoregressive Transformer
(CAT) is proposed to obtain more precise semantic attributes in texts.
Extensive reconstruction experiments with different compression rates are
conducted on several benchmark datasets, and the results demonstrate that the
proposed FA-VAE is able to restore more faithfully the details compared to SOTA
methods. CAT also shows improved generation quality with better image-text
semantic alignment. |
This paper proposes Frequency Augmented VAE (FA-VAE), a novel architecture that enhances image reconstruction quality in VQ-VAE models by addressing the loss of high-frequency details during compression. |
VQ-VAE models suffer from reduced image reconstruction quality at high compression rates due to the loss of high-frequency information. Existing methods often overlook the importance of frequency alignment for accurate reconstruction. |
FA-VAE incorporates Frequency Complement Modules (FCM) into the decoder to restore missing high-frequency information guided by a Dynamic Spectrum Loss (DSL). DSL leverages encoder activations to guide FCMs in learning a dynamic balance of frequencies for optimal reconstruction. |
FA-VAE demonstrates superior image reconstruction quality compared to state-of-the-art VQ-VAE models on FFHQ and ImageNet datasets across various compression rates.
Ablation studies confirm the effectiveness of FCMs and the DSL in enhancing reconstruction by capturing and restoring high-frequency details.
The proposed Cross-attention Autoregressive Transformer (CAT), an extension of FA-VAE for text-to-image generation, exhibits strong performance and generates high-quality images with accurate semantic alignment. |
The impact of kernel size in DSL on reconstruction quality requires further investigation.
Exploring alternative FCM architectures and merging techniques could lead to further improvements. |
image reconstruction, vq-vae, frequency analysis, image generation, text-to-image synthesis |
2305.02463
Report |
Shap-E: Generating Conditional 3D Implicit Functions |
Heewoo Jun, Alex Nichol |
We present Shap-E, a conditional generative model for 3D assets. Unlike
recent work on 3D generative models which produce a single output
representation, Shap-E directly generates the parameters of implicit functions
that can be rendered as both textured meshes and neural radiance fields. We
train Shap-E in two stages: first, we train an encoder that deterministically
maps 3D assets into the parameters of an implicit function; second, we train a
conditional diffusion model on outputs of the encoder. When trained on a large
dataset of paired 3D and text data, our resulting models are capable of
generating complex and diverse 3D assets in a matter of seconds. When compared
to Point-E, an explicit generative model over point clouds, Shap-E converges
faster and reaches comparable or better sample quality despite modeling a
higher-dimensional, multi-representation output space. We release model
weights, inference code, and samples at https://github.com/openai/shap-e. |
This paper presents Shap-E, a generative model that produces 3D assets as both textured meshes and neural radiance fields (NeRFs) conditioned on text prompts. |
This work addresses limitations of existing 3D generative models that struggle to represent complex assets efficiently. Shap-E provides a faster and more flexible approach for text-to-3D generation. |
The authors train a two-stage model. First, a 3D encoder learns to map 3D assets into implicit function parameters, trained using NeRF and differentiable rendering objectives. Second, a conditional diffusion model is trained on the encoded latents, learning from paired text-3D data. |
Shap-E generates diverse and recognizable 3D assets from text prompts in seconds.
Compared to Point-E, an explicit 3D generative model, Shap-E shows faster convergence and comparable or better sample quality.
The authors find shared success and failure cases between Shap-E and Point-E when conditioned on images, suggesting data and model architecture outweigh representation choice in influencing output. |
Shap-E struggles with multi-object composition and attribute binding, likely due to limitations in paired training data.
Generated 3D assets, while recognizable, often lack fine details. Improved encoders and incorporating optimization-based methods could enhance details and quality. |
generative models, 3d generation, text-to-3d, neural radiance fields (nerfs), diffusion models |
2305.02385
Report |
SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning |
Xinghui Li, Kai Han, Xingchen Wan, Victor Adrian Prisacariu |
We propose SimSC, a remarkably simple framework, to address the problem of
semantic matching only based on the feature backbone. We discover that when
fine-tuning ImageNet pre-trained backbone on the semantic matching task, L2
normalization of the feature map, a standard procedure in feature matching,
produces an overly smooth matching distribution and significantly hinders the
fine-tuning process. By setting an appropriate temperature to the softmax, this
over-smoothness can be alleviated and the quality of features can be
substantially improved. We employ a learning module to predict the optimal
temperature for fine-tuning feature backbones. This module is trained together
with the backbone and the temperature is updated online. We evaluate our method
on three public datasets and demonstrate that we can achieve accuracy on par
with state-of-the-art methods under the same backbone without using a learned
matching head. Our method is versatile and works on various types of backbones.
We show that the accuracy of our framework can be easily improved by coupling
it with more powerful backbones. |
This paper presents SimSC, a simple yet effective framework for semantic correspondence matching. It highlights the detrimental impact of L2 normalization on feature map smoothness during backbone fine-tuning and proposes using a learned temperature in the softmax to mitigate this issue. |
Existing semantic matching methods often rely on complex matching heads and training strategies, overlooking the importance of properly fine-tuning the feature backbone. This work emphasizes the backbone's significance and offers a simple solution to enhance its performance. |
The method uses a temperature learning module, implemented as a two-layer MLP, to predict the optimal temperature for the softmax operation based on the input image pair's feature maps. This module is jointly trained with the backbone, eliminating the need for manual temperature tuning. |
SimSC achieves state-of-the-art accuracy on PF-Pascal and SPair-71K datasets using ResNet101 as the backbone, despite having no learned matching head.
The framework is versatile and effectively fine-tunes both CNN-based (ResNet) and ViT-based (DINO, iBOT) backbones.
Fine-tuning the entire backbone with SimSC consistently outperforms fine-tuning only the last block, showcasing the benefit of propagating the learned temperature's effect throughout the network. |
The method's performance on transfer learning to PF-Willow, while decent, is not as significant as its results on SPair-71K, suggesting potential limitations in handling different data distributions.
The paper primarily focuses on single-scale matching. Exploring multi-scale strategies within the SimSC framework could further improve its performance, especially for challenging cases with significant scale variations between images. |
semantic correspondence, temperature learning, feature backbone fine-tuning, l2 normalization, deep learning |
2305.02312
Report |
AG3D: Learning to Generate 3D Avatars from 2D Image Collections |
Zijian Dong, Xu Chen, Jinlong Yang, Michael J. Black, Otmar Hilliges, Andreas Geiger |
While progress in 2D generative models of human appearance has been rapid,
many applications require 3D avatars that can be animated and rendered.
Unfortunately, most existing methods for learning generative models of 3D
humans with diverse shape and appearance require 3D training data, which is
limited and expensive to acquire. The key to progress is hence to learn
generative models of 3D avatars from abundant unstructured 2D image
collections. However, learning realistic and complete 3D appearance and
geometry in this under-constrained setting remains challenging, especially in
the presence of loose clothing such as dresses. In this paper, we propose a new
adversarial generative model of realistic 3D people from 2D images. Our method
captures shape and deformation of the body and loose clothing by adopting a
holistic 3D generator and integrating an efficient and flexible articulation
module. To improve realism, we train our model using multiple discriminators
while also integrating geometric cues in the form of predicted 2D normal maps.
We experimentally find that our method outperforms previous 3D- and
articulation-aware methods in terms of geometry and appearance. We validate the
effectiveness of our model and the importance of each component via systematic
ablation studies. |
This paper proposes AG3D, a novel adversarial generative model that learns to generate realistic and animatable 3D human avatars from unstructured 2D image collections, effectively capturing the shape and deformation of the body and loose clothing. |
Generating diverse and high-quality 3D avatars typically requires expensive and limited 3D training data. This work leverages widely available 2D images to learn a generative model, overcoming the limitations of 3D data acquisition. |
The method utilizes a holistic 3D generator with an efficient articulation module (Fast-SNARF) for pose control and loose clothing deformation. It employs multiple discriminators specializing in full images, faces, and normal maps, enhancing visual and geometric fidelity. |
AG3D outperforms state-of-the-art methods in terms of image quality, particularly in side views, as evidenced by FID scores and user preference studies.
The model effectively captures subtle geometric details, producing realistic 3D shapes, unlike previous methods that suffer from noise and artifacts.
Unlike part-based models, AG3D effectively handles loose clothing like dresses and skirts, avoiding discontinuity artifacts. |
The model may generate incorrect clothing patterns in occluded areas due to the ambiguity of pixel-to-body-part association in single-view training data.
The training datasets used, primarily focused on fashion, lack diversity in body shapes, skin tones, and age, potentially leading to biases in generated avatars. |
3d human generation, generative adversarial networks, articulated deformation, loose clothing modeling, normal map discriminator |
2305.02310
Report |
Real-Time Radiance Fields for Single-Image Portrait View Synthesis |
Alex Trevithick, Matthew Chan, Michael Stengel, Eric R. Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Manmohan Chandraker, Ravi Ramamoorthi, Koki Nagano |
We present a one-shot method to infer and render a photorealistic 3D
representation from a single unposed image (e.g., face portrait) in real-time.
Given a single RGB input, our image encoder directly predicts a canonical
triplane representation of a neural radiance field for 3D-aware novel view
synthesis via volume rendering. Our method is fast (24 fps) on consumer
hardware, and produces higher quality results than strong GAN-inversion
baselines that require test-time optimization. To train our triplane encoder
pipeline, we use only synthetic data, showing how to distill the knowledge from
a pretrained 3D GAN into a feedforward encoder. Technical contributions include
a Vision Transformer-based triplane encoder, a camera data augmentation
strategy, and a well-designed loss function for synthetic data training. We
benchmark against the state-of-the-art methods, demonstrating significant
improvements in robustness and image quality in challenging real-world
settings. We showcase our results on portraits of faces (FFHQ) and cats (AFHQ),
but our algorithm can also be applied in the future to other categories with a
3D-aware image generator. |
Presents a one-shot method to infer and render a photorealistic 3D representation from a single unposed image in real-time using a triplane representation of a neural radiance field. |
Enables real-time 3D-aware novel view synthesis from a single image, significantly faster than optimization-based methods, opening possibilities for applications like AR/VR and 3D telepresence. |
Trains a Vision Transformer-based encoder to predict canonical triplane features from a single image, supervised using synthetic data generated from a pre-trained 3D GAN (EG3D) with on-the-fly camera augmentation. |
Achieves real-time performance (24 fps) on consumer hardware.
Outperforms state-of-the-art GAN-inversion baselines in terms of robustness and image quality on challenging real-world portraits.
Demonstrates generalization ability by lifting stylized images (drawings and paintings) to 3D. |
May struggle with strong profile views due to limitations in the training data distribution.
Temporal inconsistencies may arise when applied to videos frame-by-frame due to independent frame processing. |
novel view synthesis, 3d reconstruction, generative adversarial networks, neural radiance fields, synthetic data |
2305.02187
Report |
CLUSTSEG: Clustering for Universal Segmentation |
James Liang, Tianfei Zhou, Dongfang Liu, Wenguan Wang |
We present CLUSTSEG, a general, transformer-based framework that tackles
different image segmentation tasks (i.e., superpixel, semantic, instance, and
panoptic) through a unified neural clustering scheme. Regarding queries as
cluster centers, CLUSTSEG is innovative in two aspects:1) cluster centers are
initialized in heterogeneous ways so as to pointedly address task-specific
demands (e.g., instance- or category-level distinctiveness), yet without
modifying the architecture; and 2) pixel-cluster assignment, formalized in a
cross-attention fashion, is alternated with cluster center update, yet without
learning additional parameters. These innovations closely link CLUSTSEG to EM
clustering and make it a transparent and powerful framework that yields
superior results across the above segmentation tasks. |
\textsc{ClustSeg}, a universal, transformer-based segmentation framework that tackles superpixel, semantic, instance, and panoptic segmentation through a unified, neural clustering scheme. |
To shift the image segmentation field from task-specialized architectures towards a universal framework and address the limitations of existing universal segmenters. |
1. **\textit{Dreamy-Start}:** Task-specific initialization of cluster centers (queries) respecting the nature of each segmentation task.
2. **\textit{Recurrent Cross-Attention}:** A non-parametric, recursive module for effective and efficient neural clustering by alternating between pixel-cluster assignment and cluster center update. |
\textsc{ClustSeg} sets new records across all metrics on COCO Panoptic val (59.0 PQ).
It establishes a new state-of-the-art on COCO instance segmentation (49.1 AP).
It ranks top in ADE20K semantic segmentation benchmarking (57.4 mIoU). |
Extra clustering loops in training may reduce computational efficiency.
Future work includes developing more robust clustering algorithms to handle complex scenarios. |
image segmentation, universal framework, clustering, transformers, deep learning |
2305.01644
Report |
Key-Locked Rank One Editing for Text-to-Image Personalization |
Yoad Tewel, Rinon Gal, Gal Chechik, Yuval Atzmon |
Text-to-image models (T2I) offer a new level of flexibility by allowing users
to guide the creative process through natural language. However, personalizing
these models to align with user-provided visual concepts remains a challenging
problem. The task of T2I personalization poses multiple hard challenges, such
as maintaining high visual fidelity while allowing creative control, combining
multiple personalized concepts in a single image, and keeping a small model
size. We present Perfusion, a T2I personalization method that addresses these
challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion
avoids overfitting by introducing a new mechanism that "locks" new concepts'
cross-attention Keys to their superordinate category. Additionally, we develop
a gated rank-1 approach that enables us to control the influence of a learned
concept during inference time and to combine multiple concepts. This allows
runtime-efficient balancing of visual-fidelity and textual-alignment with a
single 100KB trained model, which is five orders of magnitude smaller than the
current state of the art. Moreover, it can span different operating points
across the Pareto front without additional training. Finally, we show that
Perfusion outperforms strong baselines in both qualitative and quantitative
terms. Importantly, key-locking leads to novel results compared to traditional
approaches, allowing to portray personalized object interactions in
unprecedented ways, even in one-shot settings. |
This paper introduces Key-Locked Rank One Editing (Perfusion), a novel method for personalizing text-to-image (T2I) diffusion models that achieves high visual fidelity and improved textual alignment with a small model size. |
Existing T2I personalization methods often overfit to training images, limiting their ability to generate diverse and creative compositions. They also struggle to combine multiple learned concepts in a single image. |
Perfusion leverages a gated rank-one editing approach applied to the cross-attention layers of diffusion models. It introduces a 'Key-Locking' mechanism that restricts a concept's attention to its super-category, preventing overfitting and promoting generalization. It also employs a gated rank-one update to control the influence of learned concepts during inference, enabling multi-concept compositions. |
Perfusion outperforms state-of-the-art methods in qualitative and quantitative comparisons, showing improved text-alignment and visual fidelity.
The method allows for runtime control over the trade-off between visual fidelity and text alignment by adjusting sigmoid parameters.
Key-Locking enables the generation of novel compositions and interactions between individually learned concepts. |
The choice of super-category for Key-Locking can sometimes lead to 'over-generalization' effects, impacting visual fidelity.
Combining multiple concepts effectively often requires significant prompt engineering. |
text-to-image synthesis, personalization, diffusion models, rank-one editing, key-locking |
2305.01569
Report |
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation |
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy |
The ability to collect a large dataset of human preferences from
text-to-image users is usually limited to companies, making such datasets
inaccessible to the public. To address this issue, we create a web app that
enables text-to-image users to generate images and specify their preferences.
Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image
prompts and real users' preferences over generated images. We leverage this
dataset to train a CLIP-based scoring function, PickScore, which exhibits
superhuman performance on the task of predicting human preferences. Then, we
test PickScore's ability to perform model evaluation and observe that it
correlates better with human rankings than other automatic evaluation metrics.
Therefore, we recommend using PickScore for evaluating future text-to-image
generation models, and using Pick-a-Pic prompts as a more relevant dataset than
MS-COCO. Finally, we demonstrate how PickScore can enhance existing
text-to-image models via ranking. |
This work introduces Pick-a-Pic, a large, open dataset of user preferences over text-to-image generations, and PickScore, a CLIP-based scoring function trained on this dataset for predicting human preferences. |
Existing text-to-image generation models lack large, open datasets of human preferences, hindering the development of models that align with user expectations. |
The authors created a web app to collect user preferences on generated images, resulting in Pick-a-Pic. They then trained PickScore, a CLIP-based model, on this dataset using a reward model objective similar to InstructGPT. |
PickScore achieves superhuman performance (70.5% accuracy) in predicting human preferences, outperforming baselines like zero-shot CLIP-H (60.8%) and human experts (68.0%).
PickScore shows a stronger correlation (0.917) with human preferences than FID (-0.900) for evaluating text-to-image models, even when tested on MS-COCO captions.
PickScore effectively improves the quality of text-to-image generations via ranking, with human raters preferring its selections over those made by other scoring functions and baselines. |
Despite efforts to ensure data quality, Pick-a-Pic may contain NSFW content and inherent biases from user preferences.
Future work includes exploring the use of PickScore and Pick-a-Pic for RLHF and other alignment techniques to further improve text-to-image models. |
text-to-image generation, human preferences, dataset, evaluation metric, clip |
2305.01275
Report |
Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation |
Peng-Tao Jiang, Yuqi Yang |
Weakly supervised semantic segmentation with weak labels is a long-lived
ill-posed problem. Mainstream methods mainly focus on improving the quality of
pseudo labels. In this report, we attempt to explore the potential of 'prompt
to masks' from the powerful class-agnostic large segmentation model,
segment-anything. Specifically, different weak labels are used as prompts to
the segment-anything model, generating precise class masks. The class masks are
utilized to generate pseudo labels to train the segmentation networks. We have
conducted extensive experiments on PASCAL VOC 2012 dataset. Experiments
demonstrate that segment-anything can serve as a good pseudo-label generator.
The code will be made publicly available. |
This paper proposes using the Segment-Anything Model (SAM) to generate pseudo labels for weakly supervised semantic segmentation. |
Constructing large-scale finely-annotated datasets for semantic segmentation is time-consuming and expensive. This paper explores the potential of using a powerful, pre-trained model (SAM) to improve weakly supervised methods, which rely on cheaper annotations. |
The paper investigates using different weak annotations (image-level labels, points, scribbles, bounding boxes) as prompts for SAM to generate object masks. These masks are then used as pseudo labels to train segmentation networks. |
SAM with scribble prompts achieves 89.7% mIoU on PASCAL VOC 2012 train set for pseudo label generation.
Using these pseudo labels, DeepLab-v2 achieves 76.6% mIoU on the test set.
SAM outperforms other weakly supervised methods across different annotation types. |
The study is limited to the PASCAL VOC 2012 dataset.
The text prompt functionality of SAM, which is currently unavailable, could be explored in the future. |
weakly supervised semantic segmentation, segment-anything model, pseudo labels, deep learning, computer vision |
2305.01257
Report |
DreamPaint: Few-Shot Inpainting of E-Commerce Items for Virtual Try-On without 3D Modeling |
Mehmet Saygin Seyfioglu, Karim Bouyarmane, Suren Kumar, Amir Tavanaei, Ismail B. Tutar |
We introduce DreamPaint, a framework to intelligently inpaint any e-commerce
product on any user-provided context image. The context image can be, for
example, the user's own image for virtual try-on of clothes from the e-commerce
catalog on themselves, the user's room image for virtual try-on of a piece of
furniture from the e-commerce catalog in their room, etc. As opposed to
previous augmented-reality (AR)-based virtual try-on methods, DreamPaint does
not use, nor does it require, 3D modeling of neither the e-commerce product nor
the user context. Instead, it directly uses 2D images of the product as
available in product catalog database, and a 2D picture of the context, for
example taken from the user's phone camera. The method relies on few-shot fine
tuning a pre-trained diffusion model with the masked latents (e.g., Masked
DreamBooth) of the catalog images per item, whose weights are then loaded on a
pre-trained inpainting module that is capable of preserving the characteristics
of the context image. DreamPaint allows to preserve both the product image and
the context (environment/user) image without requiring text guidance to
describe the missing part (product/context). DreamPaint also allows to
intelligently infer the best 3D angle of the product to place at the desired
location on the user context, even if that angle was previously unseen in the
product's reference 2D images. We compare our results against both text-guided
and image-guided inpainting modules and show that DreamPaint yields superior
performance in both subjective human study and quantitative metrics. |
DreamPaint, a framework for intelligently inpainting e-commerce products onto user-provided context images (e.g., virtual try-on) without 3D modeling, using a combination of Masked DreamBooth and Stable Diffusion Inpainting. |
Addresses limitations of current AR-based virtual try-on methods by using readily available 2D product images and user context images, improving the e-commerce customer experience. |
Fine-tunes a pre-trained diffusion model with masked product images, enabling inpainting that preserves both product and context image characteristics. Leverages Masked DreamBooth and Stable Diffusion Inpainting modules. |
Outperforms text-guided and image-guided inpainting methods in preserving product fidelity.
Demonstrates superior performance in both subjective human evaluation and quantitative metrics (CLIP score).
Allows for flexible user control with the option of additional text prompts for refinement. |
Scalability challenges arise from the need for few-shot fine-tuning per e-commerce item.
Context-appearance entanglement can lead to alterations in product appearance (e.g., color) based on the context image. |
virtual try-on, e-commerce, image inpainting, diffusion models, dreambooth |
2305.01239
Report |
DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning |
Xiaocheng Lu, Ziming Liu, Song Guo, Jingcai Guo, Fushuo Huo, Sikai Bai, Tao Han |
Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts
composed of known knowledge without training samples. Standard CZSL either
identifies visual primitives or enhances unseen composed entities, and as a
result, entanglement between state and object primitives cannot be fully
utilized. Admittedly, vision-language models (VLMs) could naturally cope with
CZSL through tuning prompts, while uneven entanglement leads prompts to be
dragged into local optimum. In this paper, we take a further step to introduce
a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to
better tap the potential of VLMs in CZSL. Specifically, the state and object
primitives are deemed as learnable tokens of vocabulary embedded in prompts and
tuned on seen compositions. Instead of jointly tuning state and object, we
devise a disentangled and recurrent tuning strategy to suppress the traction
force caused by entanglement and gradually optimize the token parameters,
leading to a better prompt space. Notably, we develop a progressive fine-tuning
procedure that allows for incremental updates to the prompts, optimizing the
object first, then the state, and vice versa. Meanwhile, the optimization of
state and object is independent, thus clearer features can be learned to
further alleviate the issue of entangling misleading optimization. Moreover, we
quantify and analyze the entanglement in CZSL and supplement entanglement
rebalancing optimization schemes. DRPT surpasses representative
state-of-the-art methods on extensive benchmark datasets, demonstrating
superiority in both accuracy and efficiency. |
This paper proposes DRPT, a novel Disentangled and Recurrent Prompt Tuning framework for Compositional Zero-Shot Learning (CZSL) that leverages the power of Vision-Language Models (VLMs). |
Existing CZSL methods struggle to effectively utilize the entanglement between state and object primitives, often leading VLMs to converge to local optima due to uneven entanglement distribution. |
DRPT treats state and object primitives as learnable tokens within prompts. It implements a disentangled and recurrent tuning strategy to decouple parameter updates, progressively optimizing object and state tokens independently before joint optimization. |
DRPT surpasses state-of-the-art methods on three benchmark datasets (UT-Zappos, AO-Clevr, C-GQA) demonstrating superior accuracy and efficiency.
The study quantifies entanglement in CZSL and demonstrates DRPT's effectiveness in mitigating entanglement issues.
Ablation studies confirm the positive impact of disentangled recurrent tuning and entanglement re-balancing techniques. |
The paper acknowledges the potential for exploring dynamic status transition sequences with varying K values and automatic status switching in future work.
Further investigation into other re-balancing schemes for entanglement is also suggested. |
zero-shot learning, compositional zero-shot learning, prompt learning, vision-language models, entanglement |
2305.00942
Report |
StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video |
Lizhen Wang, Xiaochen Zhao, Jingxiang Sun, Yuxiang Zhang, Hongwen Zhang, Tao Yu, Yebin Liu |
Face reenactment methods attempt to restore and re-animate portrait videos as
realistically as possible. Existing methods face a dilemma in quality versus
controllability: 2D GAN-based methods achieve higher image quality but suffer
in fine-grained control of facial attributes compared with 3D counterparts. In
this work, we propose StyleAvatar, a real-time photo-realistic portrait avatar
reconstruction method using StyleGAN-based networks, which can generate
high-fidelity portrait avatars with faithful expression control. We expand the
capabilities of StyleGAN by introducing a compositional representation and a
sliding window augmentation method, which enable faster convergence and improve
translation generalization. Specifically, we divide the portrait scenes into
three parts for adaptive adjustments: facial region, non-facial foreground
region, and the background. Besides, our network leverages the best of UNet,
StyleGAN and time coding for video learning, which enables high-quality video
generation. Furthermore, a sliding window augmentation method together with a
pre-training strategy are proposed to improve translation generalization and
training performance, respectively. The proposed network can converge within
two hours while ensuring high image quality and a forward rendering time of
only 20 milliseconds. Furthermore, we propose a real-time live system, which
further pushes research into applications. Results and experiments demonstrate
the superiority of our method in terms of image quality, full portrait video
generation, and real-time re-animation compared to existing facial reenactment
methods. Training and inference code for this paper are at
https://github.com/LizhenWangT/StyleAvatar. |
This paper presents StyleAvatar, a real-time photo-realistic portrait avatar reconstruction method using a StyleGAN-based network trained on a single video. |
The method addresses limitations of existing 2D and 3D portrait avatar approaches, aiming for high-fidelity, fast training, fine-grained control, and real-time efficiency. |
StyleAvatar utilizes a compositional representation, dividing the scene into facial, non-facial foreground, and background regions for adaptive adjustments. It leverages StyleGAN generators, a StyleUNet, Neural Textures, and a sliding window augmentation method for high-quality and efficient portrait avatar generation. |
StyleAvatar outperforms state-of-the-art one-shot and video-based facial reenactment methods in terms of image quality and control of facial attributes.
The proposed method achieves significantly faster training and rendering times compared to existing methods.
The system allows for real-time re-animation of the learned facial avatar with other subjects. |
The method is limited by the training dataset and struggles with poses and expressions significantly different from those seen during training.
Inaccuracies in 3DMM tracking can lead to imprecise expression control and unrealistic mouth interiors during reenactment. |
facial reenactment, stylegan, video portraits, deep learning, rendering-to-video translation |
2305.00936
Report |
Generating Texture for 3D Human Avatar from a Single Image using Sampling and Refinement Networks |
Sihun Cha, Kwanggyoon Seo, Amirsaman Ashtari, Junyong Noh |
There has been significant progress in generating an animatable 3D human
avatar from a single image. However, recovering texture for the 3D human avatar
from a single image has been relatively less addressed. Because the generated
3D human avatar reveals the occluded texture of the given image as it moves, it
is critical to synthesize the occluded texture pattern that is unseen from the
source image. To generate a plausible texture map for 3D human avatars, the
occluded texture pattern needs to be synthesized with respect to the visible
texture from the given image. Moreover, the generated texture should align with
the surface of the target 3D mesh. In this paper, we propose a texture
synthesis method for a 3D human avatar that incorporates geometry information.
The proposed method consists of two convolutional networks for the sampling and
refining process. The sampler network fills in the occluded regions of the
source image and aligns the texture with the surface of the target 3D mesh
using the geometry information. The sampled texture is further refined and
adjusted by the refiner network. To maintain the clear details in the given
image, both sampled and refined texture is blended to produce the final texture
map. To effectively guide the sampler network to achieve its goal, we designed
a curriculum learning scheme that starts from a simple sampling task and
gradually progresses to the task where the alignment needs to be considered. We
conducted experiments to show that our method outperforms previous methods
qualitatively and quantitatively. |
This paper presents a novel method for generating high-quality texture maps for 3D human avatars from single images, addressing the challenge of synthesizing occluded texture details and ensuring proper alignment with the 3D mesh. |
Recovering texture for 3D human avatars from a single image is crucial for various applications like VR/AR, but it's challenging due to limited visible texture information and the need for alignment with the 3D mesh. |
The method utilizes two convolutional networks: a Sampler Network (SamplerNet) to complete the texture map by sampling from visible regions guided by geometry information and a Refiner Network (RefinerNet) to enhance details and refine the sampled texture. A curriculum learning scheme is employed to train SamplerNet effectively. |
The proposed method outperforms previous state-of-the-art techniques in both visual quality and quantitative metrics.
It effectively synthesizes occluded texture details while preserving the appearance of visible regions in the source image.
The method demonstrates robustness to different viewpoints and successfully generates plausible textures from non-frontal images. |
The method's performance is limited by the training dataset, particularly in handling diverse clothing styles and human identities.
Current implementation relies on a supervised setting, requiring ground truth data for training. |
texture synthesis, 3d human avatar, single image, curriculum learning, deep learning |
2305.00866
Report |
Attack-SAM: Towards Attacking Segment Anything Model With Adversarial Examples |
Chenshuang Zhang, Chaoning Zhang, Taegoo Kang, Donghun Kim, Sung-Ho Bae, In So Kweon |
Segment Anything Model (SAM) has attracted significant attention recently,
due to its impressive performance on various downstream tasks in a zero-short
manner. Computer vision (CV) area might follow the natural language processing
(NLP) area to embark on a path from task-specific vision models toward
foundation models. However, deep vision models are widely recognized as
vulnerable to adversarial examples, which fool the model to make wrong
predictions with imperceptible perturbation. Such vulnerability to adversarial
attacks causes serious concerns when applying deep models to security-sensitive
applications. Therefore, it is critical to know whether the vision foundation
model SAM can also be fooled by adversarial attacks. To the best of our
knowledge, our work is the first of its kind to conduct a comprehensive
investigation on how to attack SAM with adversarial examples. With the basic
attack goal set to mask removal, we investigate the adversarial robustness of
SAM in the full white-box setting and transfer-based black-box settings. Beyond
the basic goal of mask removal, we further investigate and find that it is
possible to generate any desired mask by the adversarial attack. |
This paper presents the first comprehensive study on the vulnerability of the Segment Anything Model (SAM) to adversarial attacks. |
SAM, as a foundation model for image segmentation, has significant implications for various applications. Understanding its robustness against adversarial attacks is crucial, especially for security-sensitive applications. |
The authors propose a framework called Attack-SAM, which focuses on mask removal as the primary attack goal. They employ FGSM and PGD attacks with a tailored loss function (ClipMSE) to generate adversarial examples. They further investigate cross-prompt and cross-task transferability of the attacks. |
SAM is highly vulnerable to adversarial attacks in the white-box setting, exhibiting successful mask removal.
The attack demonstrates cross-prompt transferability, meaning the adversary doesn't need prior knowledge of the prompt to launch a successful attack.
Adversarial examples generated for semantic label prediction tasks can partially attack SAM, indicating cross-task transferability. |
The study primarily focuses on the mask removal attack, leaving room for exploring other attack goals and their impact on SAM.
The attack performance in challenging scenarios like cross-task attacks is partial, suggesting further research to enhance attack effectiveness. |
adversarial attacks, segment anything model (sam), image segmentation, model robustness, computer vision |
2305.00599
Report |
StyleGenes: Discrete and Efficient Latent Distributions for GANs |
Evangelos Ntavelis, Mohamad Shahbazi, Iason Kastanis, Radu Timofte, Martin Danelljan, Luc Van Gool |
We propose a discrete latent distribution for Generative Adversarial Networks
(GANs). Instead of drawing latent vectors from a continuous prior, we sample
from a finite set of learnable latents. However, a direct parametrization of
such a distribution leads to an intractable linear increase in memory in order
to ensure sufficient sample diversity. We address this key issue by taking
inspiration from the encoding of information in biological organisms. Instead
of learning a separate latent vector for each sample, we split the latent space
into a set of genes. For each gene, we train a small bank of gene variants.
Thus, by independently sampling a variant for each gene and combining them into
the final latent vector, our approach can represent a vast number of unique
latent samples from a compact set of learnable parameters. Interestingly, our
gene-inspired latent encoding allows for new and intuitive approaches to
latent-space exploration, enabling conditional sampling from our
unconditionally trained model. Moreover, our approach preserves
state-of-the-art photo-realism while achieving better disentanglement than the
widely-used StyleMapping network. |
This paper introduces StyleGenes, a novel approach to GAN latent spaces that utilizes a discrete distribution inspired by biological DNA encoding for more interpretable and efficient image generation. |
This approach addresses the limitations of continuous latent spaces in GANs, particularly in terms of disentanglement and interpretability, paving the way for more controllable and diverse image synthesis. |
The method divides the latent code into smaller, independent units called "genes," each with a set of learnable "variants." By combining these variants, the model can generate a vast number of distinct images from a compact set of parameters. The relationship between genes and image attributes is analyzed to enable conditional sampling from the trained model. |
StyleGenes achieves comparable image quality to GANs with continuous latent spaces, evidenced by similar FID scores across multiple datasets.
The discrete nature of StyleGenes allows for a more straightforward analysis of the relationship between latent codes and image attributes, leading to improved disentanglement compared to StyleGAN's W space.
This approach enables conditional image generation from an unconditionally trained model by leveraging the learned associations between genes and attributes, without needing additional training or modules. |
The reliance on pre-trained classifiers to analyze attribute relationships introduces limitations due to potential biases and dataset discrepancies.
Further exploration of techniques to improve attribute-based control and incorporate real images into the codebook is warranted. |
generative adversarial networks (gans), discrete latent space, image generation, conditional image synthesis, disentanglement |
2305.00521
Report |
StyleLipSync: Style-based Personalized Lip-sync Video Generation |
Taekyung Ki, Dongchan Min |
In this paper, we present StyleLipSync, a style-based personalized lip-sync
video generative model that can generate identity-agnostic lip-synchronizing
video from arbitrary audio. To generate a video of arbitrary identities, we
leverage expressive lip prior from the semantically rich latent space of a
pre-trained StyleGAN, where we can also design a video consistency with a
linear transformation. In contrast to the previous lip-sync methods, we
introduce pose-aware masking that dynamically locates the mask to improve the
naturalness over frames by utilizing a 3D parametric mesh predictor frame by
frame. Moreover, we propose a few-shot lip-sync adaptation method for an
arbitrary person by introducing a sync regularizer that preserves lip-sync
generalization while enhancing the person-specific visual information.
Extensive experiments demonstrate that our model can generate accurate lip-sync
videos even with the zero-shot setting and enhance characteristics of an unseen
face using a few seconds of target video through the proposed adaptation
method. |
Presents StyleLipSync, a style-based lip-sync video generative model that generates identity-agnostic lip-syncing videos from arbitrary audio using a pre-trained StyleGAN and pose-aware masking. |
Addresses limitations of previous lip-sync methods that struggle with inaccurate lip-syncing, blurry results, and lack of temporal consistency, aiming to generate high-fidelity, temporally consistent videos of arbitrary identities. |
Leverages a pre-trained StyleGAN for lip prior and linear manipulation of style codes for lip-syncing, introduces pose-aware masking using a 3D face mesh predictor for improved naturalness, and employs a Moving-average based Latent Smoothing module for temporal consistency. Additionally proposes a few-shot adaptation method for unseen faces using a sync regularizer. |
Outperforms state-of-the-art methods in lip-sync and visual quality on Voxceleb2 reconstruction.
Achieves state-of-the-art lip-sync accuracy and comparable face similarity in cross-id experiments on HDTF.
User study confirms superior lip-sync accuracy, face similarity, and visual quality compared to other methods. |
Extending to higher resolutions is challenging due to the need for a large number of identities during training.
Improving lip identity preservation in a zero-shot setting with a more effective reference encoder is a potential area for improvement. |
lip-sync, video generation, stylegan, pose-aware masking, few-shot adaptation |
2305.00278
Report |
Segment Anything Model (SAM) Meets Glass: Mirror and Transparent Objects Cannot Be Easily Detected |
Dongsheng Han, Chaoning Zhang, Yu Qiao, Maryam Qamar, Yuna Jung, SeungKyu Lee, Sung-Ho Bae, Choong Seon Hong |
Meta AI Research has recently released SAM (Segment Anything Model) which is
trained on a large segmentation dataset of over 1 billion masks. As a
foundation model in the field of computer vision, SAM (Segment Anything Model)
has gained attention for its impressive performance in generic object
segmentation. Despite its strong capability in a wide range of zero-shot
transfer tasks, it remains unknown whether SAM can detect things in challenging
setups like transparent objects. In this work, we perform an empirical
evaluation of two glass-related challenging scenarios: mirror and transparent
objects. We found that SAM often fails to detect the glass in both scenarios,
which raises concern for deploying the SAM in safety-critical situations that
have various forms of glass. |
This paper presents the first empirical study evaluating the performance of the Segment Anything Model (SAM) in detecting and segmenting transparent and mirror objects. |
This evaluation is crucial because the failure of SAM to recognize glass in safety-critical applications, where glass is ubiquitous, could lead to serious consequences. |
The study uses four established benchmark datasets, two for glass (GDD and GSD) and two for mirrors (MSD and PMD), and employs five standard evaluation metrics (IoU, ACC, Fβ, MAE, and BER) to compare SAM with state-of-the-art methods in semantic and glass/mirror segmentation. |
SAM often fails to detect glass in both mirror and transparent object scenarios, significantly underperforming compared to specialized models.
The model frequently recognizes objects behind transparent surfaces but not the glass itself, highlighting its difficulty in distinguishing between transmitted and reflected light.
While SAM shows comparable performance to some methods on the PMD mirror dataset (where boundaries are clearer), its performance on the MSD dataset is unsatisfactory, often segmenting reflected objects instead of the mirror. |
The study primarily focuses on single-object scenes and might not fully represent the complexity of real-world scenarios with multiple transparent or reflective surfaces.
Further research is needed to develop strategies, such as incorporating specific data augmentation techniques or fine-tuning SAM on glass-related datasets, to improve its performance in detecting glass. |
segment anything model (sam), glass detection, transparent object segmentation, mirror segmentation, computer vision |
2305.00121
Report |
Learning Locally Editable Virtual Humans |
Hsuan-I Ho, Lixin Xue, Jie Song, Otmar Hilliges |
In this paper, we propose a novel hybrid representation and end-to-end
trainable network architecture to model fully editable and customizable neural
avatars. At the core of our work lies a representation that combines the
modeling power of neural fields with the ease of use and inherent 3D
consistency of skinned meshes. To this end, we construct a trainable feature
codebook to store local geometry and texture features on the vertices of a
deformable body model, thus exploiting its consistent topology under
articulation. This representation is then employed in a generative auto-decoder
architecture that admits fitting to unseen scans and sampling of realistic
avatars with varied appearances and geometries. Furthermore, our representation
allows local editing by swapping local features between 3D assets. To verify
our method for avatar creation and editing, we contribute a new high-quality
dataset, dubbed CustomHumans, for training and evaluation. Our experiments
quantitatively and qualitatively show that our method generates diverse
detailed avatars and achieves better model fitting performance compared to
state-of-the-art methods. Our code and dataset are available at
https://custom-humans.github.io/. |
This paper introduces a novel hybrid representation and generative framework for creating fully editable and customizable 3D human avatars. |
Creating personalized and easily editable avatars is crucial for enhancing user engagement in various applications like gaming and the Metaverse. |
The proposed method combines a trainable feature codebook storing local geometry and texture features on a deformable body model with a generative auto-decoder architecture. This architecture is trained on 3D scans using both 3D reconstruction and 2D adversarial losses. |
The hybrid representation allows local editing by swapping features between avatars.
The model can be fitted to unseen 3D scans, enabling personalization.
The generative framework allows for the creation of diverse and detailed avatars by sampling from the learned feature space. |
The quality of generated avatars relies heavily on the diversity and quality of training data.
The editing workflow currently requires manual intervention for feature swapping and could benefit from automation. |
3d avatars, neural fields, generative models, avatar customization, local editing |
2304.14610
Report |
ALL-E: Aesthetics-guided Low-light Image Enhancement |
Ling Li, Dong Liang, Yuanhang Gao, Sheng-Jun Huang, Songcan Chen |
Evaluating the performance of low-light image enhancement (LLE) is highly
subjective, thus making integrating human preferences into image enhancement a
necessity. Existing methods fail to consider this and present a series of
potentially valid heuristic criteria for training enhancement models. In this
paper, we propose a new paradigm, i.e., aesthetics-guided low-light image
enhancement (ALL-E), which introduces aesthetic preferences to LLE and
motivates training in a reinforcement learning framework with an aesthetic
reward. Each pixel, functioning as an agent, refines itself by recursive
actions, i.e., its corresponding adjustment curve is estimated sequentially.
Extensive experiments show that integrating aesthetic assessment improves both
subjective experience and objective evaluation. Our results on various
benchmarks demonstrate the superiority of ALL-E over state-of-the-art methods. |
This paper introduces ALL-E, a novel aesthetics-guided low-light image enhancement (LLE) paradigm incorporating aesthetic assessment to improve the subjective and objective quality of enhanced images. |
Existing LLE methods rely on heuristic criteria and overlook the crucial role of human subjective evaluation, particularly the impact of aesthetic preferences on perceived image quality. |
ALL-E employs a reinforcement learning framework where each pixel acts as an agent, refining itself through iterative actions guided by an aesthetic reward. It leverages a pre-trained 'aesthetic oracle network' to provide general aesthetic preferences and incorporates rewards for aesthetics, feature preservation, and exposure control. |
ALL-E generates visually more appealing enhancements compared to state-of-the-art methods, as demonstrated on LOL and LIME datasets.
Quantitative evaluations using NIQE, UNIQUE, PSNR, and SSIM metrics on various datasets confirm ALL-E's superior performance in terms of image quality.
Human subjective surveys consistently rank ALL-E as the top performer, highlighting its ability to enhance images while preserving naturalness and visual appeal. |
The paper acknowledges the potential limitation of ALL-E in handling specific scenarios like nightscapes where preserving the low-light aesthetic might be preferable to brightening the entire scene.
Future work will explore incorporating high-level semantic theme guidance to address the issue of excessive exposure adjustment in such cases. |
low-light image enhancement, aesthetic assessment, reinforcement learning, image quality assessment, computer vision |
2304.14573
Report |
SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis |
Azade Farshad, Yousef Yeganeh, Yu Chi, Chengzhi Shen, Björn Ommer, Nassir Navab |
Text-conditioned image generation has made significant progress in recent
years with generative adversarial networks and more recently, diffusion models.
While diffusion models conditioned on text prompts have produced impressive and
high-quality images, accurately representing complex text prompts such as the
number of instances of a specific object remains challenging.
To address this limitation, we propose a novel guidance approach for the
sampling process in the diffusion model that leverages bounding box and
segmentation map information at inference time without additional training
data. Through a novel loss in the sampling process, our approach guides the
model with semantic features from CLIP embeddings and enforces geometric
constraints, leading to high-resolution images that accurately represent the
scene. To obtain bounding box and segmentation map information, we structure
the text prompt as a scene graph and enrich the nodes with CLIP embeddings. Our
proposed model achieves state-of-the-art performance on two public benchmarks
for image generation from scene graphs, surpassing both scene graph to image
and text-based diffusion models in various metrics. Our results demonstrate the
effectiveness of incorporating bounding box and segmentation map guidance in
the diffusion model sampling process for more accurate text-to-image
generation. |
This paper introduces a novel guidance approach for diffusion models in image synthesis, enhancing accuracy in depicting complex scenes from text prompts, particularly in representing the correct number of object instances. |
Existing text-to-image generation models, while producing impressive results, struggle with accurately representing the number of instances of objects in an image, especially from complex textual descriptions. This work addresses this limitation to achieve more precise image generation. |
The proposed method leverages scene graphs derived from text prompts to predict bounding boxes and segmentation maps. These maps, along with CLIP embeddings, guide the diffusion model's sampling process, ensuring both object realism and correct scene layout. The approach incorporates CLIP text guidance, CLIP bounding box guidance (with an augmented version), and segmentation map guidance. |
The method outperforms state-of-the-art text-to-image diffusion models and scene graph-to-image approaches on COCO stuff and Visual Genome benchmarks, without requiring additional training.
Using predicted bounding box and segmentation map information from text prompts leads to superior results compared to existing models.
Incorporating CLIP embeddings in the scene graph nodes enhances the accuracy of bounding box and segmentation predictions, further improving image generation quality. |
The model faces challenges in generating high-quality images of complex structures like faces, suggesting potential improvement through fine-tuning on specific datasets.
Like many diffusion models, the image generation process can be time-consuming during the reverse sampling stage. |
image synthesis, diffusion models, scene graphs, text-to-image generation, clip embeddings |
2304.14530
Report |
Generating images of rare concepts using pre-trained diffusion models |
Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, Gal Chechik |
Text-to-image diffusion models can synthesize high-quality images, but they
have various limitations. Here we highlight a common failure mode of these
models, namely, generating uncommon concepts and structured concepts like hand
palms. We show that their limitation is partly due to the long-tail nature of
their training data: web-crawled data sets are strongly unbalanced, causing
models to under-represent concepts from the tail of the distribution. We
characterize the effect of unbalanced training data on text-to-image models and
offer a remedy. We show that rare concepts can be correctly generated by
carefully selecting suitable generation seeds in the noise space, using a small
reference set of images, a technique that we call SeedSelect. SeedSelect does
not require retraining or finetuning the diffusion model. We assess the
faithfulness, quality and diversity of SeedSelect in creating rare objects and
generating complex formations like hand images, and find it consistently
achieves superior performance. We further show the advantage of SeedSelect in
semantic data augmentation. Generating semantically appropriate images can
successfully improve performance in few-shot recognition benchmarks, for
classes from the head and from the tail of the training data of diffusion
models |
This paper introduces SeedSelect, a method for improving text-to-image diffusion models' ability to generate uncommon or structurally complex concepts by carefully selecting the generation seed in the noise space. |
Current text-to-image diffusion models struggle to generate concepts under-represented in their training data, limiting their ability to synthesize diverse and accurate images. |
SeedSelect uses a small set of reference images to optimize the initial noise tensor (seed) via gradient descent, minimizing a combined semantic and appearance loss based on CLIP embeddings and the diffusion model's VAE. |
SeedSelect significantly improves the faithfulness of generated images for rare concepts, as evaluated by pre-trained classifiers and human raters.
SeedSelect maintains high image quality (measured by FID) comparable to the original diffusion model.
SeedSelect boosts performance in few-shot image recognition tasks, outperforming previous methods in generating valuable and diverse semantic augmentations. |
SeedSelect struggles to imitate the style of reference images.
The optimized seed is prompt-specific and doesn't generalize to other prompts. |
text-to-image synthesis, diffusion models, long-tail learning, semantic data augmentation, few-shot learning |
2304.14403
Report |
Make It So: Steering StyleGAN for Any Image Inversion and Editing |
Anand Bhattad, Viraj Shah, Derek Hoiem, D. A. Forsyth |
StyleGAN's disentangled style representation enables powerful image editing
by manipulating the latent variables, but accurately mapping real-world images
to their latent variables (GAN inversion) remains a challenge. Existing GAN
inversion methods struggle to maintain editing directions and produce realistic
results.
To address these limitations, we propose Make It So, a novel GAN inversion
method that operates in the $\mathcal{Z}$ (noise) space rather than the typical
$\mathcal{W}$ (latent style) space. Make It So preserves editing capabilities,
even for out-of-domain images. This is a crucial property that was overlooked
in prior methods. Our quantitative evaluations demonstrate that Make It So
outperforms the state-of-the-art method PTI~\cite{roich2021pivotal} by a factor
of five in inversion accuracy and achieves ten times better edit quality for
complex indoor scenes. |
Presents "Make It So," a novel GAN inversion method that achieves superior accuracy, edit consistency, and generalization for complex scenes compared to existing methods. |
Accurate GAN inversion is crucial for applying pre-trained GAN models for image manipulation and editing, especially in challenging domains like complex indoor scenes. |
Make It So inverts images in the noise space (Z space) using joint optimization of the noise vector and the StyleGAN generator. It introduces anchor and support losses for edit consistency and generalization and employs an exponential moving average strategy for faster, cleaner inversion. |
Significantly outperforms state-of-the-art methods in inversion accuracy by a factor of five.
Preserves editing capabilities by a factor of ten, demonstrating superior edit consistency.
Generalizes well to out-of-domain images, enabling inversion and editing of images from domains different from the StyleGAN's training data. |
The optimization-based nature of Make It So limits its real-time applicability.
Extremely challenging scenes might require more iterations for near-perfect inversion. |
gan inversion, image editing, stylegan, deep learning, computer vision |
2304.14396
Report |
Learning Articulated Shape with Keypoint Pseudo-labels from Web Images |
Anastasis Stathopoulos, Georgios Pavlakos, Ligong Han, Dimitris Metaxas |
This paper shows that it is possible to learn models for monocular 3D
reconstruction of articulated objects (e.g., horses, cows, sheep), using as few
as 50-150 images labeled with 2D keypoints. Our proposed approach involves
training category-specific keypoint estimators, generating 2D keypoint
pseudo-labels on unlabeled web images, and using both the labeled and
self-labeled sets to train 3D reconstruction models. It is based on two key
insights: (1) 2D keypoint estimation networks trained on as few as 50-150
images of a given object category generalize well and generate reliable
pseudo-labels; (2) a data selection mechanism can automatically create a
"curated" subset of the unlabeled web images that can be used for training --
we evaluate four data selection methods. Coupling these two insights enables us
to train models that effectively utilize web images, resulting in improved 3D
reconstruction performance for several articulated object categories beyond the
fully-supervised baseline. Our approach can quickly bootstrap a model and
requires only a few images labeled with 2D keypoints. This requirement can be
easily satisfied for any new object category. To showcase the practicality of
our approach for predicting the 3D shape of arbitrary object categories, we
annotate 2D keypoints on giraffe and bear images from COCO -- the annotation
process takes less than 1 minute per image. |
This paper presents a method for learning 3D reconstruction models of articulated objects from a limited set of 2D keypoint labeled images (50-150) by leveraging unlabeled web images and pseudo-labeling. |
Building 3D reconstruction models for articulated objects typically requires a large amount of labeled data, which is often unavailable. This method addresses this challenge by enabling the use of readily available web images. |
The method involves training a 2D keypoint estimator on a small labeled dataset, generating pseudo-labels for web images, and then using a data selection criterion to curate a subset of web images with high-quality pseudo-labels for training the 3D shape predictor. |
Training with keypoint pseudo-labels significantly improves 3D reconstruction performance compared to using only limited labeled data.
Data selection from web images is crucial, as using all pseudo-labels can degrade performance.
Consistency-based data selection methods outperform confidence-based methods in selecting high-quality pseudo-labels. |
The 3D shape prediction model is limited by the expressiveness of the template mesh and the assumed articulation model.
The data selection relies on the quality of the 2D keypoint estimator, which might be limited by the initial small labeled dataset. |
3d reconstruction, articulated objects, semi-supervised learning, pseudo-labeling, data selection |
2304.14376
Report |
Zero-shot Unsupervised Transfer Instance Segmentation |
Gyungin Shin, Samuel Albanie, Weidi Xie |
Segmentation is a core computer vision competency, with applications spanning
a broad range of scientifically and economically valuable domains. To date,
however, the prohibitive cost of annotation has limited the deployment of
flexible segmentation models. In this work, we propose Zero-shot Unsupervised
Transfer Instance Segmentation (ZUTIS), a framework that aims to meet this
challenge. The key strengths of ZUTIS are: (i) no requirement for
instance-level or pixel-level annotations; (ii) an ability of zero-shot
transfer, i.e., no assumption on access to a target data distribution; (iii) a
unified framework for semantic and instance segmentations with solid
performance on both tasks compared to state-of-the-art unsupervised methods.
While comparing to previous work, we show ZUTIS achieves a gain of 2.2 mask AP
on COCO-20K and 14.5 mIoU on ImageNet-S with 919 categories for instance and
semantic segmentations, respectively. The code is made publicly available. |
This paper introduces ZUTIS, a novel framework for zero-shot unsupervised transfer instance segmentation, enabling simultaneous segmentation of objects and prediction of their semantic categories without human supervision or access to a target dataset. |
This addresses the high cost of obtaining large, accurate collections of pixel-level annotations for training segmentation models, particularly for diverse and novel object categories. |
ZUTIS leverages a pretrained vision-language model (e.g., CLIP) to retrieve images and generate pseudo-masks for training. It employs a query-based transformer decoder for instance segmentation and a projection matrix to align image features with text embeddings for semantic segmentation. |
ZUTIS achieves comparable or better performance than state-of-the-art unsupervised instance segmentation methods on COCO-20K.
It demonstrates strong performance in zero-shot semantic segmentation, outperforming previous methods on COCO and CoCA benchmarks.
ZUTIS exhibits good generalization to novel, unseen categories, evidenced by its performance on CUB-200-2011 and unseen COCO categories. |
The reliance on a vision-language model like CLIP limits ZUTIS's ability to segment extremely rare concepts not present in the pretraining data.
The pseudo-mask generation pipeline using retrieved images and a saliency detector can lead to errors when images contain distracting objects of similar categories. |
instance segmentation, semantic segmentation, zero-shot learning, unsupervised learning, vision-language model |
2304.14291
Report |
EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation |
Suman Saha, Lukas Hoyer, Anton Obukhov, Dengxin Dai, Luc Van Gool |
With autonomous industries on the rise, domain adaptation of the visual
perception stack is an important research direction due to the cost savings
promise. Much prior art was dedicated to domain-adaptive semantic segmentation
in the synthetic-to-real context. Despite being a crucial output of the
perception stack, panoptic segmentation has been largely overlooked by the
domain adaptation community. Therefore, we revisit well-performing domain
adaptation strategies from other fields, adapt them to panoptic segmentation,
and show that they can effectively enhance panoptic domain adaptation. Further,
we study the panoptic network design and propose a novel architecture (EDAPS)
designed explicitly for domain-adaptive panoptic segmentation. It uses a
shared, domain-robust transformer encoder to facilitate the joint adaptation of
semantic and instance features, but task-specific decoders tailored for the
specific requirements of both domain-adaptive semantic and instance
segmentation. As a result, the performance gap seen in challenging panoptic
benchmarks is substantially narrowed. EDAPS significantly improves the
state-of-the-art performance for panoptic segmentation UDA by a large margin of
20% on SYNTHIA-to-Cityscapes and even 72% on the more challenging
SYNTHIA-to-Mapillary Vistas. The implementation is available at
https://github.com/susaha/edaps. |
Proposes EDAPS, a novel architecture for domain-adaptive panoptic segmentation using a shared transformer encoder and task-specific decoders, enhanced with UDA strategies like self-training, mean teacher, rare class sampling, and ImageNet feature distance. |
Panoptic segmentation UDA has been overlooked and existing methods achieve subpar performance compared to semantic segmentation UDA, highlighting the need for specialized architectures and UDA techniques. |
Conducts a systematic study of different panoptic architectures for UDA, identifying the strengths of shared encoder and task-specific decoders. Employs an enhanced UDA strategy incorporating recent techniques from semantic segmentation UDA. |
Achieves a 20% improvement on SYNTHIA-to-Cityscapes and 72% on SYNTHIA-to-Mapillary Vistas over prior state-of-the-art mPQ.
Significantly improves both recognition and segmentation quality compared to previous methods, particularly excelling in challenging classes.
Provides an efficient architecture with faster inference speed compared to previous methods like CVRN. |
Instance pseudo-labels are not explored, potentially leading to further performance gains.
The approach is validated on synthetic-to-real benchmarks and its effectiveness on other domain shifts requires further investigation. |
panoptic segmentation, domain adaptation, unsupervised domain adaptation, transformer, self-training |
2304.14070
Report |
Compositional 3D Human-Object Neural Animation |
Zhi Hou, Baosheng Yu, Dacheng Tao |
Human-object interactions (HOIs) are crucial for human-centric scene
understanding applications such as human-centric visual generation, AR/VR, and
robotics. Since existing methods mainly explore capturing HOIs, rendering HOI
remains less investigated. In this paper, we address this challenge in HOI
animation from a compositional perspective, i.e., animating novel HOIs
including novel interaction, novel human and/or novel object driven by a novel
pose sequence. Specifically, we adopt neural human-object deformation to model
and render HOI dynamics based on implicit neural representations. To enable the
interaction pose transferring among different persons and objects, we then
devise a new compositional conditional neural radiance field (or CC-NeRF),
which decomposes the interdependence between human and object using latent
codes to enable compositionally animation control of novel HOIs. Experiments
show that the proposed method can generalize well to various novel HOI
animation settings. Our project page is https://zhihou7.github.io/CHONA/ |
This paper introduces CHONA, a novel approach for compositional 3D human-object neural animation. CHONA reconstructs and renders human-object interactions (HOIs) from sparse multi-view videos using neural implicit representations. |
Rendering 3D human-object animation is crucial for various applications like AR/VR and robotics, but existing methods struggle to handle novel interactions, human subjects, and object instances. |
CHONA employs neural human-object deformation, utilizing a pseudo bone for objects and skinning-based techniques for pose-dependent deformation. For compositional control, it utilizes compositional conditional neural radiance fields (CC-NeRF) with disentangled latent codes for human and object identity. |
CHONA outperforms baseline methods in novel pose animation tasks, especially for larger objects.
The compositional invariant learning strategy in CC-NeRF effectively disentangles human and object representations, enabling animation with novel combinations and even non-interactive persons or static objects.
Quantitative and qualitative evaluations on BEHAVE, ZJU-mocap, and CO3D datasets demonstrate the effectiveness of CHONA for compositional HOI animation. |
Accurately understanding the interaction region (object affordance) remains a challenge.
Generating object poses from human motion poses based on interaction categories is a potential area for future exploration. |
human-object interaction, 3d animation, neural radiance fields, compositional representation learning, computer vision |
2304.14006
Report |
Edit Everything: A Text-Guided Generative System for Images Editing |
Defeng Xie, Ruichen Wang, Jian Ma, Chen Chen, Haonan Lu, Dong Yang, Fobo Shi, Xiaodong Lin |
We introduce a new generative system called Edit Everything, which can take
image and text inputs and produce image outputs. Edit Everything allows users
to edit images using simple text instructions. Our system designs prompts to
guide the visual module in generating requested images. Experiments demonstrate
that Edit Everything facilitates the implementation of the visual aspects of
Stable Diffusion with the use of Segment Anything model and CLIP. Our system is
publicly available at https://github.com/DefengXie/Edit_Everything. |
Introduces 'Edit Everything,' a text-guided image editing system that uses SAM for segmentation, CLIP for object ranking, and Stable Diffusion for realistic object replacement. |
Enables efficient and precise image editing based on natural language instructions, addressing the limitations of traditional image editing tools. |
Leverages SAM to segment images, trains CLIP on a large Chinese image-text dataset for object ranking, and utilizes Stable Diffusion for generating replacement objects guided by target prompts. |
Edit Everything effectively edits images based on simple text prompts, seamlessly blending different styles.
The system supports iterative editing for complex prompts, allowing for precise control over the generated output.
Trained on a large Chinese dataset, Edit Everything outperforms open-source models in Chinese text-guided image editing. |
The system relies on pre-trained models (SAM, CLIP, SD) without architectural modifications, potentially limiting performance.
Iterative editing for complex prompts, while accurate, may not be the most efficient approach. |
image editing, text-guided generation, stable diffusion, clip, segment anything |
2304.14005
Report |
ContraNeRF: 3D-Aware Generative Model via Contrastive Learning with Unsupervised Implicit Pose Embedding |
Mijeong Kim, Hyunjoon Lee, Bohyung Han |
Although 3D-aware GANs based on neural radiance fields have achieved
competitive performance, their applicability is still limited to objects or
scenes with the ground-truths or prediction models for clearly defined
canonical camera poses. To extend the scope of applicable datasets, we propose
a novel 3D-aware GAN optimization technique through contrastive learning with
implicit pose embeddings. To this end, we first revise the discriminator design
and remove dependency on ground-truth camera poses. Then, to capture complex
and challenging 3D scene structures more effectively, we make the discriminator
estimate a high-dimensional implicit pose embedding from a given image and
perform contrastive learning on the pose embedding. The proposed approach can
be employed for the dataset, where the canonical camera pose is ill-defined
because it does not look up or estimate camera poses. Experimental results show
that our algorithm outperforms existing methods by large margins on the
datasets with multiple object categories and inconsistent canonical camera
poses. |
This paper introduces ContraNeRF, a novel 3D-aware GAN that leverages contrastive learning with implicit pose embeddings to generate images and their underlying 3D structures without relying on ground-truth camera poses. |
Existing 3D-aware GANs often depend on ground-truth camera poses, limiting their applicability to datasets with clearly defined canonical views. ContraNeRF overcomes this limitation, enabling the generation of complex scenes with heterogeneous geometric configurations. |
The authors modify the discriminator of EG3D to predict implicit pose embeddings instead of explicit camera poses. They then utilize contrastive learning to maximize the similarity between embeddings of images rendered from the same viewpoint while minimizing it for different viewpoints. |
ContraNeRF outperforms state-of-the-art methods in terms of image quality and 3D reconstruction accuracy on datasets like LSUN Bedroom, LSUN Church, AFHQ, and CUB.
The implicit pose embedding effectively captures complex 3D scene structures, even when canonical camera poses are ill-defined.
Increasing the dimensionality of the pose embedding leads to improved 3D reconstruction quality. |
While generally successful, ContraNeRF occasionally produces unrealistic geometries, likely due to outlier training samples or limitations in handling extreme viewpoints.
Future work could explore techniques to mitigate the impact of outliers and further enhance the robustness of the model to diverse camera poses. |
generative adversarial networks, 3d reconstruction, contrastive learning, implicit pose embedding, neural radiance fields |
2304.13850
Report |
Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning |
Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, Chuan Guo |
Self-supervised learning (SSL) algorithms can produce useful image
representations by learning to associate different parts of natural images with
one another. However, when taken to the extreme, SSL models can unintendedly
memorize specific parts in individual training samples rather than learning
semantically meaningful associations. In this work, we perform a systematic
study of the unintended memorization of image-specific information in SSL
models -- which we refer to as d\'ej\`a vu memorization. Concretely, we show
that given the trained model and a crop of a training image containing only the
background (e.g., water, sky, grass), it is possible to infer the foreground
object with high accuracy or even visually reconstruct it. Furthermore, we show
that d\'ej\`a vu memorization is common to different SSL algorithms, is
exacerbated by certain design choices, and cannot be detected by conventional
techniques for evaluating representation quality. Our study of d\'ej\`a vu
memorization reveals previously unknown privacy risks in SSL models, as well as
suggests potential practical mitigation strategies. Code is available at
https://github.com/facebookresearch/DejaVu. |
This paper investigates and characterizes *déjà vu memorization*: the phenomenon where self-supervised learning (SSL) models unintentionally memorize specific details from individual training images, enabling the recovery of masked information beyond what can be inferred from correlations within the data distribution. |
This work highlights previously unknown privacy risks in SSL models, particularly concerning the potential extraction of sensitive information from trained models. As SSL models gain popularity as foundation models in image processing, understanding and mitigating these risks is crucial for responsible AI development and deployment. |
The authors develop a novel testing methodology that leverages a target model trained on a specific dataset and a reference model trained on a similar but disjoint dataset. By comparing their ability to infer masked information from training images, they can distinguish between memorization (unique to the target model) and correlation (present in both models). They further visualize this memorization through image reconstructions using a public dataset and a conditional generative model (RCDM). |
SSL models exhibit a significant degree of déjà vu memorization, even surpassing supervised models in some cases.
Memorization is exacerbated by factors like increasing training epochs and larger model capacity, while training set size has minimal effect.
Certain SSL training criteria (e.g., VICReg) are more susceptible to memorization than others (e.g., SimCLR, BYOL). |
The paper primarily focuses on image classification as a measure of memorization and relies on bounding box annotations for a subset of experiments.
Further research is needed to understand the underlying mechanisms of déjà vu memorization and develop more robust mitigation strategies beyond hyperparameter tuning and architectural modifications. |
self-supervised learning, memorization, privacy risks, image reconstruction, data protection |
2304.13844
Report |
GazeSAM: What You See is What You Segment |
Bin Wang, Armstrong Aboah, Zheyuan Zhang, Ulas Bagci |
This study investigates the potential of eye-tracking technology and the
Segment Anything Model (SAM) to design a collaborative human-computer
interaction system that automates medical image segmentation. We present the
\textbf{GazeSAM} system to enable radiologists to collect segmentation masks by
simply looking at the region of interest during image diagnosis. The proposed
system tracks radiologists' eye movement and utilizes the eye-gaze data as the
input prompt for SAM, which automatically generates the segmentation mask in
real time. This study is the first work to leverage the power of eye-tracking
technology and SAM to enhance the efficiency of daily clinical practice.
Moreover, eye-gaze data coupled with image and corresponding segmentation
labels can be easily recorded for further advanced eye-tracking research. The
code is available in \url{https://github.com/ukaukaaaa/GazeSAM}. |
GazeSAM, a collaborative human-computer interaction system that combines eye-tracking technology with the Segment Anything Model (SAM) for real-time medical image segmentation. |
To address the time-consuming and costly manual segmentation process in medical image analysis, GazeSAM aims to automate segmentation, thereby improving efficiency in clinical practice. |
GazeSAM uses a screen-based eye tracker to capture eye gaze data and transforms it into point prompts for SAM. The model then generates segmentation masks in real-time based on the user's eye movements, allowing for both coarse and refined segmentation. |
GazeSAM enables real-time segmentation of both 2D and 3D medical images.
The system provides an intuitive interface for users to interact with the model and refine segmentation results by simply looking at desired areas.
GazeSAM facilitates the collection of eye-tracking data synchronized with images and segmentations, which can be valuable for further research in eye-tracking and medical image analysis. |
The performance of GazeSAM may be limited by the inherent limitations of SAM in accurately segmenting medical images, especially those not well-represented in SAM's training data.
Future work includes fine-tuning SAM on large-scale medical image datasets to improve its segmentation accuracy in the medical domain. |
eye-tracking, segment anything model, medical image segmentation, human-computer interaction, real-time segmentation |
2304.13518
Report |
Super-NeRF: View-consistent Detail Generation for NeRF super-resolution |
Yuqi Han, Tao Yu, Xiaohang Yu, Yuwang Wang, Qionghai Dai |
The neural radiance field (NeRF) achieved remarkable success in modeling 3D
scenes and synthesizing high-fidelity novel views. However, existing NeRF-based
methods focus more on the make full use of the image resolution to generate
novel views, but less considering the generation of details under the limited
input resolution. In analogy to the extensive usage of image super-resolution,
NeRF super-resolution is an effective way to generate the high-resolution
implicit representation of 3D scenes and holds great potential applications. Up
to now, such an important topic is still under-explored. In this paper, we
propose a NeRF super-resolution method, named Super-NeRF, to generate
high-resolution NeRF from only low-resolution inputs. Given multi-view
low-resolution images, Super-NeRF constructs a consistency-controlling
super-resolution module to generate view-consistent high-resolution details for
NeRF. Specifically, an optimizable latent code is introduced for each
low-resolution input image to control the 2D super-resolution images to
converge to the view-consistent output. The latent codes of each low-resolution
image are optimized synergistically with the target Super-NeRF representation
to fully utilize the view consistency constraint inherent in NeRF construction.
We verify the effectiveness of Super-NeRF on synthetic, real-world, and
AI-generated NeRF datasets. Super-NeRF achieves state-of-the-art NeRF
super-resolution performance on high-resolution detail generation and
cross-view consistency. |
This paper proposes Super-NeRF, a novel method for achieving view-consistent super-resolution of neural radiance fields (NeRFs) using only low-resolution (LR) input images. |
High-quality NeRF reconstruction typically requires high-resolution (HR) images, which are costly to capture, store, and transmit. Super-NeRF addresses this by generating plausible HR details while preserving 3D consistency, enabling high-quality novel view synthesis from readily available LR inputs. |
Super-NeRF utilizes a consistency-controlling super-resolution (CCSR) module and a mutual learning strategy between the CCSR and an HR NeRF. The CCSR explores diverse HR image solutions guided by a pre-trained LR NeRF, and a consistency enforcing module ensures adherence to LR inputs. Mutual learning between the CCSR and HR NeRF progressively refines HR details while maintaining view consistency. |
Super-NeRF generates sharper edges and finer texture details compared to baselines on various datasets, including LLFF, Synthetic 360, Blender, and FaceScape.
Quantitative evaluation using LPIPS and NIQE metrics demonstrates improved perceptual quality and superior view consistency achieved by Super-NeRF.
Super-NeRF exhibits strong generalization capability, successfully extending to AI-generated NeRFs from Dreamfusion and gracefully handling hybrid-resolution input settings. |
Current implementation only utilizes a 4x SR model; exploring higher upsampling ratios (8x, 16x) with lightweight, powerful SR backbones is a potential future direction.
Training speed can be enhanced by integrating faster NeRF architectures like TensorRF or InstantNGP without altering the core framework or training strategy. |
neural radiance field, nerf super-resolution, view consistency, generative super-resolution, 3d scene reconstruction |
2304.13509
Report |
EasyPortrait -- Face Parsing and Portrait Segmentation Dataset |
Karina Kvanchiani, Elizaveta Petrova, Karen Efremyan, Alexander Sautin, Alexander Kapitanov |
Recently, video conferencing apps have become functional by accomplishing
such computer vision-based features as real-time background removal and face
beautification. Limited variability in existing portrait segmentation and face
parsing datasets, including head poses, ethnicity, scenes, and occlusions
specific to video conferencing, motivated us to create a new dataset,
EasyPortrait, for these tasks simultaneously. It contains 40,000 primarily
indoor photos repeating video meeting scenarios with 13,705 unique users and
fine-grained segmentation masks separated into 9 classes. Inappropriate
annotation masks from other datasets caused a revision of annotator guidelines,
resulting in EasyPortrait's ability to process cases, such as teeth whitening
and skin smoothing. The pipeline for data mining and high-quality mask
annotation via crowdsourcing is also proposed in this paper. In the ablation
study experiments, we proved the importance of data quantity and diversity in
head poses in our dataset for the effective learning of the model. The
cross-dataset evaluation experiments confirmed the best domain generalization
ability among portrait segmentation datasets. Moreover, we demonstrate the
simplicity of training segmentation models on EasyPortrait without extra
training tricks. The proposed dataset and trained models are publicly
available. |
Introduces EasyPortrait, a novel dataset for face parsing and portrait segmentation, specifically designed for video conferencing applications. |
Existing datasets lack variability in head poses, ethnicity, scenes, and occlusions typical of video conferencing, hindering the development of robust and accurate models for real-time applications. |
Collected 40,000 images from 13,705 unique users, focusing on indoor settings and video meeting scenarios. Employed a crowdsourcing pipeline with rigorous quality control for accurate annotation of 9 classes, including teeth. |
Data quantity and head pose diversity significantly impact model performance.
EasyPortrait-trained models achieve state-of-the-art results on cross-dataset evaluations for portrait segmentation.
EasyPortrait enables the training of high-performing models without requiring specialized training tricks or occlusion simulations. |
Current domain is limited to single-person scenarios, limiting generalizability to more complex scenes.
Future work includes expanding annotation to cover additional facial features and accessories like hair, glasses, and headphones. |
face parsing, portrait segmentation, dataset, video conferencing, deep learning |
2304.13445
Report |
Neural-PBIR Reconstruction of Shape, Material, and Illumination |
Cheng Sun, Guangyan Cai, Zhengqin Li, Kai Yan, Cheng Zhang, Carl Marshall, Jia-Bin Huang, Shuang Zhao, Zhao Dong |
Reconstructing the shape and spatially varying surface appearances of a
physical-world object as well as its surrounding illumination based on 2D
images (e.g., photographs) of the object has been a long-standing problem in
computer vision and graphics. In this paper, we introduce an accurate and
highly efficient object reconstruction pipeline combining neural based object
reconstruction and physics-based inverse rendering (PBIR). Our pipeline firstly
leverages a neural SDF based shape reconstruction to produce high-quality but
potentially imperfect object shape. Then, we introduce a neural material and
lighting distillation stage to achieve high-quality predictions for material
and illumination. In the last stage, initialized by the neural predictions, we
perform PBIR to refine the initial results and obtain the final high-quality
reconstruction of object shape, material, and illumination. Experimental
results demonstrate our pipeline significantly outperforms existing methods
quality-wise and performance-wise. |
A novel, efficient, and accurate inverse rendering pipeline named Neural-PBIR that combines neural reconstruction and physics-based inverse rendering (PBIR) to jointly estimate geometry, spatially varying material reflectance, and HDR environment map from multi-view images of an object. |
Existing neural rendering methods are computationally expensive, often neglecting complex light transport effects like interreflection. Conversely, PBIR methods can handle complex lighting but are prone to local minima. Neural-PBIR aims to address these limitations by leveraging the strengths of both approaches. |
The pipeline consists of three stages: 1) fast neural SDF and radiance field reconstruction using a hybrid volume representation, 2) neural material and lighting distillation from the reconstructed radiance field, and 3) joint refinement of geometry, materials, and lighting using a PBIR framework with a differentiable renderer handling global illumination effects. |
Neural-PBIR outperforms state-of-the-art methods on both synthetic and real datasets in terms of reconstruction accuracy and computational efficiency.
The neural material distilling stage provides high-quality initialization for the PBIR stage, significantly reducing optimization time.
The PBIR stage, by modeling global illumination effects, improves material and lighting reconstruction accuracy compared to optimization without considering interreflection. |
The method may still exhibit 'baking' artifacts when initial predictions are far from the ground truth.
The current implementation assumes opaque materials and does not support transparent or translucent objects.
Future work will focus on addressing the limitations and extending the method to handle more complex material types. |
inverse rendering, neural rendering, physics-based inverse rendering, material reconstruction, lighting estimation |
2304.13386
Report |
VGOS: Voxel Grid Optimization for View Synthesis from Sparse Inputs |
Jiakai Sun, Zhanjie Zhang, Jiafu Chen, Guangyuan Li, Boyan Ji, Lei Zhao, Wei Xing, Huaizhong Lin |
Neural Radiance Fields (NeRF) has shown great success in novel view synthesis
due to its state-of-the-art quality and flexibility. However, NeRF requires
dense input views (tens to hundreds) and a long training time (hours to days)
for a single scene to generate high-fidelity images. Although using the voxel
grids to represent the radiance field can significantly accelerate the
optimization process, we observe that for sparse inputs, the voxel grids are
more prone to overfitting to the training views and will have holes and
floaters, which leads to artifacts. In this paper, we propose VGOS, an approach
for fast (3-5 minutes) radiance field reconstruction from sparse inputs (3-10
views) to address these issues. To improve the performance of voxel-based
radiance field in sparse input scenarios, we propose two methods: (a) We
introduce an incremental voxel training strategy, which prevents overfitting by
suppressing the optimization of peripheral voxels in the early stage of
reconstruction. (b) We use several regularization techniques to smooth the
voxels, which avoids degenerate solutions. Experiments demonstrate that VGOS
achieves state-of-the-art performance for sparse inputs with super-fast
convergence. Code will be available at https://github.com/SJoJoK/VGOS. |
This paper proposes VGOS, an approach for fast novel view synthesis from sparse inputs (3-10 views) based on voxel grid optimization, which addresses overfitting and artifacts issues. |
NeRF achieved great success but requires dense inputs and long training times. Existing fast methods still need dense inputs for quality, and sparse input methods are limited by pre-training needs or extra data requirements like depth maps. |
The paper introduces: (a) an incremental voxel training strategy that prevents overfitting by gradually incorporating peripheral voxels during optimization and (b) a voxel smoothing method with color-aware total variation loss and depth smoothness loss for artifact reduction. |
VGOS achieves state-of-the-art performance for sparse inputs without pre-trained models.
It outperforms previous methods on Realistic Synthetic 360° and LLFF datasets in terms of PSNR and SSIM.
The method demonstrates a significant speedup of 10-100 times compared to previous approaches, achieving high-quality results within minutes. |
The model shows slightly lower performance on LPIPS, a perceptual metric, compared to some methods using pre-trained models.
Future work could explore the integration of high-level information from pre-trained models for improved perceptual quality. |
novel view synthesis, neural radiance fields, sparse input, voxel grids, fast training |
2304.13348
Report |
TextDeformer: Geometry Manipulation using Text Guidance |
William Gao, Noam Aigerman, Thibault Groueix, Vladimir G. Kim, Rana Hanocka |
We present a technique for automatically producing a deformation of an input
triangle mesh, guided solely by a text prompt. Our framework is capable of
deformations that produce both large, low-frequency shape changes, and small
high-frequency details. Our framework relies on differentiable rendering to
connect geometry to powerful pre-trained image encoders, such as CLIP and DINO.
Notably, updating mesh geometry by taking gradient steps through differentiable
rendering is notoriously challenging, commonly resulting in deformed meshes
with significant artifacts. These difficulties are amplified by noisy and
inconsistent gradients from CLIP. To overcome this limitation, we opt to
represent our mesh deformation through Jacobians, which updates deformations in
a global, smooth manner (rather than locally-sub-optimal steps). Our key
observation is that Jacobians are a representation that favors smoother, large
deformations, leading to a global relation between vertices and pixels, and
avoiding localized noisy gradients. Additionally, to ensure the resulting shape
is coherent from all 3D viewpoints, we encourage the deep features computed on
the 2D encoding of the rendering to be consistent for a given vertex from all
viewpoints. We demonstrate that our method is capable of smoothly-deforming a
wide variety of source mesh and target text prompts, achieving both large
modifications to, e.g., body proportions of animals, as well as adding fine
semantic details, such as shoe laces on an army boot and fine details of a
face. |
This paper introduces TextDeformer, a method for deforming existing 3D meshes into new shapes guided by text prompts, utilizing differentiable rendering and pre-trained image encoders like CLIP. |
This approach allows for automated, semantically-aware mesh deformation, enabling both large-scale shape changes and the addition of fine details, which are challenging to achieve with traditional deformation techniques. |
TextDeformer represents deformations using per-triangle Jacobians, optimized through a combination of CLIP-based semantic loss, view consistency loss (to ensure coherence across viewpoints), and Jacobian regularization (to preserve source shape characteristics). |
TextDeformer can deform diverse source meshes to match various target texts, demonstrating generalization across different shapes and prompts.
The method can produce both high-frequency details (e.g., giraffe spots, shoe laces) and low-frequency shape modifications (e.g., body proportions of animals, guitar body shapes).
Utilizing Jacobians leads to smoother and more globally coherent deformations compared to directly optimizing vertex displacements, resulting in higher-quality output meshes. |
The optimization process can be computationally expensive, taking several hours for each deformation.
Exploring the possibility of learning a space of prompt-driven deformations for faster inference and potential improvements through neural regularization. |
3d mesh deformation, text-guided synthesis, differentiable rendering, clip, jacobians |
2304.13153
Report |
LumiGAN: Unconditional Generation of Relightable 3D Human Faces |
Boyang Deng, Yifan Wang, Gordon Wetzstein |
Unsupervised learning of 3D human faces from unstructured 2D image data is an
active research area. While recent works have achieved an impressive level of
photorealism, they commonly lack control of lighting, which prevents the
generated assets from being deployed in novel environments. To this end, we
introduce LumiGAN, an unconditional Generative Adversarial Network (GAN) for 3D
human faces with a physically based lighting module that enables relighting
under novel illumination at inference time. Unlike prior work, LumiGAN can
create realistic shadow effects using an efficient visibility formulation that
is learned in a self-supervised manner. LumiGAN generates plausible physical
properties for relightable faces, including surface normals, diffuse albedo,
and specular tint without any ground truth data. In addition to relightability,
we demonstrate significantly improved geometry generation compared to
state-of-the-art non-relightable 3D GANs and notably better photorealism than
existing relightable GANs. |
\moniker{} is an unconditional 3D GAN for generating photorealistic and relightable 3D human faces, trained solely on unstructured single-view images under unknown and varying lighting conditions. |
Existing 3D GANs for generating human faces lack control of lighting, which prevents the generated assets from being deployed in novel environments. |
The framework uses an expressive, yet efficient physically based lighting model to learn to generate geometry, albedo, specular tint, and visibility components of a person's face. It employs a novel Neural Radiance Transfer approach to efficiently model visibility, producing plausible shadows without expensive ray casting. |
Significantly improved photorealism and view consistency compared to existing relightable 3D GANs.
Improved geometry generation due to the physically-based lighting model and self-supervised training.
Generates plausible physical properties like surface normals, diffuse albedo, and specular tint without ground truth data. |
Extending NRT (Neural Radiance Transfer) to dynamic scenes is non-trivial.
Potential lack of diversity in generated faces due to dataset biases. |
generative adversarial networks (gans), 3d face generation, relighting, neural rendering, computer vision |
2304.13141
Report |
CN-DHF: Compact Neural Double Height-Field Representations of 3D Shapes |
Eric Hedlin, Jinfan Yang, Nicholas Vining, Kwang Moo Yi, Alla Sheffer |
We introduce CN-DHF (Compact Neural Double-Height-Field), a novel hybrid
neural implicit 3D shape representation that is dramatically more compact than
the current state of the art. Our representation leverages Double-Height-Field
(DHF) geometries, defined as closed shapes bounded by a pair of oppositely
oriented height-fields that share a common axis, and leverages the following
key observations: DHFs can be compactly encoded as 2D neural implicits that
capture the maximal and minimal heights along the DHF axis; and typical closed
3D shapes are well represented as intersections of a very small number (three
or fewer) of DHFs. We represent input geometries as CNDHFs by first computing
the set of DHFs whose intersection well approximates each input shape, and then
encoding these DHFs via neural fields. Our approach delivers high-quality
reconstructions, and reduces the reconstruction error by a factor of 2:5 on
average compared to the state-of-the-art, given the same parameter count or
storage capacity. Compared to the best-performing alternative, our method
produced higher accuracy models on 94% of the 400 input shape and parameter
count combinations tested. |
The paper introduces CN-DHF, a new hybrid neural implicit representation for 3D shapes that is significantly more compact than current state-of-the-art methods. |
Compact 3D shape representations are crucial for applications like video games, streaming, and VR/AR, where storage, transmission, and processing times are critical. |
CN-DHF leverages Double-Height-Field (DHF) surfaces, representing shapes as intersections of a few DHFs. Each DHF is encoded as a 2D neural field capturing maximal and minimal heights along a chosen axis. A geometric algorithm finds optimal DHF axes, and a Multi-Layer Perceptron (MLP) models individual DHFs using a loss function combining positional and Laplacian terms. |
CN-DHF models achieve higher accuracy than state-of-the-art alternatives with the same storage capacity, reducing reconstruction error by an average factor of 2.5.
94% of CN-DHF models outperform the best alternative in terms of reconstruction accuracy for given parameter counts.
Analysis reveals that most virtual environment shapes can be accurately represented using an intersection of three or fewer DHFs. |
CN-DHF may not accurately capture portions of geometry invisible from the outside.
Representations are limited by the number of DHFs used, potentially impacting accuracy for highly complex shapes requiring more than three DHFs. |
3d shape representation, neural implicit representation, double-height-field (dhf), compactness, reconstruction accuracy |
2304.13027
Report |
A Strong and Reproducible Object Detector with Only Public Datasets |
Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang |
This work presents Focal-Stable-DINO, a strong and reproducible object
detection model which achieves 64.6 AP on COCO val2017 and 64.8 AP on COCO
test-dev using only 700M parameters without any test time augmentation. It
explores the combination of the powerful FocalNet-Huge backbone with the
effective Stable-DINO detector. Different from existing SOTA models that
utilize an extensive number of parameters and complex training techniques on
large-scale private data or merged data, our model is exclusively trained on
the publicly available dataset Objects365, which ensures the reproducibility of
our approach. |
This work presents Focal-Stable-DINO, a strong and reproducible object detection model achieving 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev using only 700M parameters without test time augmentation. |
This model addresses the limited reproducibility of recent object detection advancements due to reliance on large-scale private data and complex training methods. |
The model combines the FocalNet-Huge backbone pre-trained on ImageNet-22K with the Stable-DINO detector, trained solely on the publicly available Objects365 dataset. |
Focal-Stable-DINO achieves 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev without test time augmentation.
Analysis reveals performance disparity across object classes and significant room for improvement in detecting small objects.
Study highlights inconsistencies and inaccuracies in COCO annotations impacting model evaluation. |
Model performance still limited by object class and small object detection.
Future work should focus on improving dataset annotation quality alongside model performance. |
object detection, focalnet, stable-dino, reproducibility, coco dataset |
2304.12944
Report |
Latent Traversals in Generative Models as Potential Flows |
Yue Song, T. Anderson Keller, Nicu Sebe, Max Welling |
Despite the significant recent progress in deep generative models, the
underlying structure of their latent spaces is still poorly understood, thereby
making the task of performing semantically meaningful latent traversals an open
research challenge. Most prior work has aimed to solve this challenge by
modeling latent structures linearly, and finding corresponding linear
directions which result in `disentangled' generations. In this work, we instead
propose to model latent structures with a learned dynamic potential landscape,
thereby performing latent traversals as the flow of samples down the
landscape's gradient. Inspired by physics, optimal transport, and neuroscience,
these potential landscapes are learned as physically realistic partial
differential equations, thereby allowing them to flexibly vary over both space
and time. To achieve disentanglement, multiple potentials are learned
simultaneously, and are constrained by a classifier to be distinct and
semantically self-consistent. Experimentally, we demonstrate that our method
achieves both more qualitatively and quantitatively disentangled trajectories
than state-of-the-art baselines. Further, we demonstrate that our method can be
integrated as a regularization term during training, thereby acting as an
inductive bias towards the learning of structured representations, ultimately
improving model likelihood on similarly structured data. |
This paper proposes a novel method for performing disentangled latent traversals in pre-trained generative models by modeling them as the flow of particles down learned dynamic potential landscapes defined by physically realistic partial differential equations. |
Existing methods for latent traversal often struggle to disentangle semantic attributes due to limitations in modeling the complexity of the latent space. This work leverages intuitions from physics, optimal transport, and neuroscience to achieve more realistic and disentangled latent traversals. |
The method learns potential functions as partial differential equations (PDEs), specifically the wave equation, to guide the flow of samples in the latent space. An auxiliary classifier is used to encourage the learned potentials to correspond to distinct and semantically consistent transformations. For VAEs, the method can be integrated during training as a regularization term to structure the latent space and improve likelihood on similarly structured data. |
The method achieves qualitatively and quantitatively more disentangled trajectories compared to state-of-the-art baselines on pre-trained GANs (SNGAN, BigGAN, StyleGAN2) and VAEs.
Integrating the method as a regularization term during VAE training improves likelihood on MNIST and dSprites datasets, indicating a beneficial inductive bias for learning structured representations.
Empirical analysis demonstrates that the method can model unambiguous traversal paths with diverse shapes, capturing a wide range of semantic transformations. |
The current formulation mainly explores the second-order wave equation; investigating alternative PDEs could be beneficial.
The potential flow model used has inherent limitations in representing all types of physical flows (e.g., those with vorticity), potentially limiting its applicability to certain transformations. |
latent traversal, disentanglement, generative models, partial differential equations, potential flow |
2304.12748
Report |
Inverting the Imaging Process by Learning an Implicit Camera Model |
Xin Huang, Qi Zhang, Ying Feng, Hongdong Li, Qing Wang |
Representing visual signals with implicit coordinate-based neural networks,
as an effective replacement of the traditional discrete signal representation,
has gained considerable popularity in computer vision and graphics. In contrast
to existing implicit neural representations which focus on modelling the scene
only, this paper proposes a novel implicit camera model which represents the
physical imaging process of a camera as a deep neural network. We demonstrate
the power of this new implicit camera model on two inverse imaging tasks: i)
generating all-in-focus photos, and ii) HDR imaging. Specifically, we devise an
implicit blur generator and an implicit tone mapper to model the aperture and
exposure of the camera's imaging process, respectively. Our implicit camera
model is jointly learned together with implicit scene models under multi-focus
stack and multi-exposure bracket supervision. We have demonstrated the
effectiveness of our new model on a large number of test images and videos,
producing accurate and visually appealing all-in-focus and high dynamic range
images. In principle, our new implicit neural camera model has the potential to
benefit a wide array of other inverse imaging tasks. |
This paper introduces an implicit neural camera model to represent the physical imaging process of a camera, enabling tasks like generating all-in-focus photos and HDR imaging. |
Existing implicit neural representations primarily focus on scene modeling, neglecting the crucial role of the camera imaging process in image formation. |
The model consists of a blur generator (simulating aperture effects) and a tone mapper (simulating exposure effects). It's jointly trained with implicit scene models using multi-focus and multi-exposure image stacks. |
Outperforms state-of-the-art methods in all-in-focus and HDR imaging.
Generates accurate and visually appealing all-in-focus and HDR images from fewer input images compared to traditional methods.
Enables controllable rendering with adjustable focus and exposure. |
Requires significant training time per scene.
Noise modeling and handling scenes with complex geometry or extreme deformations need further improvement. |
implicit neural representation, camera model, inverse imaging, hdr imaging, all-in-focus imaging |
2304.12526
Report |
Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models |
Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou |
Diffusion models are powerful, but they require a lot of time and data to
train. We propose Patch Diffusion, a generic patch-wise training framework, to
significantly reduce the training time costs while improving data efficiency,
which thus helps democratize diffusion model training to broader users. At the
core of our innovations is a new conditional score function at the patch level,
where the patch location in the original image is included as additional
coordinate channels, while the patch size is randomized and diversified
throughout training to encode the cross-region dependency at multiple scales.
Sampling with our method is as easy as in the original diffusion model. Through
Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while
maintaining comparable or better generation quality. Patch Diffusion meanwhile
improves the performance of diffusion models trained on relatively small
datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve
outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on
CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on
ImageNet-256$\times$256. We share our code and pre-trained models at
https://github.com/Zhendong-Wang/Patch-Diffusion. |
This paper introduces Patch Diffusion, a novel training framework for diffusion models that leverages patch-level training with location and size conditioning to significantly reduce training time and data requirements. |
Training diffusion models is computationally expensive and data-intensive, limiting accessibility for researchers with limited resources. This work aims to democratize diffusion model training by significantly reducing these costs. |
The proposed method learns a conditional score function on image patches. It incorporates patch location as additional coordinate channels and employs a stochastic patch size scheduling strategy to capture cross-region dependencies at multiple scales. |
Patch Diffusion achieves comparable or better image generation quality while reducing training time by at least 50% compared to state-of-the-art diffusion models.
The method demonstrates improved data efficiency, achieving superior generation quality on small datasets with limited training images.
Patch Diffusion effectively finetunes large-scale pre-trained diffusion models without compromising performance, as shown with ControlNet for image generation. |
The current coordinate system could be further improved by utilizing more advanced positional embeddings.
Theoretical proof of convergence for patch-wise score matching in general cases remains an open question. |
diffusion models, generative ai, patch-based learning, efficient training, data efficiency |
2304.12439
Report |
TextMesh: Generation of Realistic 3D Meshes From Text Prompts |
Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, Federico Tombari |
The ability to generate highly realistic 2D images from mere text prompts has
recently made huge progress in terms of speed and quality, thanks to the advent
of image diffusion models. Naturally, the question arises if this can be also
achieved in the generation of 3D content from such text prompts. To this end, a
new line of methods recently emerged trying to harness diffusion models,
trained on 2D images, for supervision of 3D model generation using view
dependent prompts. While achieving impressive results, these methods, however,
have two major drawbacks. First, rather than commonly used 3D meshes, they
instead generate neural radiance fields (NeRFs), making them impractical for
most real applications. Second, these approaches tend to produce over-saturated
models, giving the output a cartoonish looking effect. Therefore, in this work
we propose a novel method for generation of highly realistic-looking 3D meshes.
To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D
mesh extraction. In addition, we propose a novel way to finetune the mesh
texture, removing the effect of high saturation and improving the details of
the output 3D mesh. |
Proposes TextMesh, a novel method for generating realistic 3D meshes from text prompts, addressing limitations of existing NeRF-based methods that produce oversaturated, non-mesh outputs. |
Enables creation of photorealistic 3D content directly usable in standard computer graphics pipelines for applications like AR/VR, overcoming drawbacks of existing methods. |
Modifies DreamFusion to use an SDF backbone for easier mesh extraction and employs a novel multi-view consistent texture refinement using a depth-conditioned diffusion model. |
Generates 3D meshes with more natural textures than state-of-the-art methods.
Demonstrates through user study that the proposed texture refinement significantly improves realism.
Shows SDF-based approach leads to smoother meshes compared to NeRF-based methods. |
Relies on metrics like CLIP R-Precision and FID_CLIP, which may not fully capture 3D consistency and realism.
Exploiting temporal consistency for further enhancing texture realism is left for future work. |
3d mesh generation, text-to-3d, diffusion models, photorealistic rendering, neural radiance fields |
2304.12406
Report |
AutoFocusFormer: Image Segmentation off the Grid |
Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alex Schwing, Alex Colburn, Li Fuxin |
Real world images often have highly imbalanced content density. Some areas
are very uniform, e.g., large patches of blue sky, while other areas are
scattered with many small objects. Yet, the commonly used successive grid
downsampling strategy in convolutional deep networks treats all areas equally.
Hence, small objects are represented in very few spatial locations, leading to
worse results in tasks such as segmentation. Intuitively, retaining more pixels
representing small objects during downsampling helps to preserve important
information. To achieve this, we propose AutoFocusFormer (AFF), a
local-attention transformer image recognition backbone, which performs adaptive
downsampling by learning to retain the most important pixels for the task.
Since adaptive downsampling generates a set of pixels irregularly distributed
on the image plane, we abandon the classic grid structure. Instead, we develop
a novel point-based local attention block, facilitated by a balanced clustering
module and a learnable neighborhood merging module, which yields
representations for our point-based versions of state-of-the-art segmentation
heads. Experiments show that our AutoFocusFormer (AFF) improves significantly
over baseline models of similar sizes. |
Proposes AutoFocusFormer (AFF), the first end-to-end segmentation network with successive adaptive downsampling stages for more effective segmentation, especially of small objects. |
Standard convolutional neural networks and existing vision transformer architectures uniformly downsample images, leading to the loss of information crucial for pixel-level tasks like segmentation, particularly for small objects. |
AFF employs local attention blocks and introduces a novel balanced clustering algorithm for neighborhood definition on irregularly downsampled images. It also includes a novel adaptive downsampling module that learns to retain important pixels based on task relevance. Finally, it adapts state-of-the-art segmentation heads to operate on irregularly spaced tokens generated by the backbone. |
AFF outperforms Swin Transformers on ImageNet classification across various model sizes, demonstrating the effectiveness of adaptive downsampling.
Significant improvement over Swin Transformer baselines in semantic segmentation on ADE20K and instance segmentation on Cityscapes, highlighting AFF's strength in dense prediction tasks.
AFF-Tiny achieves comparable performance to Swin-Base, a model 3.3 times larger, on Cityscapes panoptic segmentation, showcasing efficiency and superior performance with limited resources. |
Regression observed in large object segmentation performance, suggesting room for improvement in decoder aggregation with uneven sampling rates.
Limited exploration of class-specific performance correlations, warranting further investigation to understand how AFF's improvements vary across categories. |
adaptive downsampling, image segmentation, vision transformer, local attention, balanced clustering |
2304.12317
Report |
Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis |
Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, Deva Ramanan |
We explore the task of embodied view synthesis from monocular videos of
deformable scenes. Given a minute-long RGBD video of people interacting with
their pets, we render the scene from novel camera trajectories derived from the
in-scene motion of actors: (1) egocentric cameras that simulate the point of
view of a target actor and (2) 3rd-person cameras that follow the actor.
Building such a system requires reconstructing the root-body and articulated
motion of every actor, as well as a scene representation that supports
free-viewpoint synthesis. Longer videos are more likely to capture the scene
from diverse viewpoints (which helps reconstruction) but are also more likely
to contain larger motions (which complicates reconstruction). To address these
challenges, we present Total-Recon, the first method to photorealistically
reconstruct deformable scenes from long monocular RGBD videos. Crucially, to
scale to long videos, our method hierarchically decomposes the scene into the
background and objects, whose motion is decomposed into carefully initialized
root-body motion and local articulations. To quantify such "in-the-wild"
reconstruction and view synthesis, we collect ground-truth data from a
specialized stereo RGBD capture rig for 11 challenging videos, significantly
outperforming prior methods. Our code, model, and data can be found at
https://andrewsonga.github.io/totalrecon . |
This paper presents Total-Recon, a new method for embodied view synthesis from monocular videos of deformable scenes, enabling rendering from novel camera trajectories derived from the motion of actors in the scene. |
Embodied view synthesis provides highly immersive experiences in gaming and virtual reality by simulating egocentric and 3rd-person-follow camera trajectories, and it has theoretical implications in spatial cognition theory. |
Total-Recon reconstructs a deformable 3D scene representation by hierarchically decomposing the scene into object-centric neural fields, each encoding appearance, geometry, and motion. It separates object motion into global root-body movement and local articulations, enabling scalability to minute-long videos. Embodied views and 3D video filters are generated by leveraging these reconstructed motions. |
Total-Recon successfully reconstructs the geometry and appearance of dynamic scenes, including humans and pets interacting, from minute-long monocular RGBD videos.
The method outperforms state-of-the-art monocular deformable NeRF methods, even with depth supervision added to the baselines, on a new dataset of 11 challenging stereo RGBD videos.
Ablation studies demonstrate the importance of the hierarchical motion decomposition, depth supervision, and object-centric representations for achieving high-quality reconstruction and view synthesis in dynamic scenes. |
Current reliance on off-the-shelf segmentation models for scene decomposition may lead to inaccurate reconstructions in scenarios with partial occlusions.
The model is computationally expensive, requiring around 15 hours of training per sequence on multiple GPUs, limiting its real-time applicability. |
novel view synthesis, deformable nerfs, embodied view synthesis, rgbd reconstruction, object-centric representations |
2304.12308
Report |
Segment Anything in 3D with Radiance Fields |
Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian |
The Segment Anything Model (SAM) emerges as a powerful vision foundation
model to generate high-quality 2D segmentation results. This paper aims to
generalize SAM to segment 3D objects. Rather than replicating the data
acquisition and annotation procedure which is costly in 3D, we design an
efficient solution, leveraging the radiance field as a cheap and off-the-shelf
prior that connects multi-view 2D images to the 3D space. We refer to the
proposed solution as SA3D, short for Segment Anything in 3D. With SA3D, the
user is only required to provide a 2D segmentation prompt (e.g., rough points)
for the target object in a single view, which is used to generate its
corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse
rendering and cross-view self-prompting across various views to iteratively
refine the 3D mask of the target object. For one view, mask inverse rendering
projects the 2D mask obtained by SAM into the 3D space with guidance of the
density distribution learned by the radiance field for 3D mask refinement;
Then, cross-view self-prompting extracts reliable prompts automatically as the
input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new
view. We show in experiments that SA3D adapts to various scenes and achieves 3D
segmentation within seconds. Our research reveals a potential methodology to
lift the ability of a 2D segmentation model to 3D. Our code is available at
https://github.com/Jumpat/SegmentAnythingin3D. |
This paper introduces SA3D, a method that leverages radiance fields to extend the 2D Segment Anything Model (SAM) to 3D segmentation. |
Building a 3D foundation model like SAM from scratch is impractical due to the high cost of acquiring and annotating 3D data. SA3D provides an efficient alternative by combining the power of SAM with off-the-shelf radiance fields. |
SA3D takes 2D prompts in a single view as input and iteratively refines a 3D mask through two steps: 1) **Mask inverse rendering**, projecting the 2D mask generated by SAM into 3D space guided by the radiance field's density distribution; 2) **Cross-view self-prompting**, rendering the 3D mask in new views and automatically extracting prompts for SAM to generate more complete 2D masks. |
SA3D achieves state-of-the-art 3D segmentation performance on multiple benchmarks, including NVOS and SPIn-NeRF.
The method is efficient, capable of segmenting a 3D object within seconds, especially when combined with 3D Gaussian Splatting (3D-GS).
SA3D is compatible with various radiance fields and generalizes to different segmentation tasks like instance and part segmentation. |
The segmentation performance of SA3D is affected by the quality of the pre-trained radiance field, as demonstrated by the sub-optimal results on the Replica dataset.
The current ambiguous Gaussian removal strategy for SA3D-GS has limitations in handling Gaussians that contribute significantly to rendering but also model occluded parts. |
3d segmentation, radiance fields, 3d gaussian splatting, segment anything model, foundation models |
2304.12160
Report |
End-to-End Spatio-Temporal Action Localisation with Video Transformers |
Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab |
The most performant spatio-temporal action localisation models use external
person proposals and complex external memory banks. We propose a fully
end-to-end, purely-transformer based model that directly ingests an input
video, and outputs tubelets -- a sequence of bounding boxes and the action
classes at each frame. Our flexible model can be trained with either sparse
bounding-box supervision on individual frames, or full tubelet annotations. And
in both cases, it predicts coherent tubelets as the output. Moreover, our
end-to-end model requires no additional pre-processing in the form of
proposals, or post-processing in terms of non-maximal suppression. We perform
extensive ablation experiments, and significantly advance the state-of-the-art
results on four different spatio-temporal action localisation benchmarks with
both sparse keyframes and full tubelet annotations. |
This paper proposes STAR, a fully end-to-end transformer-based model for spatio-temporal action localisation that directly ingests a video and outputs tubelets (sequences of bounding boxes and action classes). |
Existing methods rely on external person proposals or complex memory banks, limiting their efficiency and practicality. This paper explores a purely end-to-end approach to address these limitations. |
The model utilizes a transformer-based vision encoder and a decoder with temporal inductive biases. It can be trained with sparse bounding-box supervision or full tubelet annotations, predicting coherent tubelets in both cases. |
STAR achieves state-of-the-art results on AVA and AVA-Kinetics (keyframe-based datasets), and UCF101-24 and JHMDB (tubelet-based datasets).
The model surpasses previous methods while being end-to-end, not requiring external person detectors or memory banks.
Experiments demonstrate that STAR effectively predicts tubelets even with sparse keyframe supervision. |
The model's performance might be further improved by exploring alternative transformer architectures or training schemes.
Evaluation on more diverse and complex datasets is needed to further validate the generalizability of the proposed approach. |
action localization, spatio-temporal, transformer, end-to-end, tubelet detection |
2304.11968
Report |
Track Anything: Segment Anything Meets Videos |
Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, Feng Zheng |
Recently, the Segment Anything Model (SAM) gains lots of attention rapidly
due to its impressive segmentation performance on images. Regarding its strong
ability on image segmentation and high interactivity with different prompts, we
found that it performs poorly on consistent segmentation in videos. Therefore,
in this report, we propose Track Anything Model (TAM), which achieves
high-performance interactive tracking and segmentation in videos. To be
detailed, given a video sequence, only with very little human participation,
i.e., several clicks, people can track anything they are interested in, and get
satisfactory results in one-pass inference. Without additional training, such
an interactive design performs impressively on video object tracking and
segmentation. All resources are available on
{https://github.com/gaomingqi/Track-Anything}. We hope this work can facilitate
related research. |
This paper proposes Track Anything Model (TAM), achieving high-performance interactive tracking and segmentation in videos through minimal human interaction (few clicks) and one-pass inference. |
This work addresses limitations of existing video tracking/segmentation methods, including labor-intensive annotation, specific initialization requirements, and poor performance in complex scenarios. |
TAM integrates Segment Anything Model (SAM) for interactive object initialization and refinement, and XMem for temporal object tracking. Users initialize object selection with clicks, XMem predicts subsequent frames, SAM refines uncertain masks, and users can manually correct errors during inference. |
TAM achieves competitive J&F scores on DAVIS-2016-val and DAVIS-2017-test-dev datasets with only click initialization and one-pass inference.
TAM handles challenging scenarios like multi-object separation, target deformation, scale change, and camera motion effectively.
TAM demonstrates potential in various applications, including efficient video annotation, long-term object tracking, user-friendly video editing, and visualized development toolkit. |
TAM's performance on long videos with mask shrinkage or lacking refinement needs improvement.
Handling complex object structures with fine-grained details during initialization remains a challenge. |
interactive tracking, video object segmentation, segment anything model (sam), one-pass inference, human-in-the-loop |
2304.11829
Report |
Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation |
Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, Xihui Liu |
Diffusion models have attained impressive visual quality for image synthesis.
However, how to interpret and manipulate the latent space of diffusion models
has not been extensively explored. Prior work diffusion autoencoders encode the
semantic representations into a semantic latent code, which fails to reflect
the rich information of details and the intrinsic feature hierarchy. To
mitigate those limitations, we propose Hierarchical Diffusion Autoencoders
(HDAE) that exploit the fine-grained-to-abstract and lowlevel-to-high-level
feature hierarchy for the latent space of diffusion models. The hierarchical
latent space of HDAE inherently encodes different abstract levels of semantics
and provides more comprehensive semantic representations. In addition, we
propose a truncated-feature-based approach for disentangled image manipulation.
We demonstrate the effectiveness of our proposed approach with extensive
experiments and applications on image reconstruction, style mixing,
controllable interpolation, detail-preserving and disentangled image
manipulation, and multi-modal semantic image synthesis. |
This paper proposes Hierarchical Diffusion Autoencoders (HDAE), which exploit the inherent feature hierarchy in images to achieve a richer and more comprehensive latent space representation for diffusion models. This hierarchical representation enables finer-grained control and disentanglement in image manipulation tasks. |
A semantically meaningful, editable, and decodable latent space is crucial for interpreting generative models and enabling applications like image editing. Existing diffusion autoencoders lack this fine-grained control and often suffer from information loss, especially in lower-level features. |
The authors introduce a hierarchical latent space design that encodes features at different scales, capturing both low-level details and high-level semantics. They further propose a truncated-feature-based method for disentangled image manipulation, addressing the entanglement issues common in latent space editing. |
HDAE achieves near-perfect image reconstruction, surpassing previous GAN inversion, VAE-based, and diffusion-based methods.
The hierarchical latent space allows for style mixing, controllable interpolation, and multi-modal semantic image synthesis.
The proposed truncated-feature method significantly improves the disentanglement of attributes during image manipulation, enabling editing of specific features without unwanted alterations. |
The higher dimensionality of hierarchical semantic vectors in HDAE poses a challenge for predicting them using a latent DDIM, requiring techniques like dimensionality reduction.
Further investigation into optimizing the trade-off between disentanglement and reconstruction quality is needed. |
diffusion models, hierarchical latent space, image manipulation, disentanglement, image synthesis |
2304.11603
Report |
LaMD: Latent Motion Diffusion for Video Generation |
Yaosi Hu, Zhenzhong Chen, Chong Luo |
Generating coherent and natural movement is the key challenge in video
generation. This research proposes to condense video generation into a problem
of motion generation, to improve the expressiveness of motion and make video
generation more manageable. This can be achieved by breaking down the video
generation process into latent motion generation and video reconstruction. We
present a latent motion diffusion (LaMD) framework, which consists of a
motion-decomposed video autoencoder and a diffusion-based motion generator, to
implement this idea. Through careful design, the motion-decomposed video
autoencoder can compress patterns in movement into a concise latent motion
representation. Meanwhile, the diffusion-based motion generator is able to
efficiently generate realistic motion on a continuous latent space under
multi-modal conditions, at a cost that is similar to that of image diffusion
models. Results show that LaMD generates high-quality videos with a wide range
of motions, from stochastic dynamics to highly controllable movements. It
achieves new state-of-the-art performance on benchmark datasets, including
BAIR, Landscape and CATER-GENs, for Image-to-Video (I2V) and
Text-Image-to-Video (TI2V) generation. The source code of LaMD will be made
available soon. |
This paper presents LaMD (Latent Motion Diffusion), a novel framework for video generation that focuses on generating realistic and diverse motion. |
Generating coherent and natural movement in videos remains a key challenge. LaMD addresses this by decomposing video generation into motion generation and video reconstruction, simplifying the process and improving motion expressiveness. |
LaMD consists of: (1) MCD-VAE (Motion-Content Decomposed Video Autoencoder): Extracts compressed latent motion representations and reconstructs videos from motion and content features. (2) DMG (Diffusion-based Motion Generator): Generates motion latents conditioned on content features and optional text descriptions using a diffusion model. |
LaMD generates high-quality videos with realistic and diverse motion, outperforming previous methods on benchmark datasets.
MCD-VAE effectively decomposes and compresses motion while preserving reconstruction quality.
DMG generates controllable motion guided by both image content and text descriptions. |
Scaling MCD-VAE to larger and more diverse video datasets could further enhance its performance.
Incorporating pre-trained image autoencoders into MCD-VAE could lead to even more compressed representations and reduced computational costs. |
video generation, motion diffusion, latent space, image-to-video, text-image-to-video |
2304.11523
Report |
TransFlow: Transformer as Flow Learner |
Yawen Lu, Qifan Wang, Siqi Ma, Tong Geng, Yingjie Victor Chen, Huaijin Chen, Dongfang Liu |
Optical flow is an indispensable building block for various important
computer vision tasks, including motion estimation, object tracking, and
disparity measurement. In this work, we propose TransFlow, a pure transformer
architecture for optical flow estimation. Compared to dominant CNN-based
methods, TransFlow demonstrates three advantages. First, it provides more
accurate correlation and trustworthy matching in flow estimation by utilizing
spatial self-attention and cross-attention mechanisms between adjacent frames
to effectively capture global dependencies; Second, it recovers more
compromised information (e.g., occlusion and motion blur) in flow estimation
through long-range temporal association in dynamic scenes; Third, it enables a
concise self-learning paradigm and effectively eliminate the complex and
laborious multi-stage pre-training procedures. We achieve the state-of-the-art
results on the Sintel, KITTI-15, as well as several downstream tasks, including
video object detection, interpolation and stabilization. For its efficacy, we
hope TransFlow could serve as a flexible baseline for optical flow estimation. |
This paper introduces TransFlow, a novel end-to-end optical flow estimation architecture based entirely on Transformers. |
The authors aim to address limitations of CNN-based optical flow methods, such as their struggle to model global spatial dependencies and temporal associations, and their reliance on complex pretraining pipelines. |
TransFlow leverages spatial self-attention and cross-attention mechanisms to capture global dependencies for accurate flow estimation. It also models temporal association across multiple frames using a Transformer encoder. For efficient training, a self-supervised pretraining module is introduced, inspired by MAE, which strategically masks and reconstructs image patches to learn strong pixel representations. |
TransFlow achieves state-of-the-art results on Sintel and KITTI-15 benchmarks, outperforming previous methods even without the common multi-stage pretraining on synthetic datasets.
The proposed self-supervised pretraining strategy proves effective, leading to competitive results with a simplified training pipeline.
TransFlow demonstrates strong generalizability and improves performance in downstream tasks like video object detection, interpolation, and stabilization. |
The impact of varying the number of Transformer blocks and their design choices on the trade-off between accuracy and computational cost needs further exploration.
Investigating the effectiveness of TransFlow on more complex real-world scenarios with significant occlusions and challenging lighting conditions is crucial. |
optical flow estimation, vision transformer, self-supervised learning, spatial-temporal attention, global matching |
2304.11463
Report |
OmniLabel: A Challenging Benchmark for Language-Based Object Detection |
Samuel Schulter, Vijay Kumar B G, Yumin Suh, Konstantinos M. Dafnis, Zhixing Zhang, Shiyu Zhao, Dimitris Metaxas |
Language-based object detection is a promising direction towards building a
natural interface to describe objects in images that goes far beyond plain
category names. While recent methods show great progress in that direction,
proper evaluation is lacking. With OmniLabel, we propose a novel task
definition, dataset, and evaluation metric. The task subsumes standard- and
open-vocabulary detection as well as referring expressions. With more than 28K
unique object descriptions on over 25K images, OmniLabel provides a challenging
benchmark with diverse and complex object descriptions in a naturally
open-vocabulary setting. Moreover, a key differentiation to existing benchmarks
is that our object descriptions can refer to one, multiple or even no object,
hence, providing negative examples in free-form text. The proposed evaluation
handles the large label space and judges performance via a modified average
precision metric, which we validate by evaluating strong language-based
baselines. OmniLabel indeed provides a challenging test bed for future research
on language-based detection. |
OmniLabel, a novel benchmark for language-based object detection, unifying standard-, open-vocabulary detection and referring expressions. |
Existing benchmarks lack proper evaluation for language-based object detection with complex descriptions and negative examples. |
Leveraging existing datasets, the authors define an annotation process for diverse free-form object descriptions, including those referring to multiple or no objects, and propose a modified average precision metric handling a large label space. |
OmniLabel contains more diverse and complex object descriptions than prior benchmarks (RefCOCO, Flickr30k, PhraseCut).
Negative descriptions in OmniLabel pose a significant challenge to current language-based detectors.
GLIP and FIBER models achieve the best results on OmniLabel, highlighting its difficulty. |
The current version of OmniLabel has a different distribution of negative descriptions across different source datasets.
Further investigation is needed to understand the impact of the number of categories on negative description verification rates. |
object detection, language-based vision, benchmark, referring expressions, open-vocabulary |
2304.11446
Report |
Fast Diffusion Probabilistic Model Sampling through the lens of Backward Error Analysis |
Yansong Gao, Zhihong Pan, Xin Zhou, Le Kang, Pratik Chaudhari |
Denoising diffusion probabilistic models (DDPMs) are a class of powerful
generative models. The past few years have witnessed the great success of DDPMs
in generating high-fidelity samples. A significant limitation of the DDPMs is
the slow sampling procedure. DDPMs generally need hundreds or thousands of
sequential function evaluations (steps) of neural networks to generate a
sample. This paper aims to develop a fast sampling method for DDPMs requiring
much fewer steps while retaining high sample quality. The inference process of
DDPMs approximates solving the corresponding diffusion ordinary differential
equations (diffusion ODEs) in the continuous limit. This work analyzes how the
backward error affects the diffusion ODEs and the sample quality in DDPMs. We
propose fast sampling through the \textbf{Restricting Backward Error schedule
(RBE schedule)} based on dynamically moderating the long-time backward error.
Our method accelerates DDPMs without any further training. Our experiments show
that sampling with an RBE schedule generates high-quality samples within only 8
to 20 function evaluations on various benchmark datasets. We achieved 12.01 FID
in 8 function evaluations on the ImageNet $128\times128$, and a $20\times$
speedup compared with previous baseline samplers. |
This paper introduces a novel fast sampling method for Denoising Diffusion Probabilistic Models (DDPMs) based on dynamically moderating the long-time backward error of the diffusion ODEs. |
DDPMs, despite their prowess in generating high-fidelity samples, suffer from slow sampling procedures, needing numerous sequential function evaluations. This work aims to alleviate this bottleneck by enabling fast sampling while preserving sample quality. |
The authors analyze the impact of backward error on diffusion ODEs and sample quality in DDPMs. They propose two methods: 1) Dynamically Restricting the Backward Error (DRBE) schedule, which crafts step size to restrict backward error, and 2) Restricting Backward Error (RBE) schedule, which learns an effective inference schedule by averaging schedules generated from DRBE. |
Sampling with RBE schedule generates high-quality samples within 8 to 20 function evaluations on benchmarks like ImageNet and LSUN.
The method achieved 12.01 FID in 8 function evaluations on ImageNet 128x128, demonstrating significant speedup over baseline samplers.
Empirical analysis reveals that RBE schedule lies between linear and cosine noise schedules, approaching the latter as the number of function evaluations increases. |
The assumption of the original flow being an analytic function in backward error analysis might not always hold true.
Future work includes investigating the design of optimal noise schedules inspired by the behavior of RBE schedule. |
denoising diffusion probabilistic models, fast sampling, backward error analysis, diffusion odes, generative models |
2304.11342
Report |
NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation |
Baao Xie, Bohan Li, Zequn Zhang, Junting Dong, Xin Jin, Jingyu Yang, Wenjun Zeng |
3D representation disentanglement aims to identify, decompose, and manipulate
the underlying explanatory factors of 3D data, which helps AI fundamentally
understand our 3D world. This task is currently under-explored and poses great
challenges: (i) the 3D representations are complex and in general contains much
more information than 2D image; (ii) many 3D representations are not well
suited for gradient-based optimization, let alone disentanglement. To address
these challenges, we use NeRF as a differentiable 3D representation, and
introduce a self-supervised Navigation to identify interpretable semantic
directions in the latent space. To our best knowledge, this novel method,
dubbed NaviNeRF, is the first work to achieve fine-grained 3D disentanglement
without any priors or supervisions. Specifically, NaviNeRF is built upon the
generative NeRF pipeline, and equipped with an Outer Navigation Branch and an
Inner Refinement Branch. They are complementary -- the outer navigation is to
identify global-view semantic directions, and the inner refinement dedicates to
fine-grained attributes. A synergistic loss is further devised to coordinate
two branches. Extensive experiments demonstrate that NaviNeRF has a superior
fine-grained 3D disentanglement ability than the previous 3D-aware models. Its
performance is also comparable to editing-oriented models relying on semantic
or geometry priors. |
NaviNeRF is a novel NeRF-based 3D representation learning method that disentangles fine-grained 3D features without relying on priors or supervision. |
3D representation disentanglement is crucial for AI to understand the 3D world, but existing methods often lack interpretability, controllability, and struggle with the complexity of 3D data. |
The method uses a two-branch approach: an outer navigation branch identifies semantic directions in the latent space by predicting shifts in latent codes, and an inner refinement branch focuses on fine-grained attributes and 3D consistency by applying shifts to specific dimensions of intermediate latent codes. A synergistic loss combines these branches. |
NaviNeRF achieves fine-grained 3D disentanglement, enabling continuous manipulation of specific attributes like mouth, whiskers, and hair.
It outperforms typical 3D-aware GANs (pi-GAN, GIRAFFE, StyleNeRF) in attribute manipulation quality.
NaviNeRF shows comparable performance to editing-oriented NeRF models that rely on semantic or geometric priors (FENeRF, CGOF++). |
The quality of 3D reconstruction depends heavily on the pre-trained generator.
Future work could explore unsupervised disentanglement in more complex scenes and incorporate temporal consistency. |
3d disentanglement, nerf, generative models, latent semantic navigation, 3d representation learning |
2304.11330
Report |
Self-supervised Learning by View Synthesis |
Shaoteng Liu, Xiangyu Zhang, Tao Hu, Jiaya Jia |
We present view-synthesis autoencoders (VSA) in this paper, which is a
self-supervised learning framework designed for vision transformers. Different
from traditional 2D pretraining methods, VSA can be pre-trained with multi-view
data. In each iteration, the input to VSA is one view (or multiple views) of a
3D object and the output is a synthesized image in another target pose. The
decoder of VSA has several cross-attention blocks, which use the source view as
value, source pose as key, and target pose as query. They achieve
cross-attention to synthesize the target view. This simple approach realizes
large-angle view synthesis and learns spatial invariant representation, where
the latter is decent initialization for transformers on downstream tasks, such
as 3D classification on ModelNet40, ShapeNet Core55, and ScanObjectNN. VSA
outperforms existing methods significantly for linear probing and is
competitive for fine-tuning. The code will be made publicly available. |
This paper introduces View-Synthesis Autoencoders (VSA), a self-supervised learning framework for vision transformers using multi-view data. |
Existing self-supervised learning methods for vision transformers do not leverage the inherent 3D geometric relationships present in multi-view data. This paper aims to address this gap and learn spatial-invariant representations by synthesizing novel views from different angles. |
VSA utilizes an encoder-decoder architecture. The encoder (e.g., ViT) processes a source view, and the decoder uses cross-attention blocks with source view features, source pose, and target pose as input to synthesize a target view. Training is done by minimizing the MSE loss between synthesized and actual target views. |
VSA successfully synthesizes novel views from single or multiple source views, demonstrating its ability to learn spatial-invariant representations.
VSA achieves competitive performance on 3D classification benchmarks (ModelNet40, ShapeNet Core55, ScanObjectNN), outperforming existing methods in linear probing evaluation.
Ablation studies reveal the impact of decoder design, view sampling strategy, data augmentation, and masking ratio on VSA performance. |
The paper primarily focuses on fixed-view scenarios, and exploring dynamic view selection could further enhance performance.
While VSA demonstrates strong results, combining it with other self-supervised methods for further improvement requires investigation. |
self-supervised learning, vision transformer, view synthesis, 3d classification, multi-view data |
2304.11312
Report |
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation |
Guoqiang Zhang, Niwa Kenta, W. Bastiaan Kleijn |
We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the
correlation in the outputs of the deep neural networks (DNNs) over subsequent
timesteps in diffusion probabilistic models (DPMs) to refine the mean
estimation of the conditional Gaussian distributions in the backward process. A
typical DPM first obtains an estimate of the original data sample
$\boldsymbol{x}$ by feeding the most recent state $\boldsymbol{z}_i$ and index
$i$ into the DNN model and then computes the mean vector of the conditional
Gaussian distribution for $\boldsymbol{z}_{i-1}$. We propose to calculate a
more accurate estimate for $\boldsymbol{x}$ by performing extrapolation on the
two estimates of $\boldsymbol{x}$ that are obtained by feeding
$(\boldsymbol{z}_{i+1},i+1)$ and $(\boldsymbol{z}_{i},i)$ into the DNN model.
The extrapolation can be easily integrated into the backward process of
existing DPMs by introducing an additional connection over two consecutive
timesteps, and fine-tuning is not required. Extensive experiments showed that
plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and
high-order DPM-Solvers leads to a significant performance gain in terms of FID
score. |
The paper proposes Lookahead Diffusion Probabilistic Models (LA-DPMs) that refine mean estimation of conditional Gaussian distributions during the backward process of DPMs by exploiting correlations in DNN outputs over consecutive timesteps. |
LA-DPMs aim to improve sampling quality, especially with a limited computational budget (fewer timesteps), which is crucial for practical applications. |
The method introduces an extrapolation operation on two recent estimates of the data sample obtained at timesteps *i* and *i+1*. This refines the data sample estimation at timestep *i* and consequently the mean estimation for the latent variable at timestep *i-1*. This is achieved by adding connections between two consecutive timesteps in the backward process. |
LA-DPMs, requiring no fine-tuning, significantly improve FID scores compared to original DPMs, particularly for a small number of timesteps.
The performance gain is observed across various DPM models like DDPM, DDIM, DEIS, S-PNDM and DPM-Solver.
The computational overhead introduced by the extrapolation operation is negligible. |
The optimal extrapolation strength (parameter *λ*) might vary across different timesteps and datasets.
Future work could involve training a separate DNN to learn optimal *λ* values. |
diffusion probabilistic models, generative models, deep learning, image generation, sampling efficiency |
2304.11267
Report |
Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations |
Yu-Hui Chen, Raman Sarokin, Juhyun Lee, Jiuqiang Tang, Chuo-Ling Chang, Andrei Kulik, Matthias Grundmann |
The rapid development and application of foundation models have
revolutionized the field of artificial intelligence. Large diffusion models
have gained significant attention for their ability to generate photorealistic
images and support various tasks. On-device deployment of these models provides
benefits such as lower server costs, offline functionality, and improved user
privacy. However, common large diffusion models have over 1 billion parameters
and pose challenges due to restricted computational and memory resources on
devices. We present a series of implementation optimizations for large
diffusion models that achieve the fastest reported inference latency to-date
(under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung
S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile
devices. These enhancements broaden the applicability of generative AI and
improve the overall user experience across a wide range of devices. |
This paper introduces a set of GPU-aware optimizations for large diffusion models, specifically targeting on-device deployment. |
On-device deployment of large diffusion models (e.g., Stable Diffusion) is crucial for reducing server costs, enhancing privacy, and enabling offline functionality, but is challenged by limited computational and memory resources on mobile devices. |
The authors implement several optimizations: 1) specialized kernels for Group Normalization and GELU activation, 2) enhanced attention module efficiency via partially fused softmax and FlashAttention, and 3) strategic use of Winograd convolution. |
Achieved state-of-the-art inference latency for Stable Diffusion 1.4 on mobile GPUs (under 12 seconds for a 512x512 image with 20 iterations on Samsung S23 Ultra).
Significantly reduced latency compared to baseline implementation on both Samsung S23 Ultra (-52.2%) and iPhone 14 Pro Max (-32.9%).
Optimized memory usage for intermediate tensors and model weights. |
The paper focuses on Stable Diffusion; applicability of these optimizations to other diffusion models needs further investigation.
The trade-off between Winograd convolution's computational efficiency and increased memory consumption requires careful consideration. |
diffusion models, on-device ai, gpu optimization, stable diffusion, latency reduction |
2304.11113
Report |
Implicit Neural Head Synthesis via Controllable Local Deformation Fields |
Chuhan Chen, Matthew O'Toole, Gaurav Bharaj, Pablo Garrido |
High-quality reconstruction of controllable 3D head avatars from 2D videos is
highly desirable for virtual human applications in movies, games, and
telepresence. Neural implicit fields provide a powerful representation to model
3D head avatars with personalized shape, expressions, and facial parts, e.g.,
hair and mouth interior, that go beyond the linear 3D morphable model (3DMM).
However, existing methods do not model faces with fine-scale facial features,
or local control of facial parts that extrapolate asymmetric expressions from
monocular videos. Further, most condition only on 3DMM parameters with poor(er)
locality, and resolve local features with a global neural field. We build on
part-based implicit shape models that decompose a global deformation field into
local ones. Our novel formulation models multiple implicit deformation fields
with local semantic rig-like control via 3DMM-based parameters, and
representative facial landmarks. Further, we propose a local control loss and
attention mask mechanism that promote sparsity of each learned deformation
field. Our formulation renders sharper locally controllable nonlinear
deformations than previous implicit monocular approaches, especially mouth
interior, asymmetric expressions, and facial details. |
This paper presents a novel approach to modeling fine-grained facial details and non-linear local deformations in human face rigs using neural radiance fields (NeRFs), surpassing the limitations of linear 3DMMs and global deformation models. |
Existing methods for reconstructing controllable head models from 2D videos often lack the ability to represent fine-scale facial details and local control, especially for asymmetric expressions. This paper addresses these limitations. |
The method decomposes the global deformation field into multiple local fields, each centered around a pre-defined facial landmark. An attention mask filters redundant expression parameters for each local field, and a novel local control loss enforces locality and consistency. The sum of local deformations is weakly supervised by a 3DMM mesh prior. |
The approach reconstructs facial details, like wrinkles and mouth interiors, more accurately than previous methods.
It enables fine-scale control of facial expressions, including asymmetric expressions, exceeding the capabilities of linear 3DMMs.
The method achieves state-of-the-art performance on perceptual metrics for radiance image quality. |
The reconstruction quality degrades for extreme pose and expression variations, indicating limitations in generalization.
Non-facial parts, such as shoulders, are not explicitly modeled, leading to potential artifacts in those regions. |
neural radiance fields, 3d face reconstruction, local deformation fields, facial expression control, monocular video |
2304.10535
Report |
Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion |
Tomas Jakab, Ruining Li, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi |
We present Farm3D, a method for learning category-specific 3D reconstructors
for articulated objects, relying solely on "free" virtual supervision from a
pre-trained 2D diffusion-based image generator. Recent approaches can learn a
monocular network that predicts the 3D shape, albedo, illumination, and
viewpoint of any object occurrence, given a collection of single-view images of
an object category. However, these approaches heavily rely on manually curated
clean training data, which are expensive to obtain. We propose a framework that
uses an image generator, such as Stable Diffusion, to generate synthetic
training data that are sufficiently clean and do not require further manual
curation, enabling the learning of such a reconstruction network from scratch.
Additionally, we incorporate the diffusion model as a score to enhance the
learning process. The idea involves randomizing certain aspects of the
reconstruction, such as viewpoint and illumination, generating virtual views of
the reconstructed 3D object, and allowing the 2D network to assess the quality
of the resulting image, thus providing feedback to the reconstructor. Unlike
work based on distillation, which produces a single 3D asset for each textual
prompt, our approach yields a monocular reconstruction network capable of
outputting a controllable 3D asset from any given image, whether real or
generated, in a single forward pass in a matter of seconds. Our network can be
used for analysis, including monocular reconstruction, or for synthesis,
generating articulated assets for real-time applications such as video games. |
This paper presents FARM3D, a method to learn articulated 3D models of object categories (e.g., cows, horses) solely from synthetic data generated by a pre-trained 2D image generator (Stable Diffusion) and without any manual data curation. |
Existing methods for learning 3D models from real images require extensive manual curation of training data, which is time-consuming and limits scalability. |
FARM3D replaces real training images with synthetic ones generated by prompting Stable Diffusion with category-specific text. It further leverages Stable Diffusion as a critic during training, providing virtual multi-view supervision via a modified Score Distillation Sampling (SDS) loss. |
FARM3D achieves comparable 3D reconstruction quality to state-of-the-art methods trained on curated real datasets, despite using only synthetic data.
The method generalizes to real images and enables controllable 3D synthesis by manipulating shape, appearance, and articulation.
A new synthetic 3D animal dataset (Animodel) is introduced for benchmarking single-view articulated 3D reconstruction. |
The current method is limited to a single object category.
Assumptions about object topology (e.g., 4 legs) are made. |
3d reconstruction, diffusion models, synthetic data, articulated objects, stable diffusion |
2304.10530
Report |
Collaborative Diffusion for Multi-Modal Face Generation and Editing |
Ziqi Huang, Kelvin C. K. Chan, Yuming Jiang, Ziwei Liu |
Diffusion models arise as a powerful generative tool recently. Despite the
great progress, existing diffusion models mainly focus on uni-modal control,
i.e., the diffusion process is driven by only one modality of condition. To
further unleash the users' creativity, it is desirable for the model to be
controllable by multiple modalities simultaneously, e.g., generating and
editing faces by describing the age (text-driven) while drawing the face shape
(mask-driven). In this work, we present Collaborative Diffusion, where
pre-trained uni-modal diffusion models collaborate to achieve multi-modal face
generation and editing without re-training. Our key insight is that diffusion
models driven by different modalities are inherently complementary regarding
the latent denoising steps, where bilateral connections can be established
upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively
hallucinates multi-modal denoising steps by predicting the spatial-temporal
influence functions for each pre-trained uni-modal model. Collaborative
Diffusion not only collaborates generation capabilities from uni-modal
diffusion models, but also integrates multiple uni-modal manipulations to
perform multi-modal editing. Extensive qualitative and quantitative experiments
demonstrate the superiority of our framework in both image quality and
condition consistency. |
This paper presents Collaborative Diffusion, a novel framework that leverages pre-trained uni-modal diffusion models for multi-modal face generation and editing without retraining. |
Existing diffusion models for face generation and editing primarily focus on uni-modal control, limiting the ability to manipulate multiple aspects simultaneously. Collaborative Diffusion addresses this limitation by enabling multi-modal control, thereby unlocking greater creative possibilities for users. |
The core of Collaborative Diffusion is the dynamic diffuser, a meta-network that predicts spatial-temporal influence functions for each pre-trained uni-modal model. These functions determine the extent of each model's contribution at every denoising step, allowing for seamless integration of multiple modalities. |
Collaborative Diffusion achieves superior image quality and condition consistency compared to existing multi-modal face generation methods like TediGAN and Composable Diffusion.
The dynamic diffuser's ability to adapt both spatially and temporally is crucial for effective collaboration between uni-modal models.
The framework's flexibility is demonstrated by extending it to multi-modal face editing with minimal modifications, showcasing its potential for various applications. |
The performance of Collaborative Diffusion is limited by the capabilities of the pre-trained uni-modal models, suggesting that training on larger datasets could further enhance results.
The potential for malicious use of the technology, such as manipulating real human faces, raises ethical concerns that need to be addressed. |
diffusion models, multi-modal generation, face generation, face editing, dynamic diffuser |
2304.10520
Report |
Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget |
Johannes Lehner, Benedikt Alkin, Andreas Fürst, Elisabeth Rumetshofer, Lukas Miklautz, Sepp Hochreiter |
Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE),
efficiently learn a rich representation of the input. However, for adapting to
downstream tasks, they require a sufficient amount of labeled data since their
rich features code not only objects but also less relevant image background. In
contrast, Instance Discrimination (ID) methods focus on objects. In this work,
we study how to combine the efficiency and scalability of MIM with the ability
of ID to perform downstream classification in the absence of large amounts of
labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning
(MAE-CT), a sequential approach that utilizes the implicit clustering of the
Nearest Neighbor Contrastive Learning (NNCLR) objective to induce abstraction
in the topmost layers of a pre-trained MAE. MAE-CT tunes the rich features such
that they form semantic clusters of objects without using any labels. Notably,
MAE-CT does not rely on hand-crafted augmentations and frequently achieves its
best performances while using only minimal augmentations (crop & flip).
Further, MAE-CT is compute efficient as it requires at most 10% overhead
compared to MAE re-training. Applied to large and huge Vision Transformer (ViT)
models, MAE-CT excels over previous self-supervised methods trained on ImageNet
in linear probing, k-NN and low-shot classification accuracy as well as in
unsupervised clustering accuracy. With ViT-H/16 MAE-CT achieves a new
state-of-the-art in linear probing of 82.2%. |
This paper proposes MAE-CT, a sequential approach that combines the strengths of Masked Image Modeling (MIM) and Instance Discrimination (ID) by contrastively tuning a pre-trained Masked Autoencoder (MAE) to induce abstraction and form semantic clusters in its representation. |
MIM methods like MAE are computationally efficient but lack abstraction, while ID methods excel in low-shot learning but heavily rely on augmentations. Combining both addresses these limitations to achieve label efficiency and computational efficiency. |
The methodology involves three steps: 1) MAE pre-training, 2) initializing a Nearest Neighbor Contrastive Learning (NNCLR) head on top of the frozen MAE encoder, and 3) contrastive tuning, where the upper layers of the encoder and the NNCLR head are trained with layer-wise learning rate decay. |
MAE-CT significantly outperforms previous self-supervised methods in linear probing, k-NN, and low-shot classification on ImageNet, achieving state-of-the-art linear probing accuracy of 82.2% with ViT-H/16.
The method demonstrates superior label efficiency compared to state-of-the-art ID methods, especially with larger models, and achieves competitive performance even with minimal augmentations.
Analysis reveals that MAE-CT effectively forms object-specific clusters, as evidenced by improved cluster accuracy and silhouette scores, confirming its ability to induce abstraction in MAE representations. |
The reliance on a pre-trained MAE model could limit the applicability to other MIM methods.
Future work could explore the use of alternative contrastive learning methods or cluster-based objectives. |
self-supervised learning, masked image modeling, instance discrimination, contrastive learning, vision transformer |
2304.10448
Report |
ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects |
Marco Toschi, Riccardo De Matteo, Riccardo Spezialetti, Daniele De Gregorio, Luigi Di Stefano, Samuele Salti |
In this paper, we focus on the problem of rendering novel views from a Neural
Radiance Field (NeRF) under unobserved light conditions. To this end, we
introduce a novel dataset, dubbed ReNe (Relighting NeRF), framing real world
objects under one-light-at-time (OLAT) conditions, annotated with accurate
ground-truth camera and light poses. Our acquisition pipeline leverages two
robotic arms holding, respectively, a camera and an omni-directional point-wise
light source. We release a total of 20 scenes depicting a variety of objects
with complex geometry and challenging materials. Each scene includes 2000
images, acquired from 50 different points of views under 40 different OLAT
conditions. By leveraging the dataset, we perform an ablation study on the
relighting capability of variants of the vanilla NeRF architecture and identify
a lightweight architecture that can render novel views of an object under novel
light conditions, which we use to establish a non-trivial baseline for the
dataset. Dataset and benchmark are available at
https://eyecan-ai.github.io/rene. |
Introduces ReNe, a novel dataset for novel view synthesis and relighting of real-world objects under one-light-at-time (OLAT) conditions, and benchmarks lightweight modifications to NeRF for relighting. |
Existing relighting datasets lack diversity, real-world capture, and ground truth light information, hindering research on relighting with NeRF. |
Developed a dual-robot capture system to collect images of various objects under OLAT illumination with precise camera and light pose annotations. Conducted an ablation study on NeRF architectures modified to incorporate light information. |
ReNe dataset comprises 20 scenes, each with 2000 images captured from 50 viewpoints under 40 OLAT conditions.
Feeding relative light position and employing a separate visibility network (V5) significantly improves NeRF's relighting capabilities.
The proposed V5 architecture establishes a strong baseline for the ReNe benchmark, outperforming standard NeRF. |
Dataset is limited to frontal views due to the trajectory setup of the robotic arms.
OLAT illumination, while offering control, is not representative of complex real-world lighting. |
novel view synthesis, relighting, neural radiance fields, dataset, benchmark |
2304.10406
Report |
LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields |
Tang Tao, Longfei Gao, Guangrun Wang, Yixing Lao, Peng Chen, Hengshuang Zhao, Dayang Hao, Xiaodan Liang, Mathieu Salzmann, Kaicheng Yu |
We introduce a new task, novel view synthesis for LiDAR sensors. While
traditional model-based LiDAR simulators with style-transfer neural networks
can be applied to render novel views, they fall short of producing accurate and
realistic LiDAR patterns because the renderers rely on explicit 3D
reconstruction and exploit game engines, that ignore important attributes of
LiDAR points. We address this challenge by formulating, to the best of our
knowledge, the first differentiable end-to-end LiDAR rendering framework,
LiDAR-NeRF, leveraging a neural radiance field (NeRF) to facilitate the joint
learning of geometry and the attributes of 3D points. However, simply employing
NeRF cannot achieve satisfactory results, as it only focuses on learning
individual pixels while ignoring local information, especially at low texture
areas, resulting in poor geometry. To this end, we have taken steps to address
this issue by introducing a structural regularization method to preserve local
structural details. To evaluate the effectiveness of our approach, we establish
an object-centric multi-view LiDAR dataset, dubbed NeRF-MVL. It contains
observations of objects from 9 categories seen from 360-degree viewpoints
captured with multiple LiDAR sensors. Our extensive experiments on the
scene-level KITTI-360 dataset, and on our object-level NeRF-MVL show that our
LiDAR-NeRF surpasses the model-based algorithms significantly. |
This paper introduces a novel differentiable rendering framework, LiDAR-NeRF, for novel view synthesis of LiDAR data. |
Existing methods for generating new LiDAR point clouds rely on explicit 3D reconstruction and game engines, resulting in inaccurate and unrealistic LiDAR patterns. Novel LiDAR view synthesis is crucial for applications like autonomous driving. |
The method leverages a neural radiance field (NeRF) to facilitate the joint learning of geometry and attributes of 3D points. It addresses the limitation of traditional NeRFs by introducing a structural regularization method to preserve local structural details, improving geometry accuracy. |
LiDAR-NeRF surpasses model-based algorithms on both scene-level (KITTI-360 dataset) and object-level (newly introduced multi-view LiDAR dataset) benchmarks.
The method effectively encodes 3D information and multiple attributes, including ray-drop probability, leading to more realistic LiDAR patterns.
LiDAR-NeRF enables scene editing by fusing novel objects into existing scenes with realistic occlusion effects. |
Limited to static scenes and requires per-scene optimization due to reliance on the NeRF formalism.
Current work focuses on synthesizing LiDAR data only; future work will explore joint rendering of LiDAR and images. |
novel view synthesis, lidar, neural radiance fields, differentiable rendering, 3d point cloud |
2304.10263
Report |
PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image |
Jianhui Li, Jianmin Li, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, Jun Zhu |
We study the 3D-aware image attribute editing problem in this paper, which
has wide applications in practice. Recent methods solved the problem by
training a shared encoder to map images into a 3D generator's latent space or
by per-image latent code optimization and then edited images in the latent
space. Despite their promising results near the input view, they still suffer
from the 3D inconsistency of produced images at large camera poses and
imprecise image attribute editing, like affecting unspecified attributes during
editing. For more efficient image inversion, we train a shared encoder for all
images. To alleviate 3D inconsistency at large camera poses, we propose two
novel methods, an alternating training scheme and a multi-view identity loss,
to maintain 3D consistency and subject identity. As for imprecise image
editing, we attribute the problem to the gap between the latent space of real
images and that of generated images. We compare the latent space and inversion
manifold of GAN models and demonstrate that editing in the inversion manifold
can achieve better results in both quantitative and qualitative evaluations.
Extensive experiments show that our method produces more 3D consistent images
and achieves more precise image editing than previous work. Source code and
pretrained models can be found on our project page:
https://mybabyyh.github.io/Preim3D/ |
Presents PREIM3D, a pipeline for efficient and precise 3D-aware image attribute editing from a single image, by training a shared encoder to map real images to the latent space of a 3D GAN and performing manipulations in an "inversion manifold". |
Addresses limitations of previous 3D GAN inversion techniques, which struggle to maintain 3D consistency at large camera poses and often lack editing precision, causing unintended attribute changes. |
Trains a 3D consistent encoder using an alternating training scheme with in-domain and out-domain images and a multi-view identity loss. Editing is then performed in a learned "inversion manifold" to improve precision and minimize distortion between desired and generated attributes. |
Achieves superior 3D consistency, particularly at large camera poses, compared to optimization-based and hybrid methods like IDE-3D and 3D-Inv.
Demonstrates more precise attribute editing with less impact on unrelated attributes, outperforming baselines in quantitative metrics like AA and AD.
Significantly faster inference time compared to optimization-based methods, making it suitable for interactive applications. |
Faces difficulty in reconstructing uncommon or highly detailed features (e.g., intricate earrings, unique hairstyles).
Reliance on the generator's capacity to capture real-world details limits the reconstruction fidelity in some cases. |
3d gan inversion, image attribute editing, neural radiance fields, inversion manifold, 3d consistency |
2304.10261
Report |
Anything-3D: Towards Single-view Anything Reconstruction in the Wild |
Qiuhong Shen, Xingyi Yang, Xinchao Wang |
3D reconstruction from a single-RGB image in unconstrained real-world
scenarios presents numerous challenges due to the inherent diversity and
complexity of objects and environments. In this paper, we introduce
Anything-3D, a methodical framework that ingeniously combines a series of
visual-language models and the Segment-Anything object segmentation model to
elevate objects to 3D, yielding a reliable and versatile system for single-view
conditioned 3D reconstruction task. Our approach employs a BLIP model to
generate textural descriptions, utilizes the Segment-Anything model for the
effective extraction of objects of interest, and leverages a text-to-image
diffusion model to lift object into a neural radiance field. Demonstrating its
ability to produce accurate and detailed 3D reconstructions for a wide array of
objects, \emph{Anything-3D\footnotemark[2]} shows promise in addressing the
limitations of existing methodologies. Through comprehensive experiments and
evaluations on various datasets, we showcase the merits of our approach,
underscoring its potential to contribute meaningfully to the field of 3D
reconstruction. Demos and code will be available at
\href{https://github.com/Anything-of-anything/Anything-3D}{https://github.com/Anything-of-anything/Anything-3D}. |
Introduces Anything-3D, a framework that leverages visual-language models and object segmentation (SAM) to reconstruct 3D objects from single-view images in uncontrolled environments. |
Addresses the challenging problem of single-image 3D reconstruction in the wild, which has significant implications for robotics, VR/AR, and 3D printing. |
Combines BLIP for image description generation, SAM for object segmentation, and a text-to-image diffusion model with score distillation to train a neural radiance field for 3D reconstruction. |
Demonstrates successful 3D reconstruction of objects from challenging real-world images with varying lighting, occlusion, and viewpoints.
Shows proficiency in reconstructing irregularly-shaped objects and small objects in cluttered scenes.
Highlights the potential for using foundation models for 3D content creation from limited data. |
Current reconstruction quality requires further refinement.
Lacks quantitative evaluation on 3D datasets; future work will include novel view synthesis and reconstruction error assessments. |
3d reconstruction, single-view reconstruction, visual-language models, object segmentation, diffusion models |
2304.10250
Report |
Revisiting Implicit Neural Representations in Low-Level Vision |
Wentian Xu, Jianbo Jiao |
Implicit Neural Representation (INR) has been emerging in computer vision in
recent years. It has been shown to be effective in parameterising continuous
signals such as dense 3D models from discrete image data, e.g. the neural
radius field (NeRF). However, INR is under-explored in 2D image processing
tasks. Considering the basic definition and the structure of INR, we are
interested in its effectiveness in low-level vision problems such as image
restoration. In this work, we revisit INR and investigate its application in
low-level image restoration tasks including image denoising, super-resolution,
inpainting, and deblurring. Extensive experimental evaluations suggest the
superior performance of INR in several low-level vision tasks with limited
resources, outperforming its counterparts by over 2dB. Code and models are
available at https://github.com/WenTXuL/LINR |
This paper explores the application of Implicit Neural Representation (INR) in low-level image restoration tasks for single-image restoration without requiring additional training data. |
INR's effectiveness in 3D deep learning tasks, particularly in representing continuous signals from discrete data, motivates its exploration for 2D image restoration. |
The study uses a lightweight INR (LINR) model based on SIREN, training it on corrupted images with task-specific loss functions for denoising, super-resolution, inpainting, and deblurring. |
LINR outperforms competing methods on benchmark datasets with limited resources, achieving superior PSNR and SSIM scores.
The study demonstrates that INR-based methods can benefit from joint training with multiple corruptions, further enhancing their performance.
LINR showcases promising results on real-world noisy images, suggesting its practical applicability. |
Denoising with LINR, similar to DIP, might be prone to overfitting, necessitating further research on optimal stopping criteria.
Future work could explore architectural modifications and training strategies to enhance LINR's efficiency for higher-resolution images. |
image restoration, implicit neural representation, single image restoration, zero-shot learning, low-level vision |
2304.10224
Report |
Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning? |
Haoyang Peng, Baopu Li, Bo Zhang, Xin Chen, Tao Chen, Hongyuan Zhu |
Point cloud based 3D deep model has wide applications in many applications
such as autonomous driving, house robot, and so on. Inspired by the recent
prompt learning in natural language processing, this work proposes a novel
Multi-view Vision-Prompt Fusion Network (MvNet) for few-shot 3D point cloud
classification. MvNet investigates the possibility of leveraging the
off-the-shelf 2D pre-trained models to achieve the few-shot classification,
which can alleviate the over-dependence issue of the existing baseline models
towards the large-scale annotated 3D point cloud data. Specifically, MvNet
first encodes a 3D point cloud into multi-view image features for a number of
different views. Then, a novel multi-view prompt fusion module is developed to
effectively fuse information from different views to bridge the gap between 3D
point cloud data and 2D pre-trained models. A set of 2D image prompts can then
be derived to better describe the suitable prior knowledge for a large-scale
pre-trained image model for few-shot 3D point cloud classification. Extensive
experiments on ModelNet, ScanObjectNN, and ShapeNet datasets demonstrate that
MvNet achieves new state-of-the-art performance for 3D few-shot point cloud
image classification. The source code of this work will be available soon. |
A novel Multi-view Vision-Prompt Fusion Network (MvNet) is proposed for few-shot 3D point cloud classification by leveraging off-the-shelf 2D pre-trained models. |
Existing deep learning-based 3D point cloud classification methods require large amounts of data, while pre-trained 3D models suffer from domain gaps between datasets. Leveraging pre-trained 2D models can alleviate these issues. |
MvNet encodes a 3D point cloud into multi-view image features. A multi-view prompt fusion module fuses information from different views to derive 2D image prompts. These prompts are fed to a pre-trained image model for classification. |
MvNet achieves state-of-the-art performance for 3D few-shot point cloud classification on ModelNet and ScanObjectNN.
MvNet significantly improves few-shot classification performance on ScanObjectNN compared to previous methods, especially under low-shot settings.
Increasing the number of views and using both attention and convolution fusion modules effectively improve the model's performance. |
The model's optimization process requires a large memory footprint.
Future work includes exploring methods to reduce memory consumption. |
3d point cloud classification, few-shot learning, prompt learning, multi-view fusion, transfer learning |
2304.10168
Report |
High-Fidelity and Freely Controllable Talking Head Video Generation |
Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, Yan Lu |
Talking head generation is to generate video based on a given source identity
and target motion. However, current methods face several challenges that limit
the quality and controllability of the generated videos. First, the generated
face often has unexpected deformation and severe distortions. Second, the
driving image does not explicitly disentangle movement-relevant information,
such as poses and expressions, which restricts the manipulation of different
attributes during generation. Third, the generated videos tend to have
flickering artifacts due to the inconsistency of the extracted landmarks
between adjacent frames. In this paper, we propose a novel model that produces
high-fidelity talking head videos with free control over head pose and
expression. Our method leverages both self-supervised learned landmarks and 3D
face model-based landmarks to model the motion. We also introduce a novel
motion-aware multi-scale feature alignment module to effectively transfer the
motion without face distortion. Furthermore, we enhance the smoothness of the
synthesized talking head videos with a feature context adaptation and
propagation module. We evaluate our model on challenging datasets and
demonstrate its state-of-the-art performance. |
This paper presents PECHead, a novel method for generating high-fidelity talking head videos with control over head pose and expression. |
Current talking head generation methods suffer from limitations such as face distortion, limited controllability, and flickering artifacts. This work aims to address these challenges. |
PECHead leverages both self-supervised learned landmarks and 3D face model-based landmarks to model motion. It employs a motion-aware multi-scale feature alignment module and a context adaptation and propagation module to enhance video quality and smoothness. |
PECHead significantly outperforms existing methods in same-identity video reconstruction, exhibiting superior facial shape preservation and expression transfer.
The method demonstrates state-of-the-art performance in cross-identity face reenactment, achieving high identity preservation and video quality while minimizing pose and expression errors.
PECHead enables precise control over head pose and facial expressions, surpassing baseline methods in frontalization and expression transfer tasks. |
The current method focuses on head and face regions, and further research is needed to incorporate full-body motions.
The model currently relies on high-quality input images, and extending it to handle lower-quality inputs is an area for future work. |
talking head generation, face reenactment, motion transfer, deep learning, computer vision |
2304.10080
Report |
NeUDF: Leaning Neural Unsigned Distance Fields with Volume Rendering |
Yu-Tao Liu, Li Wang, Jie yang, Weikai Chen, Xiaoxu Meng, Bo Yang, Lin Gao |
Multi-view shape reconstruction has achieved impressive progresses thanks to
the latest advances in neural implicit surface rendering. However, existing
methods based on signed distance function (SDF) are limited to closed surfaces,
failing to reconstruct a wide range of real-world objects that contain
open-surface structures. In this work, we introduce a new neural rendering
framework, coded NeUDF, that can reconstruct surfaces with arbitrary topologies
solely from multi-view supervision. To gain the flexibility of representing
arbitrary surfaces, NeUDF leverages the unsigned distance function (UDF) as
surface representation. While a naive extension of an SDF-based neural renderer
cannot scale to UDF, we propose two new formulations of weight function
specially tailored for UDF-based volume rendering. Furthermore, to cope with
open surface rendering, where the in/out test is no longer valid, we present a
dedicated normal regularization strategy to resolve the surface orientation
ambiguity. We extensively evaluate our method over a number of challenging
datasets, including DTU}, MGN, and Deep Fashion 3D. Experimental results
demonstrate that nEudf can significantly outperform the state-of-the-art method
in the task of multi-view surface reconstruction, especially for complex shapes
with open boundaries. |
\OurNetName{}: the first UDF-based neural volume rendering framework for multi-view reconstruction of shapes with arbitrary topologies. |
Existing SDF or occupancy-based neural rendering methods are limited to closed surfaces, failing to reconstruct open surfaces which are commonly seen in the real world. |
\OurNetName{} leverages unsigned distance function (UDF) as surface representation and introduces two specially-tailored weight functions for UDF-based volume rendering and points sampling. To solve the surface orientation ambiguity, \OurNetName{} employs a dedicated normal regularization strategy. |
\OurNetName{} significantly outperforms the state-of-the-art methods in open surface reconstruction on DF3D and MGN datasets.
\OurNetName{} achieves comparable performance in watertight surface reconstruction on DTU dataset.
\OurNetName{} can reconstruct complex open surfaces such as plants, clothes, and hollow structures with high fidelity. |
\OurNetName{} cannot reconstruct transparent surfaces well.
Severely occluded parts are challenging for reconstruction. |
multi-view reconstruction, neural rendering, open surface, unsigned distance function, implicit representation |
2304.09987
Report |
Tetra-NeRF: Representing Neural Radiance Fields Using Tetrahedra |
Jonas Kulhanek, Torsten Sattler |
Neural Radiance Fields (NeRFs) are a very recent and very popular approach
for the problems of novel view synthesis and 3D reconstruction. A popular scene
representation used by NeRFs is to combine a uniform, voxel-based subdivision
of the scene with an MLP. Based on the observation that a (sparse) point cloud
of the scene is often available, this paper proposes to use an adaptive
representation based on tetrahedra obtained by Delaunay triangulation instead
of uniform subdivision or point-based representations. We show that such a
representation enables efficient training and leads to state-of-the-art
results. Our approach elegantly combines concepts from 3D geometry processing,
triangle-based rendering, and modern neural radiance fields. Compared to
voxel-based representations, ours provides more detail around parts of the
scene likely to be close to the surface. Compared to point-based
representations, our approach achieves better performance. The source code is
publicly available at: https://jkulhanek.com/tetra-nerf. |
Presents Tetra-NeRF, a novel neural radiance field representation that leverages Delaunay triangulation of a point cloud to create an adaptive tetrahedra-based scene representation. |
Addresses limitations of uniform voxel grids and point-based representations for neural radiance fields by providing higher resolution near surfaces and enabling efficient training. |
Constructs a tetrahedra field from a point cloud, uses barycentric interpolation to query features stored at tetrahedra vertices, and employs a shallow MLP to predict density and color for volume rendering. |
Outperforms Point-NeRF, a state-of-the-art point-based method, in rendering quality.
Achieves comparable results to state-of-the-art MLP-based methods like Mip-NeRF.
Demonstrates the effectiveness of adaptive tetrahedra representation over dense grid representation with similar parameter count. |
Rendering quality can be affected by the density of the input point cloud, especially in regions with sparse points.
Current implementation has a limit on the number of intersected tetrahedra per ray, which can impact performance in complex scenes.
Future work includes exploring adaptive refinement and pruning of the tetrahedralisation and exploiting surface proximity to triangles. |
neural radiance fields, tetrahedra, delaunay triangulation, volume rendering, novel view synthesis |
2304.09748
Report |
Reference-based Image Composition with Sketch via Structure-aware Diffusion Model |
Kangyeol Kim, Sunghyun Park, Junsoo Lee, Jaegul Choo |
Recent remarkable improvements in large-scale text-to-image generative models
have shown promising results in generating high-fidelity images. To further
enhance editability and enable fine-grained generation, we introduce a
multi-input-conditioned image composition model that incorporates a sketch as a
novel modal, alongside a reference image. Thanks to the edge-level
controllability using sketches, our method enables a user to edit or complete
an image sub-part with a desired structure (i.e., sketch) and content (i.e.,
reference image). Our framework fine-tunes a pre-trained diffusion model to
complete missing regions using the reference image while maintaining sketch
guidance. Albeit simple, this leads to wide opportunities to fulfill user needs
for obtaining the in-demand images. Through extensive experiments, we
demonstrate that our proposed method offers unique use cases for image
manipulation, enabling user-driven modifications of arbitrary scenes. |
Introduces a multi-input-conditioned image composition model for cartoons that incorporates a sketch and a reference image. |
Enhances editability of large-scale text-to-image generative models by allowing edge-level controllability and fine-grained generation. |
Fine-tunes a pre-trained diffusion model to complete missing regions using a reference image while adhering to sketch guidance. A 'sketch schedule' strategy is introduced to adjust the influence of the sketch during inference. |
Model successfully generates and manipulates targeted regions based on user-provided sketches and reference images.
Demonstrates superior performance compared to baselines using only reference images or text-sketch pairs.
Offers practical applications for background scene editing, object shape editing, and object changes in cartoons. |
Exploration of a more user-centric system for seamless interaction is needed.
Developing a highly intuitive tool incorporating the model is planned for future work. |
image composition, sketch-guided generation, diffusion models, cartoon editing, multi-input conditioning |
2304.09677
Report |
Reference-guided Controllable Inpainting of Neural Radiance Fields |
Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski |
The popularity of Neural Radiance Fields (NeRFs) for view synthesis has led
to a desire for NeRF editing tools. Here, we focus on inpainting regions in a
view-consistent and controllable manner. In addition to the typical NeRF inputs
and masks delineating the unwanted region in each view, we require only a
single inpainted view of the scene, i.e., a reference view. We use monocular
depth estimators to back-project the inpainted view to the correct 3D
positions. Then, via a novel rendering technique, a bilateral solver can
construct view-dependent effects in non-reference views, making the inpainted
region appear consistent from any view. For non-reference disoccluded regions,
which cannot be supervised by the single reference view, we devise a method
based on image inpainters to guide both the geometry and appearance. Our
approach shows superior performance to NeRF inpainting baselines, with the
additional advantage that a user can control the generated scene via a single
inpainted image. Project page: https://ashmrz.github.io/reference-guided-3d |
This document provides author guidelines for preparing papers to be submitted to the ICCV proceedings. |
Standardization of paper formatting is crucial for conference proceedings to ensure consistency and readability. |
The document outlines specific formatting requirements for various aspects like title, author names, abstract, sections, bibliography, etc., likely using LaTeX. |
|
|
iccv, author guidelines, latex, conference proceedings, paper formatting |
2304.09479
Report |
DiFaReli: Diffusion Face Relighting |
Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn |
We present a novel approach to single-view face relighting in the wild.
Handling non-diffuse effects, such as global illumination or cast shadows, has
long been a challenge in face relighting. Prior work often assumes Lambertian
surfaces, simplified lighting models or involves estimating 3D shape, albedo,
or a shadow map. This estimation, however, is error-prone and requires many
training examples with lighting ground truth to generalize well. Our work
bypasses the need for accurate estimation of intrinsic components and can be
trained solely on 2D images without any light stage data, multi-view images, or
lighting ground truth. Our key idea is to leverage a conditional diffusion
implicit model (DDIM) for decoding a disentangled light encoding along with
other encodings related to 3D shape and facial identity inferred from
off-the-shelf estimators. We also propose a novel conditioning technique that
eases the modeling of the complex interaction between light and geometry by
using a rendered shading reference to spatially modulate the DDIM. We achieve
state-of-the-art performance on standard benchmark Multi-PIE and can
photorealistically relight in-the-wild images. Please visit our page:
https://diffusion-face-relighting.github.io |
Presents DiFaReli, a novel diffusion-based face relighting framework that generates photorealistic shading without needing precise intrinsic decomposition or 3D and lighting ground truth, trained exclusively on 2D images. |
Face relighting is crucial for various applications like AR and portrait photography. Existing methods struggle with non-diffuse effects and rely heavily on accurate estimations of intrinsic components, often requiring extensive training data. |
Leverages a conditional DDIM to decode disentangled light encoding, along with encodings for shape and identity, inferred from off-the-shelf estimators. Introduces a novel conditioning technique using a rendered shading reference for spatial modulation of the DDIM, easing the modeling of complex light-geometry interactions. |
Achieves state-of-the-art performance on the Multi-PIE dataset, outperforming existing methods.
Demonstrates the capability to realistically add, remove, or adjust the intensity of cast shadows in images.
Produces high-fidelity relighting results on in-the-wild images, effectively handling challenging cases with complex lighting. |
Cast shadow rendering may not always be physically accurate, with room for improvement in temporal consistency for video applications.
Relighting accuracy can be affected by limitations of light estimators and inherent ambiguities in distinguishing skin tone from lighting conditions. |
face relighting, diffusion models, conditional image synthesis, spatial modulation, deep learning |
2304.09463
Report |
HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks |
Zhuo Chen, Xudong Xu, Yichao Yan, Ye Pan, Wenhan Zhu, Wayne Wu, Bo Dai, Xiaokang Yang |
Portrait stylization is a long-standing task enabling extensive applications.
Although 2D-based methods have made great progress in recent years, real-world
applications such as metaverse and games often demand 3D content. On the other
hand, the requirement of 3D data, which is costly to acquire, significantly
impedes the development of 3D portrait stylization methods. In this paper,
inspired by the success of 3D-aware GANs that bridge 2D and 3D domains with 3D
fields as the intermediate representation for rendering 2D images, we propose a
novel method, dubbed HyperStyle3D, based on 3D-aware GANs for 3D portrait
stylization. At the core of our method is a hyper-network learned to manipulate
the parameters of the generator in a single forward pass. It not only offers a
strong capacity to handle multiple styles with a single model, but also enables
flexible fine-grained stylization that affects only texture, shape, or local
part of the portrait. While the use of 3D-aware GANs bypasses the requirement
of 3D data, we further alleviate the necessity of style images with the CLIP
model being the stylization guidance. We conduct an extensive set of
experiments across the style, attribute, and shape, and meanwhile, measure the
3D consistency. These experiments demonstrate the superior capability of our
HyperStyle3D model in rendering 3D-consistent images in diverse styles,
deforming the face shape, and editing various attributes. |
HyperStyle3D, a novel text-driven 3D portrait stylization method based on 3D-aware GANs and hyper-networks, enabling style transfer, attribute editing, and shape deformation. |
Addresses the limitations of 2D stylization methods (lack of 3D consistency and shape deformation) and 3D methods (reliance on expensive 3D data and single-style limitations). |
Utilizes a hyper-network to predict parameter offsets for a pre-trained 3D-aware GAN generator, guided by text prompts processed by CLIP. The hyper-network is split into three parts to handle shape, attribute, and style manipulations separately. |
Achieves high-quality style transfer across diverse styles, surpassing baseline methods in qualitative and user study evaluations.
Maintains 3D consistency, exhibiting comparable depth consistency to the original 3D-aware GAN and even superior facial identity consistency after manipulation.
Enables disentangled multi-level manipulation (shape, attribute, style) by leveraging different layer groups in the hyper-network, with controllable degrees of manipulation through coefficients. |
Limited to portrait stylization and may not generalize well to other object categories.
Relies on a pre-trained 3D-aware GAN, which can limit the range of achievable styles and shapes. |
3d portrait stylization, text-guided image manipulation, hyper-networks, 3d-aware gans, clip |
2304.09423
Report |
ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling |
Kai Yang, Hong Shang, Tianyang Shi, Xinghan Chen, Jingkai Zhou, Zhongqian Sun, Wei Yang |
The research fields of parametric face model and 3D face reconstruction have
been extensively studied. However, a critical question remains unanswered: how
to tailor the face model for specific reconstruction settings. We argue that
reconstruction with multi-view uncalibrated images demands a new model with
stronger capacity. Our study shifts attention from data-dependent 3D Morphable
Models (3DMM) to an understudied human-designed skinning model. We propose
Adaptive Skinning Model (ASM), which redefines the skinning model with more
compact and fully tunable parameters. With extensive experiments, we
demonstrate that ASM achieves significantly improved capacity than 3DMM, with
the additional advantage of model size and easy implementation for new
topology. We achieve state-of-the-art performance with ASM for multi-view
reconstruction on the Florence MICC Coop benchmark. Our quantitative analysis
demonstrates the importance of a high-capacity model for fully exploiting
abundant information from multi-view input in reconstruction. Furthermore, our
model with physical-semantic parameters can be directly utilized for real-world
applications, such as in-game avatar creation. As a result, our work opens up
new research direction for parametric face model and facilitates future
research on multi-view reconstruction. |
This paper introduces ASM, a novel parametric face model based on an adaptive skinning approach, designed for high-quality 3D face modeling from multi-view uncalibrated images. |
Existing methods struggle to balance reconstruction accuracy and generalization ability, especially for the middle-end scenario of multi-view uncalibrated images. ASM aims to bridge this gap by offering high capacity and flexibility. |
ASM redefines skinning weights using Gaussian Mixture Models (GMM) and introduces dynamic bone binding. This allows for joint optimization of skinning weights, bone positions, and transformations, leading to increased model capacity. |
ASM outperforms state-of-the-art parametric face models, including 3DMMs and static skinning models, in terms of representation capacity.
It achieves state-of-the-art performance for multi-view reconstruction on the Florence MICC Coop benchmark.
ASM is lightweight, easy to implement with new topologies, and offers semantically meaningful parameters for applications like avatar creation. |
The paper acknowledges limitations in handling facial hair, which can cause artifacts in high-capacity models like ASM.
Future work includes exploring identity and expression decoupling within ASM's skinning parameters. |
3d face reconstruction, parametric face model, skinning model, gaussian mixture model, multi-view reconstruction |
2304.09148
Report |
SAM Fails to Segment Anything? -- SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More |
Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, Ying Zang |
The emergence of large models, also known as foundation models, has brought
significant advancements to AI research. One such model is Segment Anything
(SAM), which is designed for image segmentation tasks. However, as with other
foundation models, our experimental findings suggest that SAM may fail or
perform poorly in certain segmentation tasks, such as shadow detection and
camouflaged object detection (concealed object detection). This study first
paves the way for applying the large pre-trained image segmentation model SAM
to these downstream tasks, even in situations where SAM performs poorly. Rather
than fine-tuning the SAM network, we propose \textbf{SAM-Adapter}, which
incorporates domain-specific information or visual prompts into the
segmentation network by using simple yet effective adapters. By integrating
task-specific knowledge with general knowledge learnt by the large model,
SAM-Adapter can significantly elevate the performance of SAM in challenging
tasks as shown in extensive experiments. We can even outperform task-specific
network models and achieve state-of-the-art performance in the task we tested:
camouflaged object detection, shadow detection. We also tested polyp
segmentation (medical image segmentation) and achieves better results. We
believe our work opens up opportunities for utilizing SAM in downstream tasks,
with potential applications in various fields, including medical image
processing, agriculture, remote sensing, and more. |
This paper presents SAM-Adapter, a novel method for adapting the Segment Anything (SAM) model to downstream tasks by incorporating domain-specific information via adapters. |
While SAM demonstrates impressive general image segmentation capabilities, it may perform poorly on specific tasks. This work addresses the crucial challenge of leveraging the knowledge acquired by large pre-trained models like SAM for enhanced performance in downstream tasks. |
SAM-Adapter utilizes SAM as the backbone network and injects task-specific knowledge through lightweight adapters. These adapters, consisting of MLPs, process domain-specific features and generate prompts to guide SAM's segmentation process. |
SAM-Adapter significantly enhances SAM's performance on challenging tasks like camouflaged object detection and shadow detection, surpassing existing state-of-the-art methods.
The method demonstrates flexibility in incorporating various forms of task-specific information, including texture features, frequency information, and hand-crafted rules.
Experiments on diverse datasets, including COD10K, CHAMELEON, CAMO, and ISTD, consistently show significant performance improvements with SAM-Adapter. |
The current work focuses on two specific downstream tasks, further exploration of SAM-Adapter's capabilities on a wider range of tasks is necessary.
Future research could investigate more specialized adapter designs tailored for specific tasks to further improve performance. |
image segmentation, foundation models, transfer learning, prompt engineering, camouflaged object detection, shadow detection |
2304.08870
Report |
UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer |
Soon Yau Cheong, Armin Mustafa, Andrew Gilbert |
Text-to-image models (T2I) such as StableDiffusion have been used to generate
high quality images of people. However, due to the random nature of the
generation process, the person has a different appearance e.g. pose, face, and
clothing, despite using the same text prompt. The appearance inconsistency
makes T2I unsuitable for pose transfer. We address this by proposing a
multimodal diffusion model that accepts text, pose, and visual prompting. Our
model is the first unified method to perform all person image tasks -
generation, pose transfer, and mask-less edit. We also pioneer using small
dimensional 3D body model parameters directly to demonstrate new capability -
simultaneous pose and camera view interpolation while maintaining the person's
appearance. |
This paper presents UPGPT, a novel unified diffusion model for person image generation, editing, and pose transfer. It leverages text, pose, and visual prompts to achieve fine-grained control over image synthesis. |
Existing methods for person image generation and editing are limited in their ability to perform multiple tasks effectively. This paper addresses the need for a single, flexible framework that can generate, edit, and transfer person images with high fidelity. |
The authors propose a multimodal diffusion model that disentangles person images into content (pose, context text) and style (style text, image features). This allows for independent manipulation of these elements during image sampling. The model uses a combination of SMPL pose parameters, CLIP image embeddings, and LLM text embeddings to condition the diffusion process. |
UPGPT achieves state-of-the-art results on both text-pose guided image generation and pose transfer tasks.
The method demonstrates fine-grained control over image editing, enabling users to modify clothing texture, shape, and appearance using text or reference images.
The use of SMPL parameters allows for novel capabilities like simultaneous pose and camera view interpolation. |
One limitation is the potential for blurry faces in generated images, particularly at lower resolutions.
Future work could explore improving the fidelity of fine-grained texture transfer, potentially through enhanced image encoding techniques. |
diffusion models, image generation, pose transfer, image editing, multimodal learning |
2304.08818
Report |
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models |
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis |
Latent Diffusion Models (LDMs) enable high-quality image synthesis while
avoiding excessive compute demands by training a diffusion model in a
compressed lower-dimensional latent space. Here, we apply the LDM paradigm to
high-resolution video generation, a particularly resource-intensive task. We
first pre-train an LDM on images only; then, we turn the image generator into a
video generator by introducing a temporal dimension to the latent space
diffusion model and fine-tuning on encoded image sequences, i.e., videos.
Similarly, we temporally align diffusion model upsamplers, turning them into
temporally consistent video super resolution models. We focus on two relevant
real-world applications: Simulation of in-the-wild driving data and creative
content creation with text-to-video modeling. In particular, we validate our
Video LDM on real driving videos of resolution 512 x 1024, achieving
state-of-the-art performance. Furthermore, our approach can easily leverage
off-the-shelf pre-trained image LDMs, as we only need to train a temporal
alignment model in that case. Doing so, we turn the publicly available,
state-of-the-art text-to-image LDM Stable Diffusion into an efficient and
expressive text-to-video model with resolution up to 1280 x 2048. We show that
the temporal layers trained in this way generalize to different fine-tuned
text-to-image LDMs. Utilizing this property, we show the first results for
personalized text-to-video generation, opening exciting directions for future
content creation. Project page:
https://research.nvidia.com/labs/toronto-ai/VideoLDM/ |
The authors propose Video LDM, an efficient approach for training high-resolution, long-term consistent video generation models based on Latent Diffusion Models (LDMs). |
Video modeling has lagged behind image modeling due to the high computational cost associated with training on video data and the lack of large-scale video datasets. This work aims to address this gap by enabling efficient high-resolution video generation. |
The authors extend image LDMs to video generation by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences. They also temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. They focus on two applications: simulation of driving data and text-to-video modeling. |
Video LDM achieves state-of-the-art performance on real driving videos of resolution 512x1024 and can generate videos of several minutes length.
By temporally fine-tuning Stable Diffusion, the authors create an efficient and expressive text-to-video model with resolution up to 1280x2048.
The learned temporal layers can be transferred to different fine-tuned text-to-image LDMs, enabling personalized text-to-video generation. |
Synthesized videos are not yet indistinguishable from real content.
The model, trained on internet data, is not suitable for productization due to ethical concerns. |
video generation, diffusion models, latent diffusion models, text-to-video, high-resolution |
2304.08483
Report |
Text2Performer: Text-Driven Human Video Generation |
Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, Ziwei Liu |
Text-driven content creation has evolved to be a transformative technique
that revolutionizes creativity. Here we study the task of text-driven human
video generation, where a video sequence is synthesized from texts describing
the appearance and motions of a target performer. Compared to general
text-driven video generation, human-centric video generation requires
maintaining the appearance of synthesized human while performing complex
motions. In this work, we present Text2Performer to generate vivid human videos
with articulated motions from texts. Text2Performer has two novel designs: 1)
decomposed human representation and 2) diffusion-based motion sampler. First,
we decompose the VQVAE latent space into human appearance and pose
representation in an unsupervised manner by utilizing the nature of human
videos. In this way, the appearance is well maintained along the generated
frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose
embeddings. Unlike existing VQ-based methods that operate in the discrete
space, continuous VQ-diffuser directly outputs the continuous pose embeddings
for better motion modeling. Finally, motion-aware masking strategy is designed
to mask the pose embeddings spatial-temporally to enhance the temporal
coherence. Moreover, to facilitate the task of text-driven human video
generation, we contribute a Fashion-Text2Video dataset with manually annotated
action labels and text descriptions. Extensive experiments demonstrate that
Text2Performer generates high-quality human videos (up to 512x256 resolution)
with diverse appearances and flexible motions. |
This paper presents Text2Performer, a novel framework for generating human videos from text descriptions of appearance and motions. |
The task is important because it addresses the limitations of general text-to-video models, which struggle to generate plausible human videos with consistent appearances and complex motions. |
Text2Performer decomposes the VQVAE latent space into appearance and pose representations, and utilizes a continuous VQ-diffuser to sample pose sequences. A motion-aware masking strategy is also employed to enhance temporal coherence. |
Text2Performer outperforms baselines on FID, FVD, and KVD metrics, indicating superior video quality and diversity.
The decomposed VQ-space and continuous VQ-diffuser enable Text2Performer to maintain consistent human identities across frames.
User studies confirm that Text2Performer generates videos that are more consistent with text descriptions and exhibit better overall quality. |
Text2Performer is trained on videos with clean backgrounds, limiting its applicability to more complex scenes.
The generated videos exhibit a bias towards female models in dresses due to limitations in the training dataset. |
video generation, text-to-video, human video synthesis, vqvae, diffusion models |
2304.08480
Report |
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training |
Yihao Chen, Xianbiao Qi, Jianan Wang, Lei Zhang |
We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach,
to reduce the memory consumption of contrastive loss when training contrastive
learning models. Our approach decomposes the contrastive loss and its gradient
computation into two parts, one to calculate the intra-GPU gradients and the
other to compute the inter-GPU gradients. According to our decomposition, only
the intra-GPU gradients are computed on the current GPU, while the inter-GPU
gradients are collected via all_reduce from other GPUs instead of being
repeatedly computed on every GPU. In this way, we can reduce the GPU memory
consumption of contrastive loss computation from $\bigO(B^2)$ to
$\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of
GPUs used for training. Such a distributed solution is mathematically
equivalent to the original non-distributed contrastive loss computation,
without sacrificing any computation accuracy. It is particularly efficient for
large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive
training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64
A100 40GB GPUs, compared with the original CLIP solution which requires 128
A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code
will be released at https://github.com/IDEA-Research/DisCo-CLIP |
This paper proposes DisCo-CLIP, a distributed memory-efficient approach for training CLIP models, which significantly reduces the memory consumption of contrastive loss computation during training, enabling the use of larger batch sizes without sacrificing accuracy. |
Large batch sizes are crucial for effective contrastive learning in CLIP, but memory constraints limit the achievable batch size, hindering research, especially in resource-constrained settings. |
DisCo-CLIP decomposes the contrastive loss and its gradient computation into intra-GPU and inter-GPU components. It calculates intra-GPU gradients locally and collects inter-GPU gradients via all_reduce operations, reducing memory consumption from O(B^2) to O(B^2/N), where B is the batch size and N is the number of GPUs. |
DisCo-CLIP achieves the same accuracy as the original CLIP with significantly reduced memory consumption and faster training times.
Using DisCo-CLIP, researchers can train a ViT-B/32 model with a batch size of 196K on 64 A100 40GB GPUs, compared to the original CLIP's limitation of 32K batch size.
Larger batch sizes enabled by DisCo-CLIP further improve the performance of contrastive learning models, leading to higher zero-shot classification accuracy on various datasets. |
The paper primarily evaluates DisCo-CLIP on ViT-B/32 due to resource constraints, leaving the investigation of larger backbones for future work.
The impact of the extra all_reduce operation on training speed could be further analyzed, especially in different network environments. |
contrastive learning, clip, vision-language representation learning, distributed training, memory efficiency |
2304.08477
Report |
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation |
Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, Xi Yin |
We propose Latent-Shift -- an efficient text-to-video generation method based
on a pretrained text-to-image generation model that consists of an autoencoder
and a U-Net diffusion model. Learning a video diffusion model in the latent
space is much more efficient than in the pixel space. The latter is often
limited to first generating a low-resolution video followed by a sequence of
frame interpolation and super-resolution models, which makes the entire
pipeline very complex and computationally expensive. To extend a U-Net from
image generation to video generation, prior work proposes to add additional
modules like 1D temporal convolution and/or temporal attention layers. In
contrast, we propose a parameter-free temporal shift module that can leverage
the spatial U-Net as is for video generation. We achieve this by shifting two
portions of the feature map channels forward and backward along the temporal
dimension. The shifted features of the current frame thus receive the features
from the previous and the subsequent frames, enabling motion learning without
additional parameters. We show that Latent-Shift achieves comparable or better
results while being significantly more efficient. Moreover, Latent-Shift can
generate images despite being finetuned for T2V generation. |
Proposes Latent-Shift, an efficient text-to-video generation method that extends a pre-trained text-to-image latent diffusion model by incorporating a parameter-free temporal shift module. |
Existing pixel-based text-to-video diffusion models are computationally expensive, requiring additional super-resolution and frame interpolation models. Latent-Shift offers a simpler, more efficient solution. |
Integrates a temporal shift module into the U-Net architecture of a pre-trained text-to-image latent diffusion model. This module shifts feature maps along the temporal dimension, enabling the model to learn temporal coherence without additional parameters. |
Achieves comparable results to existing methods on MSR-VTT and state-of-the-art results on UCF-101.
Demonstrates superior performance in video quality and text-video faithfulness compared to CogVideo in a user study.
Significantly more efficient than previous approaches due to its smaller model size and faster inference speed. |
May struggle with generating videos from complex or uncommon text prompts.
Current automatic evaluation metrics for zero-shot text-to-video generation are not ideal and need improvement. |
text-to-video generation, latent diffusion models, temporal shift module, generative ai, computer vision |
2304.08463
Report |
Learning to Render Novel Views from Wide-Baseline Stereo Pairs |
Yilun Du, Cameron Smith, Ayush Tewari, Vincent Sitzmann |
We introduce a method for novel view synthesis given only a single
wide-baseline stereo image pair. In this challenging regime, 3D scene points
are regularly observed only once, requiring prior-based reconstruction of scene
geometry and appearance. We find that existing approaches to novel view
synthesis from sparse observations fail due to recovering incorrect 3D geometry
and due to the high cost of differentiable rendering that precludes their
scaling to large-scale training. We take a step towards resolving these
shortcomings by formulating a multi-view transformer encoder, proposing an
efficient, image-space epipolar line sampling scheme to assemble image features
for a target ray, and a lightweight cross-attention-based renderer. Our
contributions enable training of our method on a large-scale real-world dataset
of indoor and outdoor scenes. We demonstrate that our method learns powerful
multi-view geometry priors while reducing the rendering time. We conduct
extensive comparisons on held-out test scenes across two real-world datasets,
significantly outperforming prior work on novel view synthesis from sparse
image observations and achieving multi-view-consistent novel view synthesis. |
This paper introduces a novel method for novel view synthesis of complex indoor and outdoor scenes from a single wide-baseline stereo image pair. |
Existing methods for novel view synthesis either require dense input views or fail to produce high-quality results in this challenging setting due to inaccurate geometry reconstruction and costly rendering pipelines. |
The method leverages a multi-view transformer encoder for geometry-aware feature extraction, an efficient image-space epipolar line sampling scheme, and a lightweight cross-attention-based renderer to enable large-scale training and high-quality reconstructions. |
The proposed method significantly outperforms previous state-of-the-art methods for novel view synthesis from sparse inputs on standard benchmarks such as RealEstate10k and ACID.
The method effectively learns multi-view geometry priors and achieves multi-view consistent novel view synthesis.
The proposed rendering pipeline is significantly faster than volume rendering-based approaches, enabling efficient and high-quality reconstructions. |
While showing significant improvement, the rendering quality is not yet on par with single-scene optimization methods using hundreds of input images.
The generalization ability to scenes with drastically different appearances compared to the training data is limited. |
novel view synthesis, wide-baseline stereo, differentiable rendering, vision transformer, epipolar geometry |
2304.08386
Report |
Progressive Visual Prompt Learning with Contrastive Feature Re-formation |
Chen Xu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang |
Prompt learning has been designed as an alternative to fine-tuning for
adapting Vision-language (V-L) models to the downstream tasks. Previous works
mainly focus on text prompt while visual prompt works are limited for V-L
models. The existing visual prompt methods endure either mediocre performance
or unstable training process, indicating the difficulty of visual prompt
learning. In this paper, we propose a new Progressive Visual Prompt (ProVP)
structure to strengthen the interactions among prompts of different layers.
More importantly, our ProVP could effectively propagate the image embeddings to
deep layers and behave partially similar to an instance adaptive prompt method.
To alleviate generalization deterioration, we further propose a new contrastive
feature re-formation, which prevents the serious deviation of the prompted
visual feature from the fixed CLIP visual feature distribution. Combining both,
our method (ProVP-Ref) is evaluated on 11 image benchmark datasets and achieves
7/11 state-of-theart results on both few-shot and base-to-novel settings. To
the best of our knowledge, we are the first to demonstrate the superior
performance of visual prompts in V-L models to previous prompt-based methods in
downstream tasks. Meanwhile, it implies that our ProVP-Ref shows the best
capability to adapt and to generalize. |
This paper proposes ProVP-Ref, a novel progressive visual prompt learning approach for Vision-language models, to enhance their adaptation and generalization capabilities for downstream tasks. |
Adapting large pre-trained V-L models to downstream tasks like few-shot learning often results in overfitting or catastrophic forgetting. Existing prompt learning methods focus on text prompts with limitations in handling visual domain shifts. |
ProVP-Ref introduces a progressive visual prompt (ProVP) structure that strengthens prompt interactions across layers, and a contrastive feature re-formation strategy to prevent significant deviation from the pre-trained feature distribution. |
ProVP-Ref achieves state-of-the-art results on 7 out of 11 image benchmark datasets for few-shot learning.
ProVP-Ref exhibits superior performance on base-to-novel generalization, demonstrating its capability to adapt to unseen classes.
The method shows significant improvements on datasets with large domain shifts from pre-trained data. |
The performance of ProVP-Ref is hindered by the intrinsic limitations of CLIP's text features, particularly when dealing with a large number of classes.
The best novel performance on datasets like StanfordCars and Flowers102 is achieved by zero-shot CLIP, suggesting potential conflict between pre-trained knowledge and downstream tasks that needs further exploration. |
visual prompt learning, vision-language models, few-shot learning, generalization, contrastive learning |
2304.08345
Report |
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset |
Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, Jing Liu |
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining
model (VALOR) for multi-modal understanding and generation. Different from
widely-studied vision-language pretraining models, VALOR jointly models
relationships of vision, audio and language in an end-to-end manner. It
contains three separate encoders for single modality representations, and a
decoder for multimodal conditional text generation. We design two pretext tasks
to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and
Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio
to the same common space, building vision-language, audio-language and
audiovisual-language alignment simultaneously. MGC learns how to generate text
tokens in conditions of vision, audio or their both. To promote
vision-audio-language pretraining research, we construct a large-scale
high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable
videos with human annotated audiovisual captions. Extensive experiments show
that VALOR can learn strong multimodal correlations and be generalized to
various downstream tasks (e.g., retrieval, captioning and question answering),
with different input modalities (e.g., vision-language, audio-language and
audiovisual-language). VALOR achieves new state-of-the-art performances on
series of public cross-modality benchmarks. Code and data are available at
project page https://casia-iva-group.github.io/projects/VALOR. |
The paper introduces VALOR, a novel Vision-Audio-Language Omni-perception pretraining model, for understanding and generating multimodal content. |
Existing vision-language models struggle to capture the comprehensive semantic understanding offered by incorporating audio, which often provides complementary information. VALOR aims to bridge this gap by jointly modeling vision, audio, and language. |
VALOR utilizes three encoders for individual modality representations and a multimodal decoder for text generation. Two novel pretext tasks, Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC), facilitate cross-modal alignment and conditional text generation, respectively. A large-scale dataset, VALOR-1M, with 1 million audio-visual videos paired with human-annotated captions, is introduced to enable effective pretraining. |
VALOR achieves state-of-the-art performance on various cross-modality benchmarks, including significant improvements in text-to-video retrieval, video captioning, and video question answering.
The model effectively utilizes audio-visual clues for audio-visual retrieval and captioning tasks, showcasing its capability in handling multimodal inputs.
Experiments demonstrate the effectiveness of modality grouping strategy in improving model generalization across different downstream tasks and modalities. |
The current scale of VALOR-1M, while large, can be further expanded using unsupervised techniques to leverage a larger pool of audiovisual data.
Future work aims to incorporate vision and audio generation capabilities into the VALOR framework. |
vision-audio-language pretraining, multimodal understanding, multimodal pretraining, audiovisual captioning, cross-modality learning |
2304.08271
Report |
Open-World Weakly-Supervised Object Localization |
Jinheng Xie, Zhaochuan Luo, Yuexiang Li, Haozhe Liu, Linlin Shen, Mike Zheng Shou |
While remarkable success has been achieved in weakly-supervised object
localization (WSOL), current frameworks are not capable of locating objects of
novel categories in open-world settings. To address this issue, we are the
first to introduce a new weakly-supervised object localization task called
OWSOL (Open-World Weakly-Supervised Object Localization). During training, all
labeled data comes from known categories and, both known and novel categories
exist in the unlabeled data. To handle such data, we propose a novel paradigm
of contrastive representation co-learning using both labeled and unlabeled data
to generate a complete G-CAM (Generalized Class Activation Map) for object
localization, without the requirement of bounding box annotation. As no class
label is available for the unlabelled data, we conduct clustering over the full
training set and design a novel multiple semantic centroids-driven contrastive
loss for representation learning. We re-organize two widely used datasets,
i.e., ImageNet-1K and iNatLoc500, and propose OpenImages150 to serve as
evaluation benchmarks for OWSOL. Extensive experiments demonstrate that the
proposed method can surpass all baselines by a large margin. We believe that
this work can shift the close-set localization towards the open-world setting
and serve as a foundation for subsequent works. Code will be released at
https://github.com/ryylcc/OWSOL. |
This paper introduces Open-World Weakly-Supervised Object Localization (OWSOL), a new task aiming to localize both known and novel objects using labeled and unlabeled data. |
Current WSOL methods are limited to a closed-world setting and cannot handle novel categories, limiting their applicability to real-world scenarios. |
The authors propose a contrastive representation co-learning paradigm using supervised and multiple semantic centroids-driven contrastive losses. They also introduce generalized class activation mapping (G-CAM) for localization in a non-parametric manner. |
The proposed method outperforms existing WSOL methods and novel category discovery methods on ImageNet-1K, iNatLoc500, and OpenImages150 datasets.
Multiple semantic centroids in contrastive learning are shown to be crucial for complete object activation.
The method exhibits robustness to the number of clusters and strong zero-shot localization ability for novel categories. |
The current method doesn't differentiate between Nov-S and Nov-D categories during training.
Future work could explore fine-grained learning for Nov-S and Nov-D to improve performance. |
weakly-supervised object localization, open-world learning, contrastive learning, class activation mapping, novel category discovery |
2304.07547
Report |
TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation |
Jingyao Li, Pengguang Chen, Shengju Qian, Jiaya Jia |
Recent success of Contrastive Language-Image Pre-training~(CLIP) has shown
great promise in pixel-level open-vocabulary learning tasks. A general paradigm
utilizes CLIP's text and patch embeddings to generate semantic masks. However,
existing models easily misidentify input pixels from unseen classes, thus
confusing novel classes with semantically-similar ones. In our work, we
disentangle the ill-posed optimization problem into two parallel processes: one
performs semantic matching individually, and the other judges reliability for
improving discrimination ability. Motivated by special tokens in language
modeling that represents sentence-level embeddings, we design a trusty token
that decouples the known and novel category prediction tendency. With almost no
extra overhead, we upgrade the pixel-level generalization capacity of existing
models effectively. Our TagCLIP (CLIP adapting with Trusty-guidance) boosts the
IoU of unseen classes by 7.4% and 1.7% on PASCAL VOC 2012 and COCO-Stuff 164K. |
This paper proposes TagCLIP, a novel framework for open-vocabulary semantic segmentation that improves the recognition of unseen classes by disentangling semantic matching and prediction reliability. |
Existing open-vocabulary segmentation models often misclassify pixels from unseen classes, confusing them with semantically-similar seen classes. This work addresses this issue to improve the generalization ability of these models. |
The proposed TagCLIP introduces a trusty token to capture the prediction tendency for known and novel categories. It uses a Trusty Learner module to optimize this token based on the inter-category relationship, and weighs the raw segmentation map with the trusty map during inference to enhance discrimination. |
TagCLIP significantly boosts the IoU of unseen classes by 7.4% on PASCAL VOC 2012 and 1.7% on COCO-Stuff 164K in the inductive setting.
The method demonstrates superior performance on unseen categories compared to state-of-the-art approaches, especially for hard classes.
TagCLIP also shows strong cross-dataset generalization capability, improving performance by 1.4% from COCO-Stuff 164K to PASCAL Context. |
TagCLIP's design is not specifically optimized for the transductive setting, where unseen category names are accessible during training.
Future work could explore incorporating mechanisms for more effective self-supervision on unseen names in the transductive setting. |
open-vocabulary learning, semantic segmentation, zero-shot learning, vision-language models, clip |
2304.07527
Report |
Align-DETR: Improving DETR with Simple IoU-aware BCE loss |
Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di Huang |
DETR has set up a simple end-to-end pipeline for object detection by
formulating this task as a set prediction problem, showing promising potential.
However, despite the significant progress in improving DETR, this paper
identifies a problem of misalignment in the output distribution, which prevents
the best-regressed samples from being assigned with high confidence, hindering
the model's accuracy. We propose a metric, recall of best-regressed samples, to
quantitively evaluate the misalignment problem. Observing its importance, we
propose a novel Align-DETR that incorporates a localization precision-aware
classification loss in optimization. The proposed loss, IA-BCE, guides the
training of DETR to build a strong correlation between classification score and
localization precision. We also adopt the mixed-matching strategy, to
facilitate DETR-based detectors with faster training convergence while keeping
an end-to-end scheme. Moreover, to overcome the dramatic decrease in sample
quality induced by the sparsity of queries, we introduce a prime sample
weighting mechanism to suppress the interference of unimportant samples.
Extensive experiments are conducted with very competitive results reported. In
particular, it delivers a 46 (+3.8)% AP on the DAB-DETR baseline with the
ResNet-50 backbone and reaches a new SOTA performance of 50.2% AP in the 1x
setting on the COCO validation set when employing the strong baseline DINO. Our
code is available at https://github.com/FelixCaae/AlignDETR. |
This paper identifies a misalignment problem in DETR, where classification confidence and localization precision are inconsistent, and proposes Align-DETR to address it. |
The misalignment problem hinders DETR's accuracy by preventing the best-regressed samples from being assigned high confidence. |
Align-DETR introduces an IoU-aware BCE loss to align classification and regression scores, adopts a mixed-matching strategy for faster convergence, and implements prime sample weighting to handle low-quality samples during training. |
Align-DETR achieves a 3.8% AP gain over the DAB-DETR baseline with a ResNet-50 backbone.
It sets a new SOTA performance of 50.2% AP on COCO validation with the DINO baseline.
Ablation studies validate the effectiveness of the proposed components, particularly the IoU-aware BCE loss and prime sample weighting. |
The improvement from the IoU branch is limited, potentially due to DETR's reliance on self-attention for duplicate removal.
Further investigation is needed to explore other methods for handling the misalignment problem in DETR. |
object detection, detr, misalignment, iou-aware loss, mixed matching |
2304.07483
Report |
Video Generation Beyond a Single Clip |
Hsin-Ping Huang, Yu-Chuan Su, Ming-Hsuan Yang |
We tackle the long video generation problem, i.e.~generating videos beyond
the output length of video generation models. Due to the computation resource
constraints, video generation models can only generate video clips that are
relatively short compared with the length of real videos. Existing works apply
a sliding window approach to generate long videos at inference time, which is
often limited to generating recurrent events or homogeneous content. To
generate long videos covering diverse content and multiple events, we propose
to use additional guidance to control the video generation process. We further
present a two-stage approach to the problem, which allows us to utilize
existing video generation models to generate high-quality videos within a small
time window while modeling the video holistically based on the input guidance.
The proposed approach is complementary to existing efforts on video generation,
which focus on generating realistic video within a fixed time window. Extensive
experiments on challenging real-world videos validate the benefit of the
proposed method, which improves over state-of-the-art by up to 9.5% in
objective metrics and is preferred by users more than 80% of time. |
This paper tackles the long video generation problem, aiming to generate videos longer than the output length of typical video generation models. The authors propose a two-stage approach using additional guidance (object labels) to control the generation process. |
Current video generation models are limited in the length of videos they can produce due to computational constraints. Existing sliding window approaches often result in repetitive content, highlighting the need for methods that can generate long videos with diverse content and multiple events. |
The proposed method decomposes the problem into two stages: keyframe generation and frame interpolation. First, keyframes representing the start of each short video clip are predicted jointly based on object label guidance and a reference frame. Then, existing video generation models are utilized to interpolate intermediate frames between keyframes, generating the complete video. |
The proposed method outperforms state-of-the-art video generation models on the challenging EPIC Kitchen dataset, showing significant improvement in metrics like LPIPS and FVD.
Jointly predicting keyframes leads to better temporal consistency and quality compared to generating keyframes independently.
User studies confirm that the generated videos have better visual quality and reproduce the content of ground truth videos more accurately. |
The current implementation relies on object labels as guidance; exploring other types of guidance like text descriptions could be beneficial.
Further research on improving the quality of intermediate representations, such as layout generation, could lead to even better long video generation results. |
video generation, long video synthesis, keyframe generation, frame interpolation, object guidance |
2304.07429
Report |
Identity Encoder for Personalized Diffusion |
Yu-Chuan Su, Kelvin C. K. Chan, Yandong Li, Yang Zhao, Han Zhang, Boqing Gong, Huisheng Wang, Xuhui Jia |
Many applications can benefit from personalized image generation models,
including image enhancement, video conferences, just to name a few. Existing
works achieved personalization by fine-tuning one model for each person. While
being successful, this approach incurs additional computation and storage
overhead for each new identity. Furthermore, it usually expects tens or
hundreds of examples per identity to achieve the best performance. To overcome
these challenges, we propose an encoder-based approach for personalization. We
learn an identity encoder which can extract an identity representation from a
set of reference images of a subject, together with a diffusion generator that
can generate new images of the subject conditioned on the identity
representation. Once being trained, the model can be used to generate images of
arbitrary identities given a few examples even if the model hasn't been trained
on the identity. Our approach greatly reduces the overhead for personalized
image generation and is more applicable in many potential applications.
Empirical results show that our approach consistently outperforms existing
fine-tuning based approach in both image generation and reconstruction, and the
outputs is preferred by users more than 95% of the time compared with the best
performing baseline. |
This paper introduces an encoder-based approach for personalized image generation using diffusion models, enabling the generation of new images of arbitrary subjects given a few reference images. |
Existing personalized image generation methods rely on fine-tuning for each identity, leading to high computational costs and limited practicality. This paper proposes a more efficient and scalable approach using an identity encoder. |
The proposed method learns an identity encoder that extracts an identity representation from a set of reference images. A diffusion generator then synthesizes new images conditioned on this representation. The system is trained using a combination of random average embedding, identity loss, and multi-task learning to balance identity preservation, output diversity, and image quality. |
The proposed method consistently outperforms baselines in terms of image quality and is on par with the best baseline in terms of identity preservation.
The method generates diverse outputs from a few reference images, unlike baselines that require hundreds of examples.
The approach effectively extends to conditional generation tasks like super-resolution and inpainting, demonstrating superior reconstruction accuracy compared to baselines. |
The average embedding strategy might be sub-optimal for capturing all potential variations of a subject.
Output quality can be subject-dependent, potentially due to biases in the training data. |
personalized image generation, diffusion models, identity encoder, conditional generation, image inpainting, super-resolution |
2304.07221
Report |
Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models |
Yaohua Zha, Jinpeng Wang, Tao Dai, Bin Chen, Zhi Wang, Shu-Tao Xia |
Pre-trained point cloud models have found extensive applications in 3D
understanding tasks like object classification and part segmentation. However,
the prevailing strategy of full fine-tuning in downstream tasks leads to large
per-task storage overhead for model parameters, which limits the efficiency
when applying large-scale pre-trained models. Inspired by the recent success of
visual prompt tuning (VPT), this paper attempts to explore prompt tuning on
pre-trained point cloud models, to pursue an elegant balance between
performance and parameter efficiency. We find while instance-agnostic static
prompting, e.g. VPT, shows some efficacy in downstream transfer, it is
vulnerable to the distribution diversity caused by various types of noises in
real-world point cloud data. To conquer this limitation, we propose a novel
Instance-aware Dynamic Prompt Tuning (IDPT) strategy for pre-trained point
cloud models. The essence of IDPT is to develop a dynamic prompt generation
module to perceive semantic prior features of each point cloud instance and
generate adaptive prompt tokens to enhance the model's robustness. Notably,
extensive experiments demonstrate that IDPT outperforms full fine-tuning in
most tasks with a mere 7% of the trainable parameters, providing a promising
solution to parameter-efficient learning for pre-trained point cloud models.
Code is available at \url{https://github.com/zyh16143998882/ICCV23-IDPT}. |
This paper proposes Instance-aware Dynamic Prompt Tuning (IDPT), a novel method for parameter-efficient tuning of pre-trained point cloud models that addresses the limitations of static prompt tuning methods like VPT when applied to real-world point cloud data. |
Full fine-tuning of pre-trained point cloud models for downstream tasks requires significant storage for model parameters, limiting efficiency. Prompt tuning offers a parameter-efficient alternative but existing static methods struggle with the distribution diversity present in real-world point cloud data. |
IDPT employs a dynamic prompt generation module that leverages EdgeConv layers to extract multi-scale contextual features from point cloud instances. These features are used to generate adaptive prompt tokens that are concatenated with the input of the last transformer layer, enhancing the model's robustness to noise and missing data. |
IDPT achieves state-of-the-art performance on the ScanObjectNN dataset for object classification, outperforming full fine-tuning in most cases with only 7% of trainable parameters.
IDPT demonstrates superior performance compared to full fine-tuning and static prompt tuning methods on both synthetic and real-world datasets for object classification and few-shot learning.
While IDPT shows improvements over static prompting for part segmentation, it still lags behind full fine-tuning, suggesting a need for further research in parameter-efficient methods for fine-grained point cloud understanding. |
The performance gap between IDPT and full fine-tuning in part segmentation highlights the challenge of parameter-efficient tuning for fine-grained point cloud tasks.
Future work could explore incorporating effective structure modeling mechanisms within the parameter-efficient tuning strategy to bridge this gap. |
point cloud, prompt tuning, pre-trained models, parameter efficiency, domain adaptation |
2304.07087
Report |
Memory Efficient Diffusion Probabilistic Models via Patch-based Generation |
Shinei Arakawa, Hideki Tsunashima, Daichi Horita, Keitaro Tanaka, Shigeo Morishima |
Diffusion probabilistic models have been successful in generating
high-quality and diverse images. However, traditional models, whose input and
output are high-resolution images, suffer from excessive memory requirements,
making them less practical for edge devices. Previous approaches for generative
adversarial networks proposed a patch-based method that uses positional
encoding and global content information. Nevertheless, designing a patch-based
approach for diffusion probabilistic models is non-trivial. In this paper, we
resent a diffusion probabilistic model that generates images on a
patch-by-patch basis. We propose two conditioning methods for a patch-based
generation. First, we propose position-wise conditioning using one-hot
representation to ensure patches are in proper positions. Second, we propose
Global Content Conditioning (GCC) to ensure patches have coherent content when
concatenated together. We evaluate our model qualitatively and quantitatively
on CelebA and LSUN bedroom datasets and demonstrate a moderate trade-off
between maximum memory consumption and generated image quality. Specifically,
when an entire image is divided into 2 x 2 patches, our proposed approach can
reduce the maximum memory consumption by half while maintaining comparable
image quality. |
This paper presents a memory-efficient diffusion probabilistic model for image generation, which operates on a patch-by-patch basis. |
Traditional diffusion models suffer from high memory requirements, especially for high-resolution images, limiting their practicality on edge devices. |
The model divides images into patches and utilizes two conditioning methods: 1) Position-wise conditioning using one-hot representation to specify patch location. 2) Global Content Conditioning (GCC) which extracts global content features from the entire image to ensure coherence when patches are combined. |
The proposed method can reduce the maximum memory consumption by half while maintaining comparable image quality when dividing an entire image into 2x2 patches.
The model exhibits good performance on CelebA dataset, particularly with 2x2 and 4x4 patch divisions.
On LSUN bedroom dataset, while quality is maintained with 2x2 division, further divisions lead to noticeable patch boundaries and quality degradation. |
The model struggles with datasets containing diverse image compositions, leading to patch boundary artifacts.
Extracting global content information at every diffusion step might lead to error accumulation and boundary discontinuities. |
diffusion probabilistic models, memory efficient, patch-based generation, global content conditioning, image generation |
2304.07060
Report |
DCFace: Synthetic Face Generation with Dual Condition Diffusion Model |
Minchul Kim, Feng Liu, Anil Jain, Xiaoming Liu |
Generating synthetic datasets for training face recognition models is
challenging because dataset generation entails more than creating high fidelity
images. It involves generating multiple images of same subjects under different
factors (\textit{e.g.}, variations in pose, illumination, expression, aging and
occlusion) which follows the real image conditional distribution. Previous
works have studied the generation of synthetic datasets using GAN or 3D models.
In this work, we approach the problem from the aspect of combining subject
appearance (ID) and external factor (style) conditions. These two conditions
provide a direct way to control the inter-class and intra-class variations. To
this end, we propose a Dual Condition Face Generator (DCFace) based on a
diffusion model. Our novel Patch-wise style extractor and Time-step dependent
ID loss enables DCFace to consistently produce face images of the same subject
under different styles with precise control. Face recognition models trained on
synthetic images from the proposed DCFace provide higher verification
accuracies compared to previous works by $6.11\%$ on average in $4$ out of $5$
test datasets, LFW, CFP-FP, CPLFW, AgeDB and CALFW. Code is available at
https://github.com/mk-minchul/dcface |
Proposes DCFace, a two-stage dual condition diffusion model for generating synthetic face datasets with improved subject uniqueness, diversity, and label consistency. |
Addresses limitations of existing synthetic face datasets in matching real-world image distributions and label accuracy, crucial for training effective face recognition models while mitigating privacy concerns associated with real datasets. |
Combines an ID image generator with a style bank and a dual condition generator (G_mix). G_mix leverages a patch-wise style extractor and a time-step dependent ID loss to blend ID and style conditions from same-subject image pairs, enhancing control over image generation. |
Achieves state-of-the-art face recognition performance with a 0.5M synthetic image dataset, outperforming previous methods by 6.11% on average across four benchmark datasets.
Demonstrates the importance of balancing label consistency and diversity in synthetic datasets for optimal face recognition accuracy.
Shows that DDPM trained on FFHQ can generate a substantial number of unique subjects, addressing the limitation of limited subject diversity in previous GAN-based methods. |
Despite improvements, DCFace lacks 3D consistency across pose, a potential area for future work leveraging 3D priors.
Current implementation still relies on real images for training, aspiring for fully synthetic datasets to completely eliminate dependence on real data. |
synthetic data generation, face recognition, diffusion models, dual condition generation, dataset diversity and consistency |
2304.07039
Report |
Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement |
Yuhui Wu, Chen Pan, Guoqing Wang, Yang Yang, Jiwei Wei, Chongyi Li, Heng Tao Shen |
Low-light image enhancement (LLIE) investigates how to improve illumination
and produce normal-light images. The majority of existing methods improve
low-light images via a global and uniform manner, without taking into account
the semantic information of different regions. Without semantic priors, a
network may easily deviate from a region's original color. To address this
issue, we propose a novel semantic-aware knowledge-guided framework (SKF) that
can assist a low-light enhancement model in learning rich and diverse priors
encapsulated in a semantic segmentation model. We concentrate on incorporating
semantic knowledge from three key aspects: a semantic-aware embedding module
that wisely integrates semantic priors in feature representation space, a
semantic-guided color histogram loss that preserves color consistency of
various instances, and a semantic-guided adversarial loss that produces more
natural textures by semantic priors. Our SKF is appealing in acting as a
general framework in LLIE task. Extensive experiments show that models equipped
with the SKF significantly outperform the baselines on multiple datasets and
our SKF generalizes to different models and scenes well. The code is available
at Semantic-Aware-Low-Light-Image-Enhancement. |
This paper proposes SKF, a semantic-aware knowledge-guided framework that improves low-light image enhancement using semantic priors. |
Existing LLIE methods often enhance images globally without considering region-specific semantics, leading to color deviations and unnatural results. |
The SKF leverages a pre-trained semantic segmentation network (HRNet) as a knowledge bank and introduces: 1) Semantic-aware embedding (SE) module for refining image features using semantic features. 2) Semantic-guided color histogram (SCH) loss for preserving instance-level color consistency. 3) Semantic-guided adversarial (SA) loss for enhancing texture realism by guiding the discriminator. |
SKF consistently improves the performance of six baseline LLIE methods on LOL and LOL-v2 datasets.
LLFlow-L with SKF achieves state-of-the-art results on both LOL and LOL-v2 datasets.
The proposed framework effectively suppresses noise, preserves color consistency, and generates realistic textures in enhanced images. |
The performance improvement is limited when encountering unknown object categories.
Future work includes exploring the framework's potential in other low-level vision tasks. |
low-light image enhancement, semantic segmentation, knowledge guidance, color consistency, adversarial learning |
2304.06957
Report |
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation |
Jie Guo, Qimeng Wang, Yan Gao, Xiaolong Jiang, Xu Tang, Yao Hu, Baochang Zhang |
CLIP (Contrastive Language-Image Pretraining) is well-developed for
open-vocabulary zero-shot image-level recognition, while its applications in
pixel-level tasks are less investigated, where most efforts directly adopt CLIP
features without deliberative adaptations. In this work, we first demonstrate
the necessity of image-pixel CLIP feature adaption, then provide Multi-View
Prompt learning (MVP-SEG) as an effective solution to achieve image-pixel
adaptation and to solve open-vocabulary semantic segmentation. Concretely,
MVP-SEG deliberately learns multiple prompts trained by our Orthogonal
Constraint Loss (OCLoss), by which each prompt is supervised to exploit CLIP
feature on different object parts, and collaborative segmentation masks
generated by all prompts promote better segmentation. Moreover, MVP-SEG
introduces Global Prompt Refining (GPR) to further eliminate class-wise
segmentation noise. Experiments show that the multi-view prompts learned from
seen categories have strong generalization to unseen categories, and MVP-SEG+
which combines the knowledge transfer stage significantly outperforms previous
methods on several benchmarks. Moreover, qualitative results justify that
MVP-SEG does lead to better focus on different local parts. |
Proposes MVP-SEG, a multi-view prompt learning method for open-vocabulary semantic segmentation using pre-trained CLIP |
CLIP, while powerful for image-level recognition, requires adaptation for pixel-level tasks like segmentation. Existing methods directly adopting CLIP features result in sub-optimal performance. |
Learns multiple prompts to capture different object parts, supervised by an Orthogonal Constraint Loss (OCLoss) to ensure part-wise attention. Introduces Global Prompt Refining (GPR) to leverage CLIP's classification ability and refine segmentation masks. |
MVP-SEG significantly outperforms baseline (MaskCLIP) on unseen classes, demonstrating the effectiveness of multi-view learnable prompts.
Learnable prompts outperform handcrafted prompts, showing adaptability and superiority of the proposed method.
MVP-SEG+, combining MVP-SEG with knowledge transfer, achieves state-of-the-art performance on three major benchmarks, even surpassing fully-supervised methods on some. |
The number of prompts and their effectiveness might vary across datasets and object categories.
Exploring alternative prompt learning strategies and architectures for further performance improvement. |
open-vocabulary semantic segmentation, zero-shot learning, clip, prompt learning, multi-view learning |
2304.06939
Report |
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text |
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, Yejin Choi |
In-context vision and language models like Flamingo support arbitrarily
interleaved sequences of images and text as input. This format not only enables
few-shot learning via interleaving independent supervised (image, text)
examples, but also, more complex prompts involving interaction between images,
e.g., "What do image A and image B have in common?" To support this interface,
pretraining occurs over web corpora that similarly contain interleaved
images+text. To date, however, large-scale data of this form have not been
publicly available.
We release Multimodal C4, an augmentation of the popular text-only C4 corpus
with images interleaved. We use a linear assignment algorithm to place images
into longer bodies of text using CLIP features, a process that we show
outperforms alternatives. Multimodal C4 spans everyday topics like cooking,
travel, technology, etc. A manual inspection of a random sample of documents
shows that a vast majority (88%) of images are topically relevant, and that
linear assignment frequently selects individual sentences specifically
well-aligned with each image (80%). After filtering NSFW images, ads, etc., the
resulting corpus consists of 101.2M documents with 571M images interleaved in
43B English tokens. |
This paper introduces Multimodal C4 (MMC4), a large-scale dataset with interleaved image and text sequences for training multimodal language models. |
Existing multimodal datasets primarily consist of image-caption pairs, limiting the ability of models to learn complex interactions between images and text. MMC4 addresses this gap by providing a rich dataset with interleaved sequences, enabling the development of models capable of few-shot learning and complex multimodal reasoning. |
The authors augmented the existing text-only C4 dataset with images from the corresponding web pages. They employed a CLIP-based linear assignment algorithm to align images with relevant sentences within each document, ensuring topical relevance and image-text alignment. |
MMC4 consists of 101.2M documents, 571M images, and 43B English tokens, surpassing previous non-public datasets in scale.
Manual verification indicates that 87.7% of images are topically relevant to their associated documents, and 80.4% are well-aligned with their assigned sentences.
Preliminary experiments demonstrate that training a multimodal language model on MMC4 improves its performance on few-shot, in-context image captioning tasks compared to training on image-caption pairs alone. |
The paper lacks detailed empirical evaluation of the model's in-context reasoning abilities beyond few-shot image captioning.
Future work could explore the impact of data scaling and instruction tuning on multimodal in-context learning. |
multimodal language models, dataset, image-text alignment, in-context learning, few-shot learning |
2304.06911
Report |
3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud Pretraining |
Siming Yan, Yuqi Yang, Yuxiao Guo, Hao Pan, Peng-shuai Wang, Xin Tong, Yang Liu, Qixing Huang |
Masked autoencoders (MAE) have recently been introduced to 3D self-supervised
pretraining for point clouds due to their great success in NLP and computer
vision. Unlike MAEs used in the image domain, where the pretext task is to
restore features at the masked pixels, such as colors, the existing 3D MAE
works reconstruct the missing geometry only, i.e, the location of the masked
points. In contrast to previous studies, we advocate that point location
recovery is inessential and restoring intrinsic point features is much
superior. To this end, we propose to ignore point position reconstruction and
recover high-order features at masked points including surface normals and
surface variations, through a novel attention-based decoder which is
independent of the encoder design. We validate the effectiveness of our pretext
task and decoder design using different encoder structures for 3D training and
demonstrate the advantages of our pretrained networks on various point cloud
analysis tasks. |
This paper proposes MaskFeat3D, a novel masked autoencoding method for 3D self-supervised pretraining that focuses on reconstructing intrinsic point features (surface normals and variations) instead of point locations. |
Existing 3D MAE methods primarily focus on reconstructing masked point locations, deviating from successful 2D approaches that prioritize feature restoration. This paper argues that recovering high-order surface features is crucial for better representation learning in 3D. |
The method uses an attention-based decoder that takes masked points as queries and leverages cross-attention with encoder features to predict normals and variations. This decoder is agnostic to the encoder architecture, supporting ViT, PointNet++, and sparse CNNs. |
MaskFeat3D consistently outperforms previous 3D MAE methods on ScanObjectNN classification and ShapeNetPart segmentation.
Using both normal and surface variation as target features yields better performance than using either alone.
The method achieves state-of-the-art results on these tasks, even surpassing supervised methods in some cases. |
The computational cost can be high due to the use of attention.
Exploration of other potential 3D features for reconstruction is left for future work. |
self-supervised learning, point cloud, masked autoencoder, feature prediction, attention mechanism |
2304.06720
Report |
Expressive Text-to-Image Generation with Rich Text |
Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang |
Plain text has become a prevalent interface for text-to-image synthesis.
However, its limited customization options hinder users from accurately
describing desired outputs. For example, plain text makes it hard to specify
continuous quantities, such as the precise RGB color value or importance of
each word. Furthermore, creating detailed text prompts for complex scenes is
tedious for humans to write and challenging for text encoders to interpret. To
address these challenges, we propose using a rich-text editor supporting
formats such as font style, size, color, and footnote. We extract each word's
attributes from rich text to enable local style control, explicit token
reweighting, precise color rendering, and detailed region synthesis. We achieve
these capabilities through a region-based diffusion process. We first obtain
each word's region based on attention maps of a diffusion process using plain
text. For each region, we enforce its text attributes by creating
region-specific detailed prompts and applying region-specific guidance, and
maintain its fidelity against plain-text generation through region-based
injections. We present various examples of image generation from rich text and
demonstrate that our method outperforms strong baselines with quantitative
evaluations. |
This paper introduces rich-text-to-image generation, enabling precise control over image synthesis using attributes like font style, size, color, and footnotes. |
Plain text prompts limit users' ability to specify precise details like color or object importance. Rich text offers a more expressive and user-friendly interface for text-to-image synthesis. |
The method uses a two-step process. First, it computes spatial layouts for token spans using attention maps from a plain-text diffusion process. Second, it employs region-based diffusion and guidance to render each region's attributes, preserving fidelity to the plain-text generation. |
The method generates more precise colors compared to baselines, accurately reflecting RGB values and subtle color names.
It enables local style control, applying distinct artistic styles to different image regions, unlike baselines that produce uniform styles.
It facilitates detailed region synthesis, incorporating information from footnotes to generate complex scenes with higher fidelity than competing approaches. |
The method's reliance on multiple diffusion processes leads to longer inference times compared to plain-text generation.
The token map generation process relies on a thresholding parameter that could be replaced with more advanced segmentation methods. |
text-to-image synthesis, rich text, diffusion models, attention mechanisms, controllable image generation |
2304.06718
Report |
Segment Everything Everywhere All at Once |
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee |
In this work, we present SEEM, a promptable and interactive model for
segmenting everything everywhere all at once in an image, as shown in Fig.1. In
SEEM, we propose a novel decoding mechanism that enables diverse prompting for
all types of segmentation tasks, aiming at a universal segmentation interface
that behaves like large language models (LLMs). More specifically, SEEM is
designed with four desiderata: i) Versatility. We introduce a new visual prompt
to unify different spatial queries including points, boxes, scribbles and
masks, which can further generalize to a different referring image; ii)
Compositionality. We learn a joint visual-semantic space between text and
visual prompts, which facilitates the dynamic composition of two prompt types
required for various segmentation tasks; iii) Interactivity. We further
incorporate learnable memory prompts into the decoder to retain segmentation
history through mask-guided cross-attention from decoder to image features; and
iv) Semantic-awareness. We use a text encoder to encode text queries and mask
labels into the same semantic space for open-vocabulary segmentation. We
conduct a comprehensive empirical study to validate the effectiveness of SEEM
across diverse segmentation tasks. Notably, our single SEEM model achieves
competitive performance across interactive segmentation, generic segmentation,
referring segmentation, and video object segmentation on 9 datasets with
minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity
for generalization to novel prompts or their combinations, rendering it a
readily universal image segmentation interface. |
This paper proposes SEEM, a promptable and interactive model for universal image segmentation, capable of segmenting everything in an image with semantic labels, covering every pixel, and supporting various prompt compositions. |
The research aims to address the limitations of existing segmentation models by introducing a universal interface that accommodates diverse human prompts and handles various segmentation tasks within a single model. |
SEEM utilizes an encoder-decoder architecture with a novel decoding mechanism that enables versatile prompting. It introduces visual prompts for non-textual inputs, facilitates compositionality of prompts, incorporates memory prompts for interactivity, and ensures semantic awareness for open-vocabulary segmentation. |
SEEM achieves competitive performance across interactive segmentation, generic segmentation, and referring segmentation tasks on nine datasets with minimal supervision.
The model demonstrates a remarkable capacity for generalization to novel prompts or their combinations, highlighting its potential as a universal image segmentation interface.
SEEM exhibits efficiency in interactive segmentation, requiring only one feature extraction at the start and lightweight decoding per interaction round. |
The model's performance on referring segmentation is slightly affected when trained from scratch.
Increasing the number of interactive training iterations improves accuracy but also elevates computational costs. |
image segmentation, interactive segmentation, referring segmentation, open-vocabulary segmentation, universal model |
2304.06717
Report |
Representing Volumetric Videos as Dynamic MLP Maps |
Sida Peng, Yunzhi Yan, Qing Shuai, Hujun Bao, Xiaowei Zhou |
This paper introduces a novel representation of volumetric videos for
real-time view synthesis of dynamic scenes. Recent advances in neural scene
representations demonstrate their remarkable capability to model and render
complex static scenes, but extending them to represent dynamic scenes is not
straightforward due to their slow rendering speed or high storage cost. To
solve this problem, our key idea is to represent the radiance field of each
frame as a set of shallow MLP networks whose parameters are stored in 2D grids,
called MLP maps, and dynamically predicted by a 2D CNN decoder shared by all
frames. Representing 3D scenes with shallow MLPs significantly improves the
rendering speed, while dynamically predicting MLP parameters with a shared 2D
CNN instead of explicitly storing them leads to low storage cost. Experiments
show that the proposed approach achieves state-of-the-art rendering quality on
the NHR and ZJU-MoCap datasets, while being efficient for real-time rendering
with a speed of 41.7 fps for $512 \times 512$ images on an RTX 3090 GPU. The
code is available at https://zju3dv.github.io/mlp_maps/. |
This paper proposes a novel representation of volumetric video called "dynamic MLP maps" for efficient view synthesis of dynamic scenes. |
Designing a volumetric video representation that allows for high-quality, real-time rendering while also being efficiently compressed remains an open problem. |
The authors represent each video frame as a set of small MLP networks, with their parameters stored in 2D grids called MLP maps. These parameters are dynamically predicted by a 2D CNN decoder shared across all frames. |
The approach achieves state-of-the-art rendering quality on the NHR and ZJU-MoCap datasets.
It enables real-time rendering with speeds of 41.7 fps for 512x512 images on an RTX 3090 GPU.
The method achieves compact representation, leading to low storage costs. |
The current work only handles relatively short videos (100-300 frames), limiting its applicability to longer videos.
The representation relies on dense camera views for training, similar to many existing methods. |
volumetric video, view synthesis, neural scene representation, mlp, real-time rendering |
2304.06712
Report |
What does CLIP know about a red circle? Visual prompt engineering for VLMs |
Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi |
Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot
classification to text-to-image generation. Despite that, their capabilities
for solving novel discriminative tasks via prompting fall behind those of large
language models, such as GPT-3. Here we explore the idea of visual prompt
engineering for solving computer vision tasks beyond classification by editing
in image space instead of text. In particular, we discover an emergent ability
of CLIP, where, by simply drawing a red circle around an object, we can direct
the model's attention to that region, while also maintaining global
information. We show the power of this simple approach by achieving
state-of-the-art in zero-shot referring expressions comprehension and strong
performance in keypoint localization tasks. Finally, we draw attention to some
potential ethical concerns of large language-vision models. |
This paper explores visual prompt engineering in Vision-Language Models (VLMs) by introducing a simple yet effective technique: marking image regions with a red circle to guide the model's attention. |
This approach aims to enhance the ability of VLMs to solve novel discriminative tasks beyond classification, bridging the gap between their capabilities and those of large language models. |
The researchers experiment with different visual prompt engineering techniques, including cropping and marking. They evaluate their method on three zero-shot tasks: naming keypoints, localizing keypoints, and referring expression comprehension. |
Marking with red circles significantly outperforms cropping and random baselines in all tasks.
The effectiveness of red circles is attributed to their presence, albeit rare, in the VLM training data (e.g., YFCC15M).
The authors achieve state-of-the-art zero-shot performance on referring expression comprehension, surpassing methods that use image cropping and manually designed relation rules. |
The reliance on the presence of specific markers in the training data may limit generalization.
The study reveals potential ethical concerns as VLMs can learn and amplify biases present in the training data, such as associating red circles with negative connotations (e.g., missing persons or criminals). |
visual prompt engineering, vision-language models, zero-shot learning, referring expression comprehension, ethical bias in ai |
2304.06711
Report |
DiffusionRig: Learning Personalized Priors for Facial Appearance Editing |
Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, Xiuming Zhang |
We address the problem of learning person-specific facial priors from a small
number (e.g., 20) of portrait photos of the same person. This enables us to
edit this specific person's facial appearance, such as expression and lighting,
while preserving their identity and high-frequency facial details. Key to our
approach, which we dub DiffusionRig, is a diffusion model conditioned on, or
"rigged by," crude 3D face models estimated from single in-the-wild images by
an off-the-shelf estimator. On a high level, DiffusionRig learns to map
simplistic renderings of 3D face models to realistic photos of a given person.
Specifically, DiffusionRig is trained in two stages: It first learns generic
facial priors from a large-scale face dataset and then person-specific priors
from a small portrait photo collection of the person of interest. By learning
the CGI-to-photo mapping with such personalized priors, DiffusionRig can "rig"
the lighting, facial expression, head pose, etc. of a portrait photo,
conditioned only on coarse 3D models while preserving this person's identity
and other high-frequency characteristics. Qualitative and quantitative
experiments show that DiffusionRig outperforms existing approaches in both
identity preservation and photorealism. Please see the project website:
https://diffusionrig.github.io for the supplemental material, video, code, and
data. |
Proposes DiffusionRig, a diffusion model that learns personalized priors for facial appearance editing from a small set of portrait photos, enabling controllable edits while preserving identity and high-frequency details. |
Addresses limitations of zero-shot facial appearance editing methods that struggle to preserve individual-specific features. |
Two-stage training: (1) Learns generic facial priors from a large-scale face dataset using a diffusion model conditioned on physical buffers (normals, albedo, Lambertian rendering) extracted by DECA. (2) Fine-tunes the model on a small set (around 20) of a specific person's photos to capture personalized priors. |
Achieves convincing appearance edits (relighting, expression, pose) while preserving identity.
Outperforms existing methods in both identity preservation and photorealism, as shown quantitatively and via user study.
Demonstrates disentanglement of physical properties from global appearance information (hairstyle, accessories) by swapping global latent codes. |
Scalability: Requires finetuning for each individual, limiting practicality for massive user adoption.
Background Inconsistency: May struggle with background preservation during dramatic head pose changes. |
diffusion models, facial appearance editing, personalized priors, 3d morphable models, image generation |
2304.06706
Report |
Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields |
Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman |
Neural Radiance Field training can be accelerated through the use of
grid-based representations in NeRF's learned mapping from spatial coordinates
to colors and volumetric density. However, these grid-based approaches lack an
explicit understanding of scale and therefore often introduce aliasing, usually
in the form of jaggies or missing scene content. Anti-aliasing has previously
been addressed by mip-NeRF 360, which reasons about sub-volumes along a cone
rather than points along a ray, but this approach is not natively compatible
with current grid-based techniques. We show how ideas from rendering and signal
processing can be used to construct a technique that combines mip-NeRF 360 and
grid-based models such as Instant NGP to yield error rates that are 8% - 77%
lower than either prior technique, and that trains 24x faster than mip-NeRF
360. |
Presents Zip-NeRF, a novel architecture that combines the advantages of grid-based NeRF models (like Instant NGP) and scale-aware anti-aliased NeRFs (like mip-NeRF 360) |
Grid-based NeRFs, while fast, lack an inherent understanding of scale, leading to aliasing. Mip-NeRF 360 addresses aliasing but is not directly compatible with grid-based techniques. This work bridges this gap, aiming for both fast and high-quality rendering. |
Employs multisampling and feature downweighting to integrate iNGP's grid pyramid into mip-NeRF 360. Introduces an anti-aliased loss function to address z-aliasing arising from the proposal network. |
Reduces error rates by 8% -- 77% compared to previous techniques on the mip-NeRF 360 and a proposed multiscale benchmark.
Achieves a 24x speedup in training time compared to mip-NeRF 360.
Demonstrates superior visual quality, particularly in recovering thin structures and fine details, as shown in comparative renderings. |
The rendering time, while not a focus, is not significantly improved.
Further investigation into reducing the number of samples required without sacrificing quality. |
neural radiance fields, anti-aliasing, multisampling, grid-based nerf, inverse rendering |
2304.06700
Report |
Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images |
Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, Josh Susskind |
Diffusion models have recently become the de-facto approach for generative
modeling in the 2D domain. However, extending diffusion models to 3D is
challenging due to the difficulties in acquiring 3D ground truth data for
training. On the other hand, 3D GANs that integrate implicit 3D representations
into GANs have shown remarkable 3D-aware generation when trained only on
single-view image datasets. However, 3D GANs do not provide straightforward
ways to precisely control image synthesis. To address these challenges, We
present Control3Diff, a 3D diffusion model that combines the strengths of
diffusion models and 3D GANs for versatile, controllable 3D-aware image
synthesis for single-view datasets. Control3Diff explicitly models the
underlying latent distribution (optionally conditioned on external inputs),
thus enabling direct control during the diffusion process. Moreover, our
approach is general and applicable to any type of controlling input, allowing
us to train it with the same diffusion objective without any auxiliary
supervision. We validate the efficacy of Control3Diff on standard image
generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various
conditioning inputs such as images, sketches, and text prompts. Please see the
project website (\url{https://jiataogu.me/control3diff}) for video comparisons. |
Presents Control3Diff, a 3D diffusion model for controllable 3D-aware image synthesis from single-view images by linking diffusion models to 3D GANs. |
Addresses the limitations of diffusion models in 3D generation due to the lack of 3D ground truth data and the difficulty in defining energy functions for guidance in the latent space. |
Leverages pre-trained 3D GANs to sample latent representations (tri-planes) and trains diffusion models on these representations for both unconditional and conditional generation, enabling precise control over 3D properties. |
Significantly outperforms existing 3D GAN inversion baselines on image-to-3D inversion tasks.
Achieves comparable or better results than Pix2Pix3D on Seg-to-3D and Edge-to-3D tasks.
Demonstrates the versatility of the framework by applying it to Text-to-3D generation and editing. |
Mode collapse in the learned latent space of 3D GANs can limit the diversity of generated samples.
The iterative nature of diffusion models results in a slower generation process compared to encoder-based approaches. |
diffusion models, 3d gans, controllable image synthesis, single-view reconstruction, 3d-aware generation |
2304.06648
Report |
DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning |
Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, Zhenguo Li |
Diffusion models have proven to be highly effective in generating
high-quality images. However, adapting large pre-trained diffusion models to
new domains remains an open challenge, which is critical for real-world
applications. This paper proposes DiffFit, a parameter-efficient strategy to
fine-tune large pre-trained diffusion models that enable fast adaptation to new
domains. DiffFit is embarrassingly simple that only fine-tunes the bias term
and newly-added scaling factors in specific layers, yet resulting in
significant training speed-up and reduced model storage costs. Compared with
full fine-tuning, DiffFit achieves 2$\times$ training speed-up and only needs
to store approximately 0.12\% of the total model parameters. Intuitive
theoretical analysis has been provided to justify the efficacy of scaling
factors on fast adaptation. On 8 downstream datasets, DiffFit achieves superior
or competitive performances compared to the full fine-tuning while being more
efficient. Remarkably, we show that DiffFit can adapt a pre-trained
low-resolution generative model to a high-resolution one by adding minimal
cost. Among diffusion-based methods, DiffFit sets a new state-of-the-art FID of
3.02 on ImageNet 512$\times$512 benchmark by fine-tuning only 25 epochs from a
public pre-trained ImageNet 256$\times$256 checkpoint while being 30$\times$
more training efficient than the closest competitor. |
This paper proposes DiffFit, a parameter-efficient strategy for fine-tuning large pre-trained diffusion models based on DiT, enabling fast adaptation to new domains. |
Adapting large pre-trained diffusion models (like DiT) to new domains is important for real-world applications but remains a challenge due to the computational cost and storage requirements of full fine-tuning. |
DiffFit freezes most parameters of a pre-trained diffusion model and only fine-tunes the bias term, normalization, class embedding, and newly-added scaling factors in specific layers. |
DiffFit achieves 2x training speed-up and only needs to store 0.12% of the total model parameters compared to full fine-tuning.
Evaluation on 8 downstream datasets shows DiffFit achieves superior or competitive performance compared to full fine-tuning while being more efficient.
DiffFit sets a new state-of-the-art FID of 3.02 on ImageNet 512x512 benchmark by fine-tuning only 25 epochs from a pre-trained ImageNet 256x256 checkpoint, being 30x more training efficient than the closest competitor. |
The experiments mainly focus on class-conditional image generation and it is unclear if DiffFit can generalize to more complex tasks like text-to-image or video generation.
Further investigation is needed to determine if the scaling factor's effectiveness extends to deeper layers of the model. |
diffusion models, parameter-efficient fine-tuning, image generation, diffusion transformer (dit), transfer learning |
2304.06544
Report |
DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos |
Qi Zhao, M. Salman Asif, Zhan Ma |
Existing implicit neural representation (INR) methods do not fully exploit
spatiotemporal redundancies in videos. Index-based INRs ignore the
content-specific spatial features and hybrid INRs ignore the contextual
dependency on adjacent frames, leading to poor modeling capability for scenes
with large motion or dynamics. We analyze this limitation from the perspective
of function fitting and reveal the importance of frame difference. To use
explicit motion information, we propose Difference Neural Representation for
Videos (DNeRV), which consists of two streams for content and frame difference.
We also introduce a collaborative content unit for effective feature fusion. We
test DNeRV for video compression, inpainting, and interpolation. DNeRV achieves
competitive results against the state-of-the-art neural compression approaches
and outperforms existing implicit methods on downstream inpainting and
interpolation for $960 \times 1920$ videos. |
The paper proposes Difference Neural Representation for Videos (DNeRV), a novel implicit neural representation method that leverages frame differences to improve video representation, particularly in scenes with large motion or dynamic elements. |
Existing NeRV methods struggle to effectively model content-specific spatial features and temporal correlations simultaneously, leading to poor performance in videos with significant motion. |
DNeRV employs a two-stream architecture, processing both the original frame (content stream) and frame differences (diff stream). A novel Collaborative Content Unit (CCU) fuses features from both streams adaptively, enhancing the representation's ability to capture adjacent dynamics. |
DNeRV outperforms existing NeRV methods on video regression tasks for benchmark datasets like Bunny and UVG.
It demonstrates superior performance in downstream tasks, including video compression, interpolation, and inpainting, compared to other implicit methods, particularly for high-resolution videos.
The incorporation of the diff stream and CCU contributes to more robust and efficient learning of the implicit mapping in videos with large motion. |
DNeRV might face challenges in accurately representing detailed textures due to the nature of frame differences.
Future work includes exploring higher-order frame differences and extending DNeRV for specific video-related tasks. |
implicit neural representation, video representation learning, video interpolation, video inpainting, video compression |
2304.06461
Report |
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning |
Kaiyou Song, Jin Xie, Shan Zhang, Zimeng Luo |
Self-supervised learning (SSL) has made remarkable progress in visual
representation learning. Some studies combine SSL with knowledge distillation
(SSL-KD) to boost the representation learning performance of small models. In
this study, we propose a Multi-mode Online Knowledge Distillation method (MOKD)
to boost self-supervised visual representation learning. Different from
existing SSL-KD methods that transfer knowledge from a static pre-trained
teacher to a student, in MOKD, two different models learn collaboratively in a
self-supervised manner. Specifically, MOKD consists of two distillation modes:
self-distillation and cross-distillation modes. Among them, self-distillation
performs self-supervised learning for each model independently, while
cross-distillation realizes knowledge interaction between different models. In
cross-distillation, a cross-attention feature search strategy is proposed to
enhance the semantic feature alignment between different models. As a result,
the two models can absorb knowledge from each other to boost their
representation learning performance. Extensive experimental results on
different backbones and datasets demonstrate that two heterogeneous models can
benefit from MOKD and outperform their independently trained baseline. In
addition, MOKD also outperforms existing SSL-KD methods for both the student
and teacher models. |
This paper proposes MOKD, a Multi-mode Online Knowledge Distillation method for boosting self-supervised visual representation learning. |
Existing SSL-KD methods use a static teacher, limiting the student's learning potential. MOKD enables collaborative learning between two models, potentially boosting both. |
MOKD uses self-distillation (individual contrastive learning) and cross-distillation (knowledge transfer between models) with a cross-attention feature search strategy for semantic alignment. |
MOKD significantly improves representation learning in both heterogeneous (ResNet-ViT) and homogeneous (two ResNets or two ViTs) model pairs.
MOKD outperforms state-of-the-art SSL methods in linear probing and k-NN evaluations on ImageNet.
Heterogeneous models trained with MOKD exhibit knowledge transfer, with ViT becoming more locally focused and ResNet becoming more globally focused. |
Training larger models repeatedly for different smaller models increases computation cost compared to offline distillation.
Future work can explore efficient fine-tuning methods to improve MOKD's efficiency. |
self-supervised learning (ssl), knowledge distillation, contrastive learning, representation learning, computer vision |
2304.06440
Report |
Zoom-VQA: Patches, Frames and Clips Integration for Video Quality Assessment |
Kai Zhao, Kun Yuan, Ming Sun, Xing Wen |
Video quality assessment (VQA) aims to simulate the human perception of video
quality, which is influenced by factors ranging from low-level color and
texture details to high-level semantic content. To effectively model these
complicated quality-related factors, in this paper, we decompose video into
three levels (\ie, patch level, frame level, and clip level), and propose a
novel Zoom-VQA architecture to perceive spatio-temporal features at different
levels. It integrates three components: patch attention module, frame pyramid
alignment, and clip ensemble strategy, respectively for capturing
region-of-interest in the spatial dimension, multi-level information at
different feature levels, and distortions distributed over the temporal
dimension. Owing to the comprehensive design, Zoom-VQA obtains state-of-the-art
results on four VQA benchmarks and achieves 2nd place in the NTIRE 2023 VQA
challenge. Notably, Zoom-VQA has outperformed the previous best results on two
subsets of LSVQ, achieving 0.8860 (+1.0%) and 0.7985 (+1.9%) of SRCC on the
respective subsets. Adequate ablation studies further verify the effectiveness
of each component. Codes and models are released in
https://github.com/k-zha14/Zoom-VQA. |
This paper proposes Zoom-VQA, a novel video quality assessment (VQA) framework that integrates information from patches, frames, and clips to better model human perception of video quality. |
Accurately assessing video quality is crucial for optimizing user experience on streaming platforms, especially with the increasing use of AI-based video enhancement techniques that introduce new types of artifacts. |
Zoom-VQA consists of two branches: an image-based branch (IQA) for global information and a clip-based branch (VQA) for local texture information. The IQA branch utilizes a patch attention module and frame pyramid alignment to capture spatial details at multiple feature levels. The VQA branch leverages a clip ensemble strategy and patch head expansion to model temporal dynamics and low-level texture information effectively. |
Zoom-VQA achieves state-of-the-art results on four VQA benchmarks (VDPVE, LSVQ, KoNViD-1k, and LIVE-VQC).
It outperforms previous best methods on LSVQ subsets, achieving 0.8860 SRCC on LSVQ_test (+1.0%) and 0.7985 SRCC on LSVQ_1080p (+1.9%).
Zoom-VQA secured 2nd place in the NTIRE 2023 VQA Challenge, demonstrating its strong generalization ability. |
The current implementation primarily focuses on No-Reference VQA; incorporating reference information could further enhance performance.
Exploring the impact of different fragment sizes and sampling strategies in the VQA branch is an area for future investigation. |
video quality assessment, deep learning, vision transformer, multi-level feature fusion, spatio-temporal analysis |
2304.06419
Report |
Tracking by 3D Model Estimation of Unknown Objects in Videos |
Denys Rozumnyi, Jiri Matas, Marc Pollefeys, Vittorio Ferrari, Martin R. Oswald |
Most model-free visual object tracking methods formulate the tracking task as
object location estimation given by a 2D segmentation or a bounding box in each
video frame. We argue that this representation is limited and instead propose
to guide and improve 2D tracking with an explicit object representation, namely
the textured 3D shape and 6DoF pose in each video frame. Our representation
tackles a complex long-term dense correspondence problem between all 3D points
on the object for all video frames, including frames where some points are
invisible. To achieve that, the estimation is driven by re-rendering the input
video frames as well as possible through differentiable rendering, which has
not been used for tracking before. The proposed optimization minimizes a novel
loss function to estimate the best 3D shape, texture, and 6DoF pose. We improve
the state-of-the-art in 2D segmentation tracking on three different datasets
with mostly rigid objects. |
This paper introduces a novel model-free object tracking method that goes beyond 2D segmentation by jointly estimating the 3D shape, texture, and 6DoF pose of unknown, rigid objects in videos. |
This approach provides a richer object representation compared to standard 2D trackers, enabling applications like augmented reality and object manipulation. |
The method leverages differentiable rendering to optimize the object parameters for accurately reconstructing the input video frames, guided by initial 2D segmentations from a standard tracker. A keyframe selection strategy ensures efficient optimization over long sequences. |
The method outperforms state-of-the-art 2D trackers in segmentation accuracy on datasets featuring rigid objects.
It demonstrates robustness to challenging scenarios, including object rotations and illumination changes, especially when using robust features like S2DNet.
Despite not requiring a pre-defined 3D model, the method achieves competitive 6DoF pose estimation results on the TUD-L benchmark. |
The current implementation relies on the assumption of object rigidity, limiting its applicability to certain scenarios.
The method's runtime can be further optimized for real-time performance. |
object tracking, 3d reconstruction, differentiable rendering, 6dof pose estimation, deep surface texture |
2304.06408
Report |
Intriguing properties of synthetic images: from generative adversarial networks to diffusion models |
Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, Luisa Verdoliva |
Detecting fake images is becoming a major goal of computer vision. This need
is becoming more and more pressing with the continuous improvement of synthesis
methods based on Generative Adversarial Networks (GAN), and even more with the
appearance of powerful methods based on Diffusion Models (DM). Towards this
end, it is important to gain insight into which image features better
discriminate fake images from real ones. In this paper we report on our
systematic study of a large number of image generators of different families,
aimed at discovering the most forensically relevant characteristics of real and
generated images. Our experiments provide a number of interesting observations
and shed light on some intriguing properties of synthetic images: (1) not only
the GAN models but also the DM and VQ-GAN (Vector Quantized Generative
Adversarial Networks) models give rise to visible artifacts in the Fourier
domain and exhibit anomalous regular patterns in the autocorrelation; (2) when
the dataset used to train the model lacks sufficient variety, its biases can be
transferred to the generated images; (3) synthetic and real images exhibit
significant differences in the mid-high frequency signal content, observable in
their radial and angular spectral power distributions. |
This paper presents a systematic investigation into the traces left by various generative models, including GANs, VQ-GANs, and diffusion models, in synthetic images by examining their second-order statistics in both spatial and frequency domains. |
With the increasing sophistication of synthetic image generators, it becomes crucial to understand the characteristic features that distinguish them from real images for developing robust forensic detectors. |
The study analyzes a large dataset of synthetic images generated using various models, alongside real images. The analysis focuses on autocorrelation functions for spatial domain analysis and power spectra, including radial and angular spectra, for frequency domain analysis. |
All examined image generators, even the most sophisticated ones, introduce specific artifacts detectable in the spatial or frequency domain.
The training dataset used for a generative model can significantly bias the generated images, transferring artifacts present in the training data to the synthetic images.
Generative models often struggle to accurately reproduce the spectral distribution of real images at mid-high frequencies, leading to discrepancies in radial and angular spectra. |
The study primarily focuses on analyzing images 'in the lab,' without considering the impact of post-processing operations commonly applied to images in real-world scenarios.
The analysis relies on a limited number of datasets for training and evaluation, potentially limiting the generalizability of findings to other datasets. |
synthetic image detection, generative adversarial networks (gans), diffusion models, image forensics, frequency analysis |
2304.06345
Report |
ASR: Attention-alike Structural Re-parameterization |
Shanshan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, Liang Lin |
The structural re-parameterization (SRP) technique is a novel deep learning
technique that achieves interconversion between different network architectures
through equivalent parameter transformations. This technique enables the
mitigation of the extra costs for performance improvement during training, such
as parameter size and inference time, through these transformations during
inference, and therefore SRP has great potential for industrial and practical
applications. The existing SRP methods have successfully considered many
commonly used architectures, such as normalizations, pooling methods, and
multi-branch convolution. However, the widely used attention modules which
drastically slow inference speed cannot be directly implemented by SRP due to
these modules usually act on the backbone network in a multiplicative manner
and the modules' output is input-dependent during inference, which limits the
application scenarios of SRP. In this paper, we conduct extensive experiments
from a statistical perspective and discover an interesting phenomenon Stripe
Observation, which reveals that channel attention values quickly approach some
constant vectors during training. This observation inspires us to propose a
simple-yet-effective attention-alike structural re-parameterization (ASR) that
allows us to achieve SRP for a given network while enjoying the effectiveness
of the attention mechanism. Extensive experiments conducted on several standard
benchmarks demonstrate the effectiveness of ASR in generally improving the
performance of existing backbone networks, attention modules, and SRP methods
without any elaborated model crafting. We also analyze the limitations and
provide experimental and theoretical evidence for the strong robustness of the
proposed ASR. |
This paper introduces Attention-alike Structural Re-parameterization (ASR), a novel method enabling the integration of channel attention mechanisms into Structural Re-parameterization (SRP) techniques for deep learning models. |
Existing SRP methods struggle to incorporate attention modules due to their multiplicative and input-dependent nature, limiting the application of SRP despite its potential for improving model performance. ASR addresses this challenge, allowing the benefits of attention without extra parameters or computational cost during inference. |
Inspired by the observed "Stripe Observation" where channel attention values converge to constant vectors during training, ASR uses a learnable vector as input for the attention module, enabling its merging into the backbone during inference without impacting performance. |
ASR consistently improves performance across various backbone models (ResNet, VGG, ShuffleNetV2, MobileNet, ViT) and datasets (ImageNet, STL10, CIFAR10/100), with accuracy improvements up to 2.77%.
ASR demonstrates strong compatibility with existing attention modules, further enhancing the performance of models already employing attention.
ASR proves compatible with other SRP methods, such as RepVGG and ACNet, showcasing its versatility and potential for integration with existing model optimization techniques. |
ASR's current formulation primarily focuses on channel attention and doesn't directly transfer to spatial or transformer-based attention mechanisms.
While validated in classification tasks, ASR's effectiveness in more complex downstream tasks requires further investigation, potentially through designing specialized attention modules within the ASR paradigm. |
structural re-parameterization, attention mechanism, deep learning, model compression, computer vision |
2304.06247
Report |
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency |
Zixuan Huang, Varun Jampani, Anh Thai, Yuanzhen Li, Stefan Stojanov, James M. Rehg |
We present ShapeClipper, a novel method that reconstructs 3D object shapes
from real-world single-view RGB images. Instead of relying on laborious 3D,
multi-view or camera pose annotation, ShapeClipper learns shape reconstruction
from a set of single-view segmented images. The key idea is to facilitate shape
learning via CLIP-based shape consistency, where we encourage objects with
similar CLIP encodings to share similar shapes. We also leverage off-the-shelf
normals as an additional geometric constraint so the model can learn better
bottom-up reasoning of detailed surface geometry. These two novel consistency
constraints, when used to regularize our model, improve its ability to learn
both global shape structure and local geometric details. We evaluate our method
over three challenging real-world datasets, Pix3D, Pascal3D+, and OpenImages,
where we achieve superior performance over state-of-the-art methods. |
ShapeClipper is a novel method for reconstructing 3D object shapes from single-view RGB images without relying on 3D, multi-view, or viewpoint supervision. |
Existing methods for 3D shape reconstruction rely on laborious annotations that are not scalable to real-world scenarios. |
ShapeClipper leverages CLIP-based shape consistency to encourage objects with similar CLIP encodings to share similar shapes. It also uses off-the-shelf surface normals as additional geometric constraints to improve bottom-up reasoning of surface details. |
ShapeClipper achieves superior performance over state-of-the-art methods on Pix3D, Pascal3D+, and OpenImages datasets.
CLIP-based shape consistency effectively improves top-down reasoning of global shape structure.
Geometric constraints using off-the-shelf normals enhance the reconstruction of local geometric details. |
ShapeClipper struggles with heavily occluded or deformable object categories.
Future work could explore explicitly handling shape misalignment in the semantic constraint. |
3d reconstruction, single-view reconstruction, clip, shape consistency, geometric constraints |
2304.06212
Report |
[CLS] Token is All You Need for Zero-Shot Semantic Segmentation |
Letian Wu, Wenyao Zhang, Tengping Jiang, Wankou Yang, Xin Jin, Wenjun Zeng |
In this paper, we propose an embarrassingly simple yet highly effective
zero-shot semantic segmentation (ZS3) method, based on the pre-trained
vision-language model CLIP. First, our study provides a couple of key
discoveries: (i) the global tokens (a.k.a [CLS] tokens in Transformer) of the
text branch in CLIP provide a powerful representation of semantic information
and (ii) these text-side [CLS] tokens can be regarded as category priors to
guide CLIP visual encoder pay more attention on the corresponding region of
interest. Based on that, we build upon the CLIP model as a backbone which we
extend with a One-Way [CLS] token navigation from text to the visual branch
that enables zero-shot dense prediction, dubbed \textbf{ClsCLIP}. Specifically,
we use the [CLS] token output from the text branch, as an auxiliary semantic
prompt, to replace the [CLS] token in shallow layers of the ViT-based visual
encoder. This one-way navigation embeds such global category prior earlier and
thus promotes semantic segmentation. Furthermore, to better segment tiny
objects in ZS3, we further enhance ClsCLIP with a local zoom-in strategy, which
employs a region proposal pre-processing and we get ClsCLIP+. Extensive
experiments demonstrate that our proposed ZS3 method achieves a SOTA
performance, and it is even comparable with those few-shot semantic
segmentation methods. |
This paper presents ClsCLIP, a simple yet effective zero-shot semantic segmentation method that leverages the pre-trained vision-language model CLIP. |
Zero-shot semantic segmentation (ZS3) is important because it allows for the segmentation of novel, unseen categories without requiring any annotations, which is crucial for real-world applications where obtaining annotations for every possible category is infeasible. |
ClsCLIP extends CLIP for ZS3 by replacing the [CLS] token in the shallow layers of the ViT-based visual encoder with the text-side [CLS] token. This one-way navigation embeds global category priors, guiding the visual encoder to focus on relevant regions for segmentation. Furthermore, ClsCLIP+ enhances this by incorporating a region proposal pre-processing step using YOLO to provide object location priors, addressing the issue of missing tiny objects. |
ClsCLIP significantly outperforms other state-of-the-art ZS3 methods on PASCAL-5^i and COCO-20^i datasets.
The one-way [CLS] token navigation effectively guides the visual encoder to focus on regions of interest, resulting in improved segmentation performance.
ClsCLIP+, enhanced with region proposals, further improves performance, especially for segmenting tiny objects, and even surpasses some few-shot semantic segmentation methods. |
The performance of ClsCLIP+ is contingent on the accuracy of the region proposal generator.
Future work could explore alternative region proposal methods or develop end-to-end trainable approaches that jointly optimize region proposal and segmentation. |
zero-shot semantic segmentation, vision-language models, clip, prompt learning, tiny object segmentation |
2304.06211
Report |
Boosting Video Object Segmentation via Space-time Correspondence Learning |
Yurong Zhang, Liulei Li, Wenguan Wang, Rong Xie, Li Song, Wenjun Zhang |
Current top-leading solutions for video object segmentation (VOS) typically
follow a matching-based regime: for each query frame, the segmentation mask is
inferred according to its correspondence to previously processed and the first
annotated frames. They simply exploit the supervisory signals from the
groundtruth masks for learning mask prediction only, without posing any
constraint on the space-time correspondence matching, which, however, is the
fundamental building block of such regime. To alleviate this crucial yet
commonly ignored issue, we devise a correspondence-aware training framework,
which boosts matching-based VOS solutions by explicitly encouraging robust
correspondence matching during network learning. Through comprehensively
exploring the intrinsic coherence in videos on pixel and object levels, our
algorithm reinforces the standard, fully supervised training of mask
segmentation with label-free, contrastive correspondence learning. Without
neither requiring extra annotation cost during training, nor causing speed
delay during deployment, nor incurring architectural modification, our
algorithm provides solid performance gains on four widely used benchmarks,
i.e., DAVIS2016&2017, and YouTube-VOS2018&2019, on the top of famous
matching-based VOS solutions. |
This paper presents a novel training framework for Video Object Segmentation (VOS) that boosts the performance of matching-based methods by explicitly encouraging robust space-time correspondence learning during training. |
Existing matching-based VOS methods rely heavily on accurate correspondence matching between frames but lack explicit supervision for this crucial component, potentially leading to sub-optimal results. |
The proposed method leverages the intrinsic coherence of videos on both pixel and object levels. It introduces self-supervised contrastive learning objectives, enforcing pixel-level consistency and object-level coherence without requiring additional annotations. |
The framework significantly improves the performance of state-of-the-art matching-based VOS methods (STCN and XMem) on DAVIS and YouTube-VOS datasets.
Ablation studies demonstrate the effectiveness of both pixel-level and object-level correspondence learning components.
The method introduces minimal computational overhead during training and doesn't affect inference time. |
The current method mainly explores local consistency within consecutive frames, leaving long-term consistency as future work.
The algorithm relies on an external memory module for storing past frames, which can be potentially improved for better efficiency and scalability. |
video object segmentation, correspondence learning, self-supervised learning, contrastive learning, computer vision |
2304.06140
Report |
An Edit Friendly DDPM Noise Space: Inversion and Manipulations |
Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli |
Denoising diffusion probabilistic models (DDPMs) employ a sequence of white
Gaussian noise samples to generate an image. In analogy with GANs, those noise
maps could be considered as the latent code associated with the generated
image. However, this native noise space does not possess a convenient
structure, and is thus challenging to work with in editing tasks. Here, we
propose an alternative latent noise space for DDPM that enables a wide range of
editing operations via simple means, and present an inversion method for
extracting these edit-friendly noise maps for any given image (real or
synthetically generated). As opposed to the native DDPM noise space, the
edit-friendly noise maps do not have a standard normal distribution and are not
statistically independent across timesteps. However, they allow perfect
reconstruction of any desired image, and simple transformations on them
translate into meaningful manipulations of the output image (e.g. shifting,
color edits). Moreover, in text-conditional models, fixing those noise maps
while changing the text prompt, modifies semantics while retaining structure.
We illustrate how this property enables text-based editing of real images via
the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM
inversion). We also show how it can be used within existing diffusion-based
editing methods to improve their quality and diversity. Webpage:
https://inbarhub.github.io/DDPM_inversion |
This paper introduces a novel method for inverting denoising diffusion probabilistic models (DDPMs) by extracting edit-friendly noise maps that allow for diverse image editing. |
The native noise space in DDPMs is not conducive to intuitive edits. This new method enables meaningful image manipulations by working with an alternative, edit-friendly latent noise space. |
The authors propose an optimization-based inversion approach to extract edit-friendly noise maps for any given image. This is achieved by minimizing the difference between the generated and target images across multiple timesteps in the DDPM sampling process. |
The extracted noise maps allow perfect reconstruction of the input image.
Simple transformations on the noise maps translate to semantically meaningful edits in the output image (e.g., shifting, color adjustments).
By fixing the noise maps and changing the text prompt in text-conditional models, edits can be applied while preserving image structure. |
The method relies on an optimization process during inversion, which can be computationally expensive.
Future work could explore more efficient inversion techniques and expand the range of editable image properties. |
ddpm, diffusion models, image editing, latent space manipulation, text-guided image manipulation |
2304.06107
Report |
PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face Inpainting |
Saman Motamed, Jianjin Xu, Chen Henry Wu, Fernando De la Torre |
Generative models such as StyleGAN2 and Stable Diffusion have achieved
state-of-the-art performance in computer vision tasks such as image synthesis,
inpainting, and de-noising. However, current generative models for face
inpainting often fail to preserve fine facial details and the identity of the
person, despite creating aesthetically convincing image structures and
textures. In this work, we propose Person Aware Tuning (PAT) of Mask-Aware
Transformer (MAT) for face inpainting, which addresses this issue. Our proposed
method, PATMAT, effectively preserves identity by incorporating reference
images of a subject and fine-tuning a MAT architecture trained on faces. By
using ~40 reference images, PATMAT creates anchor points in MAT's style module,
and tunes the model using the fixed anchors to adapt the model to a new face
identity. Moreover, PATMAT's use of multiple images per anchor during training
allows the model to use fewer reference images than competing methods. We
demonstrate that PATMAT outperforms state-of-the-art models in terms of image
quality, the preservation of person-specific details, and the identity of the
subject. Our results suggest that PATMAT can be a promising approach for
improving the quality of personalized face inpainting. |
PATMAT, a personalized face inpainting method that fine-tunes a pre-trained Mask-Aware Transformer (MAT) using a few reference images to preserve identity. |
Existing face inpainting methods struggle to preserve fine facial details and identity, which is crucial for applications like security, entertainment, and photo restoration. |
PATMAT utilizes a pre-trained MAT and conditions its style manipulation module by defining anchors in the noise-style space, inspired by Pivot Tuning. It uses multiple images per anchor and introduces regularization to prevent overfitting. |
PATMAT outperforms state-of-the-art models in image quality and identity preservation with limited reference images.
A user study confirmed PATMAT-C (multiple images per anchor) preserves identity better than PATMAT-S (single image per anchor).
Human judges struggled to distinguish PATMAT-C's inpainted images from real images, demonstrating its high perceptual quality. |
PATMAT's performance depends on the diversity and coverage of the reference images. It may struggle with poses, accessories, and lighting conditions not present in the training data.
The method relies on manual data separation for grouping images with distinct features like glasses and lighting. Automating this step could improve the approach. |
face inpainting, identity preservation, mask-aware transformer, person aware tuning, style manipulation |
2304.06061
Report |
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes |
Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anagnostidis, Gregor Bachmann, Thomas Hofmann |
Training models to apply linguistic knowledge and visual concepts from 2D
images to 3D world understanding is a promising direction that researchers have
only recently started to explore. In this work, we design a novel 3D
pre-training Vision-Language method that helps a model learn semantically
meaningful and transferable 3D scene point cloud representations. We inject the
representational power of the popular CLIP model into our 3D encoder by
aligning the encoded 3D scene features with the corresponding 2D image and text
embeddings produced by CLIP. To assess our model's 3D world reasoning
capability, we evaluate it on the downstream task of 3D Visual Question
Answering. Experimental quantitative and qualitative results show that our
pre-training method outperforms state-of-the-art works in this task and leads
to an interpretable representation of 3D scene features. |
This paper introduces a novel vision-language pre-training method for 3D Question Answering that aligns 3D scene features with corresponding 2D image and text embeddings from CLIP. |
This work addresses the gap in pre-training methods for 3D Question Answering that leverage both visual and linguistic modalities, aiming to improve 3D scene understanding. |
The authors design a 3D scene encoder based on VoteNet and a transformer and pre-train it by minimizing the cosine distance between the scene embedding and corresponding CLIP text and image embeddings. |
The pre-trained scene encoder significantly improves the performance on the ScanQA 3D-VQA benchmark compared to training from scratch.
The proposed method outperforms the state-of-the-art on ScanQA, even without using multi-view image features.
Visualization of the learned scene features shows that semantically similar scenes cluster together, indicating a meaningful representation space. |
The pre-training currently uses only a single top-down view of the scene.
Further exploration of more complex question types and reasoning tasks in 3D scenes is needed. |
3d vision-language pre-training, 3d visual question answering, clip, scene understanding, multi-modal learning |
2304.06025
Report |
DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion |
Johanna Karras, Aleksander Holynski, Ting-Chun Wang, Ira Kemelmacher-Shlizerman |
We present DreamPose, a diffusion-based method for generating animated
fashion videos from still images. Given an image and a sequence of human body
poses, our method synthesizes a video containing both human and fabric motion.
To achieve this, we transform a pretrained text-to-image model (Stable
Diffusion) into a pose-and-image guided video synthesis model, using a novel
fine-tuning strategy, a set of architectural changes to support the added
conditioning signals, and techniques to encourage temporal consistency. We
fine-tune on a collection of fashion videos from the UBC Fashion dataset. We
evaluate our method on a variety of clothing styles and poses, and demonstrate
that our method produces state-of-the-art results on fashion video
animation.Video results are available on our project page. |
DreamPose is a novel diffusion-based method for animating still fashion images, generating photorealistic videos of people wearing diverse clothing styles in motion by leveraging a pretrained Stable Diffusion model. |
Fashion videos are more informative than static images but scarce. DreamPose addresses this by enabling the creation of realistic fashion videos from readily available still images, enhancing online shopping experiences and fashion content creation. |
DreamPose modifies Stable Diffusion by incorporating a split CLIP-VAE image encoder for detailed appearance conditioning and concatenating multi-pose representations for temporal consistency. A two-stage finetuning scheme, first on a fashion video dataset and then on specific subject images, enhances realism and identity preservation. |
DreamPose generates high-quality fashion videos with realistic fabric motion and diverse appearances.
Quantitative metrics demonstrate DreamPose outperforms state-of-the-art methods in image quality, temporal consistency, and identity preservation.
User studies confirm DreamPose generates more realistic and faithful animations compared to existing techniques. |
DreamPose may exhibit limitations in handling complex patterns and occasional artifacts in challenging poses.
Future work includes improving computational efficiency, enhancing complex pattern fidelity, and exploring alternative conditioning signals like segmentation masks. |
image animation, diffusion models, fashion videos, stable diffusion, pose conditioning |
2304.06022
Report |
SAM Struggles in Concealed Scenes -- Empirical Study on "Segment Anything" |
Ge-Peng Ji, Deng-Ping Fan, Peng Xu, Ming-Ming Cheng, Bowen Zhou, Luc Van Gool |
Segmenting anything is a ground-breaking step toward artificial general
intelligence, and the Segment Anything Model (SAM) greatly fosters the
foundation models for computer vision. We could not be more excited to probe
the performance traits of SAM. In particular, exploring situations in which SAM
does not perform well is interesting. In this report, we choose three concealed
scenes, i.e., camouflaged animals, industrial defects, and medical lesions, to
evaluate SAM under unprompted settings. Our main observation is that SAM looks
unskilled in concealed scenes. |
This paper presents an empirical study on the Segment Anything Model (SAM), exploring its limitations in handling concealed scenes. |
Understanding the limitations of SAM, a groundbreaking model for image segmentation, is crucial to further improve its performance and guide future research in computer vision. |
The authors quantitatively evaluate SAM on camouflaged object segmentation benchmarks and qualitatively analyze its performance on concealed scenes like camouflaged animals, industrial defects, and medical lesions. |
SAM, while demonstrating improvement with larger model sizes, still lags behind state-of-the-art models in camouflaged object segmentation.
SAM struggles to segment objects concealed within similar backgrounds or those lacking clear boundaries in concealed scenes.
The lack of domain-specific knowledge, such as medical imaging, limits SAM's ability to accurately segment concealed lesions. |
The study primarily focuses on visual analysis and could benefit from more in-depth investigation into the model's internal representations.
Exploring techniques like incorporating prior knowledge or domain adaptation to improve SAM's performance in concealed scenes is a potential future direction. |
segment anything model, sam, concealed scene understanding, camouflaged object segmentation, medical image segmentation |
2304.06020
Report |
VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs |
Moayed Haji Ali, Andrew Bond, Tolga Birdal, Duygu Ceylan, Levent Karacan, Erkut Erdem, Aykut Erdem |
We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled
$\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and
Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by
Generative Adversarial Networks (GANs) has been the basis for recent
breakthroughs in image editing. However, the applicability of such advancements
to the video domain has been hindered by the difficulty of representing and
controlling videos in the latent space of GANs. In particular, videos are
composed of content (i.e., appearance) and complex motion components that
require a special mechanism to disentangle and control. To achieve this,
VidStyleODE encodes the video content in a pre-trained StyleGAN $\mathcal{W}_+$
space and benefits from a latent ODE component to summarize the spatiotemporal
dynamics of the input video. Our novel continuous video generation process then
combines the two to generate high-quality and temporally consistent videos with
varying frame rates. We show that our proposed method enables a variety of
applications on real videos: text-guided appearance manipulation, motion
manipulation, image animation, and video interpolation and extrapolation.
Project website: https://cyberiada.github.io/VidStyleODE |
VidStyleODE, a novel method for disentangled video editing, leverages StyleGAN and Neural ODEs to achieve spatiotemporally consistent manipulation of video content and motion. |
Existing video editing techniques struggle to balance high-quality generation, content-motion disentanglement, and accurate manipulation. VidStyleODE addresses these challenges by learning a continuous video representation. |
VidStyleODE encodes video content into StyleGAN's latent space and captures motion dynamics with a latent ODE. It then predicts latent directions conditioned on external style cues (e.g., text) to manipulate the video while preserving temporal consistency. |
Achieves state-of-the-art results on text-guided video editing, outperforming baselines in terms of visual quality, temporal consistency, and manipulation accuracy.
Enables various applications, including image animation, video interpolation/extrapolation, and local motion transfer, by disentangling content and motion representations.
Introduces a novel CLIP-based temporal consistency loss that improves video quality and training stability compared to adversarial training. |
Generation quality is limited by the pre-trained StyleGAN generator, and test-time fine-tuning could improve results.
Future work could explore using second-order ODEs for enhanced dynamics representation and enable more sophisticated text-guided manipulation of local dynamics. |
video editing, stylegan, neural odes, text-guided manipulation, video generation |
2304.05868
Report |
Mesh2Tex: Generating Mesh Textures from Image Queries |
Alexey Bokhovkin, Shubham Tulsiani, Angela Dai |
Remarkable advances have been achieved recently in learning neural
representations that characterize object geometry, while generating textured
objects suitable for downstream applications and 3D rendering remains at an
early stage. In particular, reconstructing textured geometry from images of
real objects is a significant challenge -- reconstructed geometry is often
inexact, making realistic texturing a significant challenge. We present
Mesh2Tex, which learns a realistic object texture manifold from uncorrelated
collections of 3D object geometry and photorealistic RGB images, by leveraging
a hybrid mesh-neural-field texture representation. Our texture representation
enables compact encoding of high-resolution textures as a neural field in the
barycentric coordinate system of the mesh faces. The learned texture manifold
enables effective navigation to generate an object texture for a given 3D
object geometry that matches to an input RGB image, which maintains robustness
even under challenging real-world scenarios where the mesh geometry
approximates an inexact match to the underlying geometry in the RGB image.
Mesh2Tex can effectively generate realistic object textures for an object mesh
to match real images observations towards digitization of real environments,
significantly improving over previous state of the art. |
Mesh2Tex learns a texture manifold conditioned on mesh geometry, enabling high-resolution texture generation and image-based texture transfer. |
Generating textured 3D objects is important for various applications, but existing methods struggle with realistic texturing, especially from real images. |
The method uses a hybrid mesh-neural-field texture representation, trained adversarially with differentiable rendering to match real 2D images. |
Outperforms state-of-the-art in unconditional texture generation (FID, KID).
Successfully transfers textures from single images, even with inexact geometry matches.
Demonstrates robustness to unknown image pose through NOC-guided patch-based optimization. |
Limitations: Lacks explicit semantic modeling, no probabilistic sampling for occlusions.
Future work: Incorporate semantic understanding, explore probabilistic texture generation. |
texture generation, texture transfer, neural fields, differentiable rendering, 3d shape analysis |
2304.05866
Report |
NoisyTwins: Class-Consistent and Diverse Image Generation through StyleGANs |
Harsh Rangwani, Lavish Bansal, Kartik Sharma, Tejan Karmali, Varun Jampani, R. Venkatesh Babu |
StyleGANs are at the forefront of controllable image generation as they
produce a latent space that is semantically disentangled, making it suitable
for image editing and manipulation. However, the performance of StyleGANs
severely degrades when trained via class-conditioning on large-scale
long-tailed datasets. We find that one reason for degradation is the collapse
of latents for each class in the $\mathcal{W}$ latent space. With NoisyTwins,
we first introduce an effective and inexpensive augmentation strategy for class
embeddings, which then decorrelates the latents based on self-supervision in
the $\mathcal{W}$ space. This decorrelation mitigates collapse, ensuring that
our method preserves intra-class diversity with class-consistency in image
generation. We show the effectiveness of our approach on large-scale real-world
long-tailed datasets of ImageNet-LT and iNaturalist 2019, where our method
outperforms other methods by $\sim 19\%$ on FID, establishing a new
state-of-the-art. |
This paper proposes NoisyTwins, a novel method to improve class-consistency and diversity in StyleGAN-generated images, particularly for long-tailed datasets. |
StyleGANs struggle with class-conditional generation on long-tailed datasets, often resulting in mode collapse (limited diversity) or class confusion. |
NoisyTwins introduces noise augmentation to class embeddings and uses a Barlow Twins-inspired loss to enforce invariance to these augmentations in the StyleGAN's W latent space. |
NoisyTwins achieves state-of-the-art FID scores on ImageNet-LT and iNaturalist 2019, outperforming previous methods by ~19%.
The method effectively mitigates both mode collapse and class confusion, generating diverse and class-consistent images even for tail classes.
NoisyTwins demonstrates strong performance in few-shot image generation scenarios, improving FID scores by 22.2% on average. |
The reliance on CLIP for evaluation might introduce biases from the CLIP model itself.
Exploring NoisyTwins for conditioning on more complex attributes beyond class labels is a potential area for improvement. |
generative adversarial networks, stylegan, long-tailed learning, image generation, class-conditional generation |
2304.05818
Report |
Gradient-Free Textual Inversion |
Zhengcong Fei, Mingyuan Fan, Junshi Huang |
Recent works on personalized text-to-image generation usually learn to bind a
special token with specific subjects or styles of a few given images by tuning
its embedding through gradient descent. It is natural to question whether we
can optimize the textual inversions by only accessing the process of model
inference. As only requiring the forward computation to determine the textual
inversion retains the benefits of less GPU memory, simple deployment, and
secure access for scalable models. In this paper, we introduce a
\emph{gradient-free} framework to optimize the continuous textual inversion in
an iterative evolutionary strategy. Specifically, we first initialize an
appropriate token embedding for textual inversion with the consideration of
visual and text vocabulary information. Then, we decompose the optimization of
evolutionary strategy into dimension reduction of searching space and
non-convex gradient-free optimization in subspace, which significantly
accelerates the optimization process with negligible performance loss.
Experiments in several applications demonstrate that the performance of
text-to-image model equipped with our proposed gradient-free method is
comparable to that of gradient-based counterparts with variant GPU/CPU
platforms, flexible employment, as well as computational efficiency. |
This paper presents the first gradient-free framework for personalized text-to-image generation, using an iterative evolutionary strategy to optimize textual inversion without requiring model gradients, making it suitable for limited-resource settings and large models. |
Existing gradient-based methods for personalizing text-to-image models are computationally expensive and impractical for large models or restricted access scenarios. This work addresses these limitations by enabling personalization using only model inference, making it more accessible and efficient. |
The proposed framework initializes the textual inversion embedding using cross-attention between given images and text vocabulary. Then, it employs a gradient-free optimization (CMA-ES) in a lower-dimensional subspace, determined by PCA or prior normalization, for efficient exploration and exploitation. |
Gradient-free textual inversion achieves comparable image generation quality to gradient-based methods, both qualitatively and according to human evaluation.
The proposed general condition initialization strategy significantly accelerates optimization convergence.
Single pseudo-word inversion outperforms multi-word counterparts in terms of editability and maintains comparable reconstruction quality. |
Balancing exploration and exploitation in the evolutionary strategy needs further investigation for potential improvement.
The potential for bias in generated images, similar to other generative models, requires further investigation and mitigation strategies. |
text-to-image generation, textual inversion, gradient-free optimization, evolutionary strategy, personalization |
2304.05772
Report |
An Image Quality Assessment Dataset for Portraits |
Nicolas Chahine, Ana-Stefania Calarasanu, Davide Garcia-Civiero, Theo Cayla, Sira Ferradans, Jean Ponce |
Year after year, the demand for ever-better smartphone photos continues to
grow, in particular in the domain of portrait photography. Manufacturers thus
use perceptual quality criteria throughout the development of smartphone
cameras. This costly procedure can be partially replaced by automated
learning-based methods for image quality assessment (IQA). Due to its
subjective nature, it is necessary to estimate and guarantee the consistency of
the IQA process, a characteristic lacking in the mean opinion scores (MOS)
widely used for crowdsourcing IQA. In addition, existing blind IQA (BIQA)
datasets pay little attention to the difficulty of cross-content assessment,
which may degrade the quality of annotations. This paper introduces PIQ23, a
portrait-specific IQA dataset of 5116 images of 50 predefined scenarios
acquired by 100 smartphones, covering a high variety of brands, models, and use
cases. The dataset includes individuals of various genders and ethnicities who
have given explicit and informed consent for their photographs to be used in
public research. It is annotated by pairwise comparisons (PWC) collected from
over 30 image quality experts for three image attributes: face detail
preservation, face target exposure, and overall image quality. An in-depth
statistical analysis of these annotations allows us to evaluate their
consistency over PIQ23. Finally, we show through an extensive comparison with
existing baselines that semantic information (image context) can be used to
improve IQA predictions. The dataset along with the proposed statistical
analysis and BIQA algorithms are available:
https://github.com/DXOMARK-Research/PIQ2023 |
Introduces PIQ23, the first smartphone portrait quality assessment dataset, featuring 5116 images across 50 scenes, annotated by experts via pairwise comparisons for face detail, exposure, and overall quality. |
Addresses the growing need for automated portrait quality evaluation in smartphone camera development, moving beyond generic IQA datasets to focus on portrait-specific attributes. |
Constructed PIQ23 with 100 smartphones capturing diverse portrait scenarios. Expert annotations were gathered using pairwise comparisons and a controlled lab environment. A novel statistical analysis method assessed annotation consistency and clustered images based on quality. |
PIQ23 is the first IQA dataset with legally obtained explicit consent from all individuals depicted, addressing ethical concerns.
The statistical analysis method quantifies uncertainty in pairwise comparison data and identifies significant quality differences between images.
The proposed SEM-HyperIQA model, integrating semantic information and multitasking, outperforms existing BIQA methods on PIQ23, demonstrating the importance of content awareness. |
Color quality annotation, attempted but excluded due to high subjectivity and difficulty in pairwise comparisons, needs further investigation.
Future work could explore additional portrait-specific attributes beyond the initial three, expanding the dataset and model capabilities. |
image quality assessment, portrait photography, smartphone cameras, pairwise comparison, semantic segmentation |
2304.05750
Report |
Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications |
Wei Ji, Jingjing Li, Qi Bi, Tingwei Liu, Wenbo Li, Li Cheng |
Recently, Meta AI Research approaches a general, promptable Segment Anything
Model (SAM) pre-trained on an unprecedentedly large segmentation dataset
(SA-1B). Without a doubt, the emergence of SAM will yield significant benefits
for a wide array of practical image segmentation applications. In this study,
we conduct a series of intriguing investigations into the performance of SAM
across various applications, particularly in the fields of natural images,
agriculture, manufacturing, remote sensing, and healthcare. We analyze and
discuss the benefits and limitations of SAM, while also presenting an outlook
on its future development in segmentation tasks. By doing so, we aim to give a
comprehensive understanding of SAM's practical applications. This work is
expected to provide insights that facilitate future research activities toward
generic segmentation. Source code is publicly available. |
This paper investigates the performance of the Segment Anything Model (SAM) on a variety of real-world applications beyond natural images. |
While SAM has shown impressive results on general segmentation tasks, it's crucial to understand its capabilities and limitations in diverse, specialized domains like agriculture and healthcare. |
The authors evaluate SAM on various segmentation subtasks within natural images, agriculture, manufacturing, remote sensing, and healthcare. They analyze both qualitatively (visual results) and quantitatively (comparing to state-of-the-art models on benchmarks) SAM's performance. |
SAM excels in common scenes with distinct objects, demonstrating strong generalization from its training.
SAM requires strong prior knowledge in complex scenes, often needing specific prompts to perform well.
SAM struggles with low-contrast, small, and irregular objects, highlighting limitations in handling such cases. |
The study primarily focuses on visual and limited quantitative analysis, lacking in-depth performance evaluation.
Future work includes exploring application-oriented SAMs, new prompt modes, and extending to video and semi-supervised learning. |
segment anything model (sam), image segmentation, real-world applications, computer vision, foundation models |
2304.05659
Report |
RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer |
Jiahao Wang, Songyang Zhang, Yong Liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin |
This paper studies how to keep a vision backbone effective while removing
token mixers in its basic building blocks. Token mixers, as self-attention for
vision transformers (ViTs), are intended to perform information communication
between different spatial tokens but suffer from considerable computational
cost and latency. However, directly removing them will lead to an incomplete
model structure prior, and thus brings a significant accuracy drop. To this
end, we first develop an RepIdentityFormer base on the re-parameterizing idea,
to study the token mixer free model architecture. And we then explore the
improved learning paradigm to break the limitation of simple token mixer free
backbone, and summarize the empirical practice into 5 guidelines. Equipped with
the proposed optimization strategy, we are able to build an extremely simple
vision backbone with encouraging performance, while enjoying the high
efficiency during inference. Extensive experiments and ablative analysis also
demonstrate that the inductive bias of network architecture, can be
incorporated into simple network structure with appropriate optimization
strategy. We hope this work can serve as a starting point for the exploration
of optimization-driven efficient network design. Project page:
https://techmonsterwang.github.io/RIFormer/. |
This paper presents RIFormer, a vision backbone that maintains efficacy while removing computationally expensive token mixers. |
Token mixers, like self-attention in ViTs, are computationally costly and limit backbone efficiency on resource-constrained devices. RIFormer explores removing these while preserving effectiveness. |
The study uses structural re-parameterization to train a model with an affine transformation replacing the token mixer, later merging it into the LayerNorm during inference. It further explores knowledge distillation, using a teacher model with a token mixer to guide the token mixer-free student. |
RIFormer achieves competitive performance on ImageNet-1K while surpassing models with token mixers in inference speed.
The paper provides five guidelines for effectively training such token-mixer-free models using knowledge distillation.
Analysis suggests that the inductive bias introduced by token mixers can be implicitly learned by simpler structures using the proposed training method. |
The paper primarily focuses on image classification, leaving its application to other vision tasks unexplored.
Future work could involve investigating the impact of this method on tasks like object detection and image deblurring. |
vision backbones, token mixers, knowledge distillation, efficient networks, structural re-parameterization |
2304.05568
Report |
Improving Diffusion Models for Scene Text Editing with Dual Encoders |
Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian Price, Shiyu Chang |
Scene text editing is a challenging task that involves modifying or inserting
specified texts in an image while maintaining its natural and realistic
appearance. Most previous approaches to this task rely on style-transfer models
that crop out text regions and feed them into image transfer models, such as
GANs. However, these methods are limited in their ability to change text style
and are unable to insert texts into images. Recent advances in diffusion models
have shown promise in overcoming these limitations with text-conditional image
editing. However, our empirical analysis reveals that state-of-the-art
diffusion models struggle with rendering correct text and controlling text
style. To address these problems, we propose DIFFSTE to improve pre-trained
diffusion models with a dual encoder design, which includes a character encoder
for better text legibility and an instruction encoder for better style control.
An instruction tuning framework is introduced to train our model to learn the
mapping from the text instruction to the corresponding image with either the
specified style or the style of the surrounding texts in the background. Such a
training method further brings our method the zero-shot generalization ability
to the following three scenarios: generating text with unseen font variation,
e.g., italic and bold, mixing different fonts to construct a new font, and
using more relaxed forms of natural language as the instructions to guide the
generation task. We evaluate our approach on five datasets and demonstrate its
superior performance in terms of text correctness, image naturalness, and style
controllability. Our code is publicly available.
https://github.com/UCSB-NLP-Chang/DiffSTE |
Proposed DiffSTE, a method leveraging dual encoders (character and instruction) to enhance pre-trained diffusion models for scene text editing. |
Existing methods for scene text editing either struggle with style control (GAN-based style transfer) or text accuracy (diffusion models). |
Introduce a dual encoder design to improve Stable Diffusion: character encoder for accurate spelling, instruction encoder for style control. Train using instruction tuning on synthetic and real-world datasets with varying style instructions. |
Achieves superior performance in text correctness, image naturalness, and style control on five datasets.
Shows strong zero-shot generalization to unseen font variations (e.g., bold, italic) and combinations.
Can be guided by more natural language instructions. |
Current model is limited to single-word editing.
Evaluation focuses on single-word editing; more complex scene text editing scenarios can be explored. |
scene text editing, diffusion models, instruction tuning, dual encoder, zero-shot learning |
2304.05552
Report |
DynamicDet: A Unified Dynamic Architecture for Object Detection |
Zhihao Lin, Yongtao Wang, Jinhe Zhang, Xiaojie Chu |
Dynamic neural network is an emerging research topic in deep learning. With
adaptive inference, dynamic models can achieve remarkable accuracy and
computational efficiency. However, it is challenging to design a powerful
dynamic detector, because of no suitable dynamic architecture and exiting
criterion for object detection. To tackle these difficulties, we propose a
dynamic framework for object detection, named DynamicDet. Firstly, we carefully
design a dynamic architecture based on the nature of the object detection task.
Then, we propose an adaptive router to analyze the multi-scale information and
to decide the inference route automatically. We also present a novel
optimization strategy with an exiting criterion based on the detection losses
for our dynamic detectors. Last, we present a variable-speed inference
strategy, which helps to realize a wide range of accuracy-speed trade-offs with
only one dynamic detector. Extensive experiments conducted on the COCO
benchmark demonstrate that the proposed DynamicDet achieves new
state-of-the-art accuracy-speed trade-offs. For instance, with comparable
accuracy, the inference speed of our dynamic detector Dy-YOLOv7-W6 surpasses
YOLOv7-E6 by 12%, YOLOv7-D6 by 17%, and YOLOv7-E6E by 39%. The code is
available at https://github.com/VDIGPKU/DynamicDet. |
This paper proposes DynamicDet, a dynamic neural network framework for object detection that allows for adaptable inference routes based on image difficulty to achieve a wide range of accuracy-speed trade-offs using a single model. |
Existing object detectors often require training multiple models for different accuracy-speed trade-offs, leading to high computational costs. Dynamic inference addresses this by adapting computation based on input data, offering improved efficiency. |
DynamicDet uses two cascaded detectors with an adaptive router. The router analyzes multi-scale features to estimate an image's difficulty score and dynamically chooses the appropriate detector (faster or more accurate). A novel, hyperparameter-free optimization strategy with an adaptive offset is used to train the router, ensuring accurate difficulty assessment and balanced detector usage. |
DynamicDet achieves state-of-the-art accuracy-speed trade-offs, outperforming other real-time object detectors.
The method generalizes well across one-stage and two-stage detectors and is compatible with both CNN- and transformer-based backbones.
The proposed adaptive router is lightweight and effectively learns to distinguish between "easy" and "hard" images for dynamic routing. |
The variable-speed inference strategy relies on a sufficiently large validation set for robust threshold determination.
Future work could explore more sophisticated difficulty assessment mechanisms beyond the proposed adaptive router for further performance improvement. |
object detection, dynamic neural network, accuracy-speed trade-off, adaptive inference, computer vision |
2304.05523
Report |
MoMo: A shared encoder Model for text, image and multi-Modal representations |
Rakesh Chada, Zhaoheng Zheng, Pradeep Natarajan |
We propose a self-supervised shared encoder model that achieves strong
results on several visual, language and multimodal benchmarks while being data,
memory and run-time efficient. We make three key contributions. First, in
contrast to most existing works, we use a single transformer with all the
encoder layers processing both the text and the image modalities. Second, we
propose a stage-wise training strategy where the model is first trained on
images, then jointly with unimodal text and image datasets and finally jointly
with text and text-image datasets. Third, to preserve information across both
the modalities, we propose a training pipeline that learns simultaneously from
gradient updates of different modalities at each training update step. The
results on downstream text-only, image-only and multimodal tasks show that our
model is competitive with several strong models while using fewer parameters
and lesser pre-training data. For example, MoMo performs competitively with
FLAVA on multimodal (+3.1), image-only (+1.1) and text-only (-0.1) tasks
despite having 2/5th the number of parameters and using 1/3rd the image-text
training pairs. Finally, we ablate various design choices and further show that
increasing model size produces significant performance gains indicating
potential for substantial improvements with larger models using our approach. |
This paper introduces MoMo, a self-supervised shared encoder model for text, image, and multimodal representation learning that is efficient in terms of data, memory, and runtime. |
MoMo addresses the limitations of existing multimodal models that often rely on huge training corpora or models with numerous parameters by using a single transformer encoder for all modalities and a stage-wise training strategy. |
MoMo employs a three-stage training pipeline: first trained on images (Masked Image Modeling), then jointly on unimodal text and images (Masked Language Modeling), and finally on unimodal text and multimodal image-text data (Cross-Modal Masking, contrastive, and matching losses). |
MoMo achieves competitive performance on multimodal, image-only, and text-only tasks despite using significantly fewer parameters and less pre-training data compared to models like FLAVA and CLIP.
A multi-stage training approach where the model learns simultaneously from different modalities at each training step is crucial for effective multimodal representation learning.
Scaling up the model size leads to considerable performance gains, highlighting the potential for further improvements with larger models using this approach. |
The model's performance on certain tasks, like VQA, could benefit from additional pre-training data.
Future work could explore incorporating more modalities and larger models. |
multimodal learning, vision-language pre-training, shared encoder, transformer, self-supervised learning |
2304.05395
Report |
SE-ORNet: Self-Ensembling Orientation-aware Network for Unsupervised Point Cloud Shape Correspondence |
Jiacheng Deng, Chuxin Wang, Jiahao Lu, Jianfeng He, Tianzhu Zhang, Jiyang Yu, Zhe Zhang |
Unsupervised point cloud shape correspondence aims to obtain dense
point-to-point correspondences between point clouds without manually annotated
pairs. However, humans and some animals have bilateral symmetry and various
orientations, which lead to severe mispredictions of symmetrical parts.
Besides, point cloud noise disrupts consistent representations for point cloud
and thus degrades the shape correspondence accuracy. To address the above
issues, we propose a Self-Ensembling ORientation-aware Network termed SE-ORNet.
The key of our approach is to exploit an orientation estimation module with a
domain adaptive discriminator to align the orientations of point cloud pairs,
which significantly alleviates the mispredictions of symmetrical parts.
Additionally, we design a selfensembling framework for unsupervised point cloud
shape correspondence. In this framework, the disturbances of point cloud noise
are overcome by perturbing the inputs of the student and teacher networks with
different data augmentations and constraining the consistency of predictions.
Extensive experiments on both human and animal datasets show that our SE-ORNet
can surpass state-of-the-art unsupervised point cloud shape correspondence
methods. |
This paper introduces SE-ORNet, a novel self-ensembling orientation-aware network designed for unsupervised point cloud shape correspondence. |
Existing methods struggle with the mismatching of symmetrical parts in point clouds with different orientations and are sensitive to noise. This paper aims to address these issues. |
SE-ORNet utilizes an orientation estimation module with domain adaptation to align point cloud pairs, mitigating mismatches. Additionally, a self-ensembling framework with consistency losses ensures robust feature representations despite noise and orientation variations. |
SE-ORNet surpasses state-of-the-art methods on human and animal benchmarks, including SHREC, SURREAL, TOSCA, and SMAL.
The orientation estimation module effectively aligns point cloud orientations, significantly improving correspondence accuracy for symmetrical parts.
The self-ensembling framework enhances robustness to noise, leading to more consistent and reliable feature representations. |
The performance of orientation estimation relies on the accuracy of the pre-defined angle bins.
The computational cost of the self-ensembling framework is relatively high. |
point cloud shape correspondence, unsupervised learning, self-ensembling, orientation estimation, domain adaptation |
2304.05390
Report |
HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models |
Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, Mohamed Elhoseiny |
In recent years, Text-to-Image (T2I) models have been extensively studied,
especially with the emergence of diffusion models that achieve state-of-the-art
results on T2I synthesis tasks. However, existing benchmarks heavily rely on
subjective human evaluation, limiting their ability to holistically assess the
model's capabilities. Furthermore, there is a significant gap between efforts
in developing new T2I architectures and those in evaluation. To address this,
we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is
Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on
limited aspects, HRS-Bench measures 13 skills that can be categorized into five
major categories: accuracy, robustness, generalization, fairness, and bias. In
addition, HRS-Bench covers 50 scenarios, including fashion, animals,
transportation, food, and clothes. We evaluate nine recent large-scale T2I
models using metrics that cover a wide range of skills. A human evaluation
aligned with 95% of our evaluations on average was conducted to probe the
effectiveness of HRS-Bench. Our experiments demonstrate that existing models
often struggle to generate images with the desired count of objects, visual
text, or grounded emotions. We hope that our benchmark help ease future
text-to-image generation research. The code and data are available at
https://eslambakr.github.io/hrsbench.github.io |
The paper introduces HRS-Bench, a holistic, reliable, and scalable benchmark for evaluating text-to-image models beyond just image quality. |
Existing T2I benchmarks are limited in scope, often focusing on few aspects like fidelity or bias, hindering comprehensive model assessment. |
HRS-Bench uses a large dataset of prompts across 50 scenarios and measures 13 skills grouped into five categories: accuracy, robustness, generalization, fairness, and bias. It utilizes a combination of automatic metrics (e.g., UniDet for counting, CLIPScore for similarity) and human evaluation. |
Existing models struggle with object counting accuracy, especially with increased prompt complexity.
Generating images with visual text or grounded emotions remains a significant challenge for current models.
While models show robustness against language perturbations like typos, they struggle with complex compositions, especially spatial, size, and color arrangements. |
Accessing the full training data for some models is challenging, limiting the evaluation of aspects like creativity.
Expanding the benchmark to include more intricate visual reasoning and common-sense understanding skills. |
text-to-image synthesis, benchmarking, evaluation metrics, multi-modal learning, generative ai |
2304.05265
Report |
Controllable Textual Inversion for Personalized Text-to-Image Generation |
Jianan Yang, Haobo Wang, Yanming Zhang, Ruixuan Xiao, Sai Wu, Gang Chen, Junbo Zhao |
The recent large-scale generative modeling has attained unprecedented
performance especially in producing high-fidelity images driven by text
prompts. Text inversion (TI), alongside the text-to-image model backbones, is
proposed as an effective technique in personalizing the generation when the
prompts contain user-defined, unseen or long-tail concept tokens. Despite that,
we find and show that the deployment of TI remains full of "dark-magics" -- to
name a few, the harsh requirement of additional datasets, arduous human efforts
in the loop and lack of robustness. In this work, we propose a much-enhanced
version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all
the aforementioned problems and in turn delivering a robust, data-efficient and
easy-to-use framework. The core to COTI is a theoretically-guided loss
objective instantiated with a comprehensive and novel weighted scoring
mechanism, encapsulated by an active-learning paradigm. The extensive results
show that COTI significantly outperforms the prior TI-related approaches with a
26.05 decrease in the FID score and a 23.00% boost in the R-precision. |
This paper proposes \emph{\FULLNAME{}} (\NAME{}), an enhanced text inversion (TI) framework for personalized text-to-image generation that addresses limitations of existing TI methods, such as the need for large datasets and manual data selection. |
Existing text-to-image generation models struggle to produce high-quality images for prompts containing unseen or long-tail concepts. TI offers a solution but often relies on manual data selection and large datasets, limiting its practicality. |
\NAME{} utilizes an active learning paradigm with a novel weighted scoring system to automatically select high-quality training data from a web-crawled dataset. The scoring system combines aesthetic and concept-matching scores, dynamically balancing their importance based on the evolving text embedding during training. |
\NAME{} significantly outperforms baseline TI approaches, achieving a 26.05 decrease in FID score and a 23.00% boost in R-precision.
The method demonstrates successful learning of concept attributes, progressively refining image quality across active learning cycles.
Ablation studies confirm the effectiveness of both the dual scoring system and the dynamic training schedule. |
The current implementation primarily focuses on single-concept personalization and may require further exploration for concepts with multiple visual representations.
Future work could investigate extending \NAME{} to other text-guided generative tasks. |
text-to-image generation, textual inversion, active learning, personalized image synthesis, aesthetic image assessment |
2304.05139
Report |
NeAT: Neural Artistic Tracing for Beautiful Style Transfer |
Dan Ruta, Andrew Gilbert, John Collomosse, Eli Shechtman, Nicholas Kolkin |
Style transfer is the task of reproducing the semantic contents of a source
image in the artistic style of a second target image. In this paper, we present
NeAT, a new state-of-the art feed-forward style transfer method. We
re-formulate feed-forward style transfer as image editing, rather than image
generation, resulting in a model which improves over the state-of-the-art in
both preserving the source content and matching the target style. An important
component of our model's success is identifying and fixing "style halos", a
commonly occurring artefact across many style transfer techniques. In addition
to training and testing on standard datasets, we introduce the BBST-4M dataset,
a new, large scale, high resolution dataset of 4M images. As a component of
curating this data, we present a novel model able to classify if an image is
stylistic. We use BBST-4M to improve and measure the generalization of NeAT
across a huge variety of styles. Not only does NeAT offer state-of-the-art
quality and generalization, it is designed and trained for fast inference at
high resolution. |
This supplementary material provides additional details about NeAT (Neural Artistic Tracing), a novel style transfer method, and introduces BBST-4M, a large-scale dataset for style transfer. |
This work addresses the limitations of existing style transfer datasets and methods by introducing a new dataset with diverse styles and a model that generalizes well to unseen styles. |
The authors create BBST-4M using images from Flickr and Behance.net. They train NeAT using a combination of adversarial loss, style loss, content loss, identity loss, contrastive loss, and a novel patch-based discriminator. |
NeAT trained on BBST-4M demonstrates strong generalization capabilities and produces high-quality stylizations.
BBST-4M, with its diverse range of styles, facilitates the development of more robust style transfer models.
NeAT shows promise for video stylization, even with a simple frame-by-frame approach. |
The video stylization approach lacks explicit temporal consistency mechanisms.
Quantitative comparisons with other state-of-the-art style transfer methods are limited in the supplementary material. |
style transfer, deep learning, computer vision, dataset, neural networks |
2304.05097
Report |
One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field |
Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhigang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, Xuelong Li |
Talking head generation aims to generate faces that maintain the identity
information of the source image and imitate the motion of the driving image.
Most pioneering methods rely primarily on 2D representations and thus will
inevitably suffer from face distortion when large head rotations are
encountered. Recent works instead employ explicit 3D structural representations
or implicit neural rendering to improve performance under large pose changes.
Nevertheless, the fidelity of identity and expression is not so desirable,
especially for novel-view synthesis. In this paper, we propose HiDe-NeRF, which
achieves high-fidelity and free-view talking-head synthesis. Drawing on the
recently proposed Deformable Neural Radiance Fields, HiDe-NeRF represents the
3D dynamic scene into a canonical appearance field and an implicit deformation
field, where the former comprises the canonical source face and the latter
models the driving pose and expression. In particular, we improve fidelity from
two aspects: (i) to enhance identity expressiveness, we design a generalized
appearance module that leverages multi-scale volume features to preserve face
shape and details; (ii) to improve expression preciseness, we propose a
lightweight deformation module that explicitly decouples the pose and
expression to enable precise expression modeling. Extensive experiments
demonstrate that our proposed approach can generate better results than
previous works. Project page: https://www.waytron.net/hidenerf/ |
This paper proposes HiDe-NeRF, a novel one-shot and subject-agnostic Deformable Neural Radiance Field for high-fidelity and free-view talking-head synthesis. |
Existing talking head generation methods struggle to generate high-fidelity results, particularly in preserving source identity and mimicking driving expressions, especially under large head rotations. |
HiDe-NeRF represents a 3D dynamic scene as a canonical appearance field (multi-scale tri-plane representation of the source face) and an implicit deformation field. The deformation field is learned using a novel Lightweight Expression-aware Deformation (LED) module that decouples pose and expression for precise modeling. A Multi-scale Generalized Appearance (MGA) module ensures identity expressiveness. Finally, the model renders the synthesized image and refines the texture details. |
HiDe-NeRF outperforms state-of-the-art methods in both self-reenactment and cross-identity reenactment tasks on multiple benchmark datasets, demonstrating superior performance in preserving source identity and mimicking driving expressions.
The proposed method exhibits excellent free-view synthesis capability, accurately redirecting the face while maintaining identity and expression consistency across different viewpoints.
Ablation studies confirm the effectiveness of the proposed MGA and LED modules in enhancing identity preservation and expression preciseness. |
HiDe-NeRF struggles with handling occlusions in the source image.
The method's performance degrades under extreme pose changes due to pose bias in training datasets.
Future work includes addressing occlusions, mitigating pose bias, and exploring other modality-driven talking head synthesis. |
talking-head synthesis, deformable neural radiance fields, one-shot learning, identity preservation, expression modeling |
2304.05051
Report |
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training |
Yunpeng Han, Lisai Zhang, Qingcai Chen, Zhijian Chen, Zhonghua Li, Jianxin Yang, Zhao Cao |
Fashion vision-language pre-training models have shown efficacy for a wide
range of downstream tasks. However, general vision-language pre-training models
pay less attention to fine-grained domain features, while these features are
important in distinguishing the specific domain tasks from general tasks. We
propose a method for fine-grained fashion vision-language pre-training based on
fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained
multi-modalities fashion attributes and characteristics. Firstly, we propose
the fashion symbols, a novel abstract fashion concept layer, to represent
different fashion items and to generalize various kinds of fine-grained fashion
features, making modelling fine-grained attributes more effective. Secondly,
the attributes prompt method is proposed to make the model learn specific
attributes of fashion items explicitly. We design proper prompt templates
according to the format of fashion data. Comprehensive experiments are
conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and
FashionSAP gets SOTA performances for four popular fashion tasks. The ablation
study also shows the proposed abstract fashion symbols, and the attribute
prompt method enables the model to acquire fine-grained semantics in the
fashion domain effectively. The obvious performance gains from FashionSAP
provide a new baseline for future fashion task research. |
This paper proposes FashionSAP, a novel fine-grained fashion vision-language pre-training model that leverages fashion symbols and attribute prompts to learn attribute-level fashion knowledge. |
General vision-language pre-training models often overlook the fine-grained attributes crucial for understanding fashion items, limiting their effectiveness in fashion-related tasks. |
FashionSAP utilizes: (1) Nine abstract fashion symbols representing broad categories based on body parts and functionalities, aiding in general feature capture. (2) An attribute prompt method with specifically designed templates to explicitly learn fine-grained fashion characteristics from attribute annotations. |
FashionSAP achieves state-of-the-art performance on four popular fashion tasks: text-to-image retrieval, image-to-text retrieval, category recognition, and subcategory recognition, using the FashionGen and FashionIQ datasets.
Ablation studies confirm the significant contribution of fashion symbols and attribute prompts in improving performance across all tasks.
Visualization using Grad-CAM highlights FashionSAP’s ability to focus on precise regions of interest, demonstrating effective fine-grained alignment between text and image modalities. |
The current work explores a limited set of fashion symbols based solely on category attributes; future research could investigate more diverse symbol representations.
Further exploration is needed to investigate the full potential of the attribute prompt framework in learning richer and more nuanced fashion representations. |
vision-language pre-training, fine-grained visual recognition, fashion analysis, attribute prompt learning, multi-modal representation learning |
2304.04968
Report |
Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond |
Mohammadreza Armandpour, Ali Sadeghian, Huangjie Zheng, Amir Sadeghian, Mingyuan Zhou |
Although text-to-image diffusion models have made significant strides in
generating images from text, they are sometimes more inclined to generate
images like the data on which the model was trained rather than the provided
text. This limitation has hindered their usage in both 2D and 3D applications.
To address this problem, we explored the use of negative prompts but found that
the current implementation fails to produce desired results, particularly when
there is an overlap between the main and negative prompts. To overcome this
issue, we propose Perp-Neg, a new algorithm that leverages the geometrical
properties of the score space to address the shortcomings of the current
negative prompts algorithm. Perp-Neg does not require any training or
fine-tuning of the model. Moreover, we experimentally demonstrate that Perp-Neg
provides greater flexibility in generating images by enabling users to edit out
unwanted concepts from the initially generated images in 2D cases. Furthermore,
to extend the application of Perp-Neg to 3D, we conducted a thorough
exploration of how Perp-Neg can be used in 2D to condition the diffusion model
to generate desired views, rather than being biased toward the canonical views.
Finally, we applied our 2D intuition to integrate Perp-Neg with the
state-of-the-art text-to-3D (DreamFusion) method, effectively addressing its
Janus (multi-head) problem. Our project page is available at
https://Perp-Neg.github.io/ |
This paper proposes Perp-Neg, a novel sampling algorithm for text-to-image diffusion models to address limitations of current negative prompts when there's overlap with the main prompt. |
Current text-to-image models struggle to accurately represent complex prompts, often generating images resembling training data instead of the input text. This is particularly problematic with negative prompts, hindering their use in both 2D and 3D applications. |
Perp-Neg leverages geometrical properties of the score space, ensuring negative prompt guidance remains perpendicular to the main prompt's direction, preventing unintended removal of desired concepts. This is achieved without requiring any model training or fine-tuning. |
Significantly higher success rate (73.1% vs 42% for side view, 40.4% vs 14.6% for back view) in generating images aligned with specific viewpoint prompts compared to baseline methods.
Perp-Neg provides better control over negative attribute elimination, allowing for more nuanced image generation.
Integration of Perp-Neg with DreamFusion alleviates the Janus problem in text-to-3D generation by improving view faithfulness of the underlying 2D diffusion model. |
The paper primarily focuses on single object generation and view control, further exploration is needed for more complex scenes and prompt compositions.
Fine-tuning the negative prompt weight functions for optimal performance can be time-consuming. |
text-to-image generation, diffusion models, negative prompts, view synthesis, 3d generation |
2304.04962
Report |
Mask-Based Modeling for Neural Radiance Fields |
Ganlin Yang, Guoqiang Wei, Zhizheng Zhang, Yan Lu, Dong Liu |
Most Neural Radiance Fields (NeRFs) exhibit limited generalization
capabilities, which restrict their applicability in representing multiple
scenes using a single model. To address this problem, existing generalizable
NeRF methods simply condition the model on image features. These methods still
struggle to learn precise global representations over diverse scenes since they
lack an effective mechanism for interacting among different points and views.
In this work, we unveil that 3D implicit representation learning can be
significantly improved by mask-based modeling. Specifically, we propose masked
ray and view modeling for generalizable NeRF (MRVM-NeRF), which is a
self-supervised pretraining target to predict complete scene representations
from partially masked features along each ray. With this pretraining target,
MRVM-NeRF enables better use of correlations across different points and views
as the geometry priors, which thereby strengthens the capability of capturing
intricate details within the scenes and boosts the generalization capability
across different scenes. Extensive experiments demonstrate the effectiveness of
our proposed MRVM-NeRF on both synthetic and real-world datasets, qualitatively
and quantitatively. Besides, we also conduct experiments to show the
compatibility of our proposed method with various backbones and its superiority
under few-shot cases. |
This paper proposes MRVM (Masked Ray and View Modeling), a self-supervised pretraining strategy for generalizable Neural Radiance Fields (NeRF) to improve their ability to represent multiple scenes with a single model. |
Most NeRFs lack generalization capability due to limited conditioning on image features and struggle to learn global representations across diverse scenes. |
MRVM introduces a pretraining objective that predicts complete scene representations from partially masked features along rays and across views. This encourages interactions across different points and views, enhancing the learning of global 3D scene priors. |
MRVM-NeRF significantly improves performance on synthetic datasets (ShapeNet) in both category-agnostic and category-specific settings.
It also demonstrates effectiveness on challenging real-world datasets (NeRF Synthetic, DTU, LLFF) using both MLP-based and Transformer-based architectures.
The learned priors from MRVM are beneficial for both cross-scene generalization and per-scene finetuning. |
The paper explores a limited set of masking strategies and ratios.
Future work could investigate the impact of different masking patterns and optimize them for specific scene complexities. |
neural radiance fields, generalizable nerf, self-supervised learning, masked modeling, 3d scene representation |
2304.04909
Report |
SATR: Zero-Shot Semantic Segmentation of 3D Shapes |
Ahmed Abdelreheem, Ivan Skorokhodov, Maks Ovsjanikov, Peter Wonka |
We explore the task of zero-shot semantic segmentation of 3D shapes by using
large-scale off-the-shelf 2D image recognition models. Surprisingly, we find
that modern zero-shot 2D object detectors are better suited for this task than
contemporary text/image similarity predictors or even zero-shot 2D segmentation
networks. Our key finding is that it is possible to extract accurate 3D
segmentation maps from multi-view bounding box predictions by using the
topological properties of the underlying surface. For this, we develop the
Segmentation Assignment with Topological Reweighting (SATR) algorithm and
evaluate it on ShapeNetPart and our proposed FAUST benchmarks. SATR achieves
state-of-the-art performance and outperforms a baseline algorithm by 1.3% and
4% average mIoU on the FAUST coarse and fine-grained benchmarks, respectively,
and by 5.2% average mIoU on the ShapeNetPart benchmark. Our source code and
data will be publicly released. Project webpage:
https://samir55.github.io/SATR/. |
This paper proposes SATR, a novel method for zero-shot 3D shape segmentation using off-the-shelf 2D zero-shot object detectors and leveraging the topological properties of 3D surfaces. |
Extending the success of vision-language models in 2D zero-shot recognition to 3D is hindered by limited 3D data. This work explores using readily available 2D models for efficient and accurate 3D shape segmentation. |
SATR leverages 2D object detector (GLIP) predictions from multiple views of a 3D shape. It then refines these predictions by introducing Gaussian geodesic reweighting and visibility smoothing techniques, which utilize the topological information of the mesh. |
SATR achieves state-of-the-art performance on ShapeNetPart and the proposed FAUST benchmarks.
It significantly outperforms baseline methods, especially in fine-grained segmentation tasks.
Ablation studies demonstrate the effectiveness of the proposed Gaussian geodesic reweighting and visibility smoothing techniques. |
The random view sampling algorithm does not guarantee complete triangle coverage.
Evaluation with other large language models is limited by their public availability. |
zero-shot learning, 3d shape segmentation, vision-language models, object detection, topology |
2304.04820
Report |
Binary Latent Diffusion |
Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu |
In this paper, we show that a binary latent space can be explored for compact
yet expressive image representations. We model the bi-directional mappings
between an image and the corresponding latent binary representation by training
an auto-encoder with a Bernoulli encoding distribution. On the one hand, the
binary latent space provides a compact discrete image representation of which
the distribution can be modeled more efficiently than pixels or continuous
latent representations. On the other hand, we now represent each image patch as
a binary vector instead of an index of a learned cookbook as in discrete image
representations with vector quantization. In this way, we obtain binary latent
representations that allow for better image quality and high-resolution image
representations without any multi-stage hierarchy in the latent space. In this
binary latent space, images can now be generated effectively using a binary
latent diffusion model tailored specifically for modeling the prior over the
binary image representations. We present both conditional and unconditional
image generation experiments with multiple datasets, and show that the proposed
method performs comparably to state-of-the-art methods while dramatically
improving the sampling efficiency to as few as 16 steps without using any
test-time acceleration. The proposed framework can also be seamlessly scaled to
$1024 \times 1024$ high-resolution image generation without resorting to latent
hierarchy or multi-stage refinements. |
This paper introduces a method for representing and generating images in a compact binary latent space using a novel binary latent diffusion model. |
Representing images in a binary latent space offers a compact and expressive alternative to continuous or vector-quantized representations, enabling efficient high-resolution image generation without complex hierarchical latent structures. |
The method involves training an auto-encoder with a Bernoulli latent distribution to learn bidirectional mappings between images and binary codes. A binary latent diffusion model, tailored for Bernoulli distributions, is then trained to efficiently model the prior over these binary representations, allowing for novel sample generation. |
The binary latent diffusion model achieves comparable image generation quality and diversity to state-of-the-art methods with significantly fewer denoising steps and faster sampling speed.
The method allows for high-resolution (1024x1024) image generation in a single shot without resorting to hierarchical latent structures.
Binary latent representations offer a good balance between compactness and expressiveness, achieving better reconstruction quality with fewer bits compared to vector quantization. |
The current implementation utilizes a plain transformer architecture for the sampler, which may limit its ability to model images with complex global dependencies.
Further exploration of different noise schedulers and their impact on sample quality and efficiency is needed. |
image generation, diffusion models, binary latent space, bernoulli distribution, representation learning |
2304.04742
Report |
Detection Transformer with Stable Matching |
Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, Lei Zhang |
This paper is concerned with the matching stability problem across different
decoder layers in DEtection TRansformers (DETR). We point out that the unstable
matching in DETR is caused by a multi-optimization path problem, which is
highlighted by the one-to-one matching design in DETR. To address this problem,
we show that the most important design is to use and only use positional
metrics (like IOU) to supervise classification scores of positive examples.
Under the principle, we propose two simple yet effective modifications by
integrating positional metrics to DETR's classification loss and matching cost,
named position-supervised loss and position-modulated cost. We verify our
methods on several DETR variants. Our methods show consistent improvements over
baselines. By integrating our methods with DINO, we achieve 50.4 and 51.5 AP on
the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24
epochs training settings, achieving a new record under the same setting. We
achieve 63.8 AP on COCO detection test-dev with a Swin-Large backbone. Our code
will be made available at https://github.com/IDEA-Research/Stable-DINO. |
This paper identifies and addresses the unstable matching problem in DEtection TRansformers (DETR), proposing a solution based on using positional metrics like Intersection over Union (IoU) to supervise classification scores. |
Unstable matching across decoder layers, caused by a multi-optimization path problem, hinders the training stability and efficiency of DETR-like models. |
The paper proposes two modifications: (1) position-supervised loss, using IoU to directly supervise classification scores of positive examples, and (2) position-modulated cost, incorporating IoU into the matching cost to down-weight inaccurate predictions. Additionally, a dense memory fusion technique is introduced to merge encoder and backbone features, enhancing feature utilization. |
Significantly improved training stability, evidenced by reduced inconsistencies in matching across decoder layers.
Faster convergence, especially during early training stages, attributed to both the stable matching strategy and the memory fusion technique.
State-of-the-art performance on the COCO object detection benchmark, achieving 50.4 AP and 51.5 AP with ResNet-50 backbones under 1x and 2x training schedules, respectively. |
The method is only validated on image-based object detection and segmentation tasks, leaving its applicability to other domains like 3D object detection unexplored.
The study primarily focuses on classification aspects of loss and matching, leaving the optimization of localization components for future work. |
object detection, detection transformers, detr, stable matching, position supervision |
2304.04709
Report |
Can SAM Segment Anything? When SAM Meets Camouflaged Object Detection |
Lv Tang, Haoke Xiao, Bo Li |
SAM is a segmentation model recently released by Meta AI Research and has
been gaining attention quickly due to its impressive performance in generic
object segmentation. However, its ability to generalize to specific scenes such
as camouflaged scenes is still unknown. Camouflaged object detection (COD)
involves identifying objects that are seamlessly integrated into their
surroundings and has numerous practical applications in fields such as
medicine, art, and agriculture. In this study, we try to ask if SAM can address
the COD task and evaluate the performance of SAM on the COD benchmark by
employing maximum segmentation evaluation and camouflage location evaluation.
We also compare SAM's performance with 22 state-of-the-art COD methods. Our
results indicate that while SAM shows promise in generic object segmentation,
its performance on the COD task is limited. This presents an opportunity for
further research to explore how to build a stronger SAM that may address the
COD task. The results of this paper are provided in
\url{https://github.com/luckybird1994/SAMCOD}. |
This paper evaluates the performance of the Segment Anything Model (SAM) on the task of Camouflaged Object Detection (COD). |
COD is an important task with applications in various fields, and understanding the generalization capabilities of foundation models like SAM in specific domains is crucial. |
The authors evaluate SAM on three COD benchmark datasets (CAMO, COD10K, NC4K) using two evaluation schemes: maximum segmentation evaluation (selecting the best prediction among multiple outputs) and camouflage location evaluation (analyzing the proportion of predictions exceeding a given F-measure threshold). The results are compared with 22 state-of-the-art COD methods. |
SAM's performance on COD is limited compared to state-of-the-art COD methods.
SAM's maximum segmentation performance is significantly lower than the best-performing COD methods.
SAM's ability to accurately locate camouflaged objects also requires further improvement. |
The evaluation does not explore fine-tuning SAM on COD datasets.
Future work could investigate modifications to SAM's architecture to better address COD challenges. |
camouflaged object detection, segment anything model (sam), foundation models, computer vision, image segmentation |
2304.04704
Report |
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition |
Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu Sun |
This work proposes POMP, a prompt pre-training method for vision-language
models. Being memory and computation efficient, POMP enables the learned prompt
to condense semantic information for a rich set of visual concepts with over
twenty-thousand classes. Once pre-trained, the prompt with a strong
transferable ability can be directly plugged into a variety of visual
recognition tasks including image classification, semantic segmentation, and
object detection, to boost recognition performances in a zero-shot manner.
Empirical evaluation shows that POMP achieves state-of-the-art performances on
21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1%
compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation
(+6.9 compared to ZSSeg). Our code is available at
https://github.com/amazon-science/prompt-pretraining. |
This paper proposes POMP, a memory and computation-efficient prompt pre-training method for vision-language models using the ImageNet-21K dataset. |
Existing prompt tuning methods are computationally expensive for large-scale datasets, limiting their ability to learn a universal, task-agnostic prompt for visual recognition. |
POMP introduces 'local contrast' to reduce memory overhead by sampling a subset of classes during training and 'local correction' to mitigate bias introduced by sampling. |
POMP achieves state-of-the-art accuracy on ImageNet-21K (25.3%) with CLIP ViT-B/16 backbone.
It outperforms previous methods in cross-dataset image classification, achieving 67.0% average accuracy on 10 datasets.
POMP excels in open-vocabulary semantic segmentation and object detection, surpassing previous state-of-the-art methods. |
The theoretical risk of using a subsampled class set for estimating the expected contrastive loss needs investigation.
Utilizing the semantic hierarchy within ImageNet-21K could further enhance performance. |
prompt learning, vision-language models, zero-shot learning, image recognition, semantic segmentation, object detection |
2304.04694
Report |
Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation |
Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen |
Video Panoptic Segmentation (VPS) aims to achieve comprehensive pixel-level
scene understanding by segmenting all pixels and associating objects in a
video. Current solutions can be categorized into online and near-online
approaches. Evolving over the time, each category has its own specialized
designs, making it nontrivial to adapt models between different categories. To
alleviate the discrepancy, in this work, we propose a unified approach for
online and near-online VPS. The meta architecture of the proposed Video-kMaX
consists of two components: within clip segmenter (for clip-level segmentation)
and cross-clip associater (for association beyond clips). We propose clip-kMaX
(clip k-means mask transformer) and HiLA-MB (Hierarchical Location-Aware Memory
Buffer) to instantiate the segmenter and associater, respectively. Our general
formulation includes the online scenario as a special case by adopting clip
length of one. Without bells and whistles, Video-kMaX sets a new
state-of-the-art on KITTI-STEP and VIPSeg for video panoptic segmentation, and
VSPW for video semantic segmentation. Code will be made publicly available. |
This paper presents Video-kMaX, a simple and unified approach for online and near-online video panoptic segmentation. |
Existing methods for video panoptic segmentation often require specific design choices depending on whether they process the video frame-by-frame (online) or clip-by-clip (near-online). This paper aims to alleviate this discrepancy with a unified approach. |
The proposed Video-kMaX consists of two components: clip-kMaX (clip k-means mask transformer) for clip-level segmentation and Location-Aware Memory Bank (LAMB) for cross-clip association. Clip-kMaX extends the image-level k-means mask transformer to the clip level by concatenating clip-level pixel features. LAMB leverages appearance and location features for long-term association across clips using a hierarchical matching scheme. |
Video-kMaX sets a new state-of-the-art on KITTI-STEP and VIPSeg for video panoptic segmentation, and VSPW for video semantic segmentation.
The proposed clip-kMaX effectively handles long sequences of video frames with k-means cross-attention.
LAMB significantly improves long-term association quality compared to methods relying solely on appearance features. |
The model struggles to track objects with large random movements and heavy occlusion.
Future work could explore incorporating more sophisticated motion models for robust object association. |
video panoptic segmentation, online segmentation, near-online segmentation, k-means mask transformer, memory module |
2304.04515
Report |
SOOD: Towards Semi-Supervised Oriented Object Detection |
Wei Hua, Dingkang Liang, Jingyu Li, Xiaolong Liu, Zhikang Zou, Xiaoqing Ye, Xiang Bai |
Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for
boosting object detectors, has become an active task in recent years. However,
existing SSOD approaches mainly focus on horizontal objects, leaving
multi-oriented objects that are common in aerial images unexplored. This paper
proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD,
built upon the mainstream pseudo-labeling framework. Towards oriented objects
in aerial scenes, we design two loss functions to provide better supervision.
Focusing on the orientations of objects, the first loss regularizes the
consistency between each pseudo-label-prediction pair (includes a prediction
and its corresponding pseudo label) with adaptive weights based on their
orientation gap. Focusing on the layout of an image, the second loss
regularizes the similarity and explicitly builds the many-to-many relation
between the sets of pseudo-labels and predictions. Such a global consistency
constraint can further boost semi-supervised learning. Our experiments show
that when trained with the two proposed losses, SOOD surpasses the
state-of-the-art SSOD methods under various settings on the DOTA-v1.5
benchmark. The code will be available at https://github.com/HamPerdredes/SOOD. |
This paper proposes SOOD, the first semi-supervised oriented object detection method, which introduces two novel losses (RAW and GC) to adapt the dense pseudo-labeling framework for oriented object detection. |
Oriented object detection in aerial images is crucial but suffers from high annotation costs. Semi-supervised methods can leverage unlabeled data to improve object detectors and reduce annotation effort. |
SOOD builds upon a dense pseudo-labeling framework with a teacher-student model. It introduces two novel losses: 1) Rotation-aware Adaptive Weighting (RAW) loss considers orientation differences to weigh pseudo-label-prediction pairs. 2) Global Consistency (GC) loss uses optimal transport to enforce layout similarity between teacher and student predictions. |
SOOD outperforms state-of-the-art SSOD methods on DOTA-v1.5 under various partially labeled data settings (10%, 20%, 30%).
SOOD also surpasses existing methods on the fully labeled DOTA-v1.5 benchmark, demonstrating its ability to learn from unlabeled data.
Ablation studies confirm the effectiveness of both RAW and GC losses. |
SOOD's utilization of aerial object characteristics beyond orientation and layout is limited.
The RAW and GC losses, currently separate, could be integrated for better synergy. |
semi-supervised learning, oriented object detection, aerial images, pseudo-labeling, optimal transport |
2304.04514
Report |
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment |
Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Hang Xu |
This paper presents DetCLIPv2, an efficient and scalable training framework
that incorporates large-scale image-text pairs to achieve open-vocabulary
object detection (OVD). Unlike previous OVD frameworks that typically rely on a
pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via
a pseudo labeling process, DetCLIPv2 directly learns the fine-grained
word-region alignment from massive image-text pairs in an end-to-end manner. To
accomplish this, we employ a maximum word-region similarity between region
proposals and textual words to guide the contrastive objective. To enable the
model to gain localization capability while learning broad concepts, DetCLIPv2
is trained with a hybrid supervision from detection, grounding and image-text
pair data under a unified data formulation. By jointly training with an
alternating scheme and adopting low-resolution input for image-text pairs,
DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2
utilizes 13X more image-text pairs than DetCLIP with a similar training time
and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2
demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2
with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which
outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP,
respectively, and even beats its fully-supervised counterpart by a large
margin. |
DetCLIPv2, an open-vocabulary object detection framework that learns word-region alignment directly from image-text pairs via end-to-end training. |
Addresses limitations of prior OVD methods that rely on pre-trained VL models or pseudo labeling by directly learning from large-scale image-text pairs. |
Joint training with detection, grounding, and image-text data using a unified formulation. Employs maximum word-region similarity for contrastive learning, aligning visual regions with textual concepts. |
Achieves 40.4% zero-shot AP on LVIS with Swin-T backbone, outperforming previous state-of-the-art methods.
Demonstrates efficient training, utilizing 13x more image-text pairs than DetCLIP with similar training time.
Exhibits strong generalization, achieving state-of-the-art fine-tuning performance on LVIS and ODinW13. |
Localization capability heavily relies on bounding box annotations from detection data.
Noisy and incomplete descriptions in web-crawled image-text pairs impact learning efficiency. |
open-vocabulary object detection, word-region alignment, contrastive learning, image-text pairs, weakly-supervised learning |
2304.04452
Report |
Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos |
Liao Wang, Qiang Hu, Qihan He, Ziyu Wang, Jingyi Yu, Tinne Tuytelaars, Lan Xu, Minye Wu |
The success of the Neural Radiance Fields (NeRFs) for modeling and free-view
rendering static objects has inspired numerous attempts on dynamic scenes.
Current techniques that utilize neural rendering for facilitating free-view
videos (FVVs) are restricted to either offline rendering or are capable of
processing only brief sequences with minimal motion. In this paper, we present
a novel technique, Residual Radiance Field or ReRF, as a highly compact neural
representation to achieve real-time FVV rendering on long-duration dynamic
scenes. ReRF explicitly models the residual information between adjacent
timestamps in the spatial-temporal feature space, with a global
coordinate-based tiny MLP as the feature decoder. Specifically, ReRF employs a
compact motion grid along with a residual feature grid to exploit inter-frame
feature similarities. We show such a strategy can handle large motions without
sacrificing quality. We further present a sequential training scheme to
maintain the smoothness and the sparsity of the motion/residual grids. Based on
ReRF, we design a special FVV codec that achieves three orders of magnitudes
compression rate and provides a companion ReRF player to support online
streaming of long-duration FVVs of dynamic scenes. Extensive experiments
demonstrate the effectiveness of ReRF for compactly representing dynamic
radiance fields, enabling an unprecedented free-viewpoint viewing experience in
speed and quality. |
Presents Residual Radiance Field (ReRF), a novel neural representation for streamable free-viewpoint viewing of long-duration dynamic scenes. |
Existing methods for free-viewpoint videos (FVVs) are either offline or limited to short sequences with minimal motion. ReRF aims to enable real-time FVV rendering on long, dynamic scenes with high compression. |
ReRF uses a global tiny MLP as a feature decoder and models feature space with explicit grids. It employs a compact motion grid for inter-frame position offsets and a sparse residual grid for error compensation and new regions. A two-stage sequential training scheme with motion pooling and sparsity regularizers is used. |
Achieves high-quality free-viewpoint rendering comparable to per-frame reconstructions but with significantly less storage.
Outperforms other dynamic scene reconstruction methods in terms of visual quality, especially in long sequences with large motions.
Enables real-time decoding and rendering (20fps) with a companion ReRF player, supporting traditional video controls like pause, play, seek, etc. |
Per-frame training time needs to be improved further.
Reliance on multi-view capture systems for dynamic sequences. |
neural rendering, free-viewpoint video, dynamic scene reconstruction, neural compression, streaming |
2304.04415
Report |
Meta Compositional Referring Expression Segmentation |
Li Xu, Mark He Huang, Xindi Shang, Zehuan Yuan, Ying Sun, Jun Liu |
Referring expression segmentation aims to segment an object described by a
language expression from an image. Despite the recent progress on this task,
existing models tackling this task may not be able to fully capture semantics
and visual representations of individual concepts, which limits their
generalization capability, especially when handling novel compositions of
learned concepts. In this work, through the lens of meta learning, we propose a
Meta Compositional Referring Expression Segmentation (MCRES) framework to
enhance model compositional generalization performance. Specifically, to handle
various levels of novel compositions, our framework first uses training data to
construct a virtual training set and multiple virtual testing sets, where data
samples in each virtual testing set contain a level of novel compositions
w.r.t. the virtual training set. Then, following a novel meta optimization
scheme to optimize the model to obtain good testing performance on the virtual
testing sets after training on the virtual training set, our framework can
effectively drive the model to better capture semantics and visual
representations of individual concepts, and thus obtain robust generalization
performance even when handling novel compositions. Extensive experiments on
three benchmark datasets demonstrate the effectiveness of our framework. |
The paper proposes Meta Compositional Referring Expression Segmentation (MCRES), a meta-learning framework to improve generalization performance of RES models when handling novel compositions of learned concepts (e.g., "dark coffee"). |
Existing RES models struggle to generalize to testing samples containing novel compositions of learned concepts, limiting their practical application. |
MCRES constructs a virtual training set and multiple virtual testing sets representing different levels of novel compositions. A meta-optimization scheme then optimizes the model on these sets, encouraging it to learn semantics and visual representations of individual concepts for better generalization. |
MCRES achieves state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg benchmarks.
The framework consistently improves performance across various RES model architectures (transformer and CNN based).
Ablation studies validate the effectiveness of handling different levels of novel compositions, the meta-optimization scheme, and the virtual sets construction strategy. |
The framework introduces additional training time overhead due to the meta-optimization process.
Future work could explore automatically identifying the most effective level of novel composition for each sample. |
referring expression segmentation, meta learning, compositional generalization, computer vision, natural language processing |
2304.04395
Report |
Instance Neural Radiance Field |
Yichen Liu, Benran Hu, Junkai Huang, Yu-Wing Tai, Chi-Keung Tang |
This paper presents one of the first learning-based NeRF 3D instance
segmentation pipelines, dubbed as Instance Neural Radiance Field, or Instance
NeRF. Taking a NeRF pretrained from multi-view RGB images as input, Instance
NeRF can learn 3D instance segmentation of a given scene, represented as an
instance field component of the NeRF model. To this end, we adopt a 3D
proposal-based mask prediction network on the sampled volumetric features from
NeRF, which generates discrete 3D instance masks. The coarse 3D mask prediction
is then projected to image space to match 2D segmentation masks from different
views generated by existing panoptic segmentation models, which are used to
supervise the training of the instance field. Notably, beyond generating
consistent 2D segmentation maps from novel views, Instance NeRF can query
instance information at any 3D point, which greatly enhances NeRF object
segmentation and manipulation. Our method is also one of the first to achieve
such results in pure inference. Experimented on synthetic and real-world NeRF
datasets with complex indoor scenes, Instance NeRF surpasses previous NeRF
segmentation works and competitive 2D segmentation methods in segmentation
performance on unseen views. Watch the demo video at
https://youtu.be/wW9Bme73coI. Code and data are available at
https://github.com/lyclyc52/Instance_NeRF. |
Presents Instance-NeRF (iNeRF), one of the first learning-based NeRF pipelines for 3D instance segmentation, which learns 3D instance segmentation from a pre-trained NeRF without ground truth segmentation. |
Addresses the limitations of 3D instance segmentation relying on depth sensors or custom equipment by leveraging the ability of NeRF to associate 2D images with 3D. |
Employs a 3D proposal-based mask prediction network on NeRF volumetric features, projects coarse 3D masks to image space, and uses 2D segmentation from existing models to match instances across views and supervise the training of a 3D instance field component within the NeRF model. |
Achieves state-of-the-art 3D instance segmentation in NeRF without requiring ground-truth segmentation during inference.
Introduces a Neural Instance Field capable of generating multi-view consistent 2D segmentation and continuous 3D segmentation using NeRF representation.
Outperforms competitive 2D segmentation methods and prior NeRF segmentation approaches on synthetic indoor scenes. |
Relies on existing 2D panoptic segmentation models, which may impact performance if the models are inaccurate.
Future work includes extending the method to handle dynamic scenes and more complex real-world scenarios. |
nerf, 3d instance segmentation, neural instance field, multi-view consistency, unsupervised segmentation |
2304.04344
Report |
Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models |
Nikita Starodubcev, Dmitry Baranchuk, Valentin Khrulkov, Artem Babenko |
Recent advances in diffusion models enable many powerful instruments for
image editing. One of these instruments is text-driven image manipulations:
editing semantic attributes of an image according to the provided text
description. % Popular text-conditional diffusion models offer various
high-quality image manipulation methods for a broad range of text prompts.
Existing diffusion-based methods already achieve high-quality image
manipulations for a broad range of text prompts. However, in practice, these
methods require high computation costs even with a high-end GPU. This greatly
limits potential real-world applications of diffusion-based image editing,
especially when running on user devices.
In this paper, we address efficiency of the recent text-driven editing
methods based on unconditional diffusion models and develop a novel algorithm
that learns image manipulations 4.5-10 times faster and applies them 8 times
faster. We carefully evaluate the visual quality and expressiveness of our
approach on multiple datasets using human annotators. Our experiments
demonstrate that our algorithm achieves the quality of much more expensive
methods. Finally, we show that our approach can adapt the pretrained model to
the user-specified image and text description on the fly just for 4 seconds. In
this setting, we notice that more compact unconditional diffusion models can be
considered as a rational alternative to the popular text-conditional
counterparts. |
This paper introduces a novel algorithm for text-driven image manipulation using unconditional diffusion models that significantly improves efficiency without sacrificing visual quality. |
Existing diffusion-based methods for text-driven image editing, while effective, are computationally expensive, limiting their practical applications, especially on user devices. |
The paper leverages two main ingredients: 1) replacing the sequential DDIM encoding with a closed-form, stochastic encoding at a single time step, and 2) updating the model parameters at a single decoding step per training iteration. |
The proposed algorithm learns image manipulations 4.5-10x faster and applies them 8x faster than previous diffusion-based methods.
Despite using approximate encoding and decoding, the approach achieves comparable visual and editing quality to DiffusionCLIP, significantly outperforming GAN-based alternatives.
The paper demonstrates that unconditional diffusion models can learn text-guided manipulations from a single image, enabling fast, on-the-fly editing. |
The proposed method, while more efficient, still requires careful hyperparameter tuning for optimal results.
The reliance on CLIP for semantic guidance can limit the expressiveness and success of certain text-driven manipulations. |
image manipulation, diffusion models, text-guided editing, unconditional diffusion models, clip |
2304.04269
Report |
HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation |
Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, Qiang Xu |
Controllable human image generation (HIG) has numerous real-life
applications. State-of-the-art solutions, such as ControlNet and T2I-Adapter,
introduce an additional learnable branch on top of the frozen pre-trained
stable diffusion (SD) model, which can enforce various conditions, including
skeleton guidance of HIG. While such a plug-and-play approach is appealing, the
inevitable and uncertain conflicts between the original images produced from
the frozen SD branch and the given condition incur significant challenges for
the learnable branch, which essentially conducts image feature editing for
condition enforcement. In this work, we propose a native skeleton-guided
diffusion model for controllable HIG called HumanSD. Instead of performing
image editing with dual-branch diffusion, we fine-tune the original SD model
using a novel heatmap-guided denoising loss. This strategy effectively and
efficiently strengthens the given skeleton condition during model training
while mitigating the catastrophic forgetting effects. HumanSD is fine-tuned on
the assembly of three large-scale human-centric datasets with text-image-pose
information, two of which are established in this work. As shown in Figure 1,
HumanSD outperforms ControlNet in terms of accurate pose control and image
quality, particularly when the given skeleton guidance is sophisticated. |
This paper introduces HumanSD, a novel skeleton-guided diffusion model for controllable human image generation that directly fine-tunes the Stable Diffusion model with skeleton conditions, enhancing pose control and image quality. |
Controllable human image generation is crucial for various applications, but current diffusion-based methods struggle with accurate pose control, especially in complex scenarios. HumanSD addresses these limitations by enabling native skeleton guidance during image generation. |
The authors propose a novel Heatmap-guided Denoising Loss to mitigate catastrophic forgetting during fine-tuning. They also establish two large-scale human-centric datasets, GHI and LAION-Human, to train HumanSD. |
HumanSD significantly outperforms state-of-the-art methods like ControlNet in terms of pose accuracy, particularly with challenging poses.
The model demonstrates high fidelity in replicating desired human poses while preserving image quality and text-image consistency.
The proposed Heatmap-guided Denoising Loss proves effective in improving both pose control and background preservation compared to vanilla fine-tuning. |
HumanSD still faces challenges with extremely crowded scenes and complex/rare actions.
The evaluation system for text and pose-guided image generation needs further development to be more comprehensive and robust. |
human image generation, diffusion models, pose control, stable diffusion, heatmap-guided denoising loss |
2304.04231
Report |
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model |
Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, Xiang Bai |
Supervised crowd counting relies heavily on costly manual labeling, which is
difficult and expensive, especially in dense scenes. To alleviate the problem,
we propose a novel unsupervised framework for crowd counting, named CrowdCLIP.
The core idea is built on two observations: 1) the recent contrastive
pre-trained vision-language model (CLIP) has presented impressive performance
on various downstream tasks; 2) there is a natural mapping between crowd
patches and count text. To the best of our knowledge, CrowdCLIP is the first to
investigate the vision language knowledge to solve the counting problem.
Specifically, in the training stage, we exploit the multi-modal ranking loss by
constructing ranking text prompts to match the size-sorted crowd patches to
guide the image encoder learning. In the testing stage, to deal with the
diversity of image patches, we propose a simple yet effective progressive
filtering strategy to first select the highly potential crowd patches and then
map them into the language space with various counting intervals. Extensive
experiments on five challenging datasets demonstrate that the proposed
CrowdCLIP achieves superior performance compared to previous unsupervised
state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some
popular fully-supervised methods under the cross-dataset setting. The source
code will be available at https://github.com/dk-liang/CrowdCLIP. |
This paper proposes CrowdCLIP, a novel unsupervised crowd counting framework that leverages a vision-language model (CLIP) to estimate the number of people in images without any labeled data. |
Existing supervised crowd counting methods rely heavily on expensive and time-consuming manual labeling. This paper explores the potential of vision-language models for unsupervised crowd counting, aiming to alleviate the dependence on labeled data. |
CrowdCLIP fine-tunes the image encoder of CLIP using a ranking-based contrastive loss with size-sorted image patches and corresponding count text prompts. During inference, it employs a progressive filtering strategy to select highly potential crowd patches and map them into appropriate count intervals. |
CrowdCLIP significantly outperforms the current state-of-the-art unsupervised method (CSS-CCNN) by a large margin on five challenging datasets.
CrowdCLIP even surpasses some popular fully-supervised methods in cross-dataset evaluation scenarios.
Ablation studies validate the effectiveness of the ranking-based contrastive fine-tuning, the proposed progressive filtering strategy, and the design choices of text prompts. |
CrowdCLIP currently only provides count-level estimations and lacks the ability to generate point-level localization information.
Future work can focus on exploring unsupervised localization techniques for crowd counting to provide more comprehensive crowd analysis. |
crowd counting, unsupervised learning, vision-language model, clip, contrastive learning |
2304.03950
Report |
GANHead: Towards Generative Animatable Neural Head Avatars |
Sijing Wu, Yichao Yan, Yunhao Li, Yuhao Cheng, Wenhan Zhu, Ke Gao, Xiaobo Li, Guangtao Zhai |
To bring digital avatars into people's lives, it is highly demanded to
efficiently generate complete, realistic, and animatable head avatars. This
task is challenging, and it is difficult for existing methods to satisfy all
the requirements at once. To achieve these goals, we propose GANHead
(Generative Animatable Neural Head Avatar), a novel generative head model that
takes advantages of both the fine-grained control over the explicit expression
parameters and the realistic rendering results of implicit representations.
Specifically, GANHead represents coarse geometry, fine-gained details and
texture via three networks in canonical space to obtain the ability to generate
complete and realistic head avatars. To achieve flexible animation, we define
the deformation filed by standard linear blend skinning (LBS), with the learned
continuous pose and expression bases and LBS weights. This allows the avatars
to be directly animated by FLAME parameters and generalize well to unseen poses
and expressions. Compared to state-of-the-art (SOTA) methods, GANHead achieves
superior performance on head avatar generation and raw scan fitting. |
GANHead, a novel generative model for creating realistic and animatable 3D head avatars, is presented. |
Generating complete, realistic, and animatable 3D head avatars efficiently is crucial for various applications like VR/AR and the metaverse, but remains a challenge for existing methods. |
GANHead leverages implicit representations with three neural networks to model coarse geometry, fine details, and texture in canonical space. It employs a deformation module based on FLAME parameters for animation and generalizability. |
GANHead generates high-quality head avatars with detailed geometry and realistic textures.
The generated avatars are controllable by FLAME parameters, enabling animation and generalization to unseen poses and expressions.
GANHead outperforms SOTA methods in head avatar generation and raw scan fitting, exhibiting superior reconstruction quality in both shape and texture. |
The current model still struggles to generate realistic hair with complex topology.
Training requires significant GPU memory. |
generative model, 3d head avatar, implicit representation, animatable avatar, flame parameters |
2304.03869
Report |
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis |
Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui, Zhe Lin, Yang Zhang, Shiyu Chang |
Diffusion-based models have achieved state-of-the-art performance on
text-to-image synthesis tasks. However, one critical limitation of these models
is the low fidelity of generated images with respect to the text description,
such as missing objects, mismatched attributes, and mislocated objects. One key
reason for such inconsistencies is the inaccurate cross-attention to text in
both the spatial dimension, which controls at what pixel region an object
should appear, and the temporal dimension, which controls how different levels
of details are added through the denoising steps. In this paper, we propose a
new text-to-image algorithm that adds explicit control over spatial-temporal
cross-attention in diffusion models. We first utilize a layout predictor to
predict the pixel regions for objects mentioned in the text. We then impose
spatial attention control by combining the attention over the entire text
description and that over the local description of the particular object in the
corresponding pixel region of that object. The temporal attention control is
further added by allowing the combination weights to change at each denoising
step, and the combination weights are optimized to ensure high fidelity between
the image and the text. Experiments show that our method generates images with
higher fidelity compared to diffusion-model-based baselines without fine-tuning
the diffusion model. Our code is publicly available at
https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn. |
This paper proposes a new text-to-image synthesis algorithm that addresses the low fidelity issue of diffusion models by explicitly controlling spatial-temporal cross-attention. |
Existing diffusion models often generate images inconsistent with text descriptions (e.g., missing objects, mismatched attributes), particularly for complex scenes. |
The algorithm uses a layout predictor to determine object positions and optimizes a novel spatial-temporal attention mechanism in the diffusion model. This guides the model to attend to both global and local text descriptions, focusing on overall composition initially and refining object details in later stages. |
The method significantly outperforms baseline diffusion models in generating images faithful to complex text descriptions.
Both automatic and human evaluations demonstrate the effectiveness of the proposed spatial-temporal attention control.
The method generalizes well to novel object combinations, suggesting its potential for creative applications. |
The current optimization scheme is time-consuming, taking around 10 minutes per image.
The layout predictor's performance might be improved, especially for object positions at the image edge. |
text-to-image synthesis, diffusion models, cross-attention, image fidelity, layout prediction |
2304.03768
Report |
SparseFormer: Sparse Visual Recognition via Limited Latent Tokens |
Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou |
Human visual recognition is a sparse process, where only a few salient visual
cues are attended to rather than traversing every detail uniformly. However,
most current vision networks follow a dense paradigm, processing every single
visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we
challenge this dense paradigm and present a new method, coined SparseFormer, to
imitate human's sparse visual recognition in an end-to-end manner. SparseFormer
learns to represent images using a highly limited number of tokens (down to 49)
in the latent space with sparse feature sampling procedure instead of
processing dense units in the original pixel space. Therefore, SparseFormer
circumvents most of dense operations on the image space and has much lower
computational costs. Experiments on the ImageNet classification benchmark
dataset show that SparseFormer achieves performance on par with canonical or
well-established models while offering better accuracy-throughput tradeoff.
Moreover, the design of our network can be easily extended to the video
classification with promising performance at lower computational costs. We hope
that our work can provide an alternative way for visual modeling and inspire
further research on sparse neural architectures. The code will be publicly
available at https://github.com/showlab/sparseformer |
SparseFormer is a novel vision architecture that sparsely represents images using a limited number of latent tokens and transformers in the latent space, mimicking human sparse visual recognition. |
This approach addresses the limitations of dense processing in conventional vision networks, offering a computationally efficient alternative. |
SparseFormer employs sparse feature sampling and adaptive feature decoding to build latent tokens and iteratively refines their region of interest (RoI) using a focusing transformer. A cortex transformer then processes these tokens for recognition. |
SparseFormer achieves comparable performance to dense counterparts on ImageNet classification with a better accuracy-throughput trade-off.
It effectively focuses on foregrounds in an end-to-end manner using only classification signals.
The architecture extends well to video classification, demonstrating efficiency on Kinetics-400. |
The performance of SparseFormer heavily relies on the number of latent tokens.
Further exploration of token initialization and scaling strategies is needed. |
sparse visual recognition, vision transformer, latent tokens, image classification, video classification |
2304.03752
Report |
V3Det: Vast Vocabulary Visual Detection Dataset |
Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, Dahua Lin |
Recent advances in detecting arbitrary objects in the real world are trained
and evaluated on object detection datasets with a relatively restricted
vocabulary. To facilitate the development of more general visual object
detection, we propose V3Det, a vast vocabulary visual detection dataset with
precisely annotated bounding boxes on massive images. V3Det has several
appealing properties: 1) Vast Vocabulary: It contains bounding boxes of objects
from 13,204 categories on real-world images, which is 10 times larger than the
existing large vocabulary object detection dataset, e.g., LVIS. 2) Hierarchical
Category Organization: The vast vocabulary of V3Det is organized by a
hierarchical category tree which annotates the inclusion relationship among
categories, encouraging the exploration of category relationships in vast and
open vocabulary object detection. 3) Rich Annotations: V3Det comprises
precisely annotated objects in 243k images and professional descriptions of
each category written by human experts and a powerful chatbot. By offering a
vast exploration space, V3Det enables extensive benchmarks on both vast and
open vocabulary object detection, leading to new observations, practices, and
insights for future research. It has the potential to serve as a cornerstone
dataset for developing more general visual perception systems. V3Det is
available at https://v3det.openxlab.org.cn/. |
V3Det, a vast vocabulary visual detection dataset with 13,204 categories hierarchically organized with a category tree, is introduced. |
Existing object detection datasets have a restricted vocabulary, limiting the development of more general visual object detection systems capable of detecting arbitrary objects. |
V3Det leverages the Bamboo classification dataset and web data for image and category acquisition, employs a coarse-to-fine annotation pipeline with multiple verification stages, and provides rich category descriptions. |
V3Det contains bounding boxes of objects from 13,204 categories, 10 times larger than existing datasets like LVIS.
Benchmarks on V3Det reveal insights and best practices for vast and open vocabulary object detection.
Pretraining on V3Det significantly improves the class generalizability of detectors, highlighting its value for open-vocabulary algorithms. |
Limited resources restricted the evaluation of all potential object detectors.
Further exploration of techniques for efficiently training and evaluating models on such a vast vocabulary dataset is needed. |
object detection, vast vocabulary, open vocabulary, dataset, benchmark |
2304.03659
Report |
Probing Conceptual Understanding of Large Visual-Language Models |
Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat |
In recent years large visual-language (V+L) models have achieved great
success in various downstream tasks. However, it is not well studied whether
these models have a conceptual grasp of the visual content. In this work we
focus on conceptual understanding of these large V+L models. To facilitate this
study, we propose novel benchmarking datasets for probing three different
aspects of content understanding, 1) \textit{relations}, 2)
\textit{composition}, and 3) \textit{context}. Our probes are grounded in
cognitive science and help determine if a V+L model can, for example, determine
if snow garnished with a man is implausible, or if it can identify beach
furniture by knowing it is located on a beach. We experimented with many recent
state-of-the-art V+L models and observe that these models mostly \textit{fail
to demonstrate} a conceptual understanding. This study reveals several
interesting insights such as that \textit{cross-attention} helps learning
conceptual understanding, and that CNNs are better with \textit{texture and
patterns}, while Transformers are better at \textit{color and shape}. We
further utilize some of these insights and investigate a \textit{simple
finetuning technique} that rewards the three conceptual understanding measures
with promising initial results. The proposed benchmarks will drive the
community to delve deeper into conceptual understanding and foster advancements
in the capabilities of large V+L models. The code and dataset is available at:
\url{https://tinyurl.com/vlm-robustness} |
This paper proposes three novel benchmark datasets (Probe-R, Probe-C, Probe-B) to evaluate the conceptual understanding of large visual-language (V+L) models in terms of relations, composition, and context. |
It is crucial for real-world applications that V+L models develop an understanding of visual content beyond memorization, similar to the 'conceptual maps' used by humans. |
The study evaluates various state-of-the-art V+L models using these datasets. Probe-R uses image-text matching with swapped predicates or objects. Probe-C assesses compositional understanding through image-prompt matching with swapped compositions or objects. Probe-B analyzes contextual understanding by observing performance changes after background removal or replacement. |
Existing V+L models largely fail to demonstrate robust conceptual understanding, particularly struggling with relational and contextual reasoning.
Cross-attention mechanisms in V+L models are found to be beneficial for learning conceptual understanding.
CNN-based models show strength in texture and pattern recognition, while Transformer-based models excel in color and shape understanding. |
The study primarily focuses on visual perception and could be extended to incorporate subjective inference.
Future work can explore the impact of larger and more diverse training datasets on V+L models' conceptual understanding. |
visual-language models, conceptual understanding, benchmarking datasets, compositionality, contextual reasoning |
2304.03542
Report |
Better "CMOS" Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution |
Xuhai Chen, Jiangning Zhang, Chao Xu, Yabiao Wang, Chengjie Wang, Yong Liu |
Most of the existing blind image Super-Resolution (SR) methods assume that
the blur kernels are space-invariant. However, the blur involved in real
applications are usually space-variant due to object motion, out-of-focus,
etc., resulting in severe performance drop of the advanced SR methods. To
address this problem, we firstly introduce two new datasets with out-of-focus
blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of
blind SR with space-variant blur. Based on the datasets, we design a novel
Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics
simultaneously, which leads to improved SR results. It involves a feature
Grouping Interactive Attention (GIA) module to make the two modalities interact
more effectively and avoid inconsistency. GIA can also be used for the
interaction of other features because of the universality of its structure.
Qualitative and quantitative experiments compared with state-of-the-art methods
on above datasets and real-world images demonstrate the superiority of our
method, e.g., obtaining PSNR/SSIM by +1.91/+0.0048 on NYUv2-BSR than MANet. |
This paper introduces two new datasets with out-of-focus blur for blind image super-resolution and proposes CMOS, a novel cross-modal fusion network, for estimating space-variant blur. |
Real-world blur is often space-variant, which significantly degrades the performance of existing SR methods that assume space-invariant blur. This work aims to address this limitation and improve blind SR in more realistic scenarios. |
The authors propose CMOS, a multi-scale network that leverages semantic information to improve space-variant blur estimation. It uses a novel Grouping Interactive Attention (GIA) module for effective interaction between blur and semantic features. They also introduce two new datasets with synthetic out-of-focus blur for training and evaluation. |
CMOS, combined with a non-blind SR model, achieves state-of-the-art performance on the proposed datasets, outperforming existing blind SR methods by a significant margin.
The effectiveness of using semantic information and the proposed GIA module is demonstrated through ablation studies.
CMOS also shows superior visual results on real-world images with space-variant blur. |
The current work focuses on out-of-focus blur as a representative case of space-variant blur.
Future work can explore the integration of CMOS with more advanced non-blind SR techniques and extend the approach to other types of spatially variant blur. |
blind image super-resolution, space-variant blur, out-of-focus blur, cross-modal fusion, grouping interactive attention |
2304.03526
Report |
Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field |
Leheng Li, Qing Lian, Luozhou Wang, Ningning Ma, Ying-Cong Chen |
This work explores the use of 3D generative models to synthesize training
data for 3D vision tasks. The key requirements of the generative models are
that the generated data should be photorealistic to match the real-world
scenarios, and the corresponding 3D attributes should be aligned with given
sampling labels. However, we find that the recent NeRF-based 3D GANs hardly
meet the above requirements due to their designed generation pipeline and the
lack of explicit 3D supervision. In this work, we propose Lift3D, an inverted
2D-to-3D generation framework to achieve the data generation objectives. Lift3D
has several merits compared to prior methods: (1) Unlike previous 3D GANs that
the output resolution is fixed after training, Lift3D can generalize to any
camera intrinsic with higher resolution and photorealistic output. (2) By
lifting well-disentangled 2D GAN to 3D object NeRF, Lift3D provides explicit 3D
information of generated objects, thus offering accurate 3D annotations for
downstream tasks. We evaluate the effectiveness of our framework by augmenting
autonomous driving datasets. Experimental results demonstrate that our data
generation framework can effectively improve the performance of 3D object
detectors. Project page: https://len-li.github.io/lift3d-web. |
Lift3D, a novel 2D-to-3D generation framework, synthesizes 3D training data by lifting pretrained 2D GAN to 3D generative radiance field. |
Current 3D GANs struggle to generate high-resolution, multi-view consistent images with accurate 3D annotations, limiting their use for data augmentation in 3D vision tasks. |
Lift3D disentangles a 2D GAN to generate multi-view images with pseudo pose labels, then lifts them to a 3D object NeRF using a shared conditional NeRF and optimized latent codes. |
Outperforms state-of-the-art data augmentation methods for 3D object detection on KITTI and nuScenes datasets.
Demonstrates superior visual quality and multi-view consistency compared to previous 3D GANs.
Enables unsupervised training of 3D object detectors with promising results. |
Current method lacks explicit relation reasoning between generated objects and the environment.
Illumination gaps exist between synthetic objects and real-world backgrounds. |
data augmentation, 3d object detection, generative adversarial networks, neural radiance fields, autonomous driving |
2304.03486
Report |
Can we learn better with hard samples? |
Subin Sahayam, John Zakkam, Umarani Jayaraman |
In deep learning, mini-batch training is commonly used to optimize network
parameters. However, the traditional mini-batch method may not learn the
under-represented samples and complex patterns in the data, leading to a longer
time for generalization. To address this problem, a variant of the traditional
algorithm has been proposed, which trains the network focusing on mini-batches
with high loss. The study evaluates the effectiveness of the proposed training
using various deep neural networks trained on three benchmark datasets
(CIFAR-10, CIFAR-100, and STL-10). The deep neural networks used in the study
are ResNet-18, ResNet-50, Efficient Net B4, EfficientNetV2-S, and
MobilenetV3-S. The experimental results showed that the proposed method can
significantly improve the test accuracy and speed up the convergence compared
to the traditional mini-batch training method. Furthermore, we introduce a
hyper-parameter delta ({\delta}) that decides how many mini-batches are
considered for training. Experiments on various values of {\delta} found that
the performance of the proposed method for smaller {\delta} values generally
results in similar test accuracy and faster generalization. We show that the
proposed method generalizes in 26.47% less number of epochs than the
traditional mini-batch method in EfficientNet-B4 on STL-10. The proposed method
also improves the test top-1 accuracy by 7.26% in ResNet-18 on CIFAR-100. |
This paper proposes a novel mini-batch training method that prioritizes learning from hard samples to accelerate the convergence of deep neural networks. |
The ability to efficiently learn from hard samples is crucial for improving the performance and generalization of deep learning models, especially in terms of faster convergence. |
The method introduces a hyper-parameter (δ) that selects a fraction of mini-batches with the highest loss values for training in each iteration. |
The proposed method significantly reduces convergence time while maintaining or even improving test accuracy compared to traditional mini-batch training on benchmark datasets like CIFAR-10, CIFAR-100, and STL-10.
Smaller δ values, focusing on the hardest samples, often lead to the most significant acceleration in convergence.
The effectiveness of the method varies depending on the network architecture and dataset size, with larger networks and smaller datasets showing greater benefits. |
While the method accelerates convergence, it doesn't guarantee improved accuracy in every case.
The assumption of sample independence limits the method's applicability to datasets with inherent dependencies, such as time series or 3D images. |
deep learning, mini-batch training, hard sample mining, convergence acceleration, image classification |
2304.03411
Report |
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning |
Jing Shi, Wei Xiong, Zhe Lin, Hyun Joon Jung |
Recent advances in personalized image generation allow a pre-trained
text-to-image model to learn a new concept from a set of images. However,
existing personalization approaches usually require heavy test-time finetuning
for each concept, which is time-consuming and difficult to scale. We propose
InstantBooth, a novel approach built upon pre-trained text-to-image models that
enables instant text-guided image personalization without any test-time
finetuning. We achieve this with several major components. First, we learn the
general concept of the input images by converting them to a textual token with
a learnable image encoder. Second, to keep the fine details of the identity, we
learn rich visual feature representation by introducing a few adapter layers to
the pre-trained model. We train our components only on text-image pairs without
using paired images of the same concept. Compared to test-time finetuning-based
methods like DreamBooth and Textual-Inversion, our model can generate
competitive results on unseen concepts concerning language-image alignment,
image fidelity, and identity preservation while being 100 times faster. |
This paper introduces InstantBooth, a novel approach for personalized text-to-image generation that eliminates the need for time-consuming test-time finetuning. |
Existing personalization methods often require heavy test-time finetuning for each new concept, making them inefficient and difficult to scale. This paper tackles this limitation, enabling instant personalization. |
InstantBooth leverages a pre-trained text-to-image diffusion model and incorporates: (1) an image encoder to convert input images into a textual concept embedding, (2) adapter layers to inject rich visual features for identity preservation, and (3) techniques like balanced sampling and concept token renormalization for balancing identity and language alignment. |
InstantBooth achieves comparable results to test-time finetuning-based methods like DreamBooth and Textual-Inversion.
It demonstrates superior performance in language-image alignment and identity preservation.
The method is significantly faster, being approximately 100 times faster than DreamBooth. |
The current model requires separate training for each category.
The adapter design only allows for a single concept to provide identity details. |
text-to-image generation, personalized image synthesis, test-time finetuning, diffusion models, adapter layers |
2304.03373
Report |
Training-Free Layout Control with Cross-Attention Guidance |
Minghao Chen, Iro Laina, Andrea Vedaldi |
Recent diffusion-based generators can produce high-quality images from
textual prompts. However, they often disregard textual instructions that
specify the spatial layout of the composition. We propose a simple approach
that achieves robust layout control without the need for training or
fine-tuning of the image generator. Our technique manipulates the
cross-attention layers that the model uses to interface textual and visual
information and steers the generation in the desired direction given, e.g., a
user-specified layout. To determine how to best guide attention, we study the
role of attention maps and explore two alternative strategies, forward and
backward guidance. We thoroughly evaluate our approach on three benchmarks and
provide several qualitative examples and a comparative analysis of the two
strategies that demonstrate the superiority of backward guidance compared to
forward guidance, as well as prior work. We further demonstrate the versatility
of layout guidance by extending it to applications such as editing the layout
and context of real images. |
This paper proposes a training-free layout control method for diffusion-based image generators using cross-attention guidance, enabling user-specified layout control without retraining. |
Existing text-to-image generators struggle to accurately interpret and represent spatial relationships specified in text prompts, limiting their controllability. |
The method introduces two strategies: forward guidance (directly biasing attention maps) and backward guidance (using backpropagation to optimize latent codes for desired layout). |
Backward guidance significantly outperforms forward guidance and prior arts in achieving layout control while maintaining image quality.
Analysis reveals the importance of all tokens, including special tokens like start and padding tokens, in shaping the layout.
The method effectively extends to real-image layout editing, enabling manipulation of object position and context within generated scenes. |
The impact of initial noise on layout needs further investigation.
Exploring alternative optimization strategies for improved speed and performance. |
layout control, text-to-image generation, diffusion models, cross-attention, image editing |
2304.03307
Report |
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting |
Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah |
Adopting contrastive image-text pretrained models like CLIP towards video
classification has gained attention due to its cost-effectiveness and
competitive performance. However, recent works in this area face a trade-off.
Finetuning the pretrained model to achieve strong supervised performance
results in low zero-shot generalization. Similarly, freezing the backbone to
retain zero-shot capability causes significant drop in supervised accuracy.
Because of this, recent works in literature typically train separate models for
supervised and zero-shot action recognition. In this work, we propose a
multimodal prompt learning scheme that works to balance the supervised and
zero-shot performance under a single unified training. Our prompting approach
on the vision side caters for three aspects: 1) Global video-level prompts to
model the data distribution; 2) Local frame-level prompts to provide per-frame
discriminative conditioning; and 3) a summary prompt to extract a condensed
video representation. Additionally, we define a prompting scheme on the text
side to augment the textual context. Through this prompting scheme, we can
achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and
UCF101 while remaining competitive in the supervised setting. By keeping the
pretrained backbone frozen, we optimize a much lower number of parameters and
retain the existing general representation which helps achieve the strong
zero-shot performance. Our codes/models are released at
https://github.com/TalalWasim/Vita-CLIP. |
This paper proposes Vita-CLIP, a multimodal prompt learning approach for adapting CLIP to video recognition, balancing supervised learning with zero-shot generalization using a single unified training scheme. |
Existing methods for adapting CLIP to video recognition often sacrifice zero-shot generalization for supervised performance, or vice versa. This work aims to address this trade-off with a unified model. |
Vita-CLIP introduces a multimodal prompting scheme involving: 1) video-level prompts for global distribution learning, 2) frame-level prompts for per-frame discriminative conditioning, 3) a summary prompt for condensed video representation, and 4) text prompts for augmented textual context. |
Vita-CLIP achieves state-of-the-art zero-shot performance on Kinetics-600, HMDB51, and UCF101, outperforming previous methods by significant margins.
It maintains competitive supervised performance on Kinetics-400 and Something-Something-V2 compared to methods fine-tuning the entire CLIP backbone.
The method effectively captures per-frame variations and overall video distribution, as shown through ablations and visualizations. |
The performance on fine-grained datasets like Something-Something-V2, while improved over previous vision-language models, is still lower than cross-entropy based methods, suggesting future work in this area.
Exploring more efficient prompting techniques and extending the method to other video understanding tasks like retrieval could be interesting research directions. |
video recognition, zero-shot learning, prompt learning, clip, vision-language models |
2304.03285
Report |
$\text{DC}^2$: Dual-Camera Defocus Control by Learning to Refocus |
Hadi Alzayer, Abdullah Abuolaim, Leung Chun Chan, Yang Yang, Ying Chen Lou, Jia-Bin Huang, Abhishek Kar |
Smartphone cameras today are increasingly approaching the versatility and
quality of professional cameras through a combination of hardware and software
advancements. However, fixed aperture remains a key limitation, preventing
users from controlling the depth of field (DoF) of captured images. At the same
time, many smartphones now have multiple cameras with different fixed apertures
-- specifically, an ultra-wide camera with wider field of view and deeper DoF
and a higher resolution primary camera with shallower DoF. In this work, we
propose $\text{DC}^2$, a system for defocus control for synthetically varying
camera aperture, focus distance and arbitrary defocus effects by fusing
information from such a dual-camera system. Our key insight is to leverage
real-world smartphone camera dataset by using image refocus as a proxy task for
learning to control defocus. Quantitative and qualitative evaluations on
real-world data demonstrate our system's efficacy where we outperform
state-of-the-art on defocus deblurring, bokeh rendering, and image refocus.
Finally, we demonstrate creative post-capture defocus control enabled by our
method, including tilt-shift and content-based defocus effects. |
The paper introduces $DC^2$, a learning-based system for depth-of-field control in dual-camera smartphones, enabling defocus deblurring, depth-based blur rendering, and image refocusing. |
Current smartphone cameras lack post-capture depth-of-field control due to fixed apertures, and existing methods for defocus manipulation often focus on isolated aspects like deblurring or bokeh rendering. $DC^2$ offers a unified approach to address these limitations using readily available dual-camera systems. |
The system uses a novel training strategy by leveraging image refocusing as a proxy task. It is trained on a dataset of real-world dual-camera images with varying focus distances, learning to fuse information from wide and ultra-wide cameras to control defocus. |
$DC^2$ outperforms state-of-the-art methods on defocus deblurring, even without being explicitly trained on all-in-focus images.
It achieves competitive performance in synthesizing shallow depth-of-field effects compared to dedicated bokeh rendering methods.
The system excels at image refocusing, surpassing baselines that rely on sequential deblurring and blurring steps. |
The method relies on the ultra-wide camera having a deeper depth-of-field than the wide camera, limiting its effectiveness for systems with similar camera configurations.
It depends on the accuracy of pre-existing optical flow and stereo depth algorithms, which can be unreliable in the presence of defocus blur, presenting an area for improvement in future work. |
depth-of-field control, dual-camera systems, defocus deblurring, image refocusing, bokeh rendering |
2304.03284
Report |
SegGPT: Segmenting Everything In Context |
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang |
We present SegGPT, a generalist model for segmenting everything in context.
We unify various segmentation tasks into a generalist in-context learning
framework that accommodates different kinds of segmentation data by
transforming them into the same format of images. The training of SegGPT is
formulated as an in-context coloring problem with random color mapping for each
data sample. The objective is to accomplish diverse tasks according to the
context, rather than relying on specific colors. After training, SegGPT can
perform arbitrary segmentation tasks in images or videos via in-context
inference, such as object instance, stuff, part, contour, and text. SegGPT is
evaluated on a broad range of tasks, including few-shot semantic segmentation,
video object segmentation, semantic segmentation, and panoptic segmentation.
Our results show strong capabilities in segmenting in-domain and out-of-domain
targets, either qualitatively or quantitatively. |
\Ours is a generalist model for segmenting everything in context, unifying various segmentation tasks into an in-context learning framework. |
Existing specialist segmentation models are limited to specific tasks, requiring new models and expensive annotation for different settings. This work aims to train a single model capable of solving diverse segmentation tasks. |
The model views segmentation as a general format for visual perception, accommodating different data types by transforming them into images. It uses an in-context coloring training scheme with random color mapping to foster generalizability. |
\Ours achieves comparable or better performance than state-of-the-art specialist models on few-shot semantic segmentation benchmarks, including out-of-domain tasks.
Despite not being specifically trained for video object segmentation, \Ours achieves competitive results on benchmarks like YouTube-VOS 2018 and DAVIS 2017.
The model shows strong qualitative results on arbitrary object/part segmentation, text segmentation, and close-set instance/semantic segmentation with learnable prompt tuning. |
While the random color scheme enhances generalization, it makes training more challenging, potentially leading to inferior performance on in-domain tasks with large datasets.
Future work includes scaling up the model size and exploring self-supervised learning for improved performance and addressing data limitations. |
segmentation, generalist model, in-context learning, computer vision, vision transformer |
2304.03283
Report |
Diffusion Models as Masked Autoencoders |
Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer |
There has been a longstanding belief that generation can facilitate a true
understanding of visual data. In line with this, we revisit generatively
pre-training visual representations in light of recent interest in denoising
diffusion models. While directly pre-training with diffusion models does not
produce strong representations, we condition diffusion models on masked input
and formulate diffusion models as masked autoencoders (DiffMAE). Our approach
is capable of (i) serving as a strong initialization for downstream recognition
tasks, (ii) conducting high-quality image inpainting, and (iii) being
effortlessly extended to video where it produces state-of-the-art
classification accuracy. We further perform a comprehensive study on the pros
and cons of design choices and build connections between diffusion models and
masked autoencoders. |
This paper introduces Diffusion Masked Autoencoders (DiffMAE), a novel self-supervised learning framework that unifies diffusion models with masked autoencoders. |
This work addresses the challenge of effectively utilizing generative pre-training for visual recognition tasks, inspired by the success of generative language models. |
DiffMAE incorporates masking into diffusion models, training the model to predict the pixel distribution of masked image regions conditioned on visible regions. The model leverages a ViT-based architecture and explores various decoder designs and training strategies. |
DiffMAE achieves strong performance on ImageNet classification, comparable to leading self-supervised learning methods, while also enabling high-quality image inpainting.
DiffMAE demonstrates superior inpainting capabilities compared to specialized inpainting algorithms, quantitatively and qualitatively.
The framework effortlessly extends to video, achieving state-of-the-art results on Kinetics-400 video classification and demonstrating promising video inpainting capabilities. |
There is a trade-off between optimal settings for recognition and inpainting tasks, requiring further exploration for a unified approach.
Future work includes investigating the potential of incorporating techniques like tokenization and layer scale to further enhance performance. |
diffusion models, masked autoencoders, generative pre-training, self-supervised learning, image and video inpainting |
2304.03266
Report |
Neural Fields meet Explicit Geometric Representation for Inverse Rendering of Urban Scenes |
Zian Wang, Tianchang Shen, Jun Gao, Shengyu Huang, Jacob Munkberg, Jon Hasselgren, Zan Gojcic, Wenzheng Chen, Sanja Fidler |
Reconstruction and intrinsic decomposition of scenes from captured imagery
would enable many applications such as relighting and virtual object insertion.
Recent NeRF based methods achieve impressive fidelity of 3D reconstruction, but
bake the lighting and shadows into the radiance field, while mesh-based methods
that facilitate intrinsic decomposition through differentiable rendering have
not yet scaled to the complexity and scale of outdoor scenes. We present a
novel inverse rendering framework for large urban scenes capable of jointly
reconstructing the scene geometry, spatially-varying materials, and HDR
lighting from a set of posed RGB images with optional depth. Specifically, we
use a neural field to account for the primary rays, and use an explicit mesh
(reconstructed from the underlying neural field) for modeling secondary rays
that produce higher-order lighting effects such as cast shadows. By faithfully
disentangling complex geometry and materials from lighting effects, our method
enables photorealistic relighting with specular and shadow effects on several
outdoor datasets. Moreover, it supports physics-based scene manipulations such
as virtual object insertion with ray-traced shadow casting. |
Presents FEGR, a novel hybrid-rendering pipeline for inverse rendering of large urban scenes, combining neural fields and explicit mesh representations for efficient and high-quality intrinsic decomposition. |
Enables realistic relighting and virtual object insertion in large-scale environments by disentangling geometry, materials, and HDR lighting, which is crucial for applications like AR/VR and digital twins. |
Uses a neural field for primary ray rendering and volumetrically renders a G-buffer, then extracts a mesh from the signed distance field for efficient physics-based rendering of secondary rays, enabling high-quality shadow and specular effects. |
Significantly outperforms state-of-the-art in novel-view synthesis under varying lighting on the NeRF-OSR dataset.
Demonstrates high-quality intrinsic decomposition on challenging single-illumination driving scenes, surpassing baseline methods in albedo, geometry, and environment map accuracy.
Enables photorealistic virtual object insertion with accurate cast shadows, confirmed by a user study where participants significantly preferred FEGR over baseline methods. |
Relies on manually designed priors for regularization due to the ill-posed nature of inverse rendering, potentially limiting generalizability.
Currently limited to static scenes, requiring future extensions with dynamic NeRF techniques to handle dynamic environments. |
inverse rendering, neural rendering, neural fields, explicit mesh representation, urban scenes |
2304.03246
Report |
Inst-Inpaint: Instructing to Remove Objects with Diffusion Models |
Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar |
Image inpainting task refers to erasing unwanted pixels from images and
filling them in a semantically consistent and realistic way. Traditionally, the
pixels that are wished to be erased are defined with binary masks. From the
application point of view, a user needs to generate the masks for the objects
they would like to remove which can be time-consuming and prone to errors. In
this work, we are interested in an image inpainting algorithm that estimates
which object to be removed based on natural language input and removes it,
simultaneously. For this purpose, first, we construct a dataset named
GQA-Inpaint for this task. Second, we present a novel inpainting framework,
Inst-Inpaint, that can remove objects from images based on the instructions
given as text prompts. We set various GAN and diffusion-based baselines and run
experiments on synthetic and real image datasets. We compare methods with
different evaluation metrics that measure the quality and accuracy of the
models and show significant quantitative and qualitative improvements. |
This paper presents Inst-Inpaint, a novel end-to-end image inpainting framework that removes objects from images based solely on textual instructions, eliminating the need for binary masks. |
The proposed instructional image inpainting task offers a more natural and user-friendly way to control image inpainting compared to traditional mask-based methods. |
The authors create a new real image dataset, GQA-Inpaint, derived from the GQA dataset, and propose Inst-Inpaint, a conditional Latent Diffusion Model trained on this dataset. Inst-Inpaint leverages text prompts and source image encoding to perform object removal. |
Inst-Inpaint achieves superior FID scores and CLIP-based inpainting scores compared to baseline methods like Instruct X-Decoder, InstPix2Pix, and CLIPSeg on the GQA-Inpaint dataset.
The model effectively removes objects in complex scenarios and demonstrates accurate attention to objects targeted for removal.
Analysis of attention maps reveals that Inst-Inpaint implicitly identifies objects for removal with higher accuracy than methods explicitly predicting masks based on prompts, such as CLIPSeg and X-Decoder. |
The reliance on autoencoders in the LDM architecture can lead to poor reconstruction of complex patterns in the output, even when object removal is successful.
Future work can explore more powerful autoencoders or alternative optimization strategies to address reconstruction quality limitations. |
instruction-based inpainting, diffusion models, image editing, text-to-image synthesis, gqa dataset |
2304.03199
Report |
Face Animation with an Attribute-Guided Diffusion Model |
Bohan Zeng, Xuhui Liu, Sicheng Gao, Boyu Liu, Hong Li, Jianzhuang Liu, Baochang Zhang |
Face animation has achieved much progress in computer vision. However,
prevailing GAN-based methods suffer from unnatural distortions and artifacts
due to sophisticated motion deformation. In this paper, we propose a Face
Animation framework with an attribute-guided Diffusion Model (FADM), which is
the first work to exploit the superior modeling capacity of diffusion models
for photo-realistic talking-head generation. To mitigate the uncontrollable
synthesis effect of the diffusion model, we design an Attribute-Guided
Conditioning Network (AGCN) to adaptively combine the coarse animation features
and 3D face reconstruction results, which can incorporate appearance and motion
conditions into the diffusion process. These specific designs help FADM rectify
unnatural artifacts and distortions, and also enrich high-fidelity facial
details through iterative diffusion refinements with accurate animation
attributes. FADM can flexibly and effectively improve existing animation
videos. Extensive experiments on widely used talking-head benchmarks validate
the effectiveness of FADM over prior arts. |
This paper introduces FADM, a novel face animation framework utilizing an attribute-guided diffusion model to enhance the quality of animation results, rectifying distortions and artifacts common in GAN-based methods. |
Existing GAN-based face animation methods often produce unnatural distortions and artifacts. This work leverages the superior modeling capacity of diffusion models to generate more photo-realistic talking-head videos. |
FADM consists of a coarse generative module, 3D face reconstruction, an attribute-guided conditioning network (AGCN), and a diffusion rendering module. AGCN combines coarse animation features with 3D face reconstruction to guide the diffusion process, ensuring accurate animation attributes and high-fidelity facial details. |
FADM achieves state-of-the-art performance on widely used talking-head benchmarks like VoxCeleb and CelebA.
It effectively rectifies distortions and enriches facial details while preserving accurate appearance and motion.
The framework can also be applied to improve the quality of existing animation videos. |
The performance of FADM on datasets with low-resolution images and blurred textures, like VoxCeleb2, can be further improved.
Exploring more effective attribute-guided strategies to further enhance the controllability and fidelity of face animation is a promising direction. |
face animation, diffusion models, generative models, deep learning, computer vision |
2304.03119
Report |
Zero-shot Generative Model Adaptation via Image-specific Prompt Learning |
Jiayi Guo, Chaofei Wang, You Wu, Eric Zhang, Kai Wang, Xingqian Xu, Shiji Song, Humphrey Shi, Gao Huang |
Recently, CLIP-guided image synthesis has shown appealing performance on
adapting a pre-trained source-domain generator to an unseen target domain. It
does not require any target-domain samples but only the textual domain labels.
The training is highly efficient, e.g., a few minutes. However, existing
methods still have some limitations in the quality of generated images and may
suffer from the mode collapse issue. A key reason is that a fixed adaptation
direction is applied for all cross-domain image pairs, which leads to identical
supervision signals. To address this issue, we propose an Image-specific Prompt
Learning (IPL) method, which learns specific prompt vectors for each
source-domain image. This produces a more precise adaptation direction for
every cross-domain image pair, endowing the target-domain generator with
greatly enhanced flexibility. Qualitative and quantitative evaluations on
various domains demonstrate that IPL effectively improves the quality and
diversity of synthesized images and alleviates the mode collapse. Moreover, IPL
is independent of the structure of the generative model, such as generative
adversarial networks or diffusion models. Code is available at
https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation. |
This paper proposes Image-specific Prompt Learning (IPL), a novel approach for zero-shot generative model adaptation that addresses the limitations of existing methods relying on fixed adaptation directions. |
Existing CLIP-guided zero-shot image synthesis methods, while efficient, suffer from limited image quality and mode collapse due to fixed adaptation directions applied to all cross-domain image pairs. IPL aims to overcome these limitations by introducing image-specific prompt learning. |
IPL is a two-stage method. Stage 1 trains a latent mapper to generate image-specific prompt vectors for each source image using a contrastive learning scheme and a domain regularization loss. Stage 2 incorporates the trained mapper to generate adaptive, image-specific adaptation directions for training the target-domain generator. |
IPL effectively improves the quality and diversity of synthesized images compared to existing methods like NADA.
IPL alleviates the mode collapse issue observed in previous approaches.
IPL is model-agnostic and can be applied to both GAN-based and diffusion-based generative models. |
The visualization and interpretability of the learned prompt vectors remain challenging.
The performance of IPL in scenarios with large domain shifts requires further investigation. |
generative model adaptation, zero-shot learning, clip, prompt learning, image synthesis |
2304.02978
Report |
Simplifying Low-Light Image Enhancement Networks with Relative Loss Functions |
Yu Zhang, Xiaoguang Di, Junde Wu, Rao Fu, Yong Li, Yue Wang, Yanwu Xu, Guohui Yang, Chunhui Wang |
Image enhancement is a common technique used to mitigate issues such as
severe noise, low brightness, low contrast, and color deviation in low-light
images. However, providing an optimal high-light image as a reference for
low-light image enhancement tasks is impossible, which makes the learning
process more difficult than other image processing tasks. As a result, although
several low-light image enhancement methods have been proposed, most of them
are either too complex or insufficient in addressing all the issues in
low-light images. In this paper, to make the learning easier in low-light image
enhancement, we introduce FLW-Net (Fast and LightWeight Network) and two
relative loss functions. Specifically, we first recognize the challenges of the
need for a large receptive field to obtain global contrast and the lack of an
absolute reference, which limits the simplification of network structures in
this task. Then, we propose an efficient global feature information extraction
component and two loss functions based on relative information to overcome
these challenges. Finally, we conducted comparative experiments to demonstrate
the effectiveness of the proposed method, and the results confirm that the
proposed method can significantly reduce the complexity of supervised low-light
image enhancement networks while improving processing effect. The code is
available at \url{https://github.com/hitzhangyu/FLW-Net}. |
This paper presents FLW-Net, a fast and lightweight network for low-light image enhancement, along with two novel relative loss functions to simplify the learning process. |
Low-light image enhancement suffers from a lack of optimal reference images, making existing methods either too complex or insufficient. FLW-Net addresses this by simplifying network structure and using relative loss functions that don't require exact output-reference matching. |
FLW-Net utilizes a Global Feature Extraction (GFE) component to efficiently extract global information from image histograms. It employs two relative loss functions: L_brightness for brightness order similarity and L_structure for similar gradient orders between enhanced and reference images. |
FLW-Net achieves comparable or better performance than state-of-the-art methods while maintaining faster processing speed.
Relative loss functions, particularly L_brightness and L_structure, significantly improve PSNR and SSIM, demonstrating effectiveness in noise removal and structural preservation.
Combining the proposed loss functions with other networks, like RetinexNet and KIND, enhances their performance with fewer parameters or operations. |
Enhancement results depend on the desired brightness parameter (μ_test).
Training requires paired data, limiting applicability to unpaired datasets. |
low-light image enhancement, lightweight network, relative loss functions, global feature extraction, image restoration |
2304.02827
Report |
DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model |
Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun |
The increasing demand for high-quality 3D content creation has motivated the
development of automated methods for creating 3D object models from a single
image and/or from a text prompt. However, the reconstructed 3D objects using
state-of-the-art image-to-3D methods still exhibit low correspondence to the
given image and low multi-view consistency. Recent state-of-the-art text-to-3D
methods are also limited, yielding 3D samples with low diversity per prompt
with long synthesis time. To address these challenges, we propose DITTO-NeRF, a
novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a
single image. Our DITTO-NeRF consists of constructing high-quality partial 3D
object for limited in-boundary (IB) angles using the given or text-generated 2D
image from the frontal view and then iteratively reconstructing the remaining
3D NeRF using inpainting latent diffusion model. We propose progressive 3D
object reconstruction schemes in terms of scales (low to high resolution),
angles (IB angles initially to outer-boundary (OB) later), and masks (object to
background boundary) in our DITTO-NeRF so that high-quality information on IB
can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods
in terms of fidelity and diversity qualitatively and quantitatively with much
faster training times than prior arts on image/text-to-3D such as DreamFusion,
and NeuralLift-360. |
DITTO-NeRF, a novel pipeline, generates high-quality 3D NeRF models from text prompts or single images by iteratively reconstructing partial 3D objects using inpainting latent diffusion models. |
Existing image-to-3D methods struggle with low correspondence and multi-view consistency, while text-to-3D methods suffer from low sample diversity and long synthesis times. DITTO-NeRF aims to address these challenges. |
DITTO-NeRF constructs a high-quality partial 3D object for limited in-boundary angles using a text-generated or user-provided 2D image. It then iteratively reconstructs the remaining 3D NeRF using an inpainting latent diffusion model, employing progressive schemes for scales, angles, and masks. |
Outperforms state-of-the-art image-to-3D methods in fidelity and multi-view consistency.
Exceeds existing text-to-3D methods in output fidelity and diversity.
Achieves these improvements with significantly faster training times compared to prior arts like DreamFusion and NeuralLift-360. |
Limited depth estimation accuracy for images with minimal shadows or generated from out-of-distribution data.
3D object quality is dependent on the quality of images generated by the diffusion model. |
nerf, text-to-3d, image-to-3d, diffusion models, 3d object generation |
2304.02797
Report |
DeLiRa: Self-Supervised Depth, Light, and Radiance Fields |
Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Sergey Zakharov, Vincent Sitzmann, Adrien Gaidon |
Differentiable volumetric rendering is a powerful paradigm for 3D
reconstruction and novel view synthesis. However, standard volume rendering
approaches struggle with degenerate geometries in the case of limited viewpoint
diversity, a common scenario in robotics applications. In this work, we propose
to use the multi-view photometric objective from the self-supervised depth
estimation literature as a geometric regularizer for volumetric rendering,
significantly improving novel view synthesis without requiring additional
information. Building upon this insight, we explore the explicit modeling of
scene geometry using a generalist Transformer, jointly learning a radiance
field as well as depth and light fields with a set of shared latent codes. We
demonstrate that sharing geometric information across tasks is mutually
beneficial, leading to improvements over single-task learning without an
increase in network complexity. Our DeLiRa architecture achieves
state-of-the-art results on the ScanNet benchmark, enabling high quality
volumetric rendering as well as real-time novel view and depth synthesis in the
limited viewpoint diversity setting. |
Introduces multi-view photometric loss as regularization for volumetric rendering to improve 3D geometry learning, especially in limited viewpoint scenarios, and proposes DeLiRa, a novel architecture that jointly learns depth, light, and radiance fields from a shared latent space. |
Addresses the challenge of degenerate geometries in volumetric rendering due to limited viewpoint diversity, a common issue in applications like robotics. |
Combines volumetric rendering with a self-supervised multi-view photometric loss, using depth inferred from rendering to enforce multi-view consistency. DeLiRa utilizes a shared latent representation and cross-attention decoders for efficient and effective multi-task learning. |
Achieves state-of-the-art depth and view synthesis on ScanNet, outperforming methods reliant on ground truth or pre-trained networks.
Demonstrates that multi-view photometric loss effectively regularizes volumetric rendering, enabling accurate geometry recovery in limited viewpoint settings.
Shows joint learning of depth, light, and radiance fields in a shared latent space improves performance across tasks compared to single-task networks. |
Remains scene-specific and requires retraining for new scenes.
Requires image overlap for multi-view photometric self-supervision, limiting applicability in very sparse view scenarios. |
volumetric rendering, depth estimation, neural radiance fields, self-supervised learning, multi-view photometric loss |
2304.02744
Report |
StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer |
Sasikarn Khwanmuang, Pakkapon Phongthawee, Patsorn Sangkloy, Supasorn Suwajanakorn |
Our paper seeks to transfer the hairstyle of a reference image to an input
photo for virtual hair try-on. We target a variety of challenges scenarios,
such as transforming a long hairstyle with bangs to a pixie cut, which requires
removing the existing hair and inferring how the forehead would look, or
transferring partially visible hair from a hat-wearing person in a different
pose. Past solutions leverage StyleGAN for hallucinating any missing parts and
producing a seamless face-hair composite through so-called GAN inversion or
projection. However, there remains a challenge in controlling the
hallucinations to accurately transfer hairstyle and preserve the face shape and
identity of the input. To overcome this, we propose a multi-view optimization
framework that uses "two different views" of reference composites to
semantically guide occluded or ambiguous regions. Our optimization shares
information between two poses, which allows us to produce high fidelity and
realistic results from incomplete references. Our framework produces
high-quality results and outperforms prior work in a user study that consists
of significantly more challenging hair transfer scenarios than previously
studied. Project page: https://stylegan-salon.github.io/. |
This paper presents StyleGAN Salon, a novel pose-invariant hairstyle transfer pipeline that leverages multi-view latent optimization to transfer hairstyles between images with significant pose differences. |
Hairstyle transfer in the wild is challenging, particularly when dealing with large pose discrepancies between input face and reference hair images. Existing methods often struggle with preserving hair texture, facial features, and background details. |
The method involves constructing two guide images from different viewpoints using EG3D for geometric consistency and employs a multi-stage optimization strategy. First, it hallucinates missing details by optimizing in the \W latent space of StyleGAN2. Subsequently, it refines the output by optimizing in the extended \WP space to recover fine-grained details of both face and hair. Lastly, it utilizes Pivotal Tuning Inversion (PTI) to further enhance the fidelity of the final output. |
Outperforms state-of-the-art methods like StyleYourHair, Barbershop, and LOHO in user studies, demonstrating superior hairstyle transfer quality, especially in challenging scenarios like pose misalignment, bangs removal, and hat removal.
Exhibits better preservation of input facial shape compared to other methods, as indicated by lower RMSE scores on facial landmarks.
Successfully handles a variety of challenging scenarios, including transitions from long to short hairstyles, bangs/hat removal, and background inpainting. |
Struggles with eccentric hairstyles and faces.
Relies on multiple pretrained networks, which can introduce biases and limitations. |
hairstyle transfer, generative adversarial networks, stylegan, multi-view optimization, pose invariance |
2304.02642
Report |
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models |
Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su |
This paper proposes a method for generating images of customized objects
specified by users. The method is based on a general framework that bypasses
the lengthy optimization required by previous approaches, which often employ a
per-object optimization paradigm. Our framework adopts an encoder to capture
high-level identifiable semantics of objects, producing an object-specific
embedding with only a single feed-forward pass. The acquired object embedding
is then passed to a text-to-image synthesis model for subsequent generation. To
effectively blend a object-aware embedding space into a well developed
text-to-image model under the same generation context, we investigate different
network designs and training strategies, and propose a simple yet effective
regularized joint training scheme with an object identity preservation loss.
Additionally, we propose a caption generation scheme that become a critical
piece in fostering object specific embedding faithfully reflected into the
generation process, while keeping control and editing abilities. Once trained,
the network is able to produce diverse content and styles, conditioned on both
texts and objects. We demonstrate through experiments that our proposed method
is able to synthesize images with compelling output quality, appearance
diversity, and object fidelity, without the need of test-time optimization.
Systematic studies are also conducted to analyze our models, providing insights
for future work. |
This paper presents a novel method for personalized image synthesis that bypasses the need for per-object optimization, enabling efficient generation of customized images from a single reference image and a text prompt. |
Existing personalized image synthesis methods heavily rely on time-consuming per-object optimization, hindering their scalability and practicality. This work addresses this limitation by introducing an efficient and generalizable framework. |
The proposed framework leverages a pre-trained text-to-image diffusion model augmented with an object encoder. It employs a regularized joint training scheme with a cross-reference regularization to preserve object identity while maintaining editing capability. Additionally, a caption generation scheme enhances personalization by providing diverse text captions. |
The method generates high-quality, diverse images with strong object fidelity, outperforming existing state-of-the-art approaches.
It exhibits superior efficiency by eliminating the need for per-object optimization, achieving comparable performance in a single forward pass.
The framework demonstrates generalizability, effectively synthesizing images of various objects, even those unseen during training. |
The method may struggle to generate accurate details when the reference image lacks complete information.
Potential biases in training data could lead to biased image generation, requiring further investigation and mitigation strategies. |
image synthesis, text-to-image generation, personalized image synthesis, diffusion models, object encoding |
2304.02637
Report |
GenPhys: From Physical Processes to Generative Models |
Ziming Liu, Di Luo, Yilun Xu, Tommi Jaakkola, Max Tegmark |
Since diffusion models (DM) and the more recent Poisson flow generative
models (PFGM) are inspired by physical processes, it is reasonable to ask: Can
physical processes offer additional new generative models? We show that the
answer is yes. We introduce a general family, Generative Models from Physical
Processes (GenPhys), where we translate partial differential equations (PDEs)
describing physical processes to generative models. We show that generative
models can be constructed from s-generative PDEs (s for smooth). GenPhys
subsume the two existing generative models (DM and PFGM) and even give rise to
new families of generative models, e.g., "Yukawa Generative Models" inspired
from weak interactions. On the other hand, some physical processes by default
do not belong to the GenPhys family, e.g., the wave equation and the
Schr\"{o}dinger equation, but could be made into the GenPhys family with some
modifications. Our goal with GenPhys is to explore and expand the design space
of generative models. |
This paper introduces GenPhys, a framework that converts partial differential equations (PDEs) describing physical processes into generative models. |
GenPhys expands the design space of generative models by leveraging the dynamics of diverse physical structures. |
The framework leverages the connection between PDEs and density flow. PDEs describing physical processes are rewritten as density flow equations, which then serve as the foundation for generative models. The s-generative property (smoothness, well-behaved density) is introduced as a requirement for a PDE to be suitable for GenPhys. |
GenPhys subsumes existing generative models like Diffusion Models (DM) and Poisson Flow Generative Models (PFGM).
The framework can leverage new physical processes to create new generative models (e.g., "Yukawa Generative Models" inspired by weak interactions).
Dispersion relations can serve as a rigorous criterion to determine whether a PDE is suitable for conversion into a generative model. |
The paper primarily focuses on linear PDEs, leaving the exploration of non-linear PDEs for future work.
Further investigation is needed to analyze and test the practical performance of the newly proposed GenPhys models. |
generative models, physics-inspired ai, partial differential equations, density flow, dispersion relation |
2304.02633
Report |
HNeRV: A Hybrid Neural Representation for Videos |
Hao Chen, Matt Gwilliam, Ser-Nam Lim, Abhinav Shrivastava |
Implicit neural representations store videos as neural networks and have
performed well for various vision tasks such as video compression and
denoising. With frame index or positional index as input, implicit
representations (NeRV, E-NeRV, \etc) reconstruct video from fixed and
content-agnostic embeddings. Such embedding largely limits the regression
capacity and internal generalization for video interpolation. In this paper, we
propose a Hybrid Neural Representation for Videos (HNeRV), where a learnable
encoder generates content-adaptive embeddings, which act as the decoder input.
Besides the input embedding, we introduce HNeRV blocks, which ensure model
parameters are evenly distributed across the entire network, such that higher
layers (layers near the output) can have more capacity to store high-resolution
content and video details. With content-adaptive embeddings and re-designed
architecture, HNeRV outperforms implicit methods in video regression tasks for
both reconstruction quality ($+4.7$ PSNR) and convergence speed ($16\times$
faster), and shows better internal generalization. As a simple and efficient
video representation, HNeRV also shows decoding advantages for speed,
flexibility, and deployment, compared to traditional codecs~(H.264, H.265) and
learning-based compression methods. Finally, we explore the effectiveness of
HNeRV on downstream tasks such as video compression and video inpainting. We
provide project page at https://haochen-rye.github.io/HNeRV, and Code at
https://github.com/haochen-rye/HNeRV |
This paper introduces HNeRV, a hybrid neural representation for videos that combines a learnable encoder for content-adaptive embeddings with a redesigned decoder architecture for even parameter distribution. |
Existing implicit neural representations for videos suffer from limited generalizability and regression capacity due to content-agnostic embeddings and uneven parameter distribution in decoders. HNeRV addresses these limitations, aiming for improved quality, speed, and generalization in video representation. |
HNeRV consists of a learnable encoder (ConvNeXt blocks) to generate compact frame embeddings and a decoder (HNeRV blocks) that takes embeddings as input. HNeRV blocks are designed to balance parameters across layers, enhancing the representation of high-resolution content. |
HNeRV significantly outperforms implicit methods (NeRV, E-NeRV) in video reconstruction quality (+4.7 PSNR) and convergence speed (16x faster).
The even parameter distribution strategy in HNeRV's decoder, where later layers have more parameters, proves crucial for reconstructing high-resolution videos.
HNeRV demonstrates strong performance in downstream tasks like video compression (competing with H.264, H.265) and video inpainting (comparable to SOTA). |
HNeRV requires training a new model for each video, limiting its applicability to scenarios where pre-training on a large dataset is feasible.
Determining optimal embedding size, model size, and network architecture for HNeRV remains an open challenge. |
neural representation, video compression, video regression, video inpainting, internal generalization |
2304.02626
Report |
Dynamic Point Fields |
Sergey Prokudin, Qianli Ma, Maxime Raafat, Julien Valentin, Siyu Tang |
Recent years have witnessed significant progress in the field of neural
surface reconstruction. While the extensive focus was put on volumetric and
implicit approaches, a number of works have shown that explicit graphics
primitives such as point clouds can significantly reduce computational
complexity, without sacrificing the reconstructed surface quality. However,
less emphasis has been put on modeling dynamic surfaces with point primitives.
In this work, we present a dynamic point field model that combines the
representational benefits of explicit point-based graphics with implicit
deformation networks to allow efficient modeling of non-rigid 3D surfaces.
Using explicit surface primitives also allows us to easily incorporate
well-established constraints such as-isometric-as-possible regularisation.
While learning this deformation model is prone to local optima when trained in
a fully unsupervised manner, we propose to additionally leverage semantic
information such as keypoint dynamics to guide the deformation learning. We
demonstrate our model with an example application of creating an expressive
animatable human avatar from a collection of 3D scans. Here, previous methods
mostly rely on variants of the linear blend skinning paradigm, which
fundamentally limits the expressivity of such models when dealing with complex
cloth appearances such as long skirts. We show the advantages of our dynamic
point field framework in terms of its representational power, learning
efficiency, and robustness to out-of-distribution novel poses. |
This paper introduces Dynamic Point Fields (DPF), a novel model combining point-based graphics and deformation networks to efficiently model non-rigid 3D surfaces. |
DPFs offer a more efficient and compact alternative to implicit models for representing and animating complex dynamic surfaces, with benefits like faster training, better reconstruction quality, and lower memory requirements. |
DPF learns a deformation field represented by a neural network to warp points from a canonical point cloud to target shapes. It leverages constraints like as-isometric-as-possible regularization and keypoint guidance to learn plausible deformations. |
DPF outperforms state-of-the-art implicit methods in static surface reconstruction, achieving better quality with smaller model size.
DPF demonstrates superior performance in learning deformation fields compared to SDF-based methods and non-rigid registration techniques.
DPF enables high-quality animation of clothed humans, particularly excelling with challenging garments like skirts, outperforming LBS-based methods. |
The deformation optimization can struggle with large deformations and topological changes when guidance is limited.
The current per-frame optimization is computationally expensive, limiting real-time applicability. Future work could explore meta-learning for faster inference. |
dynamic surface reconstruction, deformation learning, point cloud processing, neural surface representation, 3d human animation |
2304.02602
Report |
Generative Novel View Synthesis with 3D-Aware Diffusion Models |
Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, Gordon Wetzstein |
We present a diffusion-based model for 3D-aware generative novel view
synthesis from as few as a single input image. Our model samples from the
distribution of possible renderings consistent with the input and, even in the
presence of ambiguity, is capable of rendering diverse and plausible novel
views. To achieve this, our method makes use of existing 2D diffusion backbones
but, crucially, incorporates geometry priors in the form of a 3D feature
volume. This latent feature field captures the distribution over possible scene
representations and improves our method's ability to generate view-consistent
novel renderings. In addition to generating novel views, our method has the
ability to autoregressively synthesize 3D-consistent sequences. We demonstrate
state-of-the-art results on synthetic renderings and room-scale scenes; we also
show compelling results for challenging, real-world objects. |
This paper introduces a diffusion-based model for novel view synthesis that leverages 3D geometry priors, enabling realistic view synthesis from a single input image. |
Existing few-shot view synthesis methods struggle with long-range extrapolation and handling complex, real-world scenes due to their reliance on regression objectives. |
The method combines 2D diffusion models with a 3D feature volume capturing the scene representation. Input image(s) are encoded into the 3D feature volume, rendered from the target viewpoint, and fed to a U-Net denoiser along with a noisy image, iteratively generating the novel view. |
Outperforms state-of-the-art methods on ShapeNet and Matterport3D datasets in terms of image quality and view consistency.
Generates plausible and realistic novel views for challenging, real-world objects in the CO3D dataset, a first for single-shot NVS on this dataset.
Demonstrates high geometric consistency, as evidenced by dense COLMAP reconstructions from generated sequences. |
Limited output resolution (128x128) and slow inference speed due to the diffusion process.
Potential for minor inconsistencies and drift in challenging real-world datasets. |
novel view synthesis, diffusion models, 3d geometry priors, generative models, single image view synthesis |
2304.02364
Report |
What's in a Name? Beyond Class Indices for Image Recognition |
Kai Han, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia |
Existing machine learning models demonstrate excellent performance in image
object recognition after training on a large-scale dataset under full
supervision. However, these models only learn to map an image to a predefined
class index, without revealing the actual semantic meaning of the object in the
image. In contrast, vision-language models like CLIP are able to assign
semantic class names to unseen objects in a `zero-shot' manner, although they
still rely on a predefined set of candidate names at test time. In this paper,
we reconsider the recognition problem and task a vision-language model to
assign class names to images given only a large and essentially unconstrained
vocabulary of categories as prior information. We use non-parametric methods to
establish relationships between images which allow the model to automatically
narrow down the set of possible candidate names. Specifically, we propose
iteratively clustering the data and voting on class names within them, showing
that this enables a roughly 50\% improvement over the baseline on ImageNet.
Furthermore, we tackle this problem both in unsupervised and partially
supervised settings, as well as with a coarse-grained and fine-grained search
space as the unconstrained dictionary. |
This paper proposes a new task termed *Semantic Category Discovery (SCD)*, where the goal is to automatically assign semantic class names to images given a large, unconstrained vocabulary of categories, going beyond traditional class index prediction. |
Existing image recognition models rely on predefined class indices, limiting their ability to handle unseen objects and adapt to new categories. This work aims to bridge the gap to human-like perception, where we can assign semantic names to objects directly. |
The proposed method leverages non-parametric clustering (e.g., k-means) on image features and a pre-trained vision-language model (e.g., CLIP). It iteratively refines cluster assignments and votes on class names within clusters to narrow down the vocabulary to the most relevant concepts. |
The method significantly outperforms baseline zero-shot transfer approaches on datasets like ImageNet, Stanford Dogs, and CUB, roughly doubling sACC on ImageNet-100.
Surprisingly, the proposed method can also improve clustering accuracy compared to strong baselines like DINO, indicating its effectiveness in grouping semantically similar images.
The paper shows that choosing appropriate initial clustering algorithms (constrained semi-supervised k-means) and leveraging strong features (DINO) are crucial for good performance. |
Despite improvements, the absolute accuracy of the method remains relatively low, highlighting the challenge of unconstrained semantic naming and the need for further research.
The reliance on pre-trained models like CLIP introduces potential biases from the internet-scale data they are trained on, necessitating further investigation into transparency and controllability for real-world deployment. |
image recognition, semantic category discovery, vision-language models, zero-shot learning, clustering |
2304.02330
Report |
SMPConv: Self-moving Point Representations for Continuous Convolution |
Sanghyeon Kim, Eunbyung Park |
Continuous convolution has recently gained prominence due to its ability to
handle irregularly sampled data and model long-term dependency. Also, the
promising experimental results of using large convolutional kernels have
catalyzed the development of continuous convolution since they can construct
large kernels very efficiently. Leveraging neural networks, more specifically
multilayer perceptrons (MLPs), is by far the most prevalent approach to
implementing continuous convolution. However, there are a few drawbacks, such
as high computational costs, complex hyperparameter tuning, and limited
descriptive power of filters. This paper suggests an alternative approach to
building a continuous convolution without neural networks, resulting in more
computationally efficient and improved performance. We present self-moving
point representations where weight parameters freely move, and interpolation
schemes are used to implement continuous functions. When applied to construct
convolutional kernels, the experimental results have shown improved performance
with drop-in replacement in the existing frameworks. Due to its lightweight
structure, we are first to demonstrate the effectiveness of continuous
convolution in a large-scale setting, e.g., ImageNet, presenting the
improvements over the prior arts. Our code is available on
https://github.com/sangnekim/SMPConv |
This paper proposes SMPConv, a novel method for continuous convolution that utilizes self-moving point representations and interpolation schemes, eliminating the need for computationally expensive neural networks. |
Current continuous convolution methods rely heavily on neural networks, leading to high computational costs, complex hyperparameter tuning, and limitations in the descriptive power of filters. SMPConv aims to address these issues by offering a more efficient and effective alternative. |
SMPConv represents convolutional kernels as continuous functions using self-moving points. These points, associated with weight parameters and radii, are learned during training and interpolated to generate kernel values at arbitrary locations. This approach allows for constructing large, adaptive receptive fields with minimal computational overhead. |
SMPConv achieves state-of-the-art results on sequential image classification tasks, such as sMNIST and pMNIST, demonstrating its capability in handling long-term dependencies.
On CIFAR10 image classification, SMPConv outperforms its MLP-based counterparts with fewer parameters and significantly faster training times, indicating its effectiveness for 2D image data.
For the first time, a continuous convolution method, SMPConv, is successfully applied to ImageNet-scale image classification, achieving competitive results with fewer parameters compared to existing models. |
Limited computational budget restricted the number of experiments for large-scale image classification.
Further exploration of regularization techniques and prior knowledge integration could potentially improve performance for tasks requiring long-term dependency modeling. |
continuous convolution, self-moving point representation, large kernel convolution, efficient deep learning, image classification |
2304.02051
Report |
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing |
Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara |
Fashion illustration is used by designers to communicate their vision and to
bring the design idea from conceptualization to realization, showing how
clothes interact with the human body. In this context, computer vision can thus
be used to improve the fashion design process. Differently from previous works
that mainly focused on the virtual try-on of garments, we propose the task of
multimodal-conditioned fashion image editing, guiding the generation of
human-centric fashion images by following multimodal prompts, such as text,
human body poses, and garment sketches. We tackle this problem by proposing a
new architecture based on latent diffusion models, an approach that has not
been used before in the fashion domain. Given the lack of existing datasets
suitable for the task, we also extend two existing fashion datasets, namely
Dress Code and VITON-HD, with multimodal annotations collected in a
semi-automatic manner. Experimental results on these new datasets demonstrate
the effectiveness of our proposal, both in terms of realism and coherence with
the given multimodal inputs. Source code and collected multimodal annotations
are publicly available at:
https://github.com/aimagelab/multimodal-garment-designer. |
Introduces Multimodal Garment Designer (MGD), a novel human-centric latent diffusion model for fashion image editing, enabling the generation of novel fashion images conditioned on text, human poses, and garment sketches. |
Addresses the limitations of existing fashion image editing methods by introducing a multimodal approach that leverages text, human poses, and garment sketches to guide the generation process, enhancing control and personalization in fashion design. |
Presents a novel MGD architecture based on latent diffusion models, incorporating a denoising network conditioned on text embeddings, pose maps, and garment sketches. Extends existing fashion datasets (Dress Code and VITON-HD) with multimodal annotations, including textual descriptions and garment sketches, collected semi-automatically. |
MGD consistently outperforms competitor models in terms of image realism and coherence with input modalities on the newly collected multimodal datasets.
The model effectively combines and utilizes text, pose, and sketch information in a disentangled manner, making each modality optional during generation.
User studies confirm that MGD generates more realistic and coherent images compared to baseline methods. |
MGD occasionally struggles to generate accurate hand details, particularly when hands occupy a small portion of the input image.
The performance of the model is reliant on the quality of the input sketch, and inaccuracies in the sketch can lead to artifacts in the generated image. Future work will focus on addressing these limitations, potentially through improved hand modeling techniques or sketch refinement methods. |
fashion image editing, latent diffusion models, multimodal conditioning, garment sketch guidance, human-centric generation |
2304.02012
Report |
EGC: Image Generation and Classification via a Diffusion Energy-Based Model |
Qiushan Guo, Chuofan Ma, Yi Jiang, Zehuan Yuan, Yizhou Yu, Ping Luo |
Learning image classification and image generation using the same set of
network parameters is a challenging problem. Recent advanced approaches perform
well in one task often exhibit poor performance in the other. This work
introduces an energy-based classifier and generator, namely EGC, which can
achieve superior performance in both tasks using a single neural network.
Unlike a conventional classifier that outputs a label given an image (i.e., a
conditional distribution $p(y|\mathbf{x})$), the forward pass in EGC is a
classifier that outputs a joint distribution $p(\mathbf{x},y)$, enabling an
image generator in its backward pass by marginalizing out the label $y$. This
is done by estimating the energy and classification probability given a noisy
image in the forward pass, while denoising it using the score function
estimated in the backward pass. EGC achieves competitive generation results
compared with state-of-the-art approaches on ImageNet-1k, CelebA-HQ and LSUN
Church, while achieving superior classification accuracy and robustness against
adversarial attacks on CIFAR-10. This work represents the first successful
attempt to simultaneously excel in both tasks using a single set of network
parameters. We believe that EGC bridges the gap between discriminative and
generative learning. |
This paper proposes EGC, a novel energy-based model that unifies image classification and generation using a single neural network. |
Bridging the gap between discriminative and generative learning is a challenging problem, and existing models often excel in one task while performing poorly in the other. EGC aims to achieve superior performance in both tasks simultaneously. |
EGC leverages the diffusion process to improve the accuracy of estimated scores for stable training and image sampling. The forward pass acts as a classifier, predicting the joint distribution of noisy image and label. The backward pass acts as a generator, denoising data using unconditional and conditional scores. |
EGC achieves competitive generation results compared to state-of-the-art approaches on ImageNet, CelebA-HQ, and LSUN Church datasets.
EGC achieves superior classification accuracy and robustness against adversarial attacks on CIFAR-10, surpassing existing explicit EBMs.
EGC demonstrates promising applications in image inpainting, semantic interpolation, and high-resolution image generation. |
The paper acknowledges that incorporating stronger data augmentation techniques could further improve the results.
Exploring network architectures specifically designed for optimizing the gradient of the energy function could further enhance performance. |
energy-based model, diffusion model, image generation, image classification, generative learning |
2304.01999
Report |
Revisiting the Evaluation of Image Synthesis with GANs |
Mengping Yang, Ceyuan Yang, Yichi Zhang, Qingyan Bai, Yujun Shen, Bo Dai |
A good metric, which promises a reliable comparison between solutions, is
essential for any well-defined task. Unlike most vision tasks that have
per-sample ground-truth, image synthesis tasks target generating unseen data
and hence are usually evaluated through a distributional distance between one
set of real samples and another set of generated samples. This study presents
an empirical investigation into the evaluation of synthesis performance, with
generative adversarial networks (GANs) as a representative of generative
models. In particular, we make in-depth analyses of various factors, including
how to represent a data point in the representation space, how to calculate a
fair distance using selected samples, and how many instances to use from each
set. Extensive experiments conducted on multiple datasets and settings reveal
several important findings. Firstly, a group of models that include both
CNN-based and ViT-based architectures serve as reliable and robust feature
extractors for measurement evaluation. Secondly, Centered Kernel Alignment
(CKA) provides a better comparison across various extractors and hierarchical
layers in one model. Finally, CKA is more sample-efficient and enjoys better
agreement with human judgment in characterizing the similarity between two
internal data correlations. These findings contribute to the development of a
new measurement system, which enables a consistent and reliable re-evaluation
of current state-of-the-art generative models. |
This paper presents an empirical study investigating evaluation paradigms for generative adversarial networks (GANs) in image synthesis, focusing on the feature extractor and distributional distance. |
Accurately evaluating the performance of GANs is crucial for assessing progress in image synthesis. Existing metrics like FID have limitations, necessitating a systematic investigation for reliable comparisons. |
The study explores various feature extractors (CNNs, ViTs, MLPs) and distributional distances (FID, CKA) using techniques like heatmap visualization, histogram matching attacks, and human evaluations. |
A combination of CNN-based and ViT-based architectures provides reliable and robust feature extraction for evaluating GANs.
Centered Kernel Alignment (CKA) offers better comparison across different extractors and hierarchical layers than FID.
CKA demonstrates greater sample efficiency and stronger agreement with human judgment in assessing GAN-generated image quality. |
The study primarily focuses on image-level evaluation, without addressing potential biases from low-level image processing.
Future work could explore the impact of image resolution and dataset size on the evaluation results. |
image synthesis, generative adversarial networks, evaluation metrics, feature extractors, distributional distance |
2304.01900
Report |
PODIA-3D: Domain Adaptation of 3D Generative Model Across Large Domain Gap Using Pose-Preserved Text-to-Image Diffusion |
Gwanghyun Kim, Ji Ha Jang, Se Young Chun |
Recently, significant advancements have been made in 3D generative models,
however training these models across diverse domains is challenging and
requires an huge amount of training data and knowledge of pose distribution.
Text-guided domain adaptation methods have allowed the generator to be adapted
to the target domains using text prompts, thereby obviating the need for
assembling numerous data. Recently, DATID-3D presents impressive quality of
samples in text-guided domain, preserving diversity in text by leveraging
text-to-image diffusion. However, adapting 3D generators to domains with
significant domain gaps from the source domain still remains challenging due to
issues in current text-to-image diffusion models as following: 1) shape-pose
trade-off in diffusion-based translation, 2) pose bias, and 3) instance bias in
the target domain, resulting in inferior 3D shapes, low text-image
correspondence, and low intra-domain diversity in the generated samples. To
address these issues, we propose a novel pipeline called PODIA-3D, which uses
pose-preserved text-to-image diffusion-based domain adaptation for 3D
generative models. We construct a pose-preserved text-to-image diffusion model
that allows the use of extremely high-level noise for significant domain
changes. We also propose specialized-to-general sampling strategies to improve
the details of the generated samples. Moreover, to overcome the instance bias,
we introduce a text-guided debiasing method that improves intra-domain
diversity. Consequently, our method successfully adapts 3D generators across
significant domain gaps. Our qualitative results and user study demonstrates
that our approach outperforms existing 3D text-guided domain adaptation methods
in terms of text-image correspondence, realism, diversity of rendered images,
and sense of depth of 3D shapes in the generated samples |
Presents PODIA-3D, a pose-preserved text-to-image diffusion-based domain adaptation method for 3D generative models, enabling adaptation across large domain gaps (e.g., from human faces to animals) with strong text-image correspondence and high-quality 3D shapes. |
Training 3D generative models on diverse domains is challenging due to the need for vast amounts of training data and pose information. Existing domain adaptation methods struggle to handle significant domain shifts and often lead to low text-image correspondence and poor 3D shapes. |
1. Construct pose-preserved text-to-image diffusion models (PPD) by fine-tuning depth-guided diffusion models on data preserving source poses but with target shapes. 2. Propose a specialized-to-general sampling strategy to generate target images, leveraging PPD for structure and pose and general diffusion models for details. 3. Fine-tune 3D generators adversarially on the generated pose-aware target dataset. 4. Introduce text-guided debiasing to improve intra-domain diversity. |
Achieves superior text-image correspondence and 3D shape quality compared to existing 3D text-guided domain adaptation methods like StyleGANFusion and DATID-3D.
Demonstrates successful adaptation of EG3D to various animal and character domains with large domain gaps from the source FFHQ dataset.
Shows the effectiveness of the proposed PPD and specialized-to-general sampling in generating high-quality, pose-consistent target images. |
Domain adaptation to non-living objects (e.g., chairs) with less directional information can result in low text-image correspondence.
Reliance on text-to-image diffusion models means inheriting their limitations, such as potential biases or difficulty in generating certain image features. |
3d generative models, domain adaptation, text-to-image synthesis, diffusion models, pose preservation |
2304.01716
Report |
Decoupling Dynamic Monocular Videos for Dynamic View Synthesis |
Meng You, Junhui Hou |
The challenge of dynamic view synthesis from dynamic monocular videos, i.e.,
synthesizing novel views for free viewpoints given a monocular video of a
dynamic scene captured by a moving camera, mainly lies in accurately modeling
the dynamic objects of a scene using limited 2D frames, each with a varying
timestamp and viewpoint. Existing methods usually require pre-processed 2D
optical flow and depth maps by off-the-shelf methods to supervise the network,
making them suffer from the inaccuracy of the pre-processed supervision and the
ambiguity when lifting the 2D information to 3D. In this paper, we tackle this
challenge in an unsupervised fashion. Specifically, we decouple the motion of
the dynamic objects into object motion and camera motion, respectively
regularized by proposed unsupervised surface consistency and patch-based
multi-view constraints. The former enforces the 3D geometric surfaces of moving
objects to be consistent over time, while the latter regularizes their
appearances to be consistent across different viewpoints. Such a fine-grained
motion formulation can alleviate the learning difficulty for the network, thus
enabling it to produce not only novel views with higher quality but also more
accurate scene flows and depth than existing methods requiring extra
supervision. |
This paper proposes an unsupervised learning approach for dynamic view synthesis from monocular videos, which eliminates the need for pre-processed depth and optical flow as supervision by introducing two novel unsupervised regularization terms: surface consistency and patch-based multi-view consistency. |
Existing methods heavily rely on pre-processed 2D optical flow and depth maps, leading to limitations such as performance degradation due to inaccurate pre-processed data, ambiguity in lifting 2D information to 3D, and computational expenses. |
The method decouples object and camera motion, using a static NeRF for the background and an NSFF for dynamic objects and scene flow. The surface consistency constraint enforces temporal consistency of moving object surfaces, while the patch-based multi-view constraint ensures consistency between rendered novel views and the input view. |
The unsupervised method outperforms state-of-the-art supervised methods on the NVIDIA Dynamic Scene Dataset, achieving superior PSNR and comparable SSIM and LPIPS.
It effectively handles dynamic scenes, particularly those with significant motion, showing substantial improvements over existing methods.
The method produces accurate depth and flow maps for novel views in an unsupervised manner, comparable to those generated by supervised methods. |
The method struggles with non-rigid deformations due to the surface consistency constraint's limitations in handling such cases.
The reliance on separate modeling for static and dynamic parts and the use of mask supervision represent areas for future simplification. |
dynamic view synthesis, neural radiance fields (nerf), unsupervised learning, scene flow estimation, monocular video |
2304.01515
Report |
Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models |
Jaewoong Lee, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Yunji Kim, Jin-Hwa Kim, Jung-Woo Ha, Sung Ju Hwang |
Token-based masked generative models are gaining popularity for their fast
inference time with parallel decoding. While recent token-based approaches
achieve competitive performance to diffusion-based models, their generation
performance is still suboptimal as they sample multiple tokens simultaneously
without considering the dependence among them. We empirically investigate this
problem and propose a learnable sampling model, Text-Conditioned Token
Selection (TCTS), to select optimal tokens via localized supervision with text
information. TCTS improves not only the image quality but also the semantic
alignment of the generated images with the given texts. To further improve the
image quality, we introduce a cohesive sampling strategy, Frequency Adaptive
Sampling (FAS), to each group of tokens divided according to the self-attention
maps. We validate the efficacy of TCTS combined with FAS with various
generative tasks, demonstrating that it significantly outperforms the baselines
in image-text alignment and image quality. Our text-conditioned sampling
framework further reduces the original inference time by more than 50% without
modifying the original generative model. |
This paper introduces a text-conditioned sampling framework for text-to-image generation with masked generative models, aiming to improve text alignment and image quality. |
Current token-based diffusion models for text-to-image generation, while fast, struggle with inconsistency in generated images due to simultaneous token sampling, leading to a trade-off between speed and quality. This is particularly problematic for text alignment. |
The authors propose Text-Conditioned Token Selection (TCTS), a learnable model trained to identify and resample misaligned tokens based on text conditions. They further introduce Frequency Adaptive Sampling (FAS) to address over-simplification in low-frequency image areas by selectively applying persistent sampling based on self-attention maps. |
Revocable sampling strategies like TCTS improve text alignment compared to fixed methods, mitigating error accumulation.
TCTS, especially when combined with FAS, outperforms baselines in text alignment metrics (MID-L) and maintains competitive image quality (FID) on MS-COCO and CUB datasets.
The framework facilitates fast local image refinement and mask-free object editing leveraging cross-attention maps. |
The computational overhead of TCTS, while marginal, could be further optimized.
The paper primarily focuses on single-object datasets, and further exploration is needed for more complex multi-object scenes. |
text-to-image generation, token-based diffusion models, revocable sampling, text alignment, image refinement |
2304.01489
Report |
Improved Visual Fine-tuning with Natural Language Supervision |
Junyang Wang, Yuanhong Xu, Juhua Hu, Ming Yan, Jitao Sang, Qi Qian |
Fine-tuning a visual pre-trained model can leverage the semantic information
from large-scale pre-training data and mitigate the over-fitting problem on
downstream vision tasks with limited training examples. While the problem of
catastrophic forgetting in pre-trained backbone has been extensively studied
for fine-tuning, its potential bias from the corresponding pre-training task
and data, attracts less attention. In this work, we investigate this problem by
demonstrating that the obtained classifier after fine-tuning will be close to
that induced by the pre-trained model. To reduce the bias in the classifier
effectively, we introduce a reference distribution obtained from a fixed text
classifier, which can help regularize the learned vision classifier. The
proposed method, Text Supervised fine-tuning (TeS), is evaluated with diverse
pre-trained vision models including ResNet and ViT, and text encoders including
BERT and CLIP, on 11 downstream tasks. The consistent improvement with a clear
margin over distinct scenarios confirms the effectiveness of our proposal. Code
is available at \url{https://github.com/idstcv/TeS}. |
This paper proposes TeS, a method using text supervision from a fixed text encoder to improve fine-tuning of pre-trained vision models for image classification, reducing bias without catastrophic forgetting. |
Fine-tuning pre-trained vision models can suffer from bias due to the pre-training task and data, while tackling this issue often leads to catastrophic forgetting. Text supervision offers a readily available source of information to address this challenge. |
TeS introduces a reference distribution from a fixed text classifier (using class names). It minimizes KL-divergence between class-level distributions from vision and text encoders and further introduces instance-level regularization by approximating text representations for each image. |
TeS consistently outperforms conventional fine-tuning and label smoothing methods across various pre-trained vision models (ResNet, ViT), text encoders (BERT, CLIP), and datasets.
Text encoders pre-trained with visual data (CLIP) show superior performance in supervising visual fine-tuning compared to pure language models (BERT).
TeS shows significant improvements in few-shot learning scenarios and for datasets with long-tailed distributions. |
Current method requires the exact class names, limiting its applicability in scenarios with restricted access to such information.
Exploring the combination of TeS with state-of-the-art methods specifically designed for class imbalance learning. |
fine-tuning, text supervision, vision-language pre-training, image classification, bias reduction |
2304.01436
Report |
Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos |
Ziqian Bai, Feitong Tan, Zeng Huang, Kripasindhu Sarkar, Danhang Tang, Di Qiu, Abhimitra Meka, Ruofei Du, Mingsong Dou, Sergio Orts-Escolano, Rohit Pandey, Ping Tan, Thabo Beeler, Sean Fanello, Yinda Zhang |
We propose a method to learn a high-quality implicit 3D head avatar from a
monocular RGB video captured in the wild. The learnt avatar is driven by a
parametric face model to achieve user-controlled facial expressions and head
poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of
a 3DMM with a neural radiance field to achieve fine-grained control and
photorealism. To reduce over-smoothing and improve out-of-model expressions
synthesis, we propose to predict local features anchored on the 3DMM geometry.
These learnt features are driven by 3DMM deformation and interpolated in 3D
space to yield the volumetric radiance at a designated query point. We further
show that using a Convolutional Neural Network in the UV space is critical in
incorporating spatial context and producing representative local features.
Extensive experiments show that we are able to reconstruct high-quality
avatars, with more accurate expression-dependent details, good generalization
to out-of-training expressions, and quantitatively superior renderings compared
to other state-of-the-art approaches. |
This paper presents a method to create high-quality, controllable 3D head avatars from monocular RGB videos, leveraging a 3DMM-anchored neural radiance field. |
Creating realistic and controllable avatars from accessible data like monocular videos is crucial for AR/VR, gaming, and visual effects. |
The method combines a 3DMM for facial tracking with a neural radiance field for detailed rendering. It uses a CNN in UV space to predict expression-dependent features attached to 3DMM vertices, enhancing detail and generalization to unseen expressions. |
Reconstructs high-quality avatars from short monocular videos.
Captures fine-grained details and accurate articulations, outperforming previous methods in terms of visual quality.
Demonstrates good generalization to out-of-training expressions and novel view synthesis. |
Training is subject-specific and time-consuming, similar to other NeRF-based methods.
Limited ability to model components completely missing in the 3DMM, such as the tongue. |
head avatar, neural radiance field, 3d morphable model, monocular reconstruction, expression transfer |
2304.01200
Report |
Video Instance Segmentation in an Open-World |
Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan |
Existing video instance segmentation (VIS) approaches generally follow a
closed-world assumption, where only seen category instances are identified and
spatio-temporally segmented at inference. Open-world formulation relaxes the
close-world static-learning assumption as follows: (a) first, it distinguishes
a set of known categories as well as labels an unknown object as `unknown' and
then (b) it incrementally learns the class of an unknown as and when the
corresponding semantic labels become available. We propose the first open-world
VIS approach, named OW-VISFormer, that introduces a novel feature enrichment
mechanism and a spatio-temporal objectness (STO) module. The feature enrichment
mechanism based on a light-weight auxiliary network aims at accurate
pixel-level (unknown) object delineation from the background as well as
distinguishing category-specific known semantic classes. The STO module strives
to generate instance-level pseudo-labels by enhancing the foreground
activations through a contrastive loss. Moreover, we also introduce an
extensive experimental protocol to measure the characteristics of OW-VIS. Our
OW-VISFormer performs favorably against a solid baseline in OW-VIS setting.
Further, we evaluate our contributions in the standard fully-supervised VIS
setting by integrating them into the recent SeqFormer, achieving an absolute
gain of 1.6\% AP on Youtube-VIS 2019 val. set. Lastly, we show the
generalizability of our contributions for the open-world detection (OWOD)
setting, outperforming the best existing OWOD method in the literature. Code,
models along with OW-VIS splits are available at
\url{https://github.com/OmkarThawakar/OWVISFormer}. |
This paper introduces OW-VISFormer, the first approach for open-world video instance segmentation (OW-VIS). OW-VISFormer employs a novel feature enrichment mechanism and a spatio-temporal objectness module to identify and segment both known and unknown object instances in videos, allowing for incremental learning of new object categories. |
Existing VIS approaches operate under a closed-world assumption, limiting their ability to handle novel object categories. OW-VIS addresses this by enabling the model to identify unknown objects and incrementally learn their categories as annotations become available, crucial for real-world applications where new objects are constantly encountered. |
OW-VISFormer leverages a light-weight auxiliary network (ScratchNet) to generate shallow features that complement standard pre-trained features, improving pixel-level object delineation. A spatio-temporal objectness (STO) module with a contrastive loss enhances foreground activations, facilitating the identification of candidate unknown objects and improving mask prediction for both known and unknown instances. |
OW-VISFormer consistently outperforms the baseline on various OW-VIS splits, demonstrating its effectiveness in segmenting both known and unknown objects.
Integrating the proposed feature enrichment and STO module into SeqFormer, a fully-supervised VIS method, yields a 1.6% absolute gain in AP on the YouTube-VIS 2019 val. set.
The proposed approach generalizes well to open-world object detection (OWOD), surpassing the state-of-the-art OW-DETR method on the MS COCO OWOD split. |
The current OW-VISFormer framework focuses on single-stage instance segmentation. Exploring its integration with two-stage VIS methods could be beneficial.
The impact of different memory replay strategies for incremental learning in OW-VIS warrants further investigation. |
video instance segmentation, open-world learning, incremental learning, object detection, computer vision |
2304.01198
Report |
Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network |
Cong Han, Yujie Zhong, Dengjie Li, Kai Han, Lin Ma |
Recently, the open-vocabulary semantic segmentation problem has attracted
increasing attention and the best performing methods are based on two-stream
networks: one stream for proposal mask generation and the other for segment
classification using a pretrained visual-language model. However, existing
two-stream methods require passing a great number of (up to a hundred) image
crops into the visual-language model, which is highly inefficient. To address
the problem, we propose a network that only needs a single pass through the
visual-language model for each input image. Specifically, we first propose a
novel network adaptation approach, termed patch severance, to restrict the
harmful interference between the patch embeddings in the pre-trained visual
encoder. We then propose classification anchor learning to encourage the
network to spatially focus on more discriminative features for classification.
Extensive experiments demonstrate that the proposed method achieves outstanding
performance, surpassing state-of-the-art methods while being 4 to 7 times
faster at inference. Code: https://github.com/CongHan0808/DeOP.git |
This paper proposes Decoupled One-Pass Network (DeOP) for open-vocabulary semantic segmentation, which maintains the zero-shot ability of VLMs while being computationally efficient. |
Existing two-stream methods for open-vocabulary semantic segmentation are computationally expensive, as they require passing many image crops through the visual-language model. |
DeOP uses a decoupled, two-stream architecture with a class-agnostic mask proposal network and a mask classification network based on a frozen CLIP visual encoder. It introduces two novel components: Generalized Patch Severance (GPS) to reduce interference between patch embeddings and Classification Anchor Learning (CAL) to identify discriminative features for classification. |
DeOP achieves state-of-the-art performance on COCO-Stuff and Pascal VOC in both intra- and cross-dataset evaluations.
DeOP is 4 to 7 times faster than multi-pass methods at inference.
GPS and CAL significantly contribute to the improvement of segmentation performance. |
Applying GPS to shallower layers of the visual encoder can hurt performance.
Exploring more sophisticated CAL modules could further enhance performance. |
semantic segmentation, open-vocabulary learning, zero-shot learning, vision-language models, clip |
2304.01197
Report |
Bringing Telepresence to Every Desk |
Shengze Wang, Ziheng Wang, Ryan Schmelzle, Liujie Zheng, YoungJoong Kwon, Soumyadip Sengupta, Henry Fuchs |
In this paper, we work to bring telepresence to every desktop. Unlike
commercial systems, personal 3D video conferencing systems must render
high-quality videos while remaining financially and computationally viable for
the average consumer. To this end, we introduce a capturing and rendering
system that only requires 4 consumer-grade RGBD cameras and synthesizes
high-quality free-viewpoint videos of users as well as their environments.
Experimental results show that our system renders high-quality free-viewpoint
videos without using object templates or heavy pre-processing. While not
real-time, our system is fast and does not require per-video optimizations.
Moreover, our system is robust to complex hand gestures and clothing, and it
can generalize to new users. This work provides a strong basis for further
optimization, and it will help bring telepresence to every desk in the near
future. The code and dataset will be made available on our website
https://mcmvmc.github.io/PersonalTelepresence/. |
This paper presents a novel view synthesis system for personal 3D telepresence using only 4 consumer-grade RGBD cameras, offering a cost-effective way to synthesize high-quality free-viewpoint videos. |
This system addresses the limitations of current commercial 3D telepresence solutions, which are often expensive and require dedicated physical spaces, making them inaccessible to the average user. |
The system utilizes a novel volumetric representation called Multi-layer Point Cloud (MPC) to address depth biases in RGBD cameras, improving reconstruction accuracy, especially for slanted surfaces. It also incorporates a temporal renderer for temporal smoothing and Spatial Skip Connections for high-resolution rendering under limited GPU memory. |
The system produces high-quality free-viewpoint videos of users and their environments, outperforming baseline methods in terms of accuracy and stability.
It accurately reconstructs challenging details like hand gestures and fast body movements without relying on object templates or heavy pre-processing.
The system demonstrates generalizability to new users and environments, highlighting its potential for wider adoption. |
The system is not yet real-time, with cost volume construction being a bottleneck.
The current system does not support immersive display technologies like autostereo displays. |
novel view synthesis, 3d telepresence, rgbd cameras, volumetric rendering, personal telepresence |
2304.01186
Report |
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos |
Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, Qifeng Chen |
Generating text-editable and pose-controllable character videos have an
imperious demand in creating various digital human. Nevertheless, this task has
been restricted by the absence of a comprehensive dataset featuring paired
video-pose captions and the generative prior models for videos. In this work,
we design a novel two-stage training scheme that can utilize easily obtained
datasets (i.e.,image pose pair and pose-free video) and the pre-trained
text-to-image (T2I) model to obtain the pose-controllable character videos.
Specifically, in the first stage, only the keypoint-image pairs are used only
for a controllable text-to-image generation. We learn a zero-initialized
convolutional encoder to encode the pose information. In the second stage, we
finetune the motion of the above network via a pose-free video dataset by
adding the learnable temporal self-attention and reformed cross-frame
self-attention blocks. Powered by our new designs, our method successfully
generates continuously pose-controllable character videos while keeps the
editing and concept composition ability of the pre-trained T2I model. The code
and models will be made publicly available. |
This paper presents a novel two-stage training scheme for generating text-editable and pose-controllable character videos, leveraging pre-trained text-to-image models and easily obtainable datasets. |
Generating such videos is crucial for various digital human applications but limited by the lack of paired video-pose captions and effective video generative prior models. |
The method uses a pose encoder to incorporate pose information into a pre-trained text-to-image model (Stage 1), followed by fine-tuning on a pose-free video dataset to ensure temporal consistency (Stage 2). |
The approach successfully generates high-quality character videos controllable by both text prompts and pose sequences.
It inherits robust concept generation and composition capabilities from pre-trained T2I models.
The method outperforms existing techniques in terms of generation quality, text-video alignment, pose-video alignment, and temporal coherence. |
The model's performance on complex scenes with multiple interacting characters needs further investigation.
Future work may explore incorporating additional control signals, such as depth or style, for more versatile video generation. |
text-to-video generation, pose control, character animation, diffusion models, deep learning |
2304.01184
Report |
WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation |
Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, Xinggang Wang |
This paper explores the properties of the plain Vision Transformer (ViT) for
Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM)
is of critical importance for understanding a classification network and
launching WSSS. We observe that different attention heads of ViT focus on
different image areas. Thus a novel weight-based method is proposed to
end-to-end estimate the importance of attention heads, while the self-attention
maps are adaptively fused for high-quality CAM results that tend to have more
complete objects. Besides, we propose a ViT-based gradient clipping decoder for
online retraining with the CAM results to complete the WSSS task. We name this
plain Transformer-based Weakly-supervised learning framework WeakTr. It
achieves the state-of-the-art WSSS performance on standard benchmarks, i.e.,
78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of
COCO 2014. Code is available at https://github.com/hustvl/WeakTr. |
This paper proposes WeakTr, a novel weakly supervised semantic segmentation framework using a plain Vision Transformer (ViT), introducing a weight-based method for fusing attention heads to generate high-quality class activation maps. |
Weakly supervised semantic segmentation typically relies on class activation maps (CAMs) generated from classification networks, but traditional methods for CAM generation have limitations. This paper addresses those limitations by leveraging the multi-head attention mechanism of ViTs. |
The authors introduce a weight-based method to estimate the importance of different attention heads in ViT for adaptive fusion, leading to improved CAM quality. They also propose a ViT-based gradient clipping decoder for online retraining using the generated CAMs to achieve semantic segmentation. |
WeakTr achieves state-of-the-art WSSS performance, reaching 78.4% mIoU on the PASCAL VOC 2012 val set and 50.3% mIoU on the COCO 2014 val set.
The proposed weight-based attention head fusion method generates higher-quality CAMs compared to traditional mean-sum approaches.
The ViT-based gradient clipping decoder effectively leverages the generated CAMs for improved semantic segmentation. |
The computational cost of WeakTr is not addressed, which is crucial for real-world applications.
The paper focuses on single-label WSSS; exploring its effectiveness in multi-label settings would be beneficial. Future work could explore the impact of different ViT architectures and pretraining strategies on WeakTr's performance. |
weakly supervised semantic segmentation, vision transformer, class activation map, attention mechanism, computer vision |
2304.01172
Report |
Generative Multiplane Neural Radiance for 3D-Aware Image Generation |
Amandeep Kumar, Ankan Kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan |
We present a method to efficiently generate 3D-aware high-resolution images
that are view-consistent across multiple target views. The proposed multiplane
neural radiance model, named GMNR, consists of a novel {\alpha}-guided
view-dependent representation ({\alpha}-VdR) module for learning view-dependent
information. The {\alpha}-VdR module, faciliated by an {\alpha}-guided pixel
sampling technique, computes the view-dependent representation efficiently by
learning viewing direction and position coefficients. Moreover, we propose a
view-consistency loss to enforce photometric similarity across multiple views.
The GMNR model can generate 3D-aware high-resolution images that are
viewconsistent across multiple camera poses, while maintaining the
computational efficiency in terms of both training and inference time.
Experiments on three datasets demonstrate the effectiveness of the proposed
modules, leading to favorable results in terms of both generation quality and
inference time, compared to existing approaches. Our GMNR model generates
3D-aware images of 1024 X 1024 pixels with 17.6 FPS on a single V100. Code :
https://github.com/VIROBO-15/GMNR |
This paper proposes Generative Multiplane Neural Radiance (GMNR), an efficient approach for synthesizing 3D-aware and view-consistent high-resolution images across different camera poses. |
Generating 3D-aware images that maintain consistency across views is challenging due to the lack of 3D geometry supervision and the need for high-resolution outputs at extrapolated views. |
GMNR leverages multiplane images and introduces an α-guided view-dependent representation (α-VdR) module. This module learns view-dependent information by efficiently sampling pixels using an α-guided technique and computing view-dependent pixel representations. It also incorporates a view-consistency loss to enforce photometric similarity across multiple rendered views. |
GMNR outperforms the baseline GMPI method in terms of FID, KID, identity consistency, depth accuracy, and pose accuracy on FFHQ and AFHQv2-Cats datasets.
The α-VdR module significantly improves image quality at extrapolated views by learning view-dependent information.
GMNR achieves comparable or better performance than state-of-the-art methods like EG3D and StyleNeRF while maintaining high inference speed (17.6 FPS for 1024x1024 images on a single V100). |
The maximum sampling rate within the α-VdR module is limited by training batch size.
Future work can explore extending GMNR to handle more complex scenes and object categories. |
3d-aware image generation, view-consistency, multiplane images, generative adversarial networks, view-dependent representation |
2304.01114
Report |
Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation |
Yabo Zhang, Zihao Wang, Jun Hao Liew, Jingjia Huang, Manyu Zhu, Jiashi Feng, Wangmeng Zuo |
In this work, we investigate performing semantic segmentation solely through
the training on image-sentence pairs. Due to the lack of dense annotations,
existing text-supervised methods can only learn to group an image into semantic
regions via pixel-insensitive feedback. As a result, their grouped results are
coarse and often contain small spurious regions, limiting the upper-bound
performance of segmentation. On the other hand, we observe that grouped results
from self-supervised models are more semantically consistent and break the
bottleneck of existing methods. Motivated by this, we introduce associate
self-supervised spatially-consistent grouping with text-supervised semantic
segmentation. Considering the part-like grouped results, we further adapt a
text-supervised model from image-level to region-level recognition with two
core designs. First, we encourage fine-grained alignment with a one-way
noun-to-region contrastive loss, which reduces the mismatched noun-region
pairs. Second, we adopt a contextually aware masking strategy to enable
simultaneous recognition of all grouped regions. Coupled with
spatially-consistent grouping and region-adapted recognition, our method
achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks,
significantly surpassing the state-of-the-art methods. |
This paper proposes a novel text-supervised semantic segmentation method that leverages spatially-consistent grouping from self-supervised vision models to improve segmentation performance. |
Existing text-supervised methods struggle to produce spatially consistent segmentation results due to relying solely on pixel-insensitive image-sentence matching losses. This limits their upper-bound performance as incorrectly grouped pixels are difficult to separate during recognition. |
The proposed method utilizes self-supervised features for consistent region grouping and adapts a text-supervised model (CLIP) for region-level recognition with two key designs: 1) a context-aware masking strategy for efficient and effective encoding of grouped regions, and 2) a one-way noun-region contrastive loss to encourage fine-grained alignment while minimizing mismatched noun-region pairs. |
The method achieves state-of-the-art performance on Pascal VOC, Pascal Context, and COCO benchmarks for text-supervised semantic segmentation.
Qualitative results demonstrate higher quality segmentation masks with fewer spurious regions and more accurate boundaries compared to previous methods.
Ablation studies confirm the effectiveness of the proposed masking strategy, fine-tuning approach, and one-way noun-region alignment loss. |
The method relies on an external self-supervised model for grouping, which adds complexity.
The one-way alignment, while effective, might overlook potential matches where regions are described indirectly in the paired sentence. |
semantic segmentation, text supervision, self-supervised learning, vision-language models, clip |
2304.00964
Report |
Robust Text-driven Image Editing Method that Adaptively Explores Directions in Latent Spaces of StyleGAN and CLIP |
Tsuyoshi Baba, Kosuke Nishida, Kyosuke Nishida |
Automatic image editing has great demands because of its numerous
applications, and the use of natural language instructions is essential to
achieving flexible and intuitive editing as the user imagines. A pioneering
work in text-driven image editing, StyleCLIP, finds an edit direction in the
CLIP space and then edits the image by mapping the direction to the StyleGAN
space. At the same time, it is difficult to tune appropriate inputs other than
the original image and text instructions for image editing. In this study, we
propose a method to construct the edit direction adaptively in the StyleGAN and
CLIP spaces with SVM. Our model represents the edit direction as a normal
vector in the CLIP space obtained by training a SVM to classify positive and
negative images. The images are retrieved from a large-scale image corpus,
originally used for pre-training StyleGAN, according to the CLIP similarity
between the images and the text instruction. We confirmed that our model
performed as well as the StyleCLIP baseline, whereas it allows simple inputs
without increasing the computational time. |
This paper introduces StyleCLIP-FEU, a text-driven image editing method that eliminates the need for a neutral text description required by StyleCLIP. |
This addresses the limitations of StyleCLIP, which requires users to manually provide a neutral text description of the original image, making it less user-friendly and intuitive. |
StyleCLIP-FEU leverages a Support Vector Machine (SVM) to adaptively construct the editing direction in the latent spaces of StyleGAN and CLIP. This involves retrieving positive and negative images from a large corpus based on CLIP similarity to the instruction, and training an SVM to find a hyperplane separating them. |
StyleCLIP-FEU achieves comparable editing performance to StyleCLIP without requiring neutral text input.
The method is robust to changes in hyperparameters controlling editing strength, unlike StyleCLIP.
Subjective evaluation shows that StyleCLIP-FEU outperforms StyleCLIP in both accuracy of edits and naturalness of generated images. |
Adaptive selection of the sparsity hyperparameter for finer control over editing needs further exploration.
The method currently focuses on facial images and could be extended to other domains. |
image editing, text-guided synthesis, stylegan, clip, support vector machine |
2304.00962
Report |
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding |
Jihan Yang, Runyu Ding, Weipeng Deng, Zhe Wang, Xiaojuan Qi |
We propose a lightweight and scalable Regional Point-Language Contrastive
learning framework, namely \textbf{RegionPLC}, for open-world 3D scene
understanding, aiming to identify and recognize open-set objects and
categories. Specifically, based on our empirical studies, we introduce a
3D-aware SFusion strategy that fuses 3D vision-language pairs derived from
multiple 2D foundation models, yielding high-quality, dense region-level
language descriptions without human 3D annotations. Subsequently, we devise a
region-aware point-discriminative contrastive learning objective to enable
robust and effective 3D learning from dense regional language supervision. We
carry out extensive experiments on ScanNet, ScanNet200, and nuScenes datasets,
and our model outperforms prior 3D open-world scene understanding approaches by
an average of 17.2\% and 9.1\% for semantic and instance segmentation,
respectively, while maintaining greater scalability and lower resource demands.
Furthermore, our method has the flexibility to be effortlessly integrated with
language models to enable open-ended grounded 3D reasoning without extra
task-specific training. Code is available at https://github.com/CVMI-Lab/PLA. |
This paper introduces RegionPLC, a novel regional point-language contrastive learning framework for open-world 3D scene understanding, enabling recognition and localization of unseen object categories. |
Open-world 3D scene understanding is crucial for real-world applications but challenging due to the scarcity of dense 3D semantic annotations. Existing methods suffer from limitations such as constrained vocabulary space and high resource requirements. |
RegionPLC leverages diverse 2D foundation models to generate dense region-level 3D-language pairs using a novel supplementary-oriented fusion strategy (SFusion). It then employs a region-aware point-discriminative contrastive learning objective to train a 3D backbone for open-world understanding. |
RegionPLC significantly outperforms prior open-world methods, achieving an average of 17.2% gains in unseen category mIoU for semantic segmentation and 9.1% in mAP50 for instance segmentation.
It demonstrates promising zero-shot segmentation performance, achieving 40.5% higher foreground mIoU compared to previous state-of-the-art with only language supervision.
RegionPLC is also lightweight, requiring only 17% of OpenScene's training cost and 5% of its storage, while being easily integrated with language models for open-ended grounded 3D reasoning. |
Current integration of 2D image features is straightforward and can be further improved with more advanced strategies.
Visual prompts are pre-defined and can benefit from more adaptive techniques. |
open-world learning, 3d scene understanding, point cloud segmentation, vision-language learning, contrastive learning |
2304.00838
Report |
MetaHead: An Engine to Create Realistic Digital Head |
Dingyun Zhang, Chenglai Zhong, Yudong Guo, Yang Hong, Juyong Zhang |
Collecting and labeling training data is one important step for
learning-based methods because the process is time-consuming and biased. For
face analysis tasks, although some generative models can be used to generate
face data, they can only achieve a subset of generation diversity,
reconstruction accuracy, 3D consistency, high-fidelity visual quality, and easy
editability. One recent related work is the graphics-based generative method,
but it can only render low realism head with high computation cost. In this
paper, we propose MetaHead, a unified and full-featured controllable digital
head engine, which consists of a controllable head radiance field(MetaHead-F)
to super-realistically generate or reconstruct view-consistent 3D controllable
digital heads and a generic top-down image generation framework LabelHead to
generate digital heads consistent with the given customizable feature labels.
Experiments validate that our controllable digital head engine achieves the
state-of-the-art generation visual quality and reconstruction accuracy.
Moreover, the generated labeled data can assist real training data and
significantly surpass the labeled data generated by graphics-based methods in
terms of training effect. |
MetaHead, a unified and full-featured controllable digital head engine, enabling realistic head reconstruction, control, generation, and label-consistent synthesis. |
Addresses limitations of existing methods in generation diversity, reconstruction accuracy, 3D consistency, fidelity, and editability, especially in challenging scenarios, and aims to streamline data collection and annotation for face analysis tasks. |
Combines a controllable head radiance field (MetaHead-F) with a generic top-down image generation framework (LabelHead). MetaHead-F leverages a novel decoder-GAN combination strategy with hierarchical attention for high-quality, disentangled control, while LabelHead uses embedded features for label-consistent generation and bidirectional label estimation. |
Achieves state-of-the-art generation quality and reconstruction accuracy, outperforming existing methods on standard metrics.
Enables precise and decoupled 3D control over identity, expression, texture, illumination, pose, and other customizable features like gaze and hair color.
Demonstrates the ability to generate labeled data significantly better than graphics-based methods, improving downstream tasks like landmark estimation and gaze estimation, even with limited real data. |
Potential misuse of highly realistic generated heads necessitates safeguarding measures.
Current implementation focuses on head generation; expanding to full-body generation presents further challenges. |
digital human, generative model, 3d head, radiance field, data augmentation |
2304.00793
Report |
FinnWoodlands Dataset |
Juan Lagos, Urho Lempiö, Esa Rahtu |
While the availability of large and diverse datasets has contributed to
significant breakthroughs in autonomous driving and indoor applications,
forestry applications are still lagging behind and new forest datasets would
most certainly contribute to achieving significant progress in the development
of data-driven methods for forest-like scenarios. This paper introduces a
forest dataset called \textit{FinnWoodlands}, which consists of RGB stereo
images, point clouds, and sparse depth maps, as well as ground truth manual
annotations for semantic, instance, and panoptic segmentation.
\textit{FinnWoodlands} comprises a total of 4226 objects manually annotated,
out of which 2562 objects (60.6\%) correspond to tree trunks classified into
three different instance categories, namely "Spruce Tree", "Birch Tree", and
"Pine Tree". Besides tree trunks, we also annotated "Obstacles" objects as
instances as well as the semantic stuff classes "Lake", "Ground", and "Track".
Our dataset can be used in forestry applications where a holistic
representation of the environment is relevant. We provide an initial benchmark
using three models for instance segmentation, panoptic segmentation, and depth
completion, and illustrate the challenges that such unstructured scenarios
introduce. |
Introduces *FinnWoodlands*, a forest dataset with RGB stereo images, point clouds, sparse depth maps, and ground truth annotations for semantic, instance, and panoptic segmentation, aiming to advance holistic scene understanding in forestry applications. |
Forest datasets are limited compared to urban or indoor datasets, hindering the development of data-driven methods for forestry applications like autonomous navigation and resource management. |
Collected data from Finnish forests using a backpack-mounted LIDAR and stereo camera setup. Manually annotated 300 frames with semantic, instance, and panoptic segmentation ground truths. Provided initial benchmark results using Mask R-CNN, EfficientPS, and FuseNet for instance, panoptic segmentation, and depth completion respectively. |
FinnWoodlands contains 4226 manually annotated objects, with tree trunks constituting 60.6% of the dataset.
Mask R-CNN and EfficientPS show promising results for object detection but struggle with accurate segmentation, especially in dense forest areas.
FuseNet demonstrates good generalization in depth completion but loses fine details of objects like trees. |
Limited diversity in terms of geographical location and seasonal variation.
Extend the dataset with more annotated frames and explore other computer vision tasks relevant to forestry. |
forestry, dataset, panoptic segmentation, instance segmentation, depth completion |
2304.00784
Report |
Disentangled Pre-training for Image Matting |
Yanda Li, Zilong Huang, Gang Yu, Ling Chen, Yunchao Wei, Jianbo Jiao |
Image matting requires high-quality pixel-level human annotations to support
the training of a deep model in recent literature. Whereas such annotation is
costly and hard to scale, significantly holding back the development of the
research. In this work, we make the first attempt towards addressing this
problem, by proposing a self-supervised pre-training approach that can leverage
infinite numbers of data to boost the matting performance. The pre-training
task is designed in a similar manner as image matting, where random trimap and
alpha matte are generated to achieve an image disentanglement objective. The
pre-trained model is then used as an initialisation of the downstream matting
task for fine-tuning. Extensive experimental evaluations show that the proposed
approach outperforms both the state-of-the-art matting methods and other
alternative self-supervised initialisation approaches by a large margin. We
also show the robustness of the proposed approach over different backbone
architectures. Our project page is available at
https://crystraldo.github.io/dpt_mat/. |
This paper proposes Disentangled Pre-training (DPT), a self-supervised pre-training approach for image matting to leverage large-scale unlabeled data. |
High-quality pixel-level annotations for image matting are costly and limit the development of data-driven deep learning methods. |
DPT simulates the matting process with synthetic data. It generates random trimaps for guidance and alpha mattes as pseudo labels. It then trains an encoder-decoder network to predict alpha mattes from composited images, mimicking the image disentanglement objective of image matting. |
DPT outperforms state-of-the-art image matting methods on Composition-1k and Distinct-646 datasets.
The method shows consistent performance improvements across different network backbones (CNN and Transformer).
The pre-trained model effectively learns contour information, as shown by its ability to extract object contours even without fine-tuning. |
The training data is class-agnostic and may limit its applicability to semantic-related tasks.
Future work could explore incorporating semantic information into the pre-training process. |
image matting, self-supervised learning, pre-training, disentanglement, trimap |
2304.00749
Report |
Small but Mighty: Enhancing 3D Point Clouds Semantic Segmentation with U-Next Framework |
Ziyin Zeng, Qingyong Hu, Zhong Xie, Jian Zhou, Yongyang Xu |
We study the problem of semantic segmentation of large-scale 3D point clouds.
In recent years, significant research efforts have been directed toward local
feature aggregation, improved loss functions and sampling strategies. While the
fundamental framework of point cloud semantic segmentation has been largely
overlooked, with most existing approaches rely on the U-Net architecture by
default. In this paper, we propose U-Next, a small but mighty framework
designed for point cloud semantic segmentation. The key to this framework is to
learn multi-scale hierarchical representations from semantically similar
feature maps. Specifically, we build our U-Next by stacking multiple U-Net
$L^1$ codecs in a nested and densely arranged manner to minimize the semantic
gap, while simultaneously fusing the feature maps across scales to effectively
recover the fine-grained details. We also devised a multi-level deep
supervision mechanism to further smooth gradient propagation and facilitate
network optimization. Extensive experiments conducted on three large-scale
benchmarks including S3DIS, Toronto3D, and SensatUrban demonstrate the
superiority and the effectiveness of the proposed U-Next architecture. Our
U-Next architecture shows consistent and visible performance improvements
across different tasks and baseline models, indicating its great potential to
serve as a general framework for future research. |
This paper proposes U-Next, a novel architecture for 3D point cloud semantic segmentation, designed to learn multi-scale hierarchical representations from semantically similar feature maps. |
Existing point cloud segmentation approaches heavily rely on the U-Net architecture, overlooking its limitations in handling information loss during aggressive downsampling and upsampling inherent to 3D point cloud data. |
U-Next leverages multiple stacked U-Net L1 sub-networks, minimizing semantic gaps between feature maps. It incorporates multi-level deep supervision to facilitate smooth gradient propagation and enhance network optimization. |
U-Next consistently outperforms U-Net and U-Net++ architectures on benchmarks like S3DIS, Toronto3D, and SensatUrban.
The architecture shows improvements across different baseline models (RandLA-Net, PointNet++, BAAF-Net, LACV-Net), highlighting its generalizability.
U-Next demonstrates significant performance gains without incurring substantial computational overhead. |
The optimal level of U-Next requires consideration based on accuracy and computational cost trade-offs.
Future work will explore U-Next's applicability across different data modalities and tasks to further assess its potential. |
3d point cloud, semantic segmentation, deep learning, u-net, multi-scale feature fusion |
2304.00719
Report |
Multi-Modal Representation Learning with Text-Driven Soft Masks |
Jaeyoo Park, Bohyung Han |
We propose a visual-linguistic representation learning approach within a
self-supervised learning framework by introducing a new operation, loss, and
data augmentation strategy. First, we generate diverse features for the
image-text matching (ITM) task via soft-masking the regions in an image, which
are most relevant to a certain word in the corresponding caption, instead of
completely removing them. Since our framework relies only on image-caption
pairs with no fine-grained annotations, we identify the relevant regions to
each word by computing the word-conditional visual attention using multi-modal
encoder. Second, we encourage the model to focus more on hard but diverse
examples by proposing a focal loss for the image-text contrastive learning
(ITC) objective, which alleviates the inherent limitations of overfitting and
bias issues. Last, we perform multi-modal data augmentations for
self-supervised learning via mining various examples by masking texts and
rendering distortions on images. We show that the combination of these three
innovations is effective for learning a pretrained model, leading to
outstanding performance on multiple vision-language downstream tasks. |
This paper proposes a novel visual-linguistic representation learning framework that utilizes soft feature masking and diverse regularizations to enhance performance in vision-language tasks. |
Existing vision-language models, pretrained only on image-caption pairs, tend to overfit to discriminative image regions and lack understanding of finer details. This paper addresses this limitation. |
The proposed method introduces three key components: 1) Text-driven soft feature masking to diversify visual features by suppressing activations at important regions based on word-conditional Grad-CAM. 2) Focal image-text contrastive learning to emphasize hard examples and address overfitting and bias. 3) Multi-modal data augmentation with strong augmentations and binary caption masking to further diversify training samples. |
The proposed approach achieves state-of-the-art performance among detector-free methods on various vision-language downstream tasks, including image-text retrieval, visual entailment, and visual question answering.
Ablation studies demonstrate the effectiveness of each component, with soft masking, focal ITC loss, and multi-modal data augmentation all contributing to performance gains.
Qualitative analysis of word-conditional Grad-CAM visualizations highlights the model's ability to capture more accurate and comprehensive object and attribute representations compared to the baseline. |
The model's tendency to learn biases towards objects and scenes when the most salient parts are heavily masked.
Further exploration of optimal masking strategies and their impact on specific downstream tasks. |
multi-modal representation learning, vision-language pretraining, soft feature masking, focal contrastive learning, data augmentation |
2304.00341
Report |
JacobiNeRF: NeRF Shaping with Mutual Information Gradients |
Xiaomeng Xu, Yanchao Yang, Kaichun Mo, Boxiao Pan, Li Yi, Leonidas Guibas |
We propose a method that trains a neural radiance field (NeRF) to encode not
only the appearance of the scene but also semantic correlations between scene
points, regions, or entities -- aiming to capture their mutual co-variation
patterns. In contrast to the traditional first-order photometric reconstruction
objective, our method explicitly regularizes the learning dynamics to align the
Jacobians of highly-correlated entities, which proves to maximize the mutual
information between them under random scene perturbations. By paying attention
to this second-order information, we can shape a NeRF to express semantically
meaningful synergies when the network weights are changed by a delta along the
gradient of a single entity, region, or even a point. To demonstrate the merit
of this mutual information modeling, we leverage the coordinated behavior of
scene entities that emerges from our shaping to perform label propagation for
semantic and instance segmentation. Our experiments show that a JacobiNeRF is
more efficient in propagating annotations among 2D pixels and 3D points
compared to NeRFs without mutual information shaping, especially in extremely
sparse label regimes -- thus reducing annotation burden. The same machinery can
further be used for entity selection or scene modifications. |
This paper introduces JacobiNeRF, a novel method for shaping Neural Radiance Fields (NeRFs) to encode semantic correlations between scene elements by aligning their Jacobians in the network's tangent space. |
Standard NeRFs excel at scene appearance and geometry but often lack awareness of semantic relationships, hindering tasks like entity selection, annotation, and editing. JacobiNeRF addresses this by encoding semantic correlations directly into the NeRF representation. |
The method leverages the equivalence between mutual information and Jacobian cosine similarity. By applying contrastive learning on NeRF gradients, it aligns the tangent space with semantic correlations derived from self-supervised features (e.g., DINO). |
JacobiNeRF effectively propagates sparse annotations for semantic and instance segmentation, outperforming methods relying solely on first-order information or 2D features.
The approach generalizes well to novel views, demonstrating superior performance on distant viewpoints compared to methods overfitting to source views.
Beyond segmentation, JacobiNeRF enables tasks like entity selection and consistent scene recoloring by leveraging the encoded semantic correlations. |
The current implementation relies on self-supervised visual features, which may limit performance compared to incorporating stronger semantic priors.
The dense label propagation strategy could benefit from more sophisticated gradient subsampling techniques to improve efficiency and accuracy. |
neural radiance fields, nerf, semantic segmentation, instance segmentation, mutual information |
2304.00334
Report |
TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles |
Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan, Zhipeng Hu, Zhidong Deng, Xin Yu |
In order to produce facial-expression-specified talking head videos, previous
audio-driven one-shot talking head methods need to use a reference video with a
matching speaking style (i.e., facial expressions). However, finding videos
with a desired style may not be easy, potentially restricting their
application. In this work, we propose an expression-controllable one-shot
talking head method, dubbed TalkCLIP, where the expression in a speech is
specified by the natural language. This would significantly ease the difficulty
of searching for a video with a desired speaking style. Here, we first
construct a text-video paired talking head dataset, in which each video has
alternative prompt-alike descriptions. Specifically, our descriptions involve
coarse-level emotion annotations and facial action unit (AU) based fine-grained
annotations. Then, we introduce a CLIP-based style encoder that first projects
natural language descriptions to the CLIP text embedding space and then aligns
the textual embeddings to the representations of speaking styles. As extensive
textual knowledge has been encoded by CLIP, our method can even generalize to
infer a speaking style whose description has not been seen during training.
Extensive experiments demonstrate that our method achieves the advanced
capability of generating photo-realistic talking heads with vivid facial
expressions guided by text descriptions. |
This paper proposes TalkCLIP, a novel one-shot talking head generation framework that produces photo-realistic videos with speaking styles controlled by natural language descriptions. |
Existing methods for generating expressive talking heads rely on reference videos with matching speaking styles, which can be difficult and time-consuming to find. TalkCLIP addresses this limitation by enabling direct control of expressions via text, making the process more user-friendly and flexible. |
The authors construct TA-MEAD, a text-annotated talking head dataset based on MEAD, with coarse-level emotion and fine-grained AU-based descriptions. TalkCLIP utilizes a CLIP-based text encoder, trained with guidance from a video-to-speaking-style encoder, to map text descriptions to latent speaking style codes. These codes, along with audio features, drive a facial animation decoder and image renderer to generate the final video. |
TalkCLIP generates photo-realistic talking heads with vivid facial expressions accurately reflecting the input text descriptions.
The method exhibits strong generalization capabilities, effectively handling out-of-domain text descriptions not seen during training.
TalkCLIP achieves comparable or superior performance to state-of-the-art methods that rely on reference videos for style control. |
TalkCLIP may struggle to generate accurate speaking styles for abstract text descriptions, like idioms.
The method does not currently consider the emotional content of the input audio, potentially leading to inconsistencies between audio and generated video. |
talking head generation, text-guided synthesis, expressive facial animation, clip, visual-language learning |
2304.00287
Report |
Vision Transformers with Mixed-Resolution Tokenization |
Tomer Ronen, Omer Levy, Avram Golbert |
Vision Transformer models process input images by dividing them into a
spatially regular grid of equal-size patches. Conversely, Transformers were
originally introduced over natural language sequences, where each token
represents a subword - a chunk of raw data of arbitrary size. In this work, we
apply this approach to Vision Transformers by introducing a novel image
tokenization scheme, replacing the standard uniform grid with a
mixed-resolution sequence of tokens, where each token represents a patch of
arbitrary size. Using the Quadtree algorithm and a novel saliency scorer, we
construct a patch mosaic where low-saliency areas of the image are processed in
low resolution, routing more of the model's capacity to important image
regions. Using the same architecture as vanilla ViTs, our Quadformer models
achieve substantial accuracy gains on image classification when controlling for
the computational budget. Code and models are publicly available at
https://github.com/TomerRonen34/mixed-resolution-vit . |
This paper introduces Quadformer, a Vision Transformer that utilizes a novel mixed-resolution tokenization scheme based on Quadtrees and a saliency scorer, allowing it to process important image regions in high resolution. |
The standard uniform grid tokenization in ViTs can be inefficient, treating all image regions equally. Quadformer aims to improve efficiency by focusing computational resources on salient image areas. |
Quadformer replaces the uniform grid with a Quadtree-based patch mosaic, using a saliency scorer to determine patch sizes. It employs 2D position embeddings and adapts the standard ViT architecture to process mixed-resolution tokens. |
Quadformer with feature-based saliency consistently outperforms vanilla ViTs in accuracy by up to 0.88% when controlling for the number of patches or GMACs.
Despite not using accelerated inference techniques, Quadformer also shows gains when controlling for inference speed.
Quadformers exhibit less sensitivity to out-of-distribution input lengths compared to vanilla ViTs, enabling a better inference-time compute-accuracy trade-off with a single model. |
The runtime overhead of the feature-based saliency scorer can be significant for small ViT models.
Finding faster high-quality saliency estimators is crucial for efficient mixed-resolution tokenization in small ViTs. |
vision transformers, tokenization, quadtrees, saliency, mixed-resolution |
2304.00186
Report |
Subject-driven Text-to-Image Generation via Apprenticeship Learning |
Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, William W. Cohen |
Recent text-to-image generation models like DreamBooth have made remarkable
progress in generating highly customized images of a target subject, by
fine-tuning an ``expert model'' for a given subject from a few examples.
However, this process is expensive, since a new expert model must be learned
for each subject. In this paper, we present SuTI, a Subject-driven
Text-to-Image generator that replaces subject-specific fine tuning with
in-context learning. Given a few demonstrations of a new subject, SuTI can
instantly generate novel renditions of the subject in different scenes, without
any subject-specific optimization. SuTI is powered by apprenticeship learning,
where a single apprentice model is learned from data generated by a massive
number of subject-specific expert models. Specifically, we mine millions of
image clusters from the Internet, each centered around a specific visual
subject. We adopt these clusters to train a massive number of expert models,
each specializing in a different subject. The apprentice model SuTI then learns
to imitate the behavior of these fine-tuned experts. SuTI can generate
high-quality and customized subject-specific images 20x faster than
optimization-based SoTA methods. On the challenging DreamBench and
DreamBench-v2, our human evaluation shows that SuTI significantly outperforms
existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt,
Re-Imagen and DreamBooth, especially on the subject and text alignment aspects. |
This paper introduces SuTI, a subject-driven text-to-image generation model that uses in-context learning to generate customized images of a target subject without subject-specific optimization. |
Existing subject-driven image generation methods require fine-tuning specific models for each subject, which is slow and expensive. |
SuTI employs apprenticeship learning, where it's trained on a massive dataset of image clusters to imitate the behavior of millions of specialized expert models. |
SuTI generates high-quality, customized subject-specific images 20x faster than optimization-based methods.
On DreamBench and DreamBench-v2, SuTI significantly outperforms existing models in human evaluations.
SuTI shows strong capabilities in subject re-contextualization, attribute editing, artistic style transfer, and accessorization. |
SuTI's generations are less diverse than DreamBooth and less faithful to low-level visual details.
SuTI struggles with highly compositional prompts. |
text-to-image generation, subject-driven generation, in-context learning, apprenticeship learning, diffusion models |
2304.00049
Report |
Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate |
Mohammadi Kiarash, Zhao He, Mengyao Zhai, Frederick Tung |
In many real-world settings, the critical class is rare and a missed
detection carries a disproportionately high cost. For example, tumors are rare
and a false negative diagnosis could have severe consequences on treatment
outcomes; fraudulent banking transactions are rare and an undetected occurrence
could result in significant losses or legal penalties. In such contexts,
systems are often operated at a high true positive rate, which may require
tolerating high false positives. In this paper, we present a novel approach to
address the challenge of minimizing false positives for systems that need to
operate at a high true positive rate. We propose a ranking-based regularization
(RankReg) approach that is easy to implement, and show empirically that it not
only effectively reduces false positives, but also complements conventional
imbalanced learning losses. With this novel technique in hand, we conduct a
series of experiments on three broadly explored datasets (CIFAR-10&100 and
Melanoma) and show that our approach lifts the previous state-of-the-art
performance by notable margins. |
Presents RankReg, a ranking-based regularization method to minimize false positives in imbalanced classification tasks where a high true positive rate is critical. |
In critical applications like medical diagnosis and fraud detection, minimizing false positives at high true positive rates is crucial, which conventional methods often fail to address. |
RankReg adds a regularization term that penalizes lower rankings of critical positives, encouraging the model to rank them higher than non-critical negatives. This is optimized using a differentiable ranking method based on a combinatorial solver. |
RankReg consistently outperforms existing methods, including the state-of-the-art ALM, on CIFAR-10, CIFAR-100, and Melanoma datasets.
The method is complementary to various conventional imbalanced learning losses, demonstrating its general applicability.
RankReg shows robustness to label noise, making it suitable for real-world scenarios. |
The paper primarily focuses on binary classification tasks, and extending it to multi-class scenarios with more than one critical class requires further investigation.
The buffer size for storing positive samples impacts performance and might require task-specific tuning. |
imbalanced classification, false positive rate minimization, critical rare classes, ranking regularization, differentiable ranking |
2303.18193
Report |
GVP: Generative Volumetric Primitives |
Mallikarjun B R, Xingang Pan, Mohamed Elgharib, Christian Theobalt |
Advances in 3D-aware generative models have pushed the boundary of image
synthesis with explicit camera control. To achieve high-resolution image
synthesis, several attempts have been made to design efficient generators, such
as hybrid architectures with both 3D and 2D components. However, such a design
compromises multiview consistency, and the design of a pure 3D generator with
high resolution is still an open problem. In this work, we present Generative
Volumetric Primitives (GVP), the first pure 3D generative model that can sample
and render 512-resolution images in real-time. GVP jointly models a number of
volumetric primitives and their spatial information, both of which can be
efficiently generated via a 2D convolutional network. The mixture of these
primitives naturally captures the sparsity and correspondence in the 3D volume.
The training of such a generator with a high degree of freedom is made possible
through a knowledge distillation technique. Experiments on several datasets
demonstrate superior efficiency and 3D consistency of GVP over the
state-of-the-art. |
Presents Generative Volumetric Primitives (GVP), the first 3D-aware generative model based on pure 3D representation that can render 512-resolution images in real-time. |
Addresses the limitations of existing 3D-aware GANs in achieving high-resolution image synthesis with real-time rendering and multiview consistency. |
Utilizes a mixture of volumetric primitives (MVP) representation, where each primitive models a local volume's color and density. Employs a 2D convolutional network to efficiently generate primitives and their spatial information. Leverages knowledge distillation from a pretrained 3D-aware GAN (EG3D) for stable training. |
Achieves much faster rendering than previous pure 3D GANs.
Preserves better multiview consistency compared to hybrid architectures that rely on 2D upsampling.
Learned primitives effectively capture 3D volume sparsity and adapt to different samples, hinting at correspondence learning. |
Struggles to generate high-quality details for certain features like curly hair due to the discontinuous 3D scene representation.
Purely adversarial training proves unstable due to the discontinuous representation. |
generative models, 3d-aware gans, volumetric primitives, real-time rendering, multiview consistency |
2303.18181
Report |
A Closer Look at Parameter-Efficient Tuning in Diffusion Models |
Chendong Xiang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu |
Large-scale diffusion models like Stable Diffusion are powerful and find
various real-world applications while customizing such models by fine-tuning is
both memory and time inefficient. Motivated by the recent progress in natural
language processing, we investigate parameter-efficient tuning in large
diffusion models by inserting small learnable modules (termed adapters). In
particular, we decompose the design space of adapters into orthogonal factors
-- the input position, the output position as well as the function form, and
perform Analysis of Variance (ANOVA), a classical statistical approach for
analyzing the correlation between discrete (design options) and continuous
variables (evaluation metrics). Our analysis suggests that the input position
of adapters is the critical factor influencing the performance of downstream
tasks. Then, we carefully study the choice of the input position, and we find
that putting the input position after the cross-attention block can lead to the
best performance, validated by additional visualization analyses. Finally, we
provide a recipe for parameter-efficient tuning in diffusion models, which is
comparable if not superior to the fully fine-tuned baseline (e.g., DreamBooth)
with only 0.75 \% extra parameters, across various customized tasks. |
This paper presents a systematic study on parameter-efficient tuning of large-scale diffusion models using lightweight trainable modules called adapters. |
Fine-tuning entire large diffusion models like Stable Diffusion for customization is computationally and memory intensive. Parameter-efficient tuning offers a more efficient alternative. |
The authors decompose the adapter design space into input position, output position, and function form. They utilize Analysis of Variance (ANOVA) to identify the most impactful factor on downstream task performance. |
Input position of the adapter is the most critical factor for effective parameter-efficient tuning.
Placing the adapter after the cross-attention block in the U-Net architecture yields the best performance.
The proposed adapter-based tuning achieves comparable or better results than fully fine-tuning (DreamBooth) with only 0.75% extra parameters. |
The study primarily focuses on Stable Diffusion due to its open-source nature.
Exploration of adapter placement beyond a single position is left for future work. |
diffusion models, parameter-efficient tuning, adapters, stable diffusion, transfer learning |
2303.18144
Report |
Siamese DETR |
Zeren Chen, Gengshi Huang, Wei Li, Jianing Teng, Kun Wang, Jing Shao, Chen Change Loy, Lu Sheng |
Recent self-supervised methods are mainly designed for representation
learning with the base model, e.g., ResNets or ViTs. They cannot be easily
transferred to DETR, with task-specific Transformer modules. In this work, we
present Siamese DETR, a Siamese self-supervised pretraining approach for the
Transformer architecture in DETR. We consider learning view-invariant and
detection-oriented representations simultaneously through two complementary
tasks, i.e., localization and discrimination, in a novel multi-view learning
framework. Two self-supervised pretext tasks are designed: (i) Multi-View
Region Detection aims at learning to localize regions-of-interest between
augmented views of the input, and (ii) Multi-View Semantic Discrimination
attempts to improve object-level discrimination for each region. The proposed
Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL
VOC detection using different DETR variants in all setups. Code is available at
https://github.com/Zx55/SiameseDETR. |
This paper introduces Siamese DETR, a novel self-supervised pretraining method for DETR that utilizes a Siamese network architecture. |
This approach addresses the challenge of pretraining DETR's task-specific Transformer modules for object detection, which conventional self-supervised methods struggle with. |
Siamese DETR employs two pretext tasks: (i) Multi-View Region Detection, which learns to locate corresponding regions in augmented views, and (ii) Multi-View Semantic Discrimination, which enhances object-level discrimination by maximizing global and regional semantic consistency. |
Siamese DETR surpasses previous methods like UP-DETR and DETReg in transfer learning performance for object detection on COCO and PASCAL VOC benchmarks.
The method exhibits robustness across various DETR variants (Vanilla, Conditional, Deformable).
Siamese DETR demonstrates faster convergence and stronger objectness priors compared to its counterparts. |
Siamese DETR still depends on a pre-trained CNN backbone (e.g., SwAV) and does not yet encompass a unified pretraining strategy for both CNN and Transformer components.
Future work could explore more efficient frameworks for end-to-end DETR pretraining. |
self-supervised learning, object detection, detr, transformer, siamese networks |
2303.18080
Report |
One-shot Unsupervised Domain Adaptation with Personalized Diffusion Models |
Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalogeiton, Stéphane Lathuilière |
Adapting a segmentation model from a labeled source domain to a target
domain, where a single unlabeled datum is available, is one the most
challenging problems in domain adaptation and is otherwise known as one-shot
unsupervised domain adaptation (OSUDA). Most of the prior works have addressed
the problem by relying on style transfer techniques, where the source images
are stylized to have the appearance of the target domain. Departing from the
common notion of transferring only the target ``texture'' information, we
leverage text-to-image diffusion models (e.g., Stable Diffusion) to generate a
synthetic target dataset with photo-realistic images that not only faithfully
depict the style of the target domain, but are also characterized by novel
scenes in diverse contexts. The text interface in our method Data AugmenTation
with diffUsion Models (DATUM) endows us with the possibility of guiding the
generation of images towards desired semantic concepts while respecting the
original spatial context of a single training image, which is not possible in
existing OSUDA methods. Extensive experiments on standard benchmarks show that
our DATUM surpasses the state-of-the-art OSUDA methods by up to +7.1%. The
implementation is available at https://github.com/yasserben/DATUM |
DATUM, a data augmentation pipeline powered by diffusion models for one-shot unsupervised domain adaptation in semantic segmentation. |
Addresses the challenging problem of adapting a segmentation model to a target domain when only a single unlabeled target image is available, a scenario known as one-shot unsupervised domain adaptation (OSDA). |
1. **Personalization Stage**: Fine-tune a pre-trained text-to-image diffusion model (e.g., Stable Diffusion) on the single target image to capture its style. 2. **Data Generation Stage**: Generate a synthetic target dataset using the fine-tuned model, guided by prompts containing class names (e.g., 'car', 'bus') to increase diversity. 3. **Adaptive Segmentation Stage**: Train a segmentation model using a standard UDA method on the labeled source data and the generated synthetic target data. |
DATUM significantly outperforms existing OSDA methods on standard benchmarks (GTA to Cityscapes, SYNTHIA to Cityscapes) by up to +7.1% mIoU.
Generating synthetic images that resemble the target domain's style and content is more effective than simply stylizing source images with target texture.
Class-aware prompts in the data generation stage lead to more diverse and informative synthetic images, boosting performance. |
Potential for generating nonsensical objects due to the diffusion model's limitations, requiring caution in deployment.
Reliance on pre-trained diffusion models that might encode biases. |
unsupervised domain adaptation, one-shot learning, semantic segmentation, diffusion models, data augmentation |
2303.17968
Report |
VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization |
Bingfan Zhu, Yanchao Yang, Xulong Wang, Youyi Zheng, Leonidas Guibas |
We propose VDN-NeRF, a method to train neural radiance fields (NeRFs) for
better geometry under non-Lambertian surface and dynamic lighting conditions
that cause significant variation in the radiance of a point when viewed from
different angles. Instead of explicitly modeling the underlying factors that
result in the view-dependent phenomenon, which could be complex yet not
inclusive, we develop a simple and effective technique that normalizes the
view-dependence by distilling invariant information already encoded in the
learned NeRFs. We then jointly train NeRFs for view synthesis with
view-dependence normalization to attain quality geometry. Our experiments show
that even though shape-radiance ambiguity is inevitable, the proposed
normalization can minimize its effect on geometry, which essentially aligns the
optimal capacity needed for explaining view-dependent variations. Our method
applies to various baselines and significantly improves geometry without
changing the volume rendering pipeline, even if the data is captured under a
moving light source. Code is available at: https://github.com/BoifZ/VDN-NeRF. |
Proposes VDN-NeRF, a method to improve geometry reconstruction in neural radiance fields (NeRFs) under non-Lambertian surface and dynamic lighting, by normalizing view-dependent radiance variations. |
Addresses the challenge of shape-radiance ambiguity in NeRFs, where inaccurate geometry can be compensated by overly complex radiance functions, especially under varying lighting. |
Normalizes view-dependence by distilling invariant features from rendered images using a depth prediction network and incorporating these features into a neural feature field alongside the radiance field. |
Achieves state-of-the-art geometry reconstruction compared to various baselines, evidenced by higher IoU and lower Chamfer Distance.
Demonstrates robustness to varying lighting conditions, maintaining better geometry quality than baselines under dynamic illumination.
Shows effectiveness in challenging real-world scenarios with dynamic lighting, such as underwater scenes and intra-oral scans. |
Limited exploration of the interplay between the level of feature invariance and the complexity of the scene.
Future work could investigate extending the method to explicitly model and leverage temporal information for dynamic scene reconstruction.
Future work could investigate the choice of depth prediction network and its impact on performance. |
neural radiance fields, view-dependence normalization, shape-radiance ambiguity, dynamic lighting, geometry reconstruction |
2303.17905
Report |
3D-aware Image Generation using 2D Diffusion Models |
Jianfeng Xiang, Jiaolong Yang, Binbin Huang, Xin Tong |
In this paper, we introduce a novel 3D-aware image generation method that
leverages 2D diffusion models. We formulate the 3D-aware image generation task
as multiview 2D image set generation, and further to a sequential
unconditional-conditional multiview image generation process. This allows us to
utilize 2D diffusion models to boost the generative modeling power of the
method. Additionally, we incorporate depth information from monocular depth
estimators to construct the training data for the conditional diffusion model
using only still images. We train our method on a large-scale dataset, i.e.,
ImageNet, which is not addressed by previous methods. It produces high-quality
images that significantly outperform prior methods. Furthermore, our approach
showcases its capability to generate instances with large view angles, even
though the training images are diverse and unaligned, gathered from
"in-the-wild" real-world environments. |
This paper introduces a novel 3D-aware image generation method utilizing 2D diffusion models, framing the task as a sequential unconditional-conditional multiview image generation process. |
This approach leverages the power of 2D diffusion models for high-quality image generation, addressing the limitations of previous GAN-based methods in handling large-scale, in-the-wild datasets. |
The method uses monocular depth estimators to construct multiview training data from still images. It then employs an unconditional diffusion model to generate the initial view and a conditional model to iteratively generate subsequent views conditioned on previous ones. |
The method significantly outperforms state-of-the-art 3D-aware GANs on ImageNet in terms of image quality and diversity.
It demonstrates comparable or better performance on smaller single-category datasets while producing more realistic 3D geometry.
The method showcases the capability to generate scenes with large view angles, even up to 360 degrees, from unaligned training data. |
The reliance on estimated depth maps can introduce inaccuracies in the generated geometry.
Generating images with very large view angles can lead to degraded quality due to data bias and domain drift. |
3d-aware image generation, diffusion models, multiview image synthesis, depth estimation, generative modeling |
2303.17803
Report |
Rethinking Local Perception in Lightweight Vision Transformer |
Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He |
Vision Transformers (ViTs) have been shown to be effective in various vision
tasks. However, resizing them to a mobile-friendly size leads to significant
performance degradation. Therefore, developing lightweight vision transformers
has become a crucial area of research. This paper introduces CloFormer, a
lightweight vision transformer that leverages context-aware local enhancement.
CloFormer explores the relationship between globally shared weights often used
in vanilla convolutional operators and token-specific context-aware weights
appearing in attention, then proposes an effective and straightforward module
to capture high-frequency local information. In CloFormer, we introduce
AttnConv, a convolution operator in attention's style. The proposed AttnConv
uses shared weights to aggregate local information and deploys carefully
designed context-aware weights to enhance local features. The combination of
the AttnConv and vanilla attention which uses pooling to reduce FLOPs in
CloFormer enables the model to perceive high-frequency and low-frequency
information. Extensive experiments were conducted in image classification,
object detection, and semantic segmentation, demonstrating the superiority of
CloFormer. The code is available at \url{https://github.com/qhfan/CloFormer}. |
This paper introduces CloFormer, a lightweight vision transformer that enhances local perception through a context-aware convolution operator called AttnConv and a two-branch structure capturing both high and low-frequency information. |
Designing lightweight vision transformers suitable for mobile devices with minimal performance degradation compared to larger models is crucial. |
CloFormer employs a two-branch architecture. The local branch leverages AttnConv, combining shared-weight convolution for local feature aggregation and context-aware weights for enhancement. The global branch utilizes downsampled vanilla attention for capturing low-frequency global information. These branches are then fused for a comprehensive representation. |
CloFormer achieves state-of-the-art performance on ImageNet classification, surpassing competitors with similar model sizes and FLOPs.
In COCO object detection and instance segmentation tasks, CloFormer consistently outperforms other backbones, demonstrating its effectiveness in dense prediction tasks.
Spectral analysis confirms CloFormer's ability to capture both high-frequency and low-frequency information effectively through its two-branch design. |
The gating mechanism in AttnConv, while introducing stronger nonlinearity, might require careful tuning for optimal performance.
Exploring alternative fusion strategies for the local and global branches could potentially further enhance CloFormer's performance. |
vision transformer, lightweight model, local perception, context-aware, attnconv |
2303.17604
Report |
Token Merging for Fast Stable Diffusion |
Daniel Bolya, Judy Hoffman |
The landscape of image generation has been forever changed by open vocabulary
diffusion models. However, at their core these models use transformers, which
makes generation slow. Better implementations to increase the throughput of
these transformers have emerged, but they still evaluate the entire model. In
this paper, we instead speed up diffusion models by exploiting natural
redundancy in generated images by merging redundant tokens. After making some
diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable
Diffusion can reduce the number of tokens in an existing Stable Diffusion model
by up to 60% while still producing high quality images without any extra
training. In the process, we speed up image generation by up to 2x and reduce
memory consumption by up to 5.6x. Furthermore, this speed-up stacks with
efficient implementations such as xFormers, minimally impacting quality while
being up to 5.4x faster for large images. Code is available at
https://github.com/dbolya/tomesd. |
This paper introduces 'ToMe for Stable Diffusion', a technique applying Token Merging (ToMe) to speed up Stable Diffusion without retraining. |
Open-vocabulary diffusion models like Stable Diffusion revolutionized image generation but are computationally expensive, limiting their accessibility. |
The authors adapt ToMe for dense prediction tasks by introducing 'unmerging', enabling token reduction during processing and reconstruction afterwards. They improve upon the naive ToMe application by optimizing token partitioning and experimentally evaluate different design choices for when, where, and how to apply ToMe. |
ToMe for Stable Diffusion speeds up image generation by up to 2x.
It reduces memory consumption by up to 5.6x.
It maintains high visual quality comparable to the original Stable Diffusion model, even when merging 60% of tokens. |
The current unmerging strategy is simple and could be improved for better information retention.
Exploration of proportional attention or key-based similarity for token merging in diffusion models is left for future work. |
image generation, stable diffusion, token merging, speed optimization, memory reduction |
2303.17603
Report |
NeRF-Supervised Deep Stereo |
Fabio Tosi, Alessio Tonioni, Daniele De Gregorio, Matteo Poggi |
We introduce a novel framework for training deep stereo networks effortlessly
and without any ground-truth. By leveraging state-of-the-art neural rendering
solutions, we generate stereo training data from image sequences collected with
a single handheld camera. On top of them, a NeRF-supervised training procedure
is carried out, from which we exploit rendered stereo triplets to compensate
for occlusions and depth maps as proxy labels. This results in stereo networks
capable of predicting sharp and detailed disparity maps. Experimental results
show that models trained under this regime yield a 30-40% improvement over
existing self-supervised methods on the challenging Middlebury dataset, filling
the gap to supervised models and, most times, outperforming them at zero-shot
generalization. |
Introduces NeRF-Supervised (NS) learning, a novel framework for training deep stereo networks without ground truth by leveraging neural rendering to generate stereo training data from image sequences captured with a single handheld camera. |
Addresses the challenge of obtaining flexible and scalable training data for deep stereo networks, a major limitation despite advancements in self-supervised learning and synthetic data. |
Utilizes Neural Radiance Fields (NeRF) to generate stereo triplets and depth maps from sparse image sequences. Employs a NeRF-supervised training protocol that combines a triplet photometric loss (addressing occlusions) and a rendered disparity loss (enhancing details) to train stereo networks. |
Models trained with NS achieve a 30-40% improvement over existing self-supervised methods on the Middlebury dataset.
NS-trained networks demonstrate state-of-the-art zero-shot generalization, outperforming models trained on synthetic datasets and existing self-supervised methods.
The ease of data collection and rendering allows for building extensive and scalable training datasets, potentially leading to even better results with more collected scenes. |
Current data collection is limited to small-scale, static scenes.
NS-trained networks, like other stereo networks, face challenges in handling complex scenarios such as transparent surfaces or nighttime images. Future work may explore larger-scale data collection and specialized NeRF variants to address these limitations. |
deep stereo matching, self-supervised learning, zero-shot generalization, neural rendering, neural radiance fields |
2303.17599
Report |
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models |
Wen Wang, Yan Jiang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, Chunhua Shen |
Large-scale text-to-image diffusion models achieve unprecedented success in
image generation and editing. However, how to extend such success to video
editing is unclear. Recent initial attempts at video editing require
significant text-to-video data and computation resources for training, which is
often not accessible. In this work, we propose vid2vid-zero, a simple yet
effective method for zero-shot video editing. Our vid2vid-zero leverages
off-the-shelf image diffusion models, and doesn't require training on any
video. At the core of our method is a null-text inversion module for
text-to-video alignment, a cross-frame modeling module for temporal
consistency, and a spatial regularization module for fidelity to the original
video. Without any training, we leverage the dynamic nature of the attention
mechanism to enable bi-directional temporal modeling at test time. Experiments
and analyses show promising results in editing attributes, subjects, places,
etc., in real-world videos. Code is made available at
\url{https://github.com/baaivision/vid2vid-zero}. |
This paper presents vid2vid-zero, a zero-shot video editing method using off-the-shelf image diffusion models without requiring video training data. |
Existing video editing methods need substantial text-video data and computation for training, limiting their accessibility. This method addresses this by leveraging pre-trained image diffusion models for efficient video editing. |
vid2vid-zero utilizes: 1) a null-text inversion module for aligning text prompts with video content, 2) a spatial regularization module for maintaining fidelity to the original video, and 3) a cross-frame modeling module for ensuring temporal consistency. Notably, it introduces a spatial-temporal attention module for bi-directional temporal modeling during testing. |
vid2vid-zero effectively edits video styles, attributes, backgrounds, and subjects while preserving temporal consistency and faithfulness to the source.
The proposed spatial-temporal attention mechanism is crucial for bidirectional temporal modeling, outperforming methods like Sparse-Causal Attention and Temporal-only Attention.
The method excels in user preference tests, demonstrating superior quality, text-to-video alignment, and fidelity compared to techniques like Tune-A-Video and Plug-and-Play. |
The method may inherit limitations of the base image diffusion model, lacking specific temporal and motion priors not present in image-only training data.
Directly editing actions in videos remains a challenge due to the absence of explicit motion modeling capabilities. |
diffusion models, video editing, zero-shot learning, vision-language models, temporal consistency |
2303.17598
Report |
Consistent View Synthesis with Pose-Guided Diffusion Models |
Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, Johannes Kopf |
Novel view synthesis from a single image has been a cornerstone problem for
many Virtual Reality applications that provide immersive experiences. However,
most existing techniques can only synthesize novel views within a limited range
of camera motion or fail to generate consistent and high-quality novel views
under significant camera movement. In this work, we propose a pose-guided
diffusion model to generate a consistent long-term video of novel views from a
single image. We design an attention layer that uses epipolar lines as
constraints to facilitate the association between different viewpoints.
Experimental results on synthetic and real-world datasets demonstrate the
effectiveness of the proposed diffusion model against state-of-the-art
transformer-based and GAN-based approaches. |
This paper introduces a pose-guided diffusion model for synthesizing long-term, consistent novel view videos from a single image. |
Existing view synthesis methods struggle to generate consistent and high-quality novel views under significant camera movement, limiting immersive VR applications. |
The proposed method utilizes a UNet-based diffusion model with a novel epipolar attention layer. This layer leverages epipolar line constraints to associate features between input and output viewpoints, ensuring geometric consistency. Stochastic conditioning and fixed noise injection during inference further enhance temporal coherence. |
Outperforms state-of-the-art transformer and GAN-based methods in synthesizing realistic and consistent novel views.
Demonstrates superior performance in both short-term and long-term view synthesis scenarios.
Effectiveness of the epipolar attention layer is validated through ablation studies. |
Limited capability in handling scenes with significantly different scales compared to training data.
Inference speed is computationally expensive due to the multi-step denoising process. |
view synthesis, diffusion models, epipolar geometry, single-image view synthesis, long-term view synthesis |
2303.17591
Report |
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models |
Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi |
The unlearning problem of deep learning models, once primarily an academic
concern, has become a prevalent issue in the industry. The significant advances
in text-to-image generation techniques have prompted global discussions on
privacy, copyright, and safety, as numerous unauthorized personal IDs, content,
artistic creations, and potentially harmful materials have been learned by
these models and later utilized to generate and distribute uncontrolled
content. To address this challenge, we propose \textbf{Forget-Me-Not}, an
efficient and low-cost solution designed to safely remove specified IDs,
objects, or styles from a well-configured text-to-image model in as little as
30 seconds, without impairing its ability to generate other content. Alongside
our method, we introduce the \textbf{Memorization Score (M-Score)} and
\textbf{ConceptBench} to measure the models' capacity to generate general
concepts, grouped into three primary categories: ID, object, and style. Using
M-Score and ConceptBench, we demonstrate that Forget-Me-Not can effectively
eliminate targeted concepts while maintaining the model's performance on other
concepts. Furthermore, Forget-Me-Not offers two practical extensions: a)
removal of potentially harmful or NSFW content, and b) enhancement of model
accuracy, inclusion and diversity through \textbf{concept correction and
disentanglement}. It can also be adapted as a lightweight model patch for
Stable Diffusion, allowing for concept manipulation and convenient
distribution. To encourage future research in this critical area and promote
the development of safe and inclusive generative models, we will open-source
our code and ConceptBench at
\href{https://github.com/SHI-Labs/Forget-Me-Not}{https://github.com/SHI-Labs/Forget-Me-Not}. |
This paper introduces Forget-Me-Not, an efficient and low-cost method for removing specific concepts (IDs, objects, styles) from trained text-to-image diffusion models without retraining the entire model. |
Addresses growing concerns about privacy, copyright, safety, and bias present in large-scale text-to-image models by enabling targeted removal of sensitive or unwanted information. |
Forget-Me-Not employs an "attention resteering" technique, minimizing the influence of target concept embeddings on the model's cross-attention layers through targeted fine-tuning. |
Successfully removes targeted concepts like identities (e.g., Elon Musk) and styles (e.g., Van Gogh) while preserving the model's ability to generate other content.
Enables concept correction and disentanglement, allowing suppressed concepts to emerge and correcting biased representations.
Can be used for removing harmful or NSFW content, as demonstrated with the removal of nudity triggered by specific prompts. |
Faces challenges in identifying and forgetting abstract concepts.
May require manual intervention, such as concept-specific hyperparameter tuning. |
concept forgetting, text-to-image synthesis, diffusion models, stable diffusion, privacy, copyright, safety, bias |
2303.17569
Report |
Iterative Prompt Learning for Unsupervised Backlit Image Enhancement |
Zhexin Liang, Chongyi Li, Shangchen Zhou, Ruicheng Feng, Chen Change Loy |
We propose a novel unsupervised backlit image enhancement method, abbreviated
as CLIP-LIT, by exploring the potential of Contrastive Language-Image
Pre-Training (CLIP) for pixel-level image enhancement. We show that the
open-world CLIP prior not only aids in distinguishing between backlit and
well-lit images, but also in perceiving heterogeneous regions with different
luminance, facilitating the optimization of the enhancement network. Unlike
high-level and image manipulation tasks, directly applying CLIP to enhancement
tasks is non-trivial, owing to the difficulty in finding accurate prompts. To
solve this issue, we devise a prompt learning framework that first learns an
initial prompt pair by constraining the text-image similarity between the
prompt (negative/positive sample) and the corresponding image (backlit
image/well-lit image) in the CLIP latent space. Then, we train the enhancement
network based on the text-image similarity between the enhanced result and the
initial prompt pair. To further improve the accuracy of the initial prompt
pair, we iteratively fine-tune the prompt learning framework to reduce the
distribution gaps between the backlit images, enhanced results, and well-lit
images via rank learning, boosting the enhancement performance. Our method
alternates between updating the prompt learning framework and enhancement
network until visually pleasing results are achieved. Extensive experiments
demonstrate that our method outperforms state-of-the-art methods in terms of
visual quality and generalization ability, without requiring any paired data. |
This paper introduces CLIP-LIT, a novel unsupervised backlit image enhancement method leveraging Contrastive Language-Image Pre-Training (CLIP) for pixel-level enhancement. |
Existing methods struggle to effectively enhance backlit images due to the challenge of preserving well-lit regions while improving underexposed areas. This method explores the open-world CLIP prior to address these limitations. |
The methodology involves two stages: 1) Initializing prompts by constraining text-image similarity in CLIP space, and training an initial enhancement network. 2) Refining prompts via rank learning using backlit images, enhanced results, and well-lit images, iteratively improving the enhancement network. |
CLIP-LIT outperforms state-of-the-art methods in visual quality and quantitative metrics on both BAID and Backlit300 datasets.
Iterative prompt learning effectively guides the network to focus on luminance and color distribution, leading to superior enhancement.
The method generalizes well to unseen data, demonstrating robustness to diverse lighting conditions and scene content. |
The method may not handle extreme over-/under-exposed regions due to sRGB limitations.
Current implementation doesn't address noise; future work could explore noise augmentation. |
backlit image enhancement, unsupervised learning, clip, prompt learning, image restoration |
2303.17561
Report |
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger |
Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu Enwei Zhang, Wei Liu, Jie Yang, Ke Li, Xing Sun |
During the preceding biennium, vision-language pre-training has achieved
noteworthy success on several downstream tasks. Nevertheless, acquiring
high-quality image-text pairs, where the pairs are entirely exclusive of each
other, remains a challenging task, and noise exists in the commonly used
datasets. To address this issue, we propose SoftCLIP, a novel approach that
relaxes the strict one-to-one constraint and achieves a soft cross-modal
alignment by introducing a softened target, which is generated from the
fine-grained intra-modal self-similarity. The intra-modal guidance is
indicative to enable two pairs have some local similarities and model
many-to-many relationships between the two modalities. Besides, since the
positive still dominates in the softened target distribution, we disentangle
the negatives in the distribution to further boost the relation alignment with
the negatives in the cross-modal learning. Extensive experiments demonstrate
the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot
classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings
a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline. |
SoftCLIP, a novel approach that leverages intra-modal self-similarity to achieve soft cross-modal alignment in vision-language pre-training, thereby improving performance. |
Acquiring perfectly matched image-text pairs for contrastive learning is challenging due to noise and many-to-many relationships between modalities in existing datasets, leading to suboptimal alignment. |
SoftCLIP utilizes: (1) Fine-grained intra-modal self-similarities (between ROIs and tags) as softened targets for cross-modal alignment. (2) Disentanglement of negatives in the softened target distribution to enhance relation alignment. |
SoftCLIP significantly outperforms CLIP baselines on ImageNet zero-shot classification, achieving a top-1 accuracy improvement of 6.8%/7.2% with CC3M/CC12M pre-training.
It also exhibits significant gains on other zero-shot classification datasets, zero-shot image-text retrieval (Flickr30K, MS-COCO), instance retrieval (Oxford, Paris Buildings), and copy detection (INRIA Copydays).
Ablation studies confirm the effectiveness of each proposed component (soft alignment, relation enhancement, symmetric KL-Divergence). |
The reliance on a pre-trained object-attribute detector introduces additional complexity.
Exploration of alternative softened target sources and aggregation methods could further enhance performance. |
vision-language pre-training, contrastive learning, soft targets, cross-modal alignment, intra-modal self-similarity |
2303.17559
Report |
DDP: Diffusion Model for Dense Visual Prediction |
Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo |
We propose a simple, efficient, yet powerful framework for dense visual
predictions based on the conditional diffusion pipeline. Our approach follows a
"noise-to-map" generative paradigm for prediction by progressively removing
noise from a random Gaussian distribution, guided by the image. The method,
called DDP, efficiently extends the denoising diffusion process into the modern
perception pipeline. Without task-specific design and architecture
customization, DDP is easy to generalize to most dense prediction tasks, e.g.,
semantic segmentation and depth estimation. In addition, DDP shows attractive
properties such as dynamic inference and uncertainty awareness, in contrast to
previous single-step discriminative methods. We show top results on three
representative tasks with six diverse benchmarks, without tricks, DDP achieves
state-of-the-art or competitive performance on each task compared to the
specialist counterparts. For example, semantic segmentation (83.9 mIoU on
Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation
(0.05 REL on KITTI). We hope that our approach will serve as a solid baseline
and facilitate future research |
This paper presents DDP, a new framework for dense visual prediction tasks using conditional diffusion models. It iteratively removes noise from a random Gaussian distribution guided by the input image. |
Existing diffusion-based perception models are inefficient and complex. DDP offers a simple, effective, and general framework applicable to diverse dense prediction tasks. |
DDP separates image encoding from map decoding. During training, Gaussian noise is progressively added to the ground truth. The map decoder, conditioned on the encoded image, learns to reverse this process. Inference involves refining an initial noise map using the decoder and input image. |
DDP achieves state-of-the-art or competitive results on semantic segmentation (e.g., 83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI).
The method supports dynamic inference, trading off computation for prediction quality by adjusting the number of sampling steps.
DDP naturally provides uncertainty estimations by analyzing the consistency of predictions across sampling steps. |
Multi-step inference increases computational cost compared to single-step methods.
While effective on tested benchmarks, further research is needed to assess DDP's performance on a wider range of tasks. |
diffusion models, dense prediction, semantic segmentation, bev map segmentation, depth estimation |
2303.17546
Report |
PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor |
Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Xingqian Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, Humphrey Shi |
Generative image editing has recently witnessed extremely fast-paced growth.
Some works use high-level conditioning such as text, while others use low-level
conditioning. Nevertheless, most of them lack fine-grained control over the
properties of the different objects present in the image, i.e. object-level
image editing. In this work, we tackle the task by perceiving the images as an
amalgamation of various objects and aim to control the properties of each
object in a fine-grained manner. Out of these properties, we identify structure
and appearance as the most intuitive to understand and useful for editing
purposes. We propose PAIR Diffusion, a generic framework that can enable a
diffusion model to control the structure and appearance properties of each
object in the image. We show that having control over the properties of each
object in an image leads to comprehensive editing capabilities. Our framework
allows for various object-level editing operations on real images such as
reference image-based appearance editing, free-form shape editing, adding
objects, and variations. Thanks to our design, we do not require any inversion
step. Additionally, we propose multimodal classifier-free guidance which
enables editing images using both reference images and text when using our
approach with foundational diffusion models. We validate the above claims by
extensively evaluating our framework on both unconditional and foundational
diffusion models. Please refer to
https://vidit98.github.io/publication/conference-paper/pair_diff.html for code
and model release. |
PAIR Diffusion, a generic framework enabling object-level structure and appearance editing in diffusion models. |
Most existing image editing methods lack fine-grained control over individual object properties, limiting comprehensive editing capabilities. |
PAIR Diffusion extracts per-object structure (shape, category) using panoptic segmentation and appearance representations using pre-trained image encoders (VGG, DINOv2). It conditions diffusion models on these representations, enabling object-level manipulation. |
Enables diverse object-level edits: appearance and shape editing, object addition, and variations.
Achieves realistic and faithful appearance editing, outperforming baselines in quantitative evaluations (FID, L1, SSIM).
Demonstrates precise structure control, surpassing previous methods (SEAN) in mIoU and SSIM scores. |
Current architecture modifications for incorporating structure and appearance are simple and can be further improved.
Future work includes extending explicit control to other object aspects (illumination, pose) and improving identity preservation during editing. |
image editing, diffusion models, object-level editing, generative models, multimodal inference |
2303.17225
Report |
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation |
Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xuefeng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan, Xingang Wang |
Recently, open-vocabulary learning has emerged to accomplish segmentation for
arbitrary categories of text-based descriptions, which popularizes the
segmentation system to more general-purpose application scenarios. However,
existing methods devote to designing specialized architectures or parameters
for specific segmentation tasks. These customized design paradigms lead to
fragmentation between various segmentation tasks, thus hindering the uniformity
of segmentation models. Hence in this paper, we propose FreeSeg, a generic
framework to accomplish Unified, Universal and Open-Vocabulary Image
Segmentation. FreeSeg optimizes an all-in-one network via one-shot training and
employs the same architecture and parameters to handle diverse segmentation
tasks seamlessly in the inference procedure. Additionally, adaptive prompt
learning facilitates the unified model to capture task-aware and
category-sensitive concepts, improving model robustness in multi-task and
varied scenarios. Extensive experimental results demonstrate that FreeSeg
establishes new state-of-the-art results in performance and generalization on
three segmentation tasks, which outperforms the best task-specific
architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP
on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen
class on COCO. |
FreeSeg is a novel framework for Unified, Universal and Open-Vocabulary Image Segmentation, using a single model to handle semantic, instance, and panoptic segmentation of arbitrary categories. |
Existing open-vocabulary segmentation methods are task-specific, hindering model uniformity and resource efficiency. FreeSeg addresses this by enabling a single model to handle diverse segmentation tasks and arbitrary categories. |
FreeSeg employs a two-stage framework: 1) extracting universal mask proposals via a unified network trained with multi-task labels and 2) zero-shot classification on masks using CLIP with adaptive prompt learning for task and category awareness. |
FreeSeg achieves state-of-the-art performance on open-vocabulary semantic, instance, and panoptic segmentation, outperforming previous methods by a large margin on unseen classes.
The method shows strong generalization across datasets, demonstrating its ability to handle different data distributions and domains.
FreeSeg's multi-task training reduces training costs by two-thirds compared to single-task training while achieving superior performance and generalization. |
FreeSeg's performance on instance segmentation, while exceeding previous open-vocabulary methods, is lower than specialized instance segmentation models due to the use of mask supervision.
Future work could explore incorporating box-level supervision to further improve instance segmentation performance without compromising the framework's universality. |
open-vocabulary segmentation, universal segmentation, multi-task learning, prompt learning, zero-shot learning |
2303.17189
Report |
LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation |
Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li |
Recently, diffusion models have achieved great success in image synthesis.
However, when it comes to the layout-to-image generation where an image often
has a complex scene of multiple objects, how to make strong control over both
the global layout map and each detailed object remains a challenging task. In
this paper, we propose a diffusion model named LayoutDiffusion that can obtain
higher generation quality and greater controllability than the previous works.
To overcome the difficult multimodal fusion of image and layout, we propose to
construct a structural image patch with region information and transform the
patched image into a special layout to fuse with the normal layout in a unified
form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention
(OaCA) are proposed to model the relationship among multiple objects and
designed to be object-aware and position-sensitive, allowing for precisely
controlling the spatial related information. Extensive experiments show that
our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by
relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is
available at https://github.com/ZGCTroy/LayoutDiffusion. |
This paper proposes LayoutDiffusion, a diffusion model for layout-to-image generation that achieves high quality and controllability by fusing layout and image information in a unified form. |
Existing text-to-image generation models struggle with precise control over object placement and scene composition, while GAN-based layout-to-image methods suffer from limitations like unstable training. LayoutDiffusion addresses these challenges with a novel diffusion-based approach. |
The method introduces a structural image patch with region information, treating each patch as a special object. It employs a Layout Fusion Module (LFM) to model relationships between layout objects and an Object-aware Cross Attention (OaCA) mechanism to fuse multi-resolution image patches with the layout. |
LayoutDiffusion significantly outperforms state-of-the-art GAN-based and diffusion-based methods on FID, DS, and CAS metrics.
The model demonstrates accurate control over object placement, size, and category, as evidenced by high YOLOScore.
LayoutDiffusion generates high-quality images with diverse appearances for a given layout. |
Generating realistic images with no distortion and overlap for complex multi-object layouts remains challenging.
The model is trained from scratch on specific datasets, and future work could explore combining it with text-guided diffusion models pre-trained on large text-image datasets. |
layout-to-image generation, diffusion models, image synthesis, multimodal fusion, controllable generation |
2303.17158
Report |
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation |
Kaiwen Cui, Yingchen Yu, Fangneng Zhan, Shengcai Liao, Shijian Lu1, Eric Xing |
Generative Adversarial Networks (GANs) rely heavily on large-scale training
data for training high-quality image generation models. With limited training
data, the GAN discriminator often suffers from severe overfitting which
directly leads to degraded generation especially in generation diversity.
Inspired by the recent advances in knowledge distillation (KD), we propose
KD-DLGAN, a knowledge-distillation based generation framework that introduces
pre-trained vision-language models for training effective data-limited
generation models. KD-DLGAN consists of two innovative designs. The first is
aggregated generative KD that mitigates the discriminator overfitting by
challenging the discriminator with harder learning tasks and distilling more
generalizable knowledge from the pre-trained models. The second is correlated
generative KD that improves the generation diversity by distilling and
preserving the diverse image-text correlation within the pre-trained models.
Extensive experiments over multiple benchmarks show that KD-DLGAN achieves
superior image generation with limited training data. In addition, KD-DLGAN
complements the state-of-the-art with consistent and substantial performance
gains. |
KD-DLGAN, a novel image generation framework, leverages knowledge distillation from vision-language models to improve GAN training with limited data. |
Effective GAN training typically requires large-scale datasets. This work addresses the challenge of data-limited image generation, particularly the discriminator overfitting issue. |
The paper introduces two novel generative KD techniques: (1) Aggregated Generative KD (AGKD) challenges the discriminator with harder learning tasks by aggregating real/fake sample features and distilling knowledge from a pretrained CLIP model. (2) Correlated Generative KD (CGKD) distills and preserves diverse image-text correlations from CLIP to the GAN discriminator, improving generation diversity. |
KD-DLGAN consistently outperforms existing state-of-the-art data-limited image generation methods on various benchmarks (CIFAR, ImageNet, 100-shot, AFHQ).
Both AGKD and CGKD techniques individually improve performance, and their combination yields the best results.
The method generalizes well across different GAN architectures (StyleGAN-v2, BigGAN), generation tasks (object, face), and training data sizes. |
The study primarily focuses on CLIP as the teacher model; exploring other vision-language models is left for future work.
Future research could explore the applications of KD-DLGAN in other image generation tasks like translation and editing. |
generative adversarial networks, knowledge distillation, data-limited image generation, vision-language models, discriminator overfitting |
2303.17155
Report |
Discriminative Class Tokens for Text-to-Image Diffusion Models |
Idan Schwartz, Vésteinn Snæbjarnarson, Hila Chefer, Ryan Cotterell, Serge Belongie, Lior Wolf, Sagie Benaim |
Recent advances in text-to-image diffusion models have enabled the generation
of diverse and high-quality images. While impressive, the images often fall
short of depicting subtle details and are susceptible to errors due to
ambiguity in the input text. One way of alleviating these issues is to train
diffusion models on class-labeled datasets. This approach has two
disadvantages: (i) supervised datasets are generally small compared to
large-scale scraped text-image datasets on which text-to-image models are
trained, affecting the quality and diversity of the generated images, or (ii)
the input is a hard-coded label, as opposed to free-form text, limiting the
control over the generated images.
In this work, we propose a non-invasive fine-tuning technique that
capitalizes on the expressive potential of free-form text while achieving high
accuracy through discriminative signals from a pretrained classifier. This is
done by iteratively modifying the embedding of an added input token of a
text-to-image diffusion model, by steering generated images toward a given
target class according to a classifier. Our method is fast compared to prior
fine-tuning methods and does not require a collection of in-class images or
retraining of a noise-tolerant classifier. We evaluate our method extensively,
showing that the generated images are: (i) more accurate and of higher quality
than standard diffusion models, (ii) can be used to augment training data in a
low-resource setting, and (iii) reveal information about the data used to train
the guiding classifier. The code is available at
\url{https://github.com/idansc/discriminative_class_tokens}. |
The paper proposes a novel fine-tuning technique for text-to-image diffusion models that introduces a discriminative class token representing specific classes from a pre-trained classifier. |
This technique addresses the limitations of existing text-to-image models in generating images with subtle details and resolving ambiguity in input text, leading to more accurate and higher-quality image generation. |
The method iteratively optimizes the embedding of the added class token by generating images and using feedback from the pre-trained classifier to steer the token towards generating images of the target class. A gradient skipping technique is used for efficient training. |
Generated images are more accurate and of higher quality (lower FID scores) compared to standard diffusion models.
The technique can be used for data augmentation, improving classifier performance in low-resource settings.
The optimized tokens can reveal information about the data used to train the guiding classifier. |
The method may still exhibit limitations in resolving highly ambiguous classes.
Further research is needed to explore deeper backpropagation for potentially enhanced results. |
text-to-image synthesis, diffusion models, classifier guidance, fine-grained details, lexical ambiguity |
2303.17123
Report |
Masked and Adaptive Transformer for Exemplar Based Image Translation |
Chang Jiang, Fei Gao, Biao Ma, Yuhao Lin, Nannan Wang, Gang Xu |
We present a novel framework for exemplar based image translation. Recent
advanced methods for this task mainly focus on establishing cross-domain
semantic correspondence, which sequentially dominates image generation in the
manner of local style control. Unfortunately, cross-domain semantic matching is
challenging; and matching errors ultimately degrade the quality of generated
images. To overcome this challenge, we improve the accuracy of matching on the
one hand, and diminish the role of matching in image generation on the other
hand. To achieve the former, we propose a masked and adaptive transformer (MAT)
for learning accurate cross-domain correspondence, and executing context-aware
feature augmentation. To achieve the latter, we use source features of the
input and global style codes of the exemplar, as supplementary information, for
decoding an image. Besides, we devise a novel contrastive style learning
method, for acquire quality-discriminative style representations, which in turn
benefit high-quality image generation. Experimental results show that our
method, dubbed MATEBIT, performs considerably better than state-of-the-art
methods, in diverse image translation tasks. The codes are available at
\url{https://github.com/AiArt-HDU/MATEBIT}. |
This paper proposes MATEBIT, a novel framework for exemplar-based image translation that improves cross-domain semantic matching and integrates local and global style control for high-fidelity image generation. |
Exemplar-based image translation is challenging because cross-domain semantic matching is difficult and errors degrade the quality of generated images. This work aims to address this challenge for higher quality image translation. |
The proposed MATEBIT framework utilizes a Masked and Adaptive Transformer (MAT) for accurate cross-domain correspondence learning and feature augmentation. It also introduces a Contrastive Style Learning (CSL) method for discriminative style representation and employs a U-Net architecture with skip connections for preserving semantic information. |
MATEBIT consistently outperforms state-of-the-art methods in terms of FID, SWD, and style relevance metrics on various datasets.
The proposed MAT effectively refines cross-domain correspondence and augments features, leading to improved image quality compared to baseline models.
The CSL method enhances style control and generates high-quality images by learning to discriminate subtle differences in perceptual quality. |
Artistic portraits generated from facial photos show some degradation, potentially due to differences in edge maps between photos and paintings.
Future work will explore semi-supervised learning or domain transfer techniques to address the limitations in edge map representation. |
image translation, exemplar-based, transformer, contrastive learning, style control |
2303.17076
Report |
DiffCollage: Parallel Generation of Large Content with Diffusion Models |
Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen, Ming-Yu Liu |
We present DiffCollage, a compositional diffusion model that can generate
large content by leveraging diffusion models trained on generating pieces of
the large content. Our approach is based on a factor graph representation where
each factor node represents a portion of the content and a variable node
represents their overlap. This representation allows us to aggregate
intermediate outputs from diffusion models defined on individual nodes to
generate content of arbitrary size and shape in parallel without resorting to
an autoregressive generation procedure. We apply DiffCollage to various tasks,
including infinite image generation, panorama image generation, and
long-duration text-guided motion generation. Extensive experimental results
with a comparison to strong autoregressive baselines verify the effectiveness
of our approach. |
This paper proposes DiffCollage, a novel method for generating large content (e.g., images, videos, panoramas) using diffusion models trained on smaller pieces of content. |
This is important because collecting large-scale datasets for training diffusion models can be prohibitively expensive for certain content types. DiffCollage leverages the abundance of smaller pieces to synthesize high-quality large content. |
DiffCollage uses a factor graph representation of the large content, where each node represents a portion and is associated with a pre-trained diffusion model. The method approximates the joint distribution of the large content using Bethe approximation and leverages this to generate pieces in parallel and merge them seamlessly. |
DiffCollage outperforms autoregressive baselines in infinite image generation, achieving better FID+ scores and significantly faster generation speeds.
It enables text-to-motion generation of long sequences with complex actions, exceeding the capabilities of models trained on shorter sequences.
DiffCollage successfully synthesizes 360-degree panoramas from normal perspective images conditioned on semantic segmentation maps, showcasing its ability to handle complex dependency structures. |
DiffCollage relies on conditional independence assumptions between content pieces, potentially limiting its ability to capture long-range dependencies.
Parallel computation in DiffCollage comes at the cost of increased memory footprint compared to autoregressive methods. |
diffusion models, large content generation, factor graphs, bethe approximation, parallel sampling |
2303.16891
Report |
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations |
Vibashan VS, Ning Yu, Chen Xing, Can Qin, Mingfei Gao, Juan Carlos Niebles, Vishal M. Patel, Ran Xu |
Existing instance segmentation models learn task-specific information using
manual mask annotations from base (training) categories. These mask annotations
require tremendous human effort, limiting the scalability to annotate novel
(new) categories. To alleviate this problem, Open-Vocabulary (OV) methods
leverage large-scale image-caption pairs and vision-language models to learn
novel categories. In summary, an OV method learns task-specific information
using strong supervision from base annotations and novel category information
using weak supervision from image-captions pairs. This difference between
strong and weak supervision leads to overfitting on base categories, resulting
in poor generalization towards novel categories. In this work, we overcome this
issue by learning both base and novel categories from pseudo-mask annotations
generated by the vision-language model in a weakly supervised manner using our
proposed Mask-free OVIS pipeline. Our method automatically generates
pseudo-mask annotations by leveraging the localization ability of a pre-trained
vision-language model for objects present in image-caption pairs. The generated
pseudo-mask annotations are then used to supervise an instance segmentation
model, freeing the entire pipeline from any labour-expensive instance-level
annotations and overfitting. Our extensive experiments show that our method
trained with just pseudo-masks significantly improves the mAP scores on the
MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art
methods trained with manual masks. Codes and models are provided in
https://vibashan.github.io/ovis-web/. |
This paper proposes Mask-free OVIS, a novel pipeline for open-vocabulary instance segmentation that does not require any human-annotated instance-level labels. |
Existing instance segmentation methods rely on expensive manual annotations, limiting their scalability to novel categories. Open-Vocabulary methods, while promising, often overfit to base categories due to the discrepancy between strong base supervision and weak novel category supervision. |
The method generates pseudo-mask annotations for both base and novel categories using a pre-trained vision-language model (VLM). It leverages a weakly-supervised proposal network and iterative masking with GradCAM to localize objects and generate accurate masks. These pseudo-masks are then used to train a Mask-RCNN model for open-vocabulary instance segmentation. |
Mask-free OVIS achieves state-of-the-art performance on MS-COCO and OpenImages datasets for open-vocabulary instance segmentation, even without using any manual mask annotations.
The weakly-supervised proposal network effectively generalizes to novel categories compared to fully-supervised counterparts.
Iterative masking with GradCAM significantly improves the quality of pseudo-mask generation by capturing less discriminative object regions. |
The performance of Mask-free OVIS, while surpassing existing methods trained without base annotations, is still lower than those fine-tuned with base annotations, suggesting room for improvement in pseudo-mask quality.
The iterative masking strategy, while effective, may introduce redundant activations if performed for too many iterations, requiring careful hyperparameter tuning. |
open-vocabulary learning, instance segmentation, weakly-supervised learning, vision-language models, pseudo-labeling |
2303.16513
Report |
Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution |
Hao-Wei Chen, Yu-Syuan Xu, Min-Fong Hong, Yi-Min Tsai, Hsien-Kai Kuo, Chun-Yi Lee |
Implicit neural representation has recently shown a promising ability in
representing images with arbitrary resolutions. In this paper, we present a
Local Implicit Transformer (LIT), which integrates the attention mechanism and
frequency encoding technique into a local implicit image function. We design a
cross-scale local attention block to effectively aggregate local features. To
further improve representative power, we propose a Cascaded LIT (CLIT) that
exploits multi-scale features, along with a cumulative training strategy that
gradually increases the upsampling scales during training. We have conducted
extensive experiments to validate the effectiveness of these components and
analyze various training strategies. The qualitative and quantitative results
demonstrate that LIT and CLIT achieve favorable results and outperform the
prior works in arbitrary super-resolution tasks. |
This paper introduces Local Implicit Transformer (LIT) and Cascaded LIT (CLIT) for arbitrary-scale super-resolution, integrating attention mechanisms and frequency encoding into local implicit image functions for improved performance. |
Existing super-resolution methods often require separate models for different upsampling scales. LIT and CLIT address this limitation by enabling arbitrary-scale super-resolution with a single model, expanding potential applications. |
LIT employs a cross-scale local attention block to aggregate local features and a decoder to generate residual images. CLIT extends this by using a cascaded framework with multi-scale feature embeddings, trained with a cumulative strategy that progressively increases upsampling scales. |
CLIT outperforms prior local implicit neural representation methods on standard benchmarks like DIV2K, Set5, Set14, B100, and Urban100.
Qualitative results demonstrate CLIT's ability to reconstruct sharp details and continuous structures more effectively than LIIF and LTE.
Ablation studies confirm the contribution of each component in LIT and the effectiveness of the cumulative training strategy for both LIT and CLIT. |
Increasing the local grid size in LIT improves performance but also increases training time.
Future work could explore extending CLIT with more sophisticated encoders or exploring its application in other image restoration tasks. |
super-resolution, arbitrary-scale, implicit neural representation, local attention, cascaded framework |
2303.16509
Report |
HoloDiffusion: Training a 3D Diffusion Model using 2D Images |
Animesh Karnewar, Andrea Vedaldi, David Novotny, Niloy Mitra |
Diffusion models have emerged as the best approach for generative modeling of
2D images. Part of their success is due to the possibility of training them on
millions if not billions of images with a stable learning objective. However,
extending these models to 3D remains difficult for two reasons. First, finding
a large quantity of 3D training data is much more complex than for 2D images.
Second, while it is conceptually trivial to extend the models to operate on 3D
rather than 2D grids, the associated cubic growth in memory and compute
complexity makes this infeasible. We address the first challenge by introducing
a new diffusion setup that can be trained, end-to-end, with only posed 2D
images for supervision; and the second challenge by proposing an image
formation model that decouples model memory from spatial memory. We evaluate
our method on real-world data, using the CO3D dataset which has not been used
to train 3D generative models before. We show that our diffusion models are
scalable, train robustly, and are competitive in terms of sample quality and
fidelity to existing approaches for 3D generative modeling. |
This paper introduces HoloDiffusion, the first 3D-aware generative diffusion model trained with posed 2D images, producing 3D-consistent images. |
Extending diffusion models to 3D enhances generative capabilities, offering view consistency and direct manipulation potential for applications like object placement and content creation. |
The method uses a hybrid explicit-implicit 3D feature grid, decoupling model memory from spatial memory. A novel diffusion process learns the distribution of these grids using only posed 2D images by generating intermediate 3D features and applying a denoising 3D UNet trained with a photometric loss. |
HoloDiffusion generates high-quality, 3D-consistent samples, outperforming baselines qualitatively.
The model demonstrates robustness and scalability, training effectively on a large dataset of real-world videos.
Quantitative metrics like FID and KID confirm the superior performance of HoloDiffusion compared to existing methods. |
The method currently relies on camera information during training, requiring future work to explore joint viewpoint estimation.
Exploration of conditional generation, editing capabilities for shape and appearance, and multi-class training are promising future directions. |
diffusion models, 3d generative models, view synthesis, neural rendering, 3d reconstruction |
2303.16493
Report |
AnyFlow: Arbitrary Scale Optical Flow with Implicit Neural Representation |
Hyunyoung Jung, Zhuo Hui, Lei Luo, Haitao Yang, Feng Liu, Sungjoo Yoo, Rakesh Ranjan, Denis Demandolx |
To apply optical flow in practice, it is often necessary to resize the input
to smaller dimensions in order to reduce computational costs. However,
downsizing inputs makes the estimation more challenging because objects and
motion ranges become smaller. Even though recent approaches have demonstrated
high-quality flow estimation, they tend to fail to accurately model small
objects and precise boundaries when the input resolution is lowered,
restricting their applicability to high-resolution inputs. In this paper, we
introduce AnyFlow, a robust network that estimates accurate flow from images of
various resolutions. By representing optical flow as a continuous
coordinate-based representation, AnyFlow generates outputs at arbitrary scales
from low-resolution inputs, demonstrating superior performance over prior works
in capturing tiny objects with detail preservation on a wide range of scenes.
We establish a new state-of-the-art performance of cross-dataset generalization
on the KITTI dataset, while achieving comparable accuracy on the online
benchmarks to other SOTA methods. |
AnyFlow, a novel neural network architecture for optical flow estimation that can handle arbitrary image resolutions, leading to robust performance even with low-resolution inputs. |
Existing optical flow methods struggle with low-resolution images, limiting their use on devices where resizing to smaller sizes is often necessary for reduced computational cost. AnyFlow addresses this by producing high-quality flow estimations at arbitrary scales, even from downsampled input. |
The proposed AnyFlow builds upon RAFT and introduces: 1) a neural implicit flow upsampler for arbitrary scale output; 2) a multi-scale feature warping module to leverage high-resolution representations; 3) a dynamic lookup scheme with region encoding to adapt to diverse motions and input sizes. The model is trained with multi-scale inputs. |
Achieves state-of-the-art cross-dataset generalization performance on KITTI, surpassing previous methods by a significant margin.
Demonstrates robustness to downsampling, maintaining high accuracy even with 50% downsampled images, outperforming existing methods that degrade substantially.
Generates high-resolution optical flow directly from low-resolution input, avoiding artifacts introduced by interpolation or super-resolution techniques used in other methods. |
The dynamic lookup with region encoding, while improving performance on the training set, did not show consistent improvements on the test set, suggesting further exploration is needed.
Future work could explore the application of AnyFlow to downstream tasks like video super-resolution, which require accurate optical flow from low-resolution input. |
optical flow, arbitrary resolution, implicit neural representation, multi-scale feature warping, dynamic lookup |
2303.16482
Report |
Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields |
Tao Hu, Xiaogang Xu, Shu Liu, Jiaya Jia |
Synthesizing photo-realistic images from a point cloud is challenging because
of the sparsity of point cloud representation. Recent Neural Radiance Fields
and extensions are proposed to synthesize realistic images from 2D input. In
this paper, we present Point2Pix as a novel point renderer to link the 3D
sparse point clouds with 2D dense image pixels. Taking advantage of the point
cloud 3D prior and NeRF rendering pipeline, our method can synthesize
high-quality images from colored point clouds, generally for novel indoor
scenes. To improve the efficiency of ray sampling, we propose point-guided
sampling, which focuses on valid samples. Also, we present Point Encoding to
build Multi-scale Radiance Fields that provide discriminative 3D point
features. Finally, we propose Fusion Encoding to efficiently synthesize
high-quality images. Extensive experiments on the ScanNet and ArkitScenes
datasets demonstrate the effectiveness and generalization. |
Proposes Point2Pix, a novel point cloud renderer that synthesizes photo-realistic images from colored point clouds, particularly for indoor scenes, by bridging the gap between point clouds and Neural Radiance Fields (NeRF). |
Addresses the challenge of synthesizing realistic images from sparse point clouds, leveraging the strengths of point cloud 3D priors and NeRF rendering. |
Combines point-guided sampling for efficient ray focusing, Multi-scale Radiance Fields for extracting discriminative 3D point features, and a Fusion Decoder to generate high-quality images from rendered feature maps. |
Achieves state-of-the-art results on ScanNet and ARkitScenes datasets, outperforming existing point cloud renderers and demonstrating strong generalization to novel indoor scenes.
Significantly reduces rendering time and memory consumption compared to traditional NeRF-based methods due to efficient point-guided sampling and multi-scale feature rendering.
Demonstrates applicability in point cloud upsampling and in-painting by leveraging the learned 3D point features and attributes. |
Rendering speed is relatively slow compared to caching-based methods.
Generalization to arbitrary environments beyond indoor scenes remains a challenge. |
point cloud rendering, neural radiance fields, 3d point features, point-guided sampling, fusion decoder |
2303.16333
Report |
Flow supervision for Deformable NeRF |
Chaoyang Wang, Lachlan Ewen MacDonald, Laszlo A. Jeni, Simon Lucey |
In this paper we present a new method for deformable NeRF that can directly
use optical flow as supervision. We overcome the major challenge with respect
to the computationally inefficiency of enforcing the flow constraints to the
backward deformation field, used by deformable NeRFs. Specifically, we show
that inverting the backward deformation function is actually not needed for
computing scene flows between frames. This insight dramatically simplifies the
problem, as one is no longer constrained to deformation functions that can be
analytically inverted. Instead, thanks to the weak assumptions required by our
derivation based on the inverse function theorem, our approach can be extended
to a broad class of commonly used backward deformation field. We present
results on monocular novel view synthesis with rapid object motion, and
demonstrate significant improvements over baselines without flow supervision. |
This paper presents a new method to apply optical flow supervision to deformable NeRF, by deriving an analytical solution to compute object velocities directly from the backward warping field. |
Current deformable NeRFs struggle with rapid object motion due to the lack of temporal regularization. This method allows for the use of optical flow as supervision to address this issue. |
The method leverages the inverse function theorem to compute velocity fields from the backward warping field without needing an explicit inverse function. Scene flows are then computed via time integration of the velocities, enabling the use of optical flow for supervision. |
Flow supervision significantly improves the convergence speed and reconstruction quality for rapid object motion.
The method outperforms baseline deformable NeRFs on datasets with lower effective multi-view factors.
The method enables clean separation of moving foreground objects and static background. |
The method suffers from scale ambiguity due to the lack of depth supervision.
The choice of the canonical frame can significantly impact performance for highly deformable objects. |
deformable nerf, optical flow, novel view synthesis, dynamic scene reconstruction, temporal regularization |
2303.16201
Report |
ASIC: Aligning Sparse in-the-wild Image Collections |
Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, Abhishek Kar |
We present a method for joint alignment of sparse in-the-wild image
collections of an object category. Most prior works assume either ground-truth
keypoint annotations or a large dataset of images of a single object category.
However, neither of the above assumptions hold true for the long-tail of the
objects present in the world. We present a self-supervised technique that
directly optimizes on a sparse collection of images of a particular
object/object category to obtain consistent dense correspondences across the
collection. We use pairwise nearest neighbors obtained from deep features of a
pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches
and make them dense and accurate matches by optimizing a neural network that
jointly maps the image collection into a learned canonical grid. Experiments on
CUB and SPair-71k benchmarks demonstrate that our method can produce globally
consistent and higher quality correspondences across the image collection when
compared to existing self-supervised methods. Code and other material will be
made available at \url{https://kampta.github.io/asic}. |
This paper presents ASIC, a method for obtaining consistent dense correspondences across a small collection of in-the-wild images of an object or object category, without requiring manual annotations. |
Dense image alignment in the wild is crucial for various applications, but most existing methods rely on annotated keypoints, large datasets, or specific object knowledge. ASIC addresses this gap by leveraging pre-trained self-supervised vision models to enable low-shot dense correspondence. |
ASIC utilizes sparse pseudo-correspondences from deep features of a pre-trained ViT model as noisy keypoint matches. It then jointly optimizes a neural network to map the image collection into a learned canonical grid, using a contrastive loss, equivariance regularization, and reconstruction terms. |
ASIC produces globally consistent and dense canonical space mappings for various object categories, effectively acting as a continuous co-segmentation.
The method outperforms or achieves competitive results with existing unsupervised keypoint correspondence approaches on SPair-71k, CUB-200, PF-Willow, and SAMURAI datasets.
ASIC exhibits superior consistency in keypoint propagation over image sequences compared to baselines, as demonstrated by the proposed k-cycle PCK metric. |
ASIC may struggle with left-right ambiguity in symmetric objects due to the nature of SSL models.
The method might not handle large viewpoint changes well, especially when intermediate viewpoints are scarce.
Future work will explore applications in few-shot tasks like reconstruction, pose estimation, and tracking. |
dense correspondence, image alignment, self-supervised learning, vision transformer, low-shot learning |
2303.16187
Report |
Visual Chain-of-Thought Diffusion Models |
William Harvey, Frank Wood |
Recent progress with conditional image diffusion models has been stunning,
and this holds true whether we are speaking about models conditioned on a text
description, a scene layout, or a sketch. Unconditional image diffusion models
are also improving but lag behind, as do diffusion models which are conditioned
on lower-dimensional features like class labels. We propose to close the gap
between conditional and unconditional models using a two-stage sampling
procedure. In the first stage we sample an embedding describing the semantic
content of the image. In the second stage we sample the image conditioned on
this embedding and then discard the embedding. Doing so lets us leverage the
power of conditional diffusion models on the unconditional generation task,
which we show improves FID by 25-50% compared to standard unconditional
generation. |
This paper introduces Visual Chain-of-Thought Diffusion Models (VCDM), a two-stage sampling procedure for improved unconditional and lightly-conditional image generation. |
Unconditional and lightly-conditional diffusion models lag behind their heavily-conditioned counterparts in terms of sample quality. This paper aims to bridge this performance gap. |
VCDM first samples a semantically-meaningful CLIP embedding and then, conditioned on this embedding, samples the final image using a conditional diffusion model. The CLIP embedding is discarded after image generation. |
VCDM consistently outperforms standard unconditional diffusion models (EDM) on AFHQ, FFHQ, and ImageNet datasets.
The performance gap between VCDM and using an oracle CLIP embedding is small, indicating the effectiveness of the learned auxiliary model for embedding sampling.
VCDM achieves significant FID improvements with minimal computational overhead compared to baselines. |
VCDM relies on the availability of pretrained CLIP embedders, which might be limited for certain domains.
Exploring alternative self-supervised representations or joint diffusion models over both image and embedding spaces could further enhance VCDM. |
diffusion models, image generation, clip embeddings, unconditional generation, conditional generation |
2303.15951
Report |
F$^{2}$-NeRF: Fast Neural Radiance Field Training with Free Camera Trajectories |
Peng Wang, Yuan Liu, Zhaoxi Chen, Lingjie Liu, Ziwei Liu, Taku Komura, Christian Theobalt, Wenping Wang |
This paper presents a novel grid-based NeRF called F2-NeRF (Fast-Free-NeRF)
for novel view synthesis, which enables arbitrary input camera trajectories and
only costs a few minutes for training. Existing fast grid-based NeRF training
frameworks, like Instant-NGP, Plenoxels, DVGO, or TensoRF, are mainly designed
for bounded scenes and rely on space warping to handle unbounded scenes.
Existing two widely-used space-warping methods are only designed for the
forward-facing trajectory or the 360-degree object-centric trajectory but
cannot process arbitrary trajectories. In this paper, we delve deep into the
mechanism of space warping to handle unbounded scenes. Based on our analysis,
we further propose a novel space-warping method called perspective warping,
which allows us to handle arbitrary trajectories in the grid-based NeRF
framework. Extensive experiments demonstrate that F2-NeRF is able to use the
same perspective warping to render high-quality images on two standard datasets
and a new free trajectory dataset collected by us. Project page:
https://totoro97.github.io/projects/f2-nerf. |
This paper presents F$^2$-NeRF, a novel grid-based NeRF method that enables fast training with free camera trajectories for novel view synthesis in unbounded scenes. |
Existing fast grid-based NeRF methods are limited to bounded scenes or specific camera trajectories (forward-facing or 360° object-centric) due to their reliance on space warping techniques. |
F$^2$-NeRF introduces a new perspective warping method that generalizes to arbitrary camera trajectories. It maps 3D points to a warp space based on their projections onto multiple input views using PCA. Additionally, it employs an adaptive space subdivision scheme and multiple hash functions to efficiently handle unbounded scenes. |
F$^2$-NeRF outperforms baseline methods on a new Free dataset with challenging free camera trajectories while achieving training times of around 12 minutes on a 2080Ti GPU.
The proposed perspective warping is shown to be compatible with both forward-facing and 360° object-centric trajectories, achieving comparable results to specialized warping techniques on LLFF and NeRF-360-V2 datasets.
Ablation studies validate the effectiveness of perspective warping and perspective sampling for improved rendering quality on free trajectories. |
The current implementation relies on a fixed number of cameras for perspective warping computation, potentially limiting its representation capacity for scenes with drastic changes in viewpoints.
The proposed perspective warping utilizes a linear approximation for perspective sampling, which may lead to suboptimal sampling in certain cases with complex geometry. |
neural radiance fields, novel view synthesis, space warping, perspective warping, unbounded scenes |
2303.15893
Report |
VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs |
Anna Frühstück, Nikolaos Sarafianos, Yuanlu Xu, Peter Wonka, Tony Tung |
We introduce VIVE3D, a novel approach that extends the capabilities of
image-based 3D GANs to video editing and is able to represent the input video
in an identity-preserving and temporally consistent way. We propose two new
building blocks. First, we introduce a novel GAN inversion technique
specifically tailored to 3D GANs by jointly embedding multiple frames and
optimizing for the camera parameters. Second, besides traditional semantic face
edits (e.g. for age and expression), we are the first to demonstrate edits that
show novel views of the head enabled by the inherent properties of 3D GANs and
our optical flow-guided compositing technique to combine the head with the
background video. Our experiments demonstrate that VIVE3D generates
high-fidelity face edits at consistent quality from a range of camera
viewpoints which are composited with the original video in a temporally and
spatially consistent manner. |
Presents VIVE3D, a novel method for viewpoint-independent video editing that leverages 3D-aware GANs to enable realistic and consistent edits of facial attributes and camera viewpoints. |
Existing video editing techniques struggle with maintaining consistency when altering viewpoints, limiting their application in scenarios requiring perspective changes. |
VIVE3D decomposes the input video into identity and offset latent codes, allowing for personalized generator fine-tuning and per-frame editing. It also employs optical flow correction to ensure seamless compositing of the edited face onto the original body, even under significant viewpoint changes. |
VIVE3D successfully performs attribute edits (like aging) and viewpoint adjustments while preserving temporal and spatial consistency.
The method surpasses existing 2D GAN-based techniques in handling viewpoint changes, showcasing superior quantitative and qualitative results.
VIVE3D demonstrates strong generalization ability by compositing faces and motions from different videos with plausible outcomes. |
Lighting inconsistencies between source and target videos can impact the realism of composites, suggesting an area for future improvement.
The reliance on per-frame optimization, while effective, can be computationally intensive. Exploring encoding strategies for EG3D could improve efficiency. |
video editing, 3d gans, viewpoint synthesis, facial attribute editing, deep learning |
2303.15892
Report |
Head3D: Complete 3D Head Generation via Tri-plane Feature Distillation |
Yuhao Cheng, Yichao Yan, Wenhan Zhu, Ye Pan, Bowen Pan, Xiaokang Yang |
Head generation with diverse identities is an important task in computer
vision and computer graphics, widely used in multimedia applications. However,
current full head generation methods require a large number of 3D scans or
multi-view images to train the model, resulting in expensive data acquisition
cost. To address this issue, we propose Head3D, a method to generate full 3D
heads with limited multi-view images. Specifically, our approach first extracts
facial priors represented by tri-planes learned in EG3D, a 3D-aware generative
model, and then proposes feature distillation to deliver the 3D frontal faces
into complete heads without compromising head integrity. To mitigate the domain
gap between the face and head models, we present dual-discriminators to guide
the frontal and back head generation, respectively. Our model achieves
cost-efficient and diverse complete head generation with photo-realistic
renderings and high-quality geometry representations. Extensive experiments
demonstrate the effectiveness of our proposed Head3D, both qualitatively and
quantitatively. |
This paper proposes Head3D, a novel method for generating complete 3D heads using limited multi-view images and a pre-trained 3D face generator. |
Existing 3D head generation methods are limited to frontal faces or require expensive 3D scans. Head3D addresses these limitations by leveraging a cost-effective approach. |
Head3D extracts facial priors from a pre-trained EG3D model via tri-plane feature distillation, transferring identity information while completing head geometry. A dual-discriminator approach addresses the distribution gap between frontal and back head images. |
Head3D generates high-fidelity complete heads with photo-realistic renderings and detailed geometry.
The proposed tri-plane feature distillation effectively transfers identity information while maintaining head integrity.
Dual-discriminators improve generation quality by addressing the distribution gap and quantity imbalance between front and back views. |
The quality of generated heads, while high, is slightly lower than the pre-trained face generator due to knowledge distillation and limited data.
Future work could explore improving generation quality and extending the method to handle various head poses and expressions. |
3d head generation, tri-plane feature distillation, dual-discriminator, knowledge distillation, 3d-aware gan |
2303.15780
Report |
Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion |
Hiromichi Kamata, Yuiko Sakuma, Akio Hayakawa, Masato Ishii, Takuya Narihira |
We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D. Our
method is designed for a novel task, which is to convert a given 3D scene to
another scene according to text instructions. Instruct 3D-to-3D applies
pretrained Image-to-Image diffusion models for 3D-to-3D conversion. This
enables the likelihood maximization of each viewpoint image and high-quality 3D
generation. In addition, our proposed method explicitly inputs the source 3D
scene as a condition, which enhances 3D consistency and controllability of how
much of the source 3D scene structure is reflected. We also propose dynamic
scaling, which allows the intensity of the geometry transformation to be
adjusted. We performed quantitative and qualitative evaluations and showed that
our proposed method achieves higher quality 3D-to-3D conversions than baseline
methods. |
Presents Instruct 3D-to-3D, a method for converting a 3D scene into another based on text instructions, leveraging pretrained Image-to-Image diffusion models and dynamic scaling for high quality and 3D consistency. |
Editing 3D scenes with text instructions is desirable for easier and more intuitive 3D content creation. |
Uses a pretrained Image-to-Image diffusion model (InstructPix2Pix) conditioned on the source 3D scene and the text instruction to optimize a target 3D model. Dynamic scaling is employed to control the strength of geometry transformation by gradually decreasing and increasing the 3D resolution. |
Achieves higher quality 3D-to-3D conversions than CLIP-NeRF and DreamFusion in qualitative and quantitative evaluations.
Demonstrates better preservation of source 3D scene structure while reflecting text instructions.
User study confirms a strong preference for Instruct 3D-to-3D over baseline methods in terms of overall conversion quality. |
Limitations in handling instructions requiring spatial reasoning, such as accurately placing objects.
Future work to incorporate depth information and improve spatial reasoning capabilities. |
3d-to-3d conversion, text-guided 3d editing, diffusion models, dynamic scaling, implicit 3d representation |
2303.15768
Report |
RobustSwap: A Simple yet Robust Face Swapping Model against Attribute Leakage |
Jaeseong Lee, Taewoo Kim, Sunghyun Park, Younggun Lee, Jaegul Choo |
Face swapping aims at injecting a source image's identity (i.e., facial
features) into a target image, while strictly preserving the target's
attributes, which are irrelevant to identity. However, we observed that
previous approaches still suffer from source attribute leakage, where the
source image's attributes interfere with the target image's. In this paper, we
analyze the latent space of StyleGAN and find the adequate combination of the
latents geared for face swapping task. Based on the findings, we develop a
simple yet robust face swapping model, RobustSwap, which is resistant to the
potential source attribute leakage. Moreover, we exploit the coordination of
3DMM's implicit and explicit information as a guidance to incorporate the
structure of the source image and the precise pose of the target image. Despite
our method solely utilizing an image dataset without identity labels for
training, our model has the capability to generate high-fidelity and temporally
consistent videos. Through extensive qualitative and quantitative evaluations,
we demonstrate that our method shows significant improvements compared with the
previous face swapping models in synthesizing both images and videos. Project
page is available at https://robustswap.github.io/ |
This paper introduces RobustSwap, a novel face swapping model that addresses the issue of source attribute leakage in previous approaches. |
Existing face swapping methods often exhibit source attribute leakage, where attributes from the source image, such as hair or pose, contaminate the swapped image. This leakage degrades the quality and realism of the results. |
The paper analyzes StyleGAN's latent space to find an optimal combination of latent codes for preserving target attributes while injecting source identity. The proposed RobustSwap model utilizes a target attribute encoder, source identity encoder, and a shape-guided identity injection mechanism based on 3DMM. |
RobustSwap demonstrates superior performance in preserving target attributes and minimizing source attribute leakage compared to existing methods.
The method achieves state-of-the-art quantitative results on CelebA-HQ dataset across various metrics, including identity similarity, attribute preservation, and image quality.
RobustSwap can generate high-quality and temporally consistent videos even without training on video datasets, highlighting its robustness and generalization ability. |
The model's reliance on a pre-trained StyleGAN might limit its generalizability to unseen domains or facial variations.
Further research could explore enhancing identity preservation while maintaining attribute fidelity. |
face swapping, source attribute leakage, stylegan, 3d morphable model (3dmm), latent space analysis |
2303.15649
Report |
StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing |
Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang |
A significant research effort is focused on exploiting the amazing capacities
of pretrained diffusion models for the editing of images. They either finetune
the model, or invert the image in the latent space of the pretrained model.
However, they suffer from two problems: (1) Unsatisfying results for selected
regions, and unexpected changes in nonselected regions. (2) They require
careful text prompt editing where the prompt should include all visual objects
in the input image. To address this, we propose two improvements: (1) Only
optimizing the input of the value linear network in the cross-attention layers,
is sufficiently powerful to reconstruct a real image. (2) We propose attention
regularization to preserve the object-like attention maps after editing,
enabling us to obtain accurate style editing without invoking significant
structural changes. We further improve the editing technique which is used for
the unconditional branch of classifier-free guidance, as well as the
conditional one as used by P2P. Extensive experimental prompt-editing results
on a variety of images, demonstrate qualitatively and quantitatively that our
method has superior editing capabilities than existing and concurrent works. |
This paper introduces StyleDiffusion, a method for accurate text-based editing of real images using pre-trained text-guided diffusion models. |
Existing methods struggle with achieving accurate edits in specific regions while preserving the rest of the image and often require complex prompt engineering. StyleDiffusion addresses these limitations. |
StyleDiffusion maps a real image to the input embeddings of the *value* linear layer in the cross-attention layers, keeping the text embedding for the *key* layer fixed. This preserves structure from the input image while enabling style editing. It also introduces an attention regularization to maintain attention map accuracy during editing and proposes P2Plus, an enhancement to the P2P editing technique for improved handling of large structural changes. |
StyleDiffusion achieves more accurate editing of real images compared to baselines like Null-text and SDEdit, as demonstrated qualitatively and quantitatively using metrics like Structure Dist, NS-LPIPS, and Clipscore.
The method successfully preserves the structure of non-edited regions in the image while modifying the targeted elements.
Attention regularization proves essential for maintaining the fidelity of attention maps during editing, resulting in more precise edits. |
StyleDiffusion may not perform optimally when the input image contains objects in uncommon poses or when the semantic gap between source and target prompts is too large.
Future work could explore extending StyleDiffusion to handle multiple edits simultaneously and further improve its efficiency for real-time applications. |
image editing, diffusion models, text-guided synthesis, attention mechanisms, style transfer |
2303.15446
Report |
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications |
Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan |
Self-attention has become a defacto choice for capturing global context in
various vision applications. However, its quadratic computational complexity
with respect to image resolution limits its use in real-time applications,
especially for deployment on resource-constrained mobile devices. Although
hybrid approaches have been proposed to combine the advantages of convolutions
and self-attention for a better speed-accuracy trade-off, the expensive matrix
multiplication operations in self-attention remain a bottleneck. In this work,
we introduce a novel efficient additive attention mechanism that effectively
replaces the quadratic matrix multiplication operations with linear
element-wise multiplications. Our design shows that the key-value interaction
can be replaced with a linear layer without sacrificing any accuracy. Unlike
previous state-of-the-art methods, our efficient formulation of self-attention
enables its usage at all stages of the network. Using our proposed efficient
additive attention, we build a series of models called "SwiftFormer" which
achieves state-of-the-art performance in terms of both accuracy and mobile
inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy
with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster
compared to MobileViT-v2. Code: https://github.com/Amshaker/SwiftFormer |
This paper introduces SwiftFormer, a novel efficient additive attention mechanism for transformer-based real-time mobile vision applications, which replaces quadratic matrix multiplications in self-attention with linear element-wise multiplications. |
Self-attention, while effective for capturing global context in vision applications, has a quadratic computational complexity that limits its use in real-time applications on mobile devices. SwiftFormer addresses this by providing an efficient alternative. |
The paper proposes an efficient additive attention mechanism that eliminates the need for matrix multiplications by focusing on query-key interactions and using a linear layer for context encoding. This allows for a consistent hybrid design with the attention block used in all stages of the network. |
SwiftFormer achieves state-of-the-art performance in terms of accuracy and mobile inference speed, outperforming existing ConvNets, transformer-based, and hybrid models.
The small variant of SwiftFormer achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8ms latency on iPhone 14, surpassing MobileViT-v2 in both accuracy and speed.
The effectiveness of SwiftFormer is demonstrated across various tasks, including image classification, object detection, instance segmentation, and semantic segmentation. |
The paper primarily focuses on 224x224 image resolution for evaluation and comparison.
Future work may explore the use of neural architecture search to potentially further optimize SwiftFormer's performance. |
computer vision, mobile vision, efficient deep learning, transformers, self-attention |
2303.15437
Report |
FaceLit: Neural 3D Relightable Faces |
Anurag Ranjan, Kwang Moo Yi, Jen-Hao Rick Chang, Oncel Tuzel |
We propose a generative framework, FaceLit, capable of generating a 3D face
that can be rendered at various user-defined lighting conditions and views,
learned purely from 2D images in-the-wild without any manual annotation. Unlike
existing works that require careful capture setup or human labor, we rely on
off-the-shelf pose and illumination estimators. With these estimates, we
incorporate the Phong reflectance model in the neural volume rendering
framework. Our model learns to generate shape and material properties of a face
such that, when rendered according to the natural statistics of pose and
illumination, produces photorealistic face images with multiview 3D and
illumination consistency. Our method enables photorealistic generation of faces
with explicit illumination and view controls on multiple datasets - FFHQ,
MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware
GANs on FFHQ dataset achieving an FID score of 3.5. |
Proposes FaceLit, a generative framework that learns a disentangled 3D model of a face from 2D images, enabling rendering under various user-defined lighting conditions and views. |
Existing 3D generative models entangle geometry and illumination, limiting controllability. FaceLit addresses this by incorporating physics-based illumination for disentanglement. |
Embeds a simplified Phong reflectance model with Spherical Harmonics into the EG3D neural volume rendering pipeline. The model learns to generate shape and material properties for realistic rendering under varying pose and illumination. |
Achieves state-of-the-art FID score of 3.5 among 3D aware GANs on FFHQ dataset.
Demonstrates photorealistic generation of faces with explicit illumination and view controls on FFHQ, MetFaces, and CelebA-HQ.
Shows improved detail in challenging areas like lips and teeth compared to previous methods. |
Does not model all physical aspects of the scene, e.g., global illumination or subsurface scattering, which could further improve realism.
Accuracy is limited by the performance of external pose and illumination estimation methods (DECA). |
generative model, 3d face reconstruction, relighting, neural volume rendering, disentanglement |
2303.15435
Report |
The Stable Signature: Rooting Watermarks in Latent Diffusion Models |
Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, Teddy Furon |
Generative image modeling enables a wide range of applications but raises
ethical concerns about responsible deployment. This paper introduces an active
strategy combining image watermarking and Latent Diffusion Models. The goal is
for all generated images to conceal an invisible watermark allowing for future
detection and/or identification. The method quickly fine-tunes the latent
decoder of the image generator, conditioned on a binary signature. A
pre-trained watermark extractor recovers the hidden signature from any
generated image and a statistical test then determines whether it comes from
the generative model. We evaluate the invisibility and robustness of the
watermarks on a variety of generation tasks, showing that Stable Signature
works even after the images are modified. For instance, it detects the origin
of an image generated from a text prompt, then cropped to keep $10\%$ of the
content, with $90$+$\%$ accuracy at a false positive rate below 10$^{-6}$. |
This paper proposes Stable Signature, a method to embed invisible watermarks directly into images generated by Latent Diffusion Models (LDMs) for detection and identification purposes. |
The rise of AI-generated content necessitates reliable methods for detecting and tracing generated images to address concerns about authenticity, copyright, and misuse. |
Stable Signature fine-tunes the LDM decoder by back-propagating a combined loss of a perceptual image loss and a hidden message loss from a pre-trained watermark extractor. The extractor, trained using a simplified HiDDeN method, ensures robust watermark recovery even after image modifications. |
Stable Signature achieves high detection rates (e.g., 99% for unmodified images) with low false positive rates (e.g., 1 in 1 billion) on various image transformations.
The method demonstrates accurate identification of the specific LDM model used for generation even with a large number of deployed models.
Stable Signature maintains high image quality, with minimal perceptual differences between watermarked and original LDM outputs. |
The watermark's robustness against model purification attacks, where the model is fine-tuned to remove the watermark, requires further investigation.
Exploring more powerful traitor tracing codes and accusation algorithms to enhance the identification of colluding users is an area for future work. |
image watermarking, latent diffusion models, ai-generated content detection, content authentication, responsible ai |
2303.15403
Report |
Training-free Content Injection using h-space in Diffusion Models |
Jaeseok Jeong, Mingi Kwon, Youngjung Uh |
Diffusion models (DMs) synthesize high-quality images in various domains.
However, controlling their generative process is still hazy because the
intermediate variables in the process are not rigorously studied. Recently, the
bottleneck feature of the U-Net, namely $h$-space, is found to convey the
semantics of the resulting image. It enables StyleCLIP-like latent editing
within DMs. In this paper, we explore further usage of $h$-space beyond
attribute editing, and introduce a method to inject the content of one image
into another image by combining their features in the generative processes.
Briefly, given the original generative process of the other image, 1) we
gradually blend the bottleneck feature of the content with proper
normalization, and 2) we calibrate the skip connections to match the injected
content. Unlike custom-diffusion approaches, our method does not require
time-consuming optimization or fine-tuning. Instead, our method manipulates
intermediate features within a feed-forward generative process. Furthermore,
our method does not require supervision from external networks. The code is
available at https://curryjung.github.io/InjectFusion/ |
This paper introduces InjectFusion, a training-free method for injecting content from one image into another using pretrained diffusion models by manipulating feature maps in the bottleneck of the U-Net. |
Controlling the generative process of diffusion models remains a challenge. This method provides a novel way to control content without requiring additional training or external networks, unlike previous approaches. |
InjectFusion uses spherical interpolation (Slerp) to blend bottleneck features of content and original images while preserving statistical correlations within the model. It further introduces "latent calibration" to fine-tune the blending and preserve image quality. |
Simply replacing bottleneck features leads to content injection but with significant distortion.
Slerp with proper normalization effectively injects content while maintaining high image quality.
InjectFusion successfully injects content even when using out-of-domain images as stylistic references. |
The small spatial dimensions of the bottleneck feature map limit the granularity of local content injection.
Injecting content from out-of-domain images with drastically different semantics can lead to poor results. |
diffusion models, content injection, image synthesis, generative models, training-free |
2303.15389
Report |
EVA-CLIP: Improved Training Techniques for CLIP at Scale |
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao |
Contrastive language-image pre-training, CLIP for short, has gained
increasing attention for its potential in various scenarios. In this paper, we
propose EVA-CLIP, a series of models that significantly improve the efficiency
and effectiveness of CLIP training. Our approach incorporates new techniques
for representation learning, optimization, and augmentation, enabling EVA-CLIP
to achieve superior performance compared to previous CLIP models with the same
number of parameters but significantly smaller training costs. Notably, our
largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples
achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller
EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples
achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open
access and open research, we release the complete suite of EVA-CLIP to the
community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP. |
This paper proposes \evaclip, a family of models that significantly improves the efficiency and effectiveness of CLIP training, achieving superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. |
Training CLIP models is challenging due to high computational costs and training instability when scaling up. \evaclip provides a feasible, efficient, and effective solution by significantly reducing training costs, stabilizing training, and improving zero-shot performance. |
The \evaclip approach leverages pre-trained EVA representations for initialization, utilizes the LAMB optimizer, implements random dropping of input tokens (FLIP), and employs a speedup trick called flash attention. |
The 5.0B-parameter \evaTwoclip-E/14+ achieves 82.0% zero-shot top-1 accuracy on ImageNet-1K val with only 9 billion seen samples.
The smaller \evaTwoclip-L/14+ achieves 80.4% zero-shot top-1 accuracy on ImageNet-1K val using only 430 million parameters and 6 billion seen samples.
\evaclip models demonstrate superior performance across various zero-shot benchmarks, including image classification, video classification, and image-text retrieval tasks. |
The zero-shot retrieval performance of \evaTwoclip-E/14, while competitive, is slightly lower than OpenCLIP-G/14, possibly due to the smaller text encoder capacity and fewer training samples.
Future work could explore further scaling up \evaclip models with larger text encoders and more training data, especially for retrieval tasks. |
clip, image classification, zero-shot learning, vision-language pre-training, efficient training |
2303.15342
Report |
Exploring Continual Learning of Diffusion Models |
Michał Zając, Kamil Deja, Anna Kuzina, Jakub M. Tomczak, Tomasz Trzciński, Florian Shkurti, Piotr Miłoś |
Diffusion models have achieved remarkable success in generating high-quality
images thanks to their novel training procedures applied to unprecedented
amounts of data. However, training a diffusion model from scratch is
computationally expensive. This highlights the need to investigate the
possibility of training these models iteratively, reusing computation while the
data distribution changes. In this study, we take the first step in this
direction and evaluate the continual learning (CL) properties of diffusion
models. We begin by benchmarking the most common CL methods applied to
Denoising Diffusion Probabilistic Models (DDPMs), where we note the strong
performance of the experience replay with the reduced rehearsal coefficient.
Furthermore, we provide insights into the dynamics of forgetting, which exhibit
diverse behavior across diffusion timesteps. We also uncover certain pitfalls
of using the bits-per-dimension metric for evaluating CL. |
This paper presents the first study on the continual learning (CL) properties of diffusion models, benchmarking common CL methods on DDPMs and revealing insights into forgetting dynamics and evaluation metrics. |
Training diffusion models from scratch is computationally expensive, making iterative training with changing data distributions highly desirable. |
The authors evaluated various CL methods, including finetuning, L2 regularization, experience replay, and generative replay, on MNIST and Fashion-MNIST datasets. |
DDPMs exhibit significant catastrophic forgetting in CL settings.
Experience replay with a reduced rehearsal coefficient effectively prevents catastrophic forgetting in DDPMs.
Bits-per-dimension (BPD) is an unreliable metric for evaluating generative CL performance in diffusion models. |
The study was limited to MNIST and Fashion-MNIST datasets.
Future work could explore novel CL strategies tailored for diffusion models and extend the analysis to more complex scenarios like text-to-image generation. |
continual learning, diffusion models, generative models, catastrophic forgetting, experience replay |
2303.15234
Report |
Prompt Tuning based Adapter for Vision-Language Model Adaption |
Jingchen Sun, Jiayu Qin, Zihao Lin, Changyou Chen |
Large pre-trained vision-language (VL) models have shown significant promise
in adapting to various downstream tasks. However, fine-tuning the entire
network is challenging due to the massive number of model parameters. To
address this issue, efficient adaptation methods such as prompt tuning have
been proposed. We explore the idea of prompt tuning with multi-task pre-trained
initialization and find it can significantly improve model performance. Based
on our findings, we introduce a new model, termed Prompt-Adapter, that combines
pre-trained prompt tunning with an efficient adaptation network. Our approach
beat the state-of-the-art methods in few-shot image classification on the
public 11 datasets, especially in settings with limited data instances such as
1 shot, 2 shots, 4 shots, and 8 shots images. Our proposed method demonstrates
the promise of combining prompt tuning and parameter-efficient networks for
efficient vision-language model adaptation. The code is publicly available at:
https://github.com/Jingchensun/prompt_adapter. |
This paper introduces Prompt-Adapter, a novel model for efficient vision-language model adaptation by combining pre-trained prompt tuning and an efficient adaptation network (cache model). |
Adapting large pre-trained vision-language models to downstream tasks is challenging due to their size and complexity. This paper addresses this challenge by improving efficiency while maintaining high performance, especially in few-shot learning scenarios. |
The paper proposes Prompt-Adapter, which leverages a pre-trained text prompt from CoOp and a cache model similar to Tip-Adapter. They explore both training-free (Prompt-Adapter) and fine-tuned (Prompt-Adapter-F) variants. Additionally, they investigate the impact of multi-task prompt initialization and different training strategies. |
Prompt-Adapter achieves superior few-shot image classification performance on 11 datasets, outperforming baselines like CoOp and Tip-Adapter.
The study shows that multi-task prompt initialization significantly improves performance compared to random or manual initialization.
Separately training the prompt and the cache model is found to be more effective than joint training. |
The method shows slight decreases in accuracy on datasets with high intra-class visual feature variance, such as EuroSAT and OxfordPets.
Future work could explore combining other prompt learning methods or expanding the approach to other downstream tasks beyond image classification. |
vision-language models, prompt tuning, few-shot learning, image classification, parameter-efficient learning |
2303.15067
Report |
Intersection over Union with smoothing for bounding box regression |
Petra Števuliáková, Petr Hurtik |
We focus on the construction of a loss function for the bounding box
regression. The Intersection over Union (IoU) metric is improved to converge
faster, to make the surface of the loss function smooth and continuous over the
whole searched space, and to reach a more precise approximation of the labels.
The main principle is adding a smoothing part to the original IoU, where the
smoothing part is given by a linear space with values that increases from the
ground truth bounding box to the border of the input image, and thus covers the
whole spatial search space. We show the motivation and formalism behind this
loss function and experimentally prove that it outperforms IoU, DIoU, CIoU, and
SIoU by a large margin. We experimentally show that the proposed loss function
is robust with respect to the noise in the dimension of ground truth bounding
boxes. The reference implementation is available at
gitlab.com/irafm-ai/smoothing-iou. |
This paper introduces a novel smoothing modification to the standard Intersection over Union (IoU) loss function, aiming to improve bounding box regression in object detection. |
The proposed method addresses limitations of existing IoU-based losses by improving convergence speed and robustness against noisy labels, crucial for real-world applications with limited or imperfect data. |
The approach involves adding a smoothing part, a linear space with values increasing from the ground truth bounding box to the image border, to the standard IoU loss. This guides gradient descent and mitigates the effects of noisy labels. |
The smoothing IoU loss outperforms standard IoU, SIoU, DIoU, and CIoU in regression accuracy on both clean and noisy datasets.
It exhibits lower overfitting on clean data and higher underfitting on noisy data, indicating robustness against label noise.
The method shows stable regression accuracy even with high noise levels (up to 60%), surpassing the performance of other losses trained on clean data. |
The paper employs a custom dataset with limited diversity, potentially impacting the generalizability of the findings.
Future work could explore the integration of the smoothing approach with other advanced IoU-based losses for enhanced performance. |
bounding box regression, intersection over union, object detection, noisy labels, loss function |
2303.15043
Report |
Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time |
Wei Shang, Dongwei Ren, Yi Yang, Hongzhi Zhang, Kede Ma, Wangmeng Zuo |
Natural videos captured by consumer cameras often suffer from low framerate
and motion blur due to the combination of dynamic scene complexity, lens and
sensor imperfection, and less than ideal exposure setting. As a result,
computational methods that jointly perform video frame interpolation and
deblurring begin to emerge with the unrealistic assumption that the exposure
time is known and fixed. In this work, we aim ambitiously for a more realistic
and challenging task - joint video multi-frame interpolation and deblurring
under unknown exposure time. Toward this goal, we first adopt a variant of
supervised contrastive learning to construct an exposure-aware representation
from input blurred frames. We then train two U-Nets for intra-motion and
inter-motion analysis, respectively, adapting to the learned exposure
representation via gain tuning. We finally build our video reconstruction
network upon the exposure and motion representation by progressive
exposure-adaptive convolution and motion refinement. Extensive experiments on
both simulated and real-world datasets show that our optimized method achieves
notable performance gains over the state-of-the-art on the joint video x8
interpolation and deblurring task. Moreover, on the seemingly implausible x16
interpolation task, our method outperforms existing methods by more than 1.5 dB
in terms of PSNR. |
This paper proposes VIDUE, a novel method for jointly interpolating and deblurring videos with unknown exposure times. |
Existing video frame interpolation and deblurring methods often assume fixed and known exposure time, which is unrealistic for real-world videos captured by consumer cameras. This work addresses the more challenging and realistic setting of unknown exposure time. |
VIDUE leverages supervised contrastive learning to construct an exposure-aware representation from input blurred frames. It then uses two U-Nets for intra-motion and inter-motion analysis, adapting them to the learned exposure representation via gain tuning. Finally, it builds a video reconstruction network with exposure-adaptive convolution and motion refinement. |
VIDUE achieves state-of-the-art performance on joint video x8 interpolation and deblurring, outperforming existing methods by a significant margin on both synthetic and real-world datasets.
The method demonstrates robust performance under different exposure time settings, effectively handling the challenges posed by unknown blur.
VIDUE exhibits promising results on the challenging x16 interpolation task, surpassing previous approaches by more than 1.5 dB in PSNR. |
The computational complexity of VIDUE could be further reduced for practical applications.
Future work can explore optimizing VIDUE using perceptual quality metrics to improve the temporal coherence of the reconstructed videos. |
video frame interpolation, video deblurring, unknown exposure time, adaptive computation, supervised contrastive learning |
2303.14968
Report |
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective |
Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, Kede Ma |
We aim at advancing blind image quality assessment (BIQA), which predicts the
human perception of image quality without any reference information. We develop
a general and automated multitask learning scheme for BIQA to exploit auxiliary
knowledge from other tasks, in a way that the model parameter sharing and the
loss weighting are determined automatically. Specifically, we first describe
all candidate label combinations (from multiple tasks) using a textual
template, and compute the joint probability from the cosine similarities of the
visual-textual embeddings. Predictions of each task can be inferred from the
joint distribution, and optimized by carefully designed loss functions. Through
comprehensive experiments on learning three tasks - BIQA, scene classification,
and distortion type identification, we verify that the proposed BIQA method 1)
benefits from the scene classification and distortion type identification tasks
and outperforms the state-of-the-art on multiple IQA datasets, 2) is more
robust in the group maximum differentiation competition, and 3) realigns the
quality annotations from different IQA datasets more effectively. The source
code is available at https://github.com/zwx8981/LIQE. |
This paper proposes LIQE, a novel blind image quality assessment (BIQA) method that leverages multitask learning via vision-language correspondence. |
Addressing the challenge of limited human-annotated data in BIQA, this work explores incorporating auxiliary knowledge from other vision tasks like scene classification and distortion type identification to improve quality prediction accuracy. |
LIQE utilizes a pre-trained CLIP model to obtain visual and textual embeddings for input images and textual descriptions of scene, distortion, and quality. It jointly optimizes a multitask objective function with dynamically weighted fidelity losses for quality prediction, scene classification, and distortion type identification. |
LIQE outperforms state-of-the-art BIQA methods on multiple benchmark datasets, demonstrating the benefits of multitask learning with vision-language correspondence.
It exhibits improved generalizability in cross-dataset evaluations and the group maximum differentiation (gMAD) competition, indicating better perceptual scale alignment across datasets.
The method effectively leverages distortion type identification as an auxiliary task to aid BIQA, suggesting a cooperative relationship between them. |
The performance of LIQE on algorithm-dependent distortions is limited, suggesting a need for task-specific training in such scenarios.
Future work includes exploring other auxiliary tasks and more sophisticated loss weighting schemes to further enhance BIQA performance. |
blind image quality assessment, multitask learning, vision-language correspondence, clip, perceptual quality |
2303.14960
Report |
Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection |
Chang Liu, Weiming Zhang, Xiangru Lin, Wei Zhang, Xiao Tan, Junyu Han, Xiaomao Li, Errui Ding, Jingdong Wang |
With basic Semi-Supervised Object Detection (SSOD) techniques, one-stage
detectors generally obtain limited promotions compared with two-stage clusters.
We experimentally find that the root lies in two kinds of ambiguities: (1)
Selection ambiguity that selected pseudo labels are less accurate, since
classification scores cannot properly represent the localization quality. (2)
Assignment ambiguity that samples are matched with improper labels in
pseudo-label assignment, as the strategy is misguided by missed objects and
inaccurate pseudo boxes. To tackle these problems, we propose a
Ambiguity-Resistant Semi-supervised Learning (ARSL) for one-stage detectors.
Specifically, to alleviate the selection ambiguity, Joint-Confidence Estimation
(JCE) is proposed to jointly quantifies the classification and localization
quality of pseudo labels. As for the assignment ambiguity, Task-Separation
Assignment (TSA) is introduced to assign labels based on pixel-level
predictions rather than unreliable pseudo boxes. It employs a
"divide-and-conquer" strategy and separately exploits positives for the
classification and localization task, which is more robust to the assignment
ambiguity. Comprehensive experiments demonstrate that ARSL effectively
mitigates the ambiguities and achieves state-of-the-art SSOD performance on MS
COCO and PASCAL VOC. Codes can be found at
https://github.com/PaddlePaddle/PaddleDetection. |
This paper proposes Ambiguity-Resistant Semi-supervised Learning (ARSL) to address the limited performance of one-stage detectors in semi-supervised object detection. |
One-stage detectors, despite their efficiency, lag behind two-stage counterparts in semi-supervised object detection due to ambiguities in pseudo-label selection and assignment. |
ARSL tackles two ambiguities: (1) Selection ambiguity is mitigated by Joint-Confidence Estimation (JCE), jointly quantifying classification and localization quality. (2) Assignment ambiguity is addressed by Task-Separation Assignment (TSA), assigning labels based on pixel-level predictions, separately leveraging potential positives for classification and localization. |
ARSL significantly boosts one-stage detector performance in semi-supervised settings, outperforming previous methods on COCO-Standard.
JCE effectively reduces selection ambiguity, reflected in improved Top-5 IoU and correlation between classification and localization quality.
TSA, assigning labels based on dense predictions instead of boxes, mitigates assignment ambiguity by increasing true positives and reducing false positives. |
The slight increase in false positives in TSA is attributed to treating all ambiguous candidates as positives for classification.
Future work can explore more sophisticated strategies to further refine the selection of potential positives in TSA, potentially by incorporating contextual information or leveraging relationships between objects. |
semi-supervised object detection, one-stage detectors, pseudo-labeling, joint-confidence estimation, task-separation assignment |
2303.14707
Report |
Clean-NeRF: Reformulating NeRF to account for View-Dependent Observations |
Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang |
While Neural Radiance Fields (NeRFs) had achieved unprecedented novel view
synthesis results, they have been struggling in dealing with large-scale
cluttered scenes with sparse input views and highly view-dependent appearances.
Specifically, existing NeRF-based models tend to produce blurry rendering with
the volumetric reconstruction often inaccurate, where a lot of reconstruction
errors are observed in the form of foggy "floaters" hovering within the entire
volume of an opaque 3D scene. Such inaccuracies impede NeRF's potential for
accurate 3D NeRF registration, object detection, segmentation, etc., which
possibly accounts for only limited significant research effort so far to
directly address these important 3D fundamental computer vision problems to
date. This paper analyzes the NeRF's struggles in such settings and proposes
Clean-NeRF for accurate 3D reconstruction and novel view rendering in complex
scenes. Our key insights consist of enforcing effective appearance and geometry
constraints, which are absent in the conventional NeRF reconstruction, by 1)
automatically detecting and modeling view-dependent appearances in the training
views to prevent them from interfering with density estimation, which is
complete with 2) a geometric correction procedure performed on each traced ray
during inference. Clean-NeRF can be implemented as a plug-in that can
immediately benefit existing NeRF-based methods without additional input. Codes
will be released. |
This paper proposes Clean-NeRF, an extension to NeRF for accurate 3D reconstruction and novel view rendering in complex scenes with sparse view inputs. |
Existing NeRF-based models struggle with blurry rendering and inaccurate volumetric reconstruction in complex scenes, hindering their application in 3D computer vision tasks. |
Clean-NeRF enforces appearance and geometry constraints by 1) decomposing and modeling view-dependent and view-independent color components during training, and 2) applying a geometry correction procedure to eliminate density errors during inference. |
Effectively removes "floaters" (density errors) in challenging indoor scenes.
Recovers intricate object details, especially for glossy surfaces.
Outperforms baselines in quantitative metrics such as PSNR, SSIM, and LPIPS. |
Assumes fixed lighting conditions and no semi-transparent objects.
May misinterpret consistently appearing specular highlights as view-independent colors. |
neural radiance fields, nerf, 3d reconstruction, novel view synthesis, appearance decomposition |
2303.14662
Report |
OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering |
Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Zhen Lei, Lei Zhang |
Controllability, generalizability and efficiency are the major objectives of
constructing face avatars represented by neural implicit field. However,
existing methods have not managed to accommodate the three requirements
simultaneously. They either focus on static portraits, restricting the
representation ability to a specific subject, or suffer from substantial
computational cost, limiting their flexibility. In this paper, we propose
One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a
generalized controllable tri-plane rendering solution so that each personalized
avatar can be constructed from only one portrait as the reference.
Specifically, OTAvatar first inverts a portrait image to a motion-free identity
code. Second, the identity code and a motion code are utilized to modulate an
efficient CNN to generate a tri-plane formulated volume, which encodes the
subject in the desired motion. Finally, volume rendering is employed to
generate an image in any view. The core of our solution is a novel
decoupling-by-inverting strategy that disentangles identity and motion in the
latent code via optimization-based inversion. Benefiting from the efficient
tri-plane representation, we achieve controllable rendering of generalized face
avatar at $35$ FPS on A100. Experiments show promising performance of
cross-identity reenactment on subjects out of the training set and better 3D
consistency. |
OTAvatar, a one-shot talking face avatar generation method using controllable tri-plane rendering. |
Existing methods for creating talking face avatars struggle to balance controllability, generalizability, and efficiency. They are either limited to specific individuals, computationally expensive, or fail to produce high-quality animations. |
OTAvatar leverages a pre-trained 3D face generator and introduces a motion controller module. It utilizes a decoupling-by-inverting strategy to disentangle identity and motion in the latent code during training and inference. This allows for one-shot avatar creation from a single portrait and animation driven by 3DMM coefficients. |
Achieves one-shot reconstruction and animation of photo-realistic face avatars.
Demonstrates superior performance in cross-identity reenactment and multi-view consistency compared to 2D and 3D baselines.
Enables real-time inference speed at 35 FPS on A100 GPU due to efficient tri-plane representation and compact architecture. |
Relies on accurate 3DMM coefficient extraction for optimal performance.
Further exploration of alternative motion representations beyond 3DMM coefficients. |
talking face avatar, one-shot learning, volume rendering, 3d face animation, generative adversarial networks |
2303.14651
Report |
You Only Segment Once: Towards Real-Time Panoptic Segmentation |
Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Rongrong Ji, Liujuan Cao |
In this paper, we propose YOSO, a real-time panoptic segmentation framework.
YOSO predicts masks via dynamic convolutions between panoptic kernels and image
feature maps, in which you only need to segment once for both instance and
semantic segmentation tasks. To reduce the computational overhead, we design a
feature pyramid aggregator for the feature map extraction, and a separable
dynamic decoder for the panoptic kernel generation. The aggregator
re-parameterizes interpolation-first modules in a convolution-first way, which
significantly speeds up the pipeline without any additional costs. The decoder
performs multi-head cross-attention via separable dynamic convolution for
better efficiency and accuracy. To the best of our knowledge, YOSO is the first
real-time panoptic segmentation framework that delivers competitive performance
compared to state-of-the-art models. Specifically, YOSO achieves 46.4 PQ, 45.6
FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K; and
34.1 PQ, 7.1 FPS on Mapillary Vistas. Code is available at
https://github.com/hujiecpp/YOSO. |
YOSO, a real-time panoptic segmentation framework that predicts masks via dynamic convolutions between panoptic kernels and image feature maps, allowing for simultaneous instance and semantic segmentation. |
Real-time panoptic segmentation is challenging due to computationally intensive separate branches for semantic and instance segmentation. Existing methods struggle to achieve both speed and accuracy. |
YOSO employs a feature pyramid aggregator with convolution-first aggregation (CFA) for efficient feature extraction. It utilizes a separable dynamic decoder with separable dynamic convolution attention (SDCA) for lightweight and accurate panoptic kernel generation. |
YOSO achieves competitive speed and accuracy compared to state-of-the-art models on COCO, Cityscapes, ADE20K, and Mapillary Vistas datasets.
CFA significantly reduces computational burden without re-training or sacrificing performance.
SDCA outperforms traditional multi-head cross-attention in both accuracy and efficiency for panoptic kernel generation. |
YOSO's performance on instance segmentation of smaller objects can be further improved.
Exploring alternative lightweight backbones for enhanced efficiency. |
panoptic segmentation, real-time, dynamic convolution, feature pyramid, separable attention |
2303.14541
Report |
UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes |
David Rozenberszki, Or Litany, Angela Dai |
3D instance segmentation is fundamental to geometric understanding of the
world around us. Existing methods for instance segmentation of 3D scenes rely
on supervision from expensive, manual 3D annotations. We propose UnScene3D, the
first fully unsupervised 3D learning approach for class-agnostic 3D instance
segmentation of indoor scans. UnScene3D first generates pseudo masks by
leveraging self-supervised color and geometry features to find potential object
regions. We operate on a basis of geometric oversegmentation, enabling
efficient representation and learning on high-resolution 3D data. The coarse
proposals are then refined through self-training our model on its predictions.
Our approach improves over state-of-the-art unsupervised 3D instance
segmentation methods by more than 300% Average Precision score, demonstrating
effective instance segmentation even in challenging, cluttered 3D scenes. |
This paper introduces [Method Name], the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. |
3D instance segmentation is crucial for scene understanding but existing methods rely on expensive manual annotations. |
The method generates pseudo masks by leveraging self-supervised color and geometry features, then refines them through self-training on a 3D transformer-based model. |
[Method Name] significantly outperforms clustering-based unsupervised methods (over 300% improvement in AP).
The method effectively leverages both color and geometric signals for improved pseudo mask generation.
Self-training significantly enhances the density and completeness of instance proposals. |
The reliance on mesh representation for graph coarsening could be extended.
Small objects might be missed during pseudo annotation generation. |
3d instance segmentation, unsupervised learning, self-training, geometric primitives, rgb-d scans |
2303.14536
Report |
SUDS: Scalable Urban Dynamic Scenes |
Haithem Turki, Jason Y. Zhang, Francesco Ferroni, Deva Ramanan |
We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes.
Prior work tends to reconstruct single video clips of short durations (up to 10
seconds). Two reasons are that such methods (a) tend to scale linearly with the
number of moving objects and input videos because a separate model is built for
each and (b) tend to require supervision via 3D bounding boxes and panoptic
labels, obtained manually or via category-specific models. As a step towards
truly open-world reconstructions of dynamic cities, we introduce two key
innovations: (a) we factorize the scene into three separate hash table data
structures to efficiently encode static, dynamic, and far-field radiance
fields, and (b) we make use of unlabeled target signals consisting of RGB
images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most
importantly, 2D optical flow.
Operationalizing such inputs via photometric, geometric, and feature-metric
reconstruction losses enables SUDS to decompose dynamic scenes into the static
background, individual objects, and their motions. When combined with our
multi-branch table representation, such reconstructions can be scaled to tens
of thousands of objects across 1.2 million frames from 1700 videos spanning
geospatial footprints of hundreds of kilometers, (to our knowledge) the largest
dynamic NeRF built to date.
We present qualitative initial results on a variety of tasks enabled by our
representations, including novel-view synthesis of dynamic urban scenes,
unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To
compare to prior work, we also evaluate on KITTI and Virtual KITTI 2,
surpassing state-of-the-art methods that rely on ground truth 3D bounding box
annotations while being 10x quicker to train. |
The paper introduces SUDS (Scalable Urban Dynamic Scenes), a novel approach extending neural radiance fields (NeRFs) to reconstruct large-scale, dynamic urban environments from multi-view videos, achieving scalability and handling dynamic elements like vehicles and pedestrians. |
Existing NeRF methods struggle with city-scale dynamic scenes due to limitations in handling numerous moving objects and reliance on labeled data like 3D bounding boxes. SUDS addresses these challenges, aiming for open-world dynamic city reconstructions. |
SUDS leverages a three-branch hash table structure to represent static background, dynamic objects, and far-field environment. It utilizes unlabeled inputs: RGB images, sparse LiDAR, optical flow, and self-supervised 2D descriptors. Spatial partitioning allows independent model training for different city areas, enabling scalability. |
SUDS achieves the first large-scale dynamic NeRF reconstruction, covering over 100 square kilometers.
It outperforms baseline methods on City-1M dataset and standard benchmarks (KITTI, Virtual KITTI 2) for novel view synthesis, even with limited training views.
The learned representation enables downstream tasks like unsupervised 3D instance segmentation and cuboid detection. |
Current implementation doesn't extrapolate object motion beyond the captured video boundaries.
Reliance on accurate camera pose estimation is crucial, with joint optimization of camera parameters during training still underexplored. |
neural radiance fields, dynamic scene reconstruction, large-scale 3d modeling, unsupervised learning, urban environments |
2303.14516
Report |
OVeNet: Offset Vector Network for Semantic Segmentation |
Stamatis Alexandropoulos, Christos Sakaridis, Petros Maragos |
Semantic segmentation is a fundamental task in visual scene understanding. We
focus on the supervised setting, where ground-truth semantic annotations are
available. Based on knowledge about the high regularity of real-world scenes,
we propose a method for improving class predictions by learning to selectively
exploit information from neighboring pixels. In particular, our method is based
on the prior that for each pixel, there is a seed pixel in its close
neighborhood sharing the same prediction with the former. Motivated by this
prior, we design a novel two-head network, named Offset Vector Network
(OVeNet), which generates both standard semantic predictions and a dense 2D
offset vector field indicating the offset from each pixel to the respective
seed pixel, which is used to compute an alternative, seed-based semantic
prediction. The two predictions are adaptively fused at each pixel using a
learnt dense confidence map for the predicted offset vector field. We supervise
offset vectors indirectly via optimizing the seed-based prediction and via a
novel loss on the confidence map. Compared to the baseline state-of-the-art
architectures HRNet and HRNet+OCR on which OVeNet is built, the latter achieves
significant performance gains on three prominent benchmarks for semantic
segmentation, namely Cityscapes, ACDC and ADE20K. Code is available at
https://github.com/stamatisalex/OVeNet |
OVeNet, a novel two-head network for semantic segmentation, leverages a learnt offset vector field to exploit information from neighboring pixels, thereby enhancing class predictions. |
Existing methods often misclassify pixels, especially near boundaries, due to overlooking the regularity of real-world scenes. OVeNet addresses this by learning to selectively use information from neighboring pixels. |
OVeNet comprises two heads: one predicts semantic logits, the other predicts an offset vector field and a confidence map. Offsets resample logits to generate a seed-based prediction, fused with the initial prediction using the confidence map. |
OVeNet significantly outperforms HRNet and HRNet+OCR baselines on Cityscapes, ACDC, and ADE20K.
It demonstrates significant improvement in per-class accuracy, particularly in challenging conditions like fog and night in the ACDC dataset.
The confidence map effectively guides the fusion of predictions, improving boundary delineation and overall segmentation quality. |
The model's performance is sensitive to the offset vector length, requiring careful hyperparameter tuning.
The current implementation is limited by memory constraints, leading to a reduced number of blocks in the offset head. |
semantic segmentation, offset vector network, seed pixel, deep learning, computer vision |
2303.14471
Report |
HQ3DAvatar: High Quality Controllable 3D Head Avatar |
Kartik Teotia, Mallikarjun B R, Xingang Pan, Hyeongwoo Kim, Pablo Garrido, Mohamed Elgharib, Christian Theobalt |
Multi-view volumetric rendering techniques have recently shown great
potential in modeling and synthesizing high-quality head avatars. A common
approach to capture full head dynamic performances is to track the underlying
geometry using a mesh-based template or 3D cube-based graphics primitives.
While these model-based approaches achieve promising results, they often fail
to learn complex geometric details such as the mouth interior, hair, and
topological changes over time. This paper presents a novel approach to building
highly photorealistic digital head avatars. Our method learns a canonical space
via an implicit function parameterized by a neural network. It leverages
multiresolution hash encoding in the learned feature space, allowing for
high-quality, faster training and high-resolution rendering. At test time, our
method is driven by a monocular RGB video. Here, an image encoder extracts
face-specific features that also condition the learnable canonical space. This
encourages deformation-dependent texture variations during training. We also
propose a novel optical flow based loss that ensures correspondences in the
learned canonical space, thus encouraging artifact-free and temporally
consistent renderings. We show results on challenging facial expressions and
show free-viewpoint renderings at interactive real-time rates for medium image
resolutions. Our method outperforms all existing approaches, both visually and
numerically. We will release our multiple-identity dataset to encourage further
research. Our Project page is available at:
https://vcai.mpi-inf.mpg.de/projects/HQ3DAvatar/ |
This paper presents HQ3DAvatar, a novel method for creating high-quality, controllable 3D head avatars from multi-view video data that can be driven by a monocular RGB video at test time. |
Creating realistic and controllable digital humans is crucial for various applications like VR/AR, VFX, and media production. |
The method learns a canonical space via an implicit function parameterized by a neural network, leveraging multiresolution hash encoding for efficiency and high resolution. It utilizes an image encoder to extract face-specific features that condition the canonical space, enabling deformation-dependent texture variations. A novel optical flow based loss ensures temporal coherence and reduces artifacts. |
HQ3DAvatar achieves state-of-the-art photorealism, outperforming existing methods in visual quality and accuracy, especially in challenging regions like hair and mouth interior.
The method enables dynamic free-view synthesis from arbitrary monocular viewpoints, with promising results for generalization to in-the-wild videos.
It allows for high-resolution rendering, showcasing the first 2K results in literature, and enables real-time performance at medium resolutions (480x270). |
The method may produce artifacts in regions with strong disocclusions, like the tongue moving out of the mouth.
The current solution is person-specific, and future work could explore generalization to unseen identities. |
volumetric rendering, implicit representations, neural radiance fields, neural avatars, free-viewpoint rendering |
2303.14420
Report |
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference |
Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, Hongsheng Li |
Recent years have witnessed a rapid growth of deep generative models, with
text-to-image models gaining significant attention from the public. However,
existing models often generate images that do not align well with human
preferences, such as awkward combinations of limbs and facial expressions. To
address this issue, we collect a dataset of human choices on generated images
from the Stable Foundation Discord channel. Our experiments demonstrate that
current evaluation metrics for generative models do not correlate well with
human choices. Thus, we train a human preference classifier with the collected
dataset and derive a Human Preference Score (HPS) based on the classifier.
Using HPS, we propose a simple yet effective method to adapt Stable Diffusion
to better align with human preferences. Our experiments show that HPS
outperforms CLIP in predicting human choices and has good generalization
capability toward images generated from other models. By tuning Stable
Diffusion with the guidance of HPS, the adapted model is able to generate
images that are more preferred by human users. The project page is available
here: https://tgxs002.github.io/align_sd_web/ . |
This paper introduces Human Preference Score (HPS) for aligning text-to-image models with human preferences by leveraging a large-scale dataset of human choices on generated images. |
Existing evaluation metrics like IS, FID, and CLIP score often fail to capture subtle human preferences in generated images, particularly concerning aspects like awkward compositions. This misalignment necessitates a new metric that better reflects human choices. |
The authors collect a large dataset of human choices on images generated by Stable Diffusion. They then fine-tune a CLIP model on this dataset to develop a human preference classifier, from which HPS is derived. This HPS is then used to adapt Stable Diffusion by explicitly training it to distinguish between preferred and non-preferred images. |
Existing evaluation metrics (IS, FID, CLIP score) show poor correlation with human choices on the collected dataset.
The fine-tuned CLIP model, used to derive HPS, demonstrates superior performance in predicting human preferences compared to the original CLIP score.
Adapting Stable Diffusion using HPS guidance leads to the generation of images that are significantly more preferred by human users, as evidenced by user studies. |
The collected dataset, though large, represents preferences from a limited user group active on the Stable Foundation Discord channel and might not reflect global diversity in preferences.
The use of prompts engineered by experienced Stable Diffusion users might introduce bias and deviate from natural language patterns. |
text-to-image generation, human preference learning, stable diffusion, evaluation metrics, aesthetic quality |
2303.14412
Report |
Freestyle Layout-to-Image Synthesis |
Han Xue, Zhiwu Huang, Qianru Sun, Li Song, Wenjun Zhang |
Typical layout-to-image synthesis (LIS) models generate images for a closed
set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work,
we explore the freestyle capability of the model, i.e., how far can it generate
unseen semantics (e.g., classes, attributes, and styles) onto a given layout,
and call the task Freestyle LIS (FLIS). Thanks to the development of
large-scale pre-trained language-image models, a number of discriminative
models (e.g., image classification and object detection) trained on limited
base classes are empowered with the ability of unseen class prediction.
Inspired by this, we opt to leverage large-scale pre-trained text-to-image
diffusion models to achieve the generation of unseen semantics. The key
challenge of FLIS is how to enable the diffusion model to synthesize images
from a specific layout which very likely violates its pre-learned knowledge,
e.g., the model never sees "a unicorn sitting on a bench" during its
pre-training. To this end, we introduce a new module called Rectified
Cross-Attention (RCA) that can be conveniently plugged in the diffusion model
to integrate semantic masks. This "plug-in" is applied in each cross-attention
layer of the model to rectify the attention maps between image and text tokens.
The key idea of RCA is to enforce each text token to act on the pixels in a
specified region, allowing us to freely put a wide variety of semantics from
pre-trained knowledge (which is general) onto the given layout (which is
specific). Extensive experiments show that the proposed diffusion network
produces realistic and freestyle layout-to-image generation results with
diverse text inputs, which has a high potential to spawn a bunch of interesting
applications. Code is available at https://github.com/essunny310/FreestyleNet. |
This paper proposes Freestyle Layout-to-Image Synthesis (FLIS) which generates images with unseen semantics (classes, attributes, styles) onto a layout by leveraging pre-trained text-to-image diffusion models. |
Existing layout-to-image synthesis (LIS) methods are limited to generating images with semantics from a fixed set of classes. FLIS breaks this limitation and allows for more creative and diverse image generation. |
The paper introduces Rectified Cross-Attention (RCA) that integrates layout information into the pre-trained diffusion model. RCA rectifies the attention maps between image and text tokens, forcing each text token to act on pixels within its corresponding mask region. |
FreestyleNet can synthesize unseen objects, bind new attributes to objects, and render images in various styles.
FreestyleNet outperforms state-of-the-art LIS methods in FID score, indicating high visual quality.
RCA effectively enforces spatial alignment between generated images and input layouts. |
The model struggles to generate rare or unreasonable semantics that are not well-represented in the pre-trained knowledge.
The method requires a user-defined set of category names which can be challenging to obtain for long-tailed datasets. |
image generation, layout-to-image synthesis, text-to-image synthesis, diffusion models, cross-attention |
2303.14407
Report |
LPFF: A Portrait Dataset for Face Generators Across Large Poses |
Yiqian Wu, Jing Zhang, Hongbo Fu, Xiaogang Jin |
The creation of 2D realistic facial images and 3D face shapes using
generative networks has been a hot topic in recent years. Existing face
generators exhibit exceptional performance on faces in small to medium poses
(with respect to frontal faces) but struggle to produce realistic results for
large poses. The distorted rendering results on large poses in 3D-aware
generators further show that the generated 3D face shapes are far from the
distribution of 3D faces in reality. We find that the above issues are caused
by the training dataset's pose imbalance.
In this paper, we present LPFF, a large-pose Flickr face dataset comprised of
19,590 high-quality real large-pose portrait images. We utilize our dataset to
train a 2D face generator that can process large-pose face images, as well as a
3D-aware generator that can generate realistic human face geometry. To better
validate our pose-conditional 3D-aware generators, we develop a new FID measure
to evaluate the 3D-level performance. Through this novel FID measure and other
experiments, we show that LPFF can help 2D face generators extend their latent
space and better manipulate the large-pose data, and help 3D-aware face
generators achieve better view consistency and more realistic 3D reconstruction
results. |
This paper introduces LPFF, a large-pose face dataset containing 19,590 high-quality images, designed to address the pose imbalance in existing datasets used for training face generators. |
Existing face generators, both 2D and 3D-aware, struggle to generate realistic results for faces at large poses due to the lack of sufficient large-pose training data. |
The authors developed a pipeline to collect, process, and filter large-pose face images from Flickr. They then used this dataset to train a 2D face generator (StyleGAN2-ada) and a 3D-aware generator (EG3D). A new FID measure for 3D-aware generators is also proposed for evaluation. |
LPFF enables StyleGAN2-ada to generate and manipulate large-pose faces more realistically.
LPFF leads to more realistic face geometry generation in EG3D, with better view consistency and higher quality rendering at large poses.
A novel FID measure for evaluating pose-conditional 3D-aware generators is proposed. |
The dataset still suffers from semantic attribute imbalance (e.g., smiling faces are more prevalent in frontal views).
The proposed processing pipeline cannot handle extreme poses where the face is fully occluded. |
face generation, large-pose faces, dataset, generative adversarial networks (gans), 3d-aware generators |
2303.14389
Report |
MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer |
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, Shuicheng Yan |
Despite its success in image synthesis, we observe that diffusion
probabilistic models (DPMs) often lack contextual reasoning ability to learn
the relations among object parts in an image, leading to a slow learning
process. To solve this issue, we propose a Masked Diffusion Transformer (MDT)
that introduces a mask latent modeling scheme to explicitly enhance the DPMs'
ability to contextual relation learning among object semantic parts in an
image. During training, MDT operates in the latent space to mask certain
tokens. Then, an asymmetric diffusion transformer is designed to predict masked
tokens from unmasked ones while maintaining the diffusion generation process.
Our MDT can reconstruct the full information of an image from its incomplete
contextual input, thus enabling it to learn the associated relations among
image tokens. We further improve MDT with a more efficient macro network
structure and training strategy, named MDTv2. Experimental results show that
MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score
of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed
than the previous SOTA DiT. The source code is released at
https://github.com/sail-sg/MDT. |
This paper introduces Masked Diffusion Transformer (MDT), a novel approach for enhancing contextual representation learning in diffusion probabilistic models (DPMs) for image synthesis, and its improved version MDTv2. |
DPMs often struggle to learn associated relations between object parts in an image, leading to slow training convergence. MDT addresses this by explicitly enhancing the contextual learning ability of DPMs. |
MDT employs a mask latent modeling scheme. It operates in the latent space, masking certain image tokens and using an asymmetric diffusion transformer to predict masked tokens from unmasked ones. MDTv2 further enhances MDT with long shortcuts in the encoder, dense input shortcuts in the decoder, and improved training strategies. |
MDT demonstrates superior image synthesis performance compared to previous state-of-the-art methods, achieving a new SOTA FID score of 1.58 on ImageNet for class-conditional image generation.
It exhibits significantly faster learning progress during training, achieving about 3x faster convergence speed than DiT.
MDTv2 further accelerates training, achieving up to 5x faster convergence than MDT and up to 18x faster convergence than DiT. |
The optimal masking ratio needs further investigation for different model sizes and datasets.
Exploring the effectiveness of MDT on higher-resolution image generation and other downstream tasks is promising. |
image synthesis, diffusion models, masked modeling, transformer, contextual learning |
2303.14386
Report |
Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection |
Hwanjun Song, Jihwan Bang |
Prompt-OVD is an efficient and effective framework for open-vocabulary object
detection that utilizes class embeddings from CLIP as prompts, guiding the
Transformer decoder to detect objects in both base and novel classes.
Additionally, our novel RoI-based masked attention and RoI pruning techniques
help leverage the zero-shot classification ability of the Vision
Transformer-based CLIP, resulting in improved detection performance at minimal
computational cost. Our experiments on the OV-COCO and OVLVIS datasets
demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference
speed than the first end-to-end open-vocabulary detection method (OV-DETR),
while also achieving higher APs than four two-stage-based methods operating
within similar inference time ranges. Code will be made available soon. |
Prompt-OVD, an end-to-end open-vocabulary object detection framework that uses class embeddings from CLIP as prompts to guide the Transformer decoder. |
To address the limitations of existing open-vocabulary object detection methods, such as overfitting to base classes and slow inference speeds. |
The framework leverages prompt-based decoding, RoI-based masked attention, and RoI pruning to achieve efficient and effective open-vocabulary object detection. |
Prompt-OVD achieves a 21.2 times faster inference speed than OV-DETR, a previous end-to-end method.
Prompt-OVD achieves higher APs on base and novel classes compared to two-stage OVD methods with similar inference speeds.
The framework shows strong performance in both OV-COCO and OV-LVIS datasets. |
The performance gap between object detection and instance segmentation could be further improved.
Exploring more advanced prediction ensemble strategies could enhance the synergy between CLIP and the detection model. |
open-vocabulary object detection, prompt-based decoding, vision transformer, clip, roi-based masked attention |
2303.14377
Report |
Unsupervised Domain Adaption with Pixel-level Discriminator for Image-aware Layout Generation |
Chenchen Xu, Min Zhou, Tiezheng Ge, Yuning Jiang, Weiwei Xu |
Layout is essential for graphic design and poster generation. Recently,
applying deep learning models to generate layouts has attracted increasing
attention. This paper focuses on using the GAN-based model conditioned on image
contents to generate advertising poster graphic layouts, which requires an
advertising poster layout dataset with paired product images and graphic
layouts. However, the paired images and layouts in the existing dataset are
collected by inpainting and annotating posters, respectively. There exists a
domain gap between inpainted posters (source domain data) and clean product
images (target domain data). Therefore, this paper combines unsupervised domain
adaption techniques to design a GAN with a novel pixel-level discriminator
(PD), called PDA-GAN, to generate graphic layouts according to image contents.
The PD is connected to the shallow level feature map and computes the GAN loss
for each input-image pixel. Both quantitative and qualitative evaluations
demonstrate that PDA-GAN can achieve state-of-the-art performances and generate
high-quality image-aware graphic layouts for advertising posters. |
This paper presents PDA-GAN, a novel GAN-based model for generating image-aware graphic layouts of advertising posters, leveraging unsupervised domain adaptation to bridge the domain gap between clean product images and inpainted images. |
Generating advertising posters often relies on a paired dataset of product images and graphic layouts. Existing datasets suffer from a domain gap due to using inpainted poster images, leading to unrealistic layouts. PDA-GAN addresses this gap. |
PDA-GAN incorporates a pixel-level discriminator (PD) connected to shallow feature maps. This PD analyzes pixel-level discrepancies to align the feature spaces of inpainted and clean product images, enabling the generation of layouts consistent with image content details. |
PDA-GAN significantly outperforms state-of-the-art methods in generating image-aware layouts, evidenced by both quantitative and qualitative results.
Compared to methods using Gaussian blur for domain adaptation, PDA-GAN achieves superior performance, particularly in metrics related to background complexity, subject occlusion, and product occlusion.
The pixel-level discriminator proves to be more effective than global or patch-level discriminators, highlighting the importance of fine-grained domain adaptation at the pixel level. |
One limitation is the potential bias towards source domain data during training due to additional reconstruction loss. Future work can explore more balanced training strategies.
Another limitation is the limited control over layout diversity and user constraints. Future research can focus on incorporating explicit controls for element categories, positions, and overall layout variations. |
layout generation, generative adversarial networks (gans), unsupervised domain adaptation, advertising posters, image-aware design |
2303.14297
Report |
AgileGAN3D: Few-Shot 3D Portrait Stylization by Augmented Transfer Learning |
Guoxian Song, Hongyi Xu, Jing Liu, Tiancheng Zhi, Yichun Shi, Jianfeng Zhang, Zihang Jiang, Jiashi Feng, Shen Sang, Linjie Luo |
While substantial progresses have been made in automated 2D portrait
stylization, admirable 3D portrait stylization from a single user photo remains
to be an unresolved challenge. One primary obstacle here is the lack of high
quality stylized 3D training data. In this paper, we propose a novel framework
\emph{AgileGAN3D} that can produce 3D artistically appealing and personalized
portraits with detailed geometry. New stylization can be obtained with just a
few (around 20) unpaired 2D exemplars. We achieve this by first leveraging
existing 2D stylization capabilities, \emph{style prior creation}, to produce a
large amount of augmented 2D style exemplars. These augmented exemplars are
generated with accurate camera pose labels, as well as paired real face images,
which prove to be critical for the downstream 3D stylization task. Capitalizing
on the recent advancement of 3D-aware GAN models, we perform \emph{guided
transfer learning} on a pretrained 3D GAN generator to produce
multi-view-consistent stylized renderings. In order to achieve 3D GAN inversion
that can preserve subject's identity well, we incorporate \emph{multi-view
consistency loss} in the training of our encoder. Our pipeline demonstrates
strong capability in turning user photos into a diverse range of 3D artistic
portraits. Both qualitative results and quantitative evaluations have been
conducted to show the superior performance of our method. Code and pretrained
models will be released for reproduction purpose. |
AgileGAN3D, a novel framework that generates high-quality 3D stylized portraits with detailed geometry from a single user photo and a few 2D style exemplars. |
Addresses the challenge of limited high-quality 3D data for 3D portrait stylization, enabling personalized and artistic 3D content creation. |
Combines style prior creation, guided transfer learning, and 3D GAN inversion with multi-view consistency loss. |
Generates visually appealing and multi-view consistent 3D stylized portraits.
Outperforms baseline methods in terms of perceptual quality and identity preservation.
Demonstrates robustness across different genders, face shapes, hairstyles, and illumination conditions. |
Gaze direction bias and occasional failure to preserve accessories require further improvement.
Potential for misuse in generating fake images. |
3d stylization, generative adversarial networks, neural radiance fields, few-shot learning, 3d portrait generation |
2303.14038
Report |
Accelerating Vision-Language Pretraining with Free Language Modeling |
Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie, Ping Luo |
The state of the arts in vision-language pretraining (VLP) achieves exemplary
performance but suffers from high training costs resulting from slow
convergence and long training time, especially on large-scale web datasets. An
essential obstacle to training efficiency lies in the entangled prediction rate
(percentage of tokens for reconstruction) and corruption rate (percentage of
corrupted tokens) in masked language modeling (MLM), that is, a proper
corruption rate is achieved at the cost of a large portion of output tokens
being excluded from prediction loss. To accelerate the convergence of VLP, we
propose a new pretraining task, namely, free language modeling (FLM), that
enables a 100% prediction rate with arbitrary corruption rates. FLM
successfully frees the prediction rate from the tie-up with the corruption rate
while allowing the corruption spans to be customized for each token to be
predicted. FLM-trained models are encouraged to learn better and faster given
the same GPU time by exploiting bidirectional contexts more flexibly. Extensive
experiments show FLM could achieve an impressive 2.5x pretraining time
reduction in comparison to the MLM-based methods, while keeping competitive
performance on both vision-language understanding and generation tasks. Code
will be public at https://github.com/TencentARC/FLM. |
This paper proposes Free Language Modeling (FLM), a novel pre-training objective for Vision-Language Pretraining (VLP), to accelerate training by decoupling prediction rate from corruption rate and enabling flexible corruption patterns. |
Existing VLP methods using Masked Language Modeling (MLM) suffer from slow convergence and long training times, especially on large web datasets, due to the entangled prediction and corruption rates limiting the utilization of output tokens. |
FLM employs an encode-corrupt-predict framework. It first encodes input text bidirectionally. Then, it constructs independent corruption-prediction tasks by injecting random span corruptions into encoded features. Finally, a reconstructor predicts each token by reasoning over uncorrupted bidirectional contexts, achieving 100% prediction rate. |
FLM achieves a 2.5x speed-up in pre-training time compared to MLM while maintaining comparable performance on various VL understanding tasks.
FLM demonstrates superior performance on VL generation tasks, such as image captioning, compared to MLM, AR, and PrefixLM.
Ablation studies validate the effectiveness of individual components of FLM, including decomposed bidirectional encoding, deep reconstructor, and flexible corruption rate. |
FLM's performance on cross-modal retrieval tasks currently lags behind MLM, requiring further exploration of better corruption strategies to enhance global feature alignment.
The optimal corruption rate for FLM might vary across different corruption methods, suggesting future research on effectively combining different corruption types to improve context diversity. |
vision-language pretraining, free language modeling, training acceleration, corruption-prediction, bidirectional contextual representation |
2303.13873
Report |
Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation |
Rui Chen, Yongwei Chen, Ningxin Jiao, Kui Jia |
Automatic 3D content creation has achieved rapid progress recently due to the
availability of pre-trained, large language models and image diffusion models,
forming the emerging topic of text-to-3D content creation. Existing text-to-3D
methods commonly use implicit scene representations, which couple the geometry
and appearance via volume rendering and are suboptimal in terms of recovering
finer geometries and achieving photorealistic rendering; consequently, they are
less effective for generating high-quality 3D assets. In this work, we propose
a new method of Fantasia3D for high-quality text-to-3D content creation. Key to
Fantasia3D is the disentangled modeling and learning of geometry and
appearance. For geometry learning, we rely on a hybrid scene representation,
and propose to encode surface normal extracted from the representation as the
input of the image diffusion model. For appearance modeling, we introduce the
spatially varying bidirectional reflectance distribution function (BRDF) into
the text-to-3D task, and learn the surface material for photorealistic
rendering of the generated surface. Our disentangled framework is more
compatible with popular graphics engines, supporting relighting, editing, and
physical simulation of the generated 3D assets. We conduct thorough experiments
that show the advantages of our method over existing ones under different
text-to-3D task settings. Project page and source codes:
https://fantasia3d.github.io/. |
Fantasia3D, a novel text-to-3D generation method that disentangles geometry and appearance modeling, enabling high-quality surface and material generation. |
Existing text-to-3D methods struggle to generate high-quality surfaces and photorealistic rendering due to coupled geometry and appearance learning. |
Leverages a hybrid scene representation (DMTet) for geometry modeling, using rendered normal maps as input for a pre-trained image diffusion model. Introduces spatially varying BRDF for appearance modeling, enabling photorealistic rendering with learned surface materials. |
Disentangled geometry and appearance learning outperforms entangled approaches, producing superior 3D assets.
Shape encoding of rendered normal maps proves crucial for high-quality geometry generation.
Fantasia3D generates more realistic and higher-quality 3D content compared to state-of-the-art methods like DreamFusion and Magic3D. |
Limited ability to generate loose geometries like hair and fur.
Primarily focuses on object generation, lacking support for complete scenes with backgrounds. |
text-to-3d, 3d content creation, disentangled representation learning, brdf, photorealistic rendering |
2303.13843
Report |
CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout |
Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, Lin Wang |
Recent advances have shown promise in merging neural radiance fields (NeRFs)
with pre-trained diffusion models for text-to-3D object generation. However,
one enduring challenge is their inadequate capability to accurately parse and
regenerate consistent multi-object environments. Specifically, these models
encounter difficulties in accurately representing quantity and style prompted
by multi-object texts, often resulting in a collapse of the rendering fidelity
that fails to match the semantic intricacies. Moreover, amalgamating these
elements into a coherent 3D scene is a substantial challenge, stemming from
generic distribution inherent in diffusion models. To tackle the issue of
'guidance collapse' and enhance consistency, we propose a novel framework,
dubbed CompoNeRF, by integrating an editable 3D scene layout with object
specific and scene-wide guidance mechanisms. It initiates by interpreting a
complex text into an editable 3D layout populated with multiple NeRFs, each
paired with a corresponding subtext prompt for precise object depiction. Next,
a tailored composition module seamlessly blends these NeRFs, promoting
consistency, while the dual-level text guidance reduces ambiguity and boosts
accuracy. Noticeably, the unique modularity of CompoNeRF permits NeRF
decomposition. This enables flexible scene editing and recomposition into new
scenes based on the edited layout or text prompts. Utilizing the open source
Stable Diffusion model, CompoNeRF not only generates scenes with high fidelity
but also paves the way for innovative multi-object composition using editable
3D layouts. Remarkably, our framework achieves up to a 54\% improvement in
performance, as measured by the multi-view CLIP score metric. Code is available
at https://github.com/hbai98/Componerf. |
Introduces CompoNeRF, a novel framework for synthesizing coherent multi-object 3D scenes from text descriptions by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. |
Addresses the 'guidance collapse' problem in existing text-to-3D methods, which struggle to accurately represent and compose multiple objects in a scene as described by text. |
Interprets multi-object text prompts into editable 3D layouts with bounding boxes, each associated with a distinct NeRF and subtext. Employs a composition module to blend individual NeRFs while maintaining global consistency guided by dual-level text prompts (global and object-specific). |
Achieves superior object identity accuracy and context relevance compared to previous methods like Latent-NeRF and SJC.
Demonstrates up to a 54% improvement in performance as measured by the multi-view CLIP score metric.
Enables flexible scene editing and recomposition by decomposing and caching individual NeRFs for reuse. |
Limited in interpreting uncommon object integrations or scenes due to the reliance on the pre-trained diffusion model's knowledge.
Faces occasionally exhibit the 'multi-face' issue, requiring further research into stronger geometric constraints or improved diffusion guidance. |
text-to-3d, neural radiance fields (nerfs), scene composition, 3d scene understanding, generative ai |
2303.13791
Report |
Progressively Optimized Local Radiance Fields for Robust View Synthesis |
Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, Johannes Kopf |
We present an algorithm for reconstructing the radiance field of a
large-scale scene from a single casually captured video. The task poses two
core challenges. First, most existing radiance field reconstruction approaches
rely on accurate pre-estimated camera poses from Structure-from-Motion
algorithms, which frequently fail on in-the-wild videos. Second, using a
single, global radiance field with finite representational capacity does not
scale to longer trajectories in an unbounded scene. For handling unknown poses,
we jointly estimate the camera poses with radiance field in a progressive
manner. We show that progressive optimization significantly improves the
robustness of the reconstruction. For handling large unbounded scenes, we
dynamically allocate new local radiance fields trained with frames within a
temporal window. This further improves robustness (e.g., performs well even
under moderate pose drifts) and allows us to scale to large scenes. Our
extensive evaluation on the Tanks and Temples dataset and our collected outdoor
dataset, Static Hikes, show that our approach compares favorably with the
state-of-the-art. |
This paper introduces an algorithm for reconstructing large-scale scene radiance fields from casual videos using progressive joint optimization of camera poses and local radiance fields. |
Reconstructing radiance fields from casual videos is challenging due to inaccurate camera pose estimation and limitations of global radiance fields in large scenes. |
The method uses a progressive scheme to estimate camera poses and dynamically allocates local radiance fields to model the scene. It incorporates monocular depth and optical flow for robust optimization. |
The method achieves high-quality view synthesis on long video sequences.
It outperforms existing methods in terms of robustness and scalability.
The proposed progressive optimization scheme significantly improves pose estimation accuracy. |
The method assumes continuous video without shot changes.
It currently doesn't handle dynamic elements in the scene. |
radiance fields, novel view synthesis, camera pose estimation, progressive optimization, local radiance fields |
2303.13756
Report |
GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning |
Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, Xiaodan Liang |
Image-based Virtual Try-ON aims to transfer an in-shop garment onto a
specific person. Existing methods employ a global warping module to model the
anisotropic deformation for different garment parts, which fails to preserve
the semantic information of different parts when receiving challenging inputs
(e.g, intricate human poses, difficult garments). Moreover, most of them
directly warp the input garment to align with the boundary of the preserved
region, which usually requires texture squeezing to meet the boundary shape
constraint and thus leads to texture distortion. The above inferior performance
hinders existing methods from real-world applications. To address these
problems and take a step towards real-world virtual try-on, we propose a
General-Purpose Virtual Try-ON framework, named GP-VTON, by developing an
innovative Local-Flow Global-Parsing (LFGP) warping module and a Dynamic
Gradient Truncation (DGT) training strategy. Specifically, compared with the
previous global warping mechanism, LFGP employs local flows to warp garments
parts individually, and assembles the local warped results via the global
garment parsing, resulting in reasonable warped parts and a semantic-correct
intact garment even with challenging inputs.On the other hand, our DGT training
strategy dynamically truncates the gradient in the overlap area and the warped
garment is no more required to meet the boundary constraint, which effectively
avoids the texture squeezing problem. Furthermore, our GP-VTON can be easily
extended to multi-category scenario and jointly trained by using data from
different garment categories. Extensive experiments on two high-resolution
benchmarks demonstrate our superiority over the existing state-of-the-art
methods. |
This paper introduces GP-VTON, a unified framework for general-purpose virtual try-on, capable of generating realistic try-on results even in challenging scenarios (e.g., intricate human poses, complex garment inputs) and extendable to multi-category scenarios. |
Existing VTON methods face challenges in handling complex poses and garments, often leading to artifacts and texture distortions. They also primarily focus on upper-body try-on, limiting their practical applications. |
GP-VTON leverages a novel Local-Flow Global-Parsing (LFGP) warping module to deform garment parts individually and assemble them using global garment parsing. It also employs a Dynamic Gradient Truncation (DGT) training strategy to minimize texture distortion around preserved regions. |
GP-VTON outperforms state-of-the-art methods on high-resolution benchmarks VITON-HD and DressCode, demonstrating its ability to generate more realistic and semantically accurate try-on results.
The LFGP module effectively handles challenging poses and garments, reducing artifacts like damaged sleeves, blended pant legs, and adhesive regions.
The DGT strategy successfully minimizes texture distortion around preserved regions, resulting in more visually appealing try-on results. |
The paper acknowledges limitations and social impact of GP-VTON in the supplementary materials (not provided).
Future work could explore extending GP-VTON to handle more diverse garment types and complex scenes. |
virtual try-on, image synthesis, generative adversarial networks, deep learning, computer vision |
2303.13744
Report |
Conditional Image-to-Video Generation with Latent Flow Diffusion Models |
Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, Martin Renqiang Min |
Conditional image-to-video (cI2V) generation aims to synthesize a new
plausible video starting from an image (e.g., a person's face) and a condition
(e.g., an action class label like smile). The key challenge of the cI2V task
lies in the simultaneous generation of realistic spatial appearance and
temporal dynamics corresponding to the given image and condition. In this
paper, we propose an approach for cI2V using novel latent flow diffusion models
(LFDM) that synthesize an optical flow sequence in the latent space based on
the given condition to warp the given image. Compared to previous
direct-synthesis-based works, our proposed LFDM can better synthesize spatial
details and temporal motion by fully utilizing the spatial content of the given
image and warping it in the latent space according to the generated
temporally-coherent flow. The training of LFDM consists of two separate stages:
(1) an unsupervised learning stage to train a latent flow auto-encoder for
spatial content generation, including a flow predictor to estimate latent flow
between pairs of video frames, and (2) a conditional learning stage to train a
3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike
previous DMs operating in pixel space or latent feature space that couples
spatial and temporal information, the DM in our LFDM only needs to learn a
low-dimensional latent flow space for motion generation, thus being more
computationally efficient. We conduct comprehensive experiments on multiple
datasets, where LFDM consistently outperforms prior arts. Furthermore, we show
that LFDM can be easily adapted to new domains by simply finetuning the image
decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM. |
This paper proposes Latent Flow Diffusion Models (LFDM) for conditional image-to-video generation, which synthesizes temporally-coherent optical flow sequences in the latent space to warp the given image. |
Existing methods struggle to simultaneously maintain spatial details and temporal coherence. LFDM addresses this by reusing spatial content from the given image through warping guided by the generated latent flow. |
LFDM employs a two-stage training strategy: (1) Unsupervised training of a latent flow auto-encoder for spatial content and flow estimation. (2) Conditional training of a 3D-UNet diffusion model to generate temporal latent flow from class labels. |
LFDM outperforms previous state-of-the-art methods in conditional image-to-video generation on multiple datasets.
LFDM exhibits smaller training-testing gaps, indicating better generalization to unseen images.
LFDM can be easily adapted to new domains by simply finetuning the image decoder. |
LFDM currently focuses on single-subject videos and struggles with multiple moving subjects.
The sampling process with DDPM is slow and can be improved by exploring fast sampling techniques. |
image-to-video generation, diffusion models, optical flow, latent space, conditional generation |
2303.13743
Report |
TEGLO: High Fidelity Canonical Texture Mapping from Single-View Images |
Vishal Vinod, Tanmay Shah, Dmitry Lagun |
Recent work in Neural Fields (NFs) learn 3D representations from
class-specific single view image collections. However, they are unable to
reconstruct the input data preserving high-frequency details. Further, these
methods do not disentangle appearance from geometry and hence are not suitable
for tasks such as texture transfer and editing. In this work, we propose TEGLO
(Textured EG3D-GLO) for learning 3D representations from single view
in-the-wild image collections for a given class of objects. We accomplish this
by training a conditional Neural Radiance Field (NeRF) without any explicit 3D
supervision. We equip our method with editing capabilities by creating a dense
correspondence mapping to a 2D canonical space. We demonstrate that such
mapping enables texture transfer and texture editing without requiring meshes
with shared topology. Our key insight is that by mapping the input image pixels
onto the texture space we can achieve near perfect reconstruction (>= 74 dB
PSNR at 1024^2 resolution). Our formulation allows for high quality 3D
consistent novel view synthesis with high-frequency details at megapixel image
resolution. |
TEGLO learns textured 3D representations from single-view in-the-wild images of objects, enabling tasks like texture transfer and editing, without relying on 3D supervision or textured mesh datasets. |
Existing NeRF-based methods struggle to reconstruct high-frequency details and disentangle appearance from geometry, limiting their use in tasks like texture manipulation. TEGLO addresses these limitations. |
TEGLO uses a two-stage approach: (1) trains a conditional NeRF using tri-planes and GLO to learn per-object latent codes; (2) learns dense correspondences between 3D surface points and a 2D canonical space using the rendered output from the first stage. |
Achieves near-perfect reconstruction of input images (>= 74 dB PSNR at 1024^2 resolution).
Enables high-fidelity single-view 3D reconstruction and novel view synthesis at arbitrary resolutions.
Performs texture transfer and editing without requiring mesh-based methods or spatial fine-tuning. |
Requires significant computational resources for training and inference.
Limited to mapping target image pixels, resulting in missing pixel artifacts for certain views. |
neural radiance fields, texture representation, dense correspondences, generative latent optimization, single-view 3d reconstruction |
2303.13714
Report |
High Fidelity Image Synthesis With Deep VAEs In Latent Space |
Troy Luhman, Eric Luhman |
We present fast, realistic image generation on high-resolution, multimodal
datasets using hierarchical variational autoencoders (VAEs) trained on a
deterministic autoencoder's latent space. In this two-stage setup, the
autoencoder compresses the image into its semantic features, which are then
modeled with a deep VAE. With this method, the VAE avoids modeling the
fine-grained details that constitute the majority of the image's code length,
allowing it to focus on learning its structural components. We demonstrate the
effectiveness of our two-stage approach, achieving a FID of 9.34 on the
ImageNet-256 dataset which is comparable to BigGAN. We make our implementation
available online. |
This paper introduces a two-stage approach for high-fidelity image generation using hierarchical variational autoencoders (VAEs) trained on the latent space of a pretrained deterministic autoencoder (DAE). |
This approach addresses the limitations of traditional VAEs in generating realistic images on large datasets by separating the modeling of high-frequency details from semantic structure. |
The DAE first compresses images into low-dimensional latent representations, removing imperceptible details. Then, a deep hierarchical VAE is trained on these latents to learn the underlying semantic relationships, leveraging classifier-free guidance for improved image fidelity. |
Achieved a FID of 9.34 on ImageNet-256, comparable to BigGAN and demonstrating significant improvement over previous hierarchical VAEs.
Showed the importance of latent space compression by comparing different downsampling factors, with 4x and 8x performing best.
Demonstrated the interpretability and flexibility of the latent space through image manipulations like interpolation and outpainting. |
Inability to compute data likelihood, limiting its use in tasks like density estimation.
While improved, unguided sample quality still lags behind state-of-the-art diffusion models and GANs. |
image generation, variational autoencoders, latent space, classifier-free guidance, deep generative models |
2303.13703
Report |
End-to-End Diffusion Latent Optimization Improves Classifier Guidance |
Bram Wallace, Akash Gokul, Stefano Ermon, Nikhil Naik |
Classifier guidance -- using the gradients of an image classifier to steer
the generations of a diffusion model -- has the potential to dramatically
expand the creative control over image generation and editing. However,
currently classifier guidance requires either training new noise-aware models
to obtain accurate gradients or using a one-step denoising approximation of the
final generation, which leads to misaligned gradients and sub-optimal control.
We highlight this approximation's shortcomings and propose a novel guidance
method: Direct Optimization of Diffusion Latents (DOODL), which enables
plug-and-play guidance by optimizing diffusion latents w.r.t. the gradients of
a pre-trained classifier on the true generated pixels, using an invertible
diffusion process to achieve memory-efficient backpropagation. Showcasing the
potential of more precise guidance, DOODL outperforms one-step classifier
guidance on computational and human evaluation metrics across different forms
of guidance: using CLIP guidance to improve generations of complex prompts from
DrawBench, using fine-grained visual classifiers to expand the vocabulary of
Stable Diffusion, enabling image-conditioned generation with a CLIP visual
encoder, and improving image aesthetics using an aesthetic scoring network.
Code at https://github.com/salesforce/DOODL. |
Presents DOODL, a method enabling precise guidance of pretrained diffusion models by directly optimizing diffusion latents with respect to a model-based loss on the final generation. |
Overcomes limitations of existing classifier guidance techniques that rely on noise-aware classifiers or one-step denoising approximations, which lead to sub-optimal control over image generation. |
Leverages EDICT, an invertible diffusion algorithm, to compute gradients of model losses with respect to the final generated image and uses these gradients to iteratively optimize diffusion latents for improved control and flexibility. |
DOODL outperforms one-step classifier guidance on DrawBench, showing improved generation of images from complex prompts.
Expands the vocabulary of Stable Diffusion by leveraging fine-grained visual classifiers, enabling generation of rare or unseen concepts.
Enables image-conditioned generation with CLIP, demonstrating personalized entity generation without retraining or finetuning. |
Requires more optimization iterations for certain tasks, such as visual personalization.
Can sometimes lead to warping or deformation of content during optimization, particularly when targeting aesthetic improvement. |
diffusion models, classifier guidance, image generation, invertible neural networks, direct latent optimization |
2303.13518
Report |
Three ways to improve feature alignment for open vocabulary detection |
Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman |
The core problem in zero-shot open vocabulary detection is how to align
visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch,
which breaks the vision-text feature alignment established during pretraining,
and struggles to prevent the language model from forgetting unseen classes.
We propose three methods to alleviate these issues. Firstly, a simple scheme
is used to augment the text embeddings which prevents overfitting to a small
number of classes seen during training, while simultaneously saving memory and
computation. Secondly, the feature pyramid network and the detection head are
modified to include trainable gated shortcuts, which encourages vision-text
feature alignment and guarantees it at the start of detection training.
Finally, a self-training approach is used to leverage a larger corpus of
image-text pairs thus improving detection performance on classes with no human
annotated bounding boxes.
Our three methods are evaluated on the zero-shot version of the LVIS
benchmark, each of them showing clear and significant benefits. Our final
network achieves the new stateof-the-art on the mAP-all metric and demonstrates
competitive performance for mAP-rare, as well as superior transfer to COCO and
Objects365. |
This paper introduces three methods to enhance the alignment of visual and textual features for improved zero-shot open vocabulary object detection. |
Zero-shot open vocabulary detection, enabling the detection of objects not seen during training, heavily relies on strong alignment between visual and textual representations. Previous methods struggle to maintain this alignment, particularly when incorporating components trained from scratch. |
The paper proposes (1) efficient text augmentation during training using dropout or precomputed embedding variants, (2) an alignment preserving architecture (APA) for the detector's feature pyramid network and head, using shortcuts and gates to retain feature alignment from pretraining, and (3) self-training with pseudo-labeling on a large image-text dataset to further improve alignment. |
Text augmentation with a frozen language model outperforms training the language model, improving speed and memory efficiency.
APA significantly improves both overall detection (mAP-all) and detection of unseen objects (mAP-rare).
Self-training considerably boosts performance, particularly for unseen objects, and surpasses the performance of previous self-training methods. |
Zero-shot detection still struggles with certain common objects frequently seen in training images but not annotated with bounding boxes.
Future work could explore more efficient use of large image-text datasets through improved pseudo-labeling and self-supervised learning techniques. |
open vocabulary object detection, zero-shot learning, feature alignment, self-training, text augmentation |
2303.13514
Report |
SAOR: Single-View Articulated Object Reconstruction |
Mehmet Aygün, Oisin Mac Aodha |
We introduce SAOR, a novel approach for estimating the 3D shape, texture, and
viewpoint of an articulated object from a single image captured in the wild.
Unlike prior approaches that rely on pre-defined category-specific 3D templates
or tailored 3D skeletons, SAOR learns to articulate shapes from single-view
image collections with a skeleton-free part-based model without requiring any
3D object shape priors. To prevent ill-posed solutions, we propose a
cross-instance consistency loss that exploits disentangled object shape
deformation and articulation. This is helped by a new silhouette-based sampling
mechanism to enhance viewpoint diversity during training. Our method only
requires estimated object silhouettes and relative depth maps from
off-the-shelf pre-trained networks during training. At inference time, given a
single-view image, it efficiently outputs an explicit mesh representation. We
obtain improved qualitative and quantitative results on challenging quadruped
animals compared to relevant existing work. |
SAOR, a self-supervised method for reconstructing the 3D shape, texture, and viewpoint of articulated objects from single images, without relying on 3D templates or skeletons. |
Reconstructing the 3D shape of articulated objects in the wild, particularly animals, from single images remains challenging due to limitations of existing methods, such as reliance on 3D templates or difficulty in modeling articulation. |
SAOR uses a skeleton-free, part-based model that learns to articulate shapes from single-view images. It utilizes a cross-instance consistency loss and a silhouette-based sampling mechanism to handle the ill-posed nature of 3D reconstruction and enhance viewpoint diversity during training. |
Outperforms previous methods that do not use explicit 3D supervision on keypoint transfer tasks for birds and quadrupeds.
Demonstrates multi-view consistent 3D shape reconstructions, successfully capturing articulation and viewpoint differences.
Exhibits generalization capabilities, reconstructing plausible 3D shapes from non-photorealistic images, such as drawings. |
Texture predictions, while promising, could be improved with refinement techniques.
Struggles with images containing unusual viewpoints or significant object occlusion. |
3d reconstruction, articulated objects, self-supervised learning, skeleton-free modeling, single-view reconstruction |
2303.13455
Report |
CoBIT: A Contrastive Bi-directional Image-Text Generation Model |
Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Baldridge, Jiahui Yu |
The field of vision and language has witnessed a proliferation of pre-trained
foundation models. Most existing methods are independently pre-trained with
contrastive objective like CLIP, image-to-text generative objective like PaLI,
or text-to-image generative objective like Parti. However, the three objectives
can be pre-trained on the same data, image-text pairs, and intuitively they
complement each other as contrasting provides global alignment capacity and
generation grants fine-grained understanding. In this work, we present a
Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts
to unify the three pre-training objectives in one framework. Specifically,
CoBIT employs a novel unicoder-decoder structure, consisting of an image
unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders
can switch between encoding and decoding in different tasks, enabling
flexibility and shared knowledge that benefits both image-to-text and
text-to-image generations. CoBIT achieves superior performance in image
understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE)
and text-based content creation, particularly in zero-shot scenarios. For
instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in
zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning. |
This paper introduces CoBIT, a novel vision-language model that unifies contrastive learning, image-to-text generation, and text-to-image generation within a single framework using a unicoder-decoder structure. |
This unification aims to consolidate the strengths of each pre-training objective and enable the model to excel in a wide range of vision and vision-language tasks. |
CoBIT utilizes a novel unicoder-decoder structure, enabling the image and text unicoders to switch between encoding and decoding modes depending on the task. The model is pre-trained on large-scale image-text datasets using contrastive loss, image-to-text generation loss, and text-to-image generation loss. |
CoBIT achieves state-of-the-art zero-shot performance on ImageNet classification (82.7% accuracy) and MS-COCO text-to-image generation (9.37 FID score).
The model shows strong performance in other zero-shot tasks, such as image-text retrieval and image captioning.
Ablation studies demonstrate the effectiveness of the proposed unicoder structure and the benefits of unifying the three pre-training objectives. |
The paper identifies a slight contradiction between the text-to-image and image-to-text generation objectives during training, suggesting a need for further exploration to better harmonize these tasks.
Future work could investigate scaling up the model and exploring more diverse and challenging datasets to further enhance its capabilities. |
vision-language model, contrastive learning, image-to-text generation, text-to-image generation, unicoder-decoder |
2303.13450
Report |
Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes |
Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, Daniel Cohen-Or |
Recent breakthroughs in text-guided image generation have led to remarkable
progress in the field of 3D synthesis from text. By optimizing neural radiance
fields (NeRF) directly from text, recent methods are able to produce remarkable
results. Yet, these methods are limited in their control of each object's
placement or appearance, as they represent the scene as a whole. This can be a
major issue in scenarios that require refining or manipulating objects in the
scene. To remedy this deficit, we propose a novel GlobalLocal training
framework for synthesizing a 3D scene using object proxies. A proxy represents
the object's placement in the generated scene and optionally defines its coarse
geometry. The key to our approach is to represent each object as an independent
NeRF. We alternate between optimizing each NeRF on its own and as part of the
full scene. Thus, a complete representation of each object can be learned,
while also creating a harmonious scene with style and lighting match. We show
that using proxies allows a wide variety of editing options, such as adjusting
the placement of each independent object, removing objects from a scene, or
refining an object. Our results show that Set-the-Scene offers a powerful
solution for scene synthesis and manipulation, filling a crucial gap in
controllable text-to-3D synthesis. |
Introduces Set-the-Scene, a framework for synthesizing controllable 3D scenes from text prompts and 3D object proxies using a Global-Local training approach with composable NeRFs. |
Current text-to-3D methods lack control over object placement and appearance, limiting scene customization and editing. |
Represents scenes as composable NeRFs, each built around a proxy defining placement and optionally coarse geometry. Employs a Global-Local training strategy, alternating between optimizing individual NeRFs locally and the entire scene globally using score distillation and shape loss. |
Generates scenes matching user-defined object placements and styles guided by text prompts.
Enables post-training editing like object relocation, duplication, removal, geometry modification, and color scheme adjustments.
Outperforms single-NeRF methods in generating complex scenes with consistent object relationships and styles, as demonstrated qualitatively and through a user study. |
Generation quality limited by the underlying single-object text-to-3D method (Latent-NeRF) and diffusion model.
Occasional generation of objects as textures within the background NeRF instead of separate geometry. |
text-to-3d synthesis, neural radiance fields (nerf), composable scene representation, score distillation, 3d scene editing |
2303.13396
Report |
Zero-guidance Segmentation Using Zero Segment Labels |
Pitchaporn Rewatbowornwong, Nattanat Chatthee, Ekapol Chuangsuwanich, Supasorn Suwajanakorn |
CLIP has enabled new and exciting joint vision-language applications, one of
which is open-vocabulary segmentation, which can locate any segment given an
arbitrary text query. In our research, we ask whether it is possible to
discover semantic segments without any user guidance in the form of text
queries or predefined classes, and label them using natural language
automatically? We propose a novel problem zero-guidance segmentation and the
first baseline that leverages two pre-trained generalist models, DINO and CLIP,
to solve this problem without any fine-tuning or segmentation dataset. The
general idea is to first segment an image into small over-segments, encode them
into CLIP's visual-language space, translate them into text labels, and merge
semantically similar segments together. The key challenge, however, is how to
encode a visual segment into a segment-specific embedding that balances global
and local context information, both useful for recognition. Our main
contribution is a novel attention-masking technique that balances the two
contexts by analyzing the attention layers inside CLIP. We also introduce
several metrics for the evaluation of this new task. With CLIP's innate
knowledge, our method can precisely locate the Mona Lisa painting among a
museum crowd. Project page: https://zero-guide-seg.github.io/. |
Introduces "zero-guidance segmentation," a novel problem aiming to segment images and label segments in natural language without predefined classes or text guidance, and proposes the first baseline solution. |
Significantly advances semantic segmentation by eliminating the need for user input or predefined classes, enabling more flexible and comprehensive image understanding. |
Leverages pretrained DINO and CLIP models. First, over-segments an image using DINO features. Then, maps each segment to CLIP's visual-language embedding space using a novel attention-masking technique called "global subtraction" to balance global and local contexts. Finally, translates embeddings to text labels and merges semantically similar segments. |
Presents qualitative results demonstrating the method's ability to discover semantic segments and label them with diverse and meaningful text descriptions.
Proposes new evaluation metrics to address the challenges of arbitrary label granularity and synonyms, enabling quantitative assessment of segmentation quality and text label accuracy.
Shows promising results on Pascal Context and Pascal VOC datasets, particularly in discovering a wider range of objects compared to existing zero-shot open-vocabulary methods. |
Label reassignment during evaluation remains challenging due to the potential mismatch between predicted and ground-truth labels, highlighting the need for a better understanding of object parts and relationships.
Global context leakage can still occur, particularly for background segments sharing boundaries with salient objects, suggesting avenues for improvement in attention masking and segment encoding. |
semantic segmentation, zero-shot learning, vision-language models, clip, dino |
2303.13277
Report |
SINE: Semantic-driven Image-based NeRF Editing with Prior-guided Editing Field |
Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui |
Despite the great success in 2D editing using user-friendly tools, such as
Photoshop, semantic strokes, or even text prompts, similar capabilities in 3D
areas are still limited, either relying on 3D modeling skills or allowing
editing within only a few categories. In this paper, we present a novel
semantic-driven NeRF editing approach, which enables users to edit a neural
radiance field with a single image, and faithfully delivers edited novel views
with high fidelity and multi-view consistency. To achieve this goal, we propose
a prior-guided editing field to encode fine-grained geometric and texture
editing in 3D space, and develop a series of techniques to aid the editing
process, including cyclic constraints with a proxy mesh to facilitate geometric
supervision, a color compositing mechanism to stabilize semantic-driven texture
editing, and a feature-cluster-based regularization to preserve the irrelevant
content unchanged. Extensive experiments and editing examples on both
real-world and synthetic data demonstrate that our method achieves
photo-realistic 3D editing using only a single edited image, pushing the bound
of semantic-driven editing in 3D real-world scenes. Our project webpage:
https://zju3dv.github.io/sine/. |
Proposes SINE, a semantic-driven image-based editing approach for NeRFs, enabling 3D scene editing using a single image or text prompts. |
Addresses limitations in 3D editing tools that require 3D modeling expertise or offer limited editing capabilities, aiming for effortless and realistic 3D scene manipulation. |
Learns a prior-guided editing field to encode geometric and texture modifications, utilizing shape priors (DIF, depth prediction) and semantic texture priors (ViT) for multi-view consistency. Introduces cyclic constraints with a proxy mesh, color compositing, and feature-cluster-based regularization to enhance editing quality and control. |
Achieves realistic geometric deformations from single-view edits, outperforming EG3D and EditNeRF in visual quality and generalization.
Enables semantic-aware texture editing using target images or text prompts, surpassing ARF and CLIP-NeRF in detail and realism.
Demonstrates effective editing control, preserving irrelevant scene parts through feature-cluster-based regularization. |
Current approach doesn't support edits involving topology changes (e.g., breaking objects).
Assumes user edits are semantically meaningful, limiting the use of nonsensical target images. |
nerf editing, semantic editing, single-view editing, 3d scene manipulation, prior-guided learning |
2303.13273
Report |
TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision |
Jiacheng Wei, Hao Wang, Jiashi Feng, Guosheng Lin, Kim-Hui Yap |
In this paper, we investigate an open research task of generating
controllable 3D textured shapes from the given textual descriptions. Previous
works either require ground truth caption labeling or extensive optimization
time. To resolve these issues, we present a novel framework, TAPS3D, to train a
text-guided 3D shape generator with pseudo captions. Specifically, based on
rendered 2D images, we retrieve relevant words from the CLIP vocabulary and
construct pseudo captions using templates. Our constructed captions provide
high-level semantic supervision for generated 3D shapes. Further, in order to
produce fine-grained textures and increase geometry diversity, we propose to
adopt low-level image regularization to enable fake-rendered images to align
with the real ones. During the inference phase, our proposed model can generate
3D textured shapes from the given text without any additional optimization. We
conduct extensive experiments to analyze each of our proposed components and
show the efficacy of our framework in generating high-fidelity 3D textured and
text-relevant shapes. |
Introduces TAPS3D, a novel framework for generating controllable 3D textured shapes from text descriptions without ground truth captions or extensive optimization. |
Addresses limitations of prior text-to-3D methods that require labeled captions or suffer from long optimization times, making text-guided 3D generation practical. |
Generates pseudo captions from rendered 2D images using CLIP word retrieval and templates. Trains a text-conditioned 3D generator (pretrained GET3D) with high-level CLIP loss and low-level image regularization. |
Generates high-fidelity 3D textured shapes consistent with input text prompts.
Significantly faster inference compared to optimization-based methods.
Quantitative evaluation shows superior performance in image quality (FID), text-image relevance (CLIP-R-Precision), and geometry quality (FPD). |
Limited capacity to generate fine-grained details for different object parts.
Reliance on diverse training images to handle complex text input. |
3d shape generation, text-to-3d, pseudo supervision, clip, generative adversarial networks |
2303.13232
Report |
Transforming Radiance Field with Lipschitz Network for Photorealistic 3D Scene Stylization |
Zicheng Zhang, Yinglu Liu, Congying Han, Yingwei Pan, Tiande Guo, Ting Yao |
Recent advances in 3D scene representation and novel view synthesis have
witnessed the rise of Neural Radiance Fields (NeRFs). Nevertheless, it is not
trivial to exploit NeRF for the photorealistic 3D scene stylization task, which
aims to generate visually consistent and photorealistic stylized scenes from
novel views. Simply coupling NeRF with photorealistic style transfer (PST) will
result in cross-view inconsistency and degradation of stylized view syntheses.
Through a thorough analysis, we demonstrate that this non-trivial task can be
simplified in a new light: When transforming the appearance representation of a
pre-trained NeRF with Lipschitz mapping, the consistency and photorealism
across source views will be seamlessly encoded into the syntheses. That
motivates us to build a concise and flexible learning framework namely LipRF,
which upgrades arbitrary 2D PST methods with Lipschitz mapping tailored for the
3D scene. Technically, LipRF first pre-trains a radiance field to reconstruct
the 3D scene, and then emulates the style on each view by 2D PST as the prior
to learn a Lipschitz network to stylize the pre-trained appearance. In view of
that Lipschitz condition highly impacts the expressivity of the neural network,
we devise an adaptive regularization to balance the reconstruction and
stylization. A gradual gradient aggregation strategy is further introduced to
optimize LipRF in a cost-efficient manner. We conduct extensive experiments to
show the high quality and robust performance of LipRF on both photorealistic 3D
stylization and object appearance editing. |
This paper presents LipRF, a novel framework that leverages a Lipschitz-constrained MLP to transform pre-trained NeRF appearance representations for photorealistic 3D scene stylization, ensuring consistency and photorealism in stylized novel views. |
Photorealistic 3D scene stylization, aiming to generate consistent and realistic stylized novel views, is challenging due to the lack of style loss tailored for NeRF training and the limitations of 2D PST methods causing inconsistencies across views. |
LipRF pre-trains a radiance field (Plenoxels) for scene reconstruction and learns a Lipschitz MLP to map the pre-trained appearance to stylized versions. It utilizes 2D PST stylization on individual views as guidance and employs adaptive regularization based on spectral normalization to balance reconstruction and stylization quality. Gradual gradient aggregation ensures cost-efficient optimization. |
LipRF successfully preserves photorealism and consistency in stylized scenes, outperforming existing 2D PST and 3D stylization methods.
Adaptive regularization effectively balances stylization quality and adherence to the Lipschitz constraint, crucial for preserving image structure.
Gradual gradient aggregation enables efficient training of LipRF, reducing memory footprint and computational cost. |
LipRF's performance depends on the accuracy of the pre-trained radiance field, potentially limiting its application to scenes where accurate NeRF reconstruction is challenging.
Future work could explore joint optimization of the radiance field and the Lipschitz MLP for enhanced stylization. |
3d scene stylization, neural radiance fields, lipschitz networks, photorealistic style transfer, novel view synthesis |
2303.13126
Report |
MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models |
Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wenjing Yang |
The advent of open-source AI communities has produced a cornucopia of
powerful text-guided diffusion models that are trained on various datasets.
While few explorations have been conducted on ensembling such models to combine
their strengths. In this work, we propose a simple yet effective method called
Saliency-aware Noise Blending (SNB) that can empower the fused text-guided
diffusion models to achieve more controllable generation. Specifically, we
experimentally find that the responses of classifier-free guidance are highly
related to the saliency of generated images. Thus we propose to trust different
models in their areas of expertise by blending the predicted noises of two
diffusion models in a saliency-aware manner. SNB is training-free and can be
completed within a DDIM sampling process. Additionally, it can automatically
align the semantics of two noise spaces without requiring additional
annotations such as masks. Extensive experiments show the impressive
effectiveness of SNB in various applications. Project page is available at
https://magicfusion.github.io/. |
This paper introduces Saliency-aware Noise Blending (SNB), a method to fuse pre-trained text-guided diffusion models for more controllable image generation. |
Leveraging the strengths of multiple pre-trained diffusion models allows for the creation of images that combine their individual capabilities, enabling fine-grained control, creative scene composition, and cross-domain fusion. |
SNB uses classifier-free guidance to generate saliency maps from two diffusion models, creating a mask that guides the blending of their predicted noises during the denoising sampling process. |
SNB allows the fusion of a general model with a fine-grained model, enabling the generation of specific objects within complex scenes.
The method enables recontextualization by fusing a general model with a DreamBooth model, placing specific objects in new settings.
SNB facilitates cross-domain fusion, combining the creative composition of cartoon models with the photorealism of general models. |
The method currently relies on manual tuning of hyperparameters for optimal blending.
Future work could explore extending SNB to fuse more than two diffusion models simultaneously. |
image generation, diffusion models, model fusion, text-to-image synthesis, classifier-free guidance |
2303.13076
Report |
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching |
Xiaoshi Wu, Feng Zhu, Rui Zhao, Hongsheng Li |
Open-vocabulary detection (OVD) is an object detection task aiming at
detecting objects from novel categories beyond the base categories on which the
detector is trained. Recent OVD methods rely on large-scale visual-language
pre-trained models, such as CLIP, for recognizing novel objects. We identify
the two core obstacles that need to be tackled when incorporating these models
into detector training: (1) the distribution mismatch that happens when
applying a VL-model trained on whole images to region recognition tasks; (2)
the difficulty of localizing objects of unseen classes. To overcome these
obstacles, we propose CORA, a DETR-style framework that adapts CLIP for
Open-vocabulary detection by Region prompting and Anchor pre-matching. Region
prompting mitigates the whole-to-region distribution gap by prompting the
region features of the CLIP-based region classifier. Anchor pre-matching helps
learning generalizable object localization by a class-aware matching mechanism.
We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel
classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting
to extra training data. When extra training data is available, we train
CORA$^+$ on both ground-truth base-category annotations and additional pseudo
bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO
OVD benchmark and 28.1 box APr on the LVIS OVD benchmark. |
CORA, a DETR-style framework that adapts CLIP for Open-Vocabulary Detection using Region Prompting and Anchor Pre-Matching. |
To address the obstacles of distribution mismatch and novel class localization when using CLIP for open-vocabulary detection. |
The paper introduces Region Prompting to adapt CLIP for region-level tasks and Anchor Pre-Matching for efficient and generalizable object localization. |
CORA achieves 41.7 AP50 on novel classes of the COCO OVD benchmark, outperforming the previous state-of-the-art by 2.4 AP50 without extra training data.
Region Prompting effectively mitigates the distribution gap, boosting classification performance on novel classes from 63.9% to 74.1%.
Anchor Pre-Matching enables efficient class-aware object localization, leading to better generalization to novel classes. |
The method relies on the performance of the CLIP model, which may limit its ability to detect objects with complex visual appearances or relationships.
Future work could explore incorporating other modalities, such as depth or semantic segmentation, to further improve object localization. |
open-vocabulary detection, clip, region prompting, anchor pre-matching, detr |
2303.13071
Report |
PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360$^{\circ}$ |
Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras, Linjie Luo |
Synthesis and reconstruction of 3D human head has gained increasing interests
in computer vision and computer graphics recently. Existing state-of-the-art 3D
generative adversarial networks (GANs) for 3D human head synthesis are either
limited to near-frontal views or hard to preserve 3D consistency in large view
angles. We propose PanoHead, the first 3D-aware generative model that enables
high-quality view-consistent image synthesis of full heads in $360^\circ$ with
diverse appearance and detailed geometry using only in-the-wild unstructured
images for training. At its core, we lift up the representation power of recent
3D GANs and bridge the data alignment gap when training from in-the-wild images
with widely distributed views. Specifically, we propose a novel two-stage
self-adaptive image alignment for robust 3D GAN training. We further introduce
a tri-grid neural volume representation that effectively addresses front-face
and back-head feature entanglement rooted in the widely-adopted tri-plane
formulation. Our method instills prior knowledge of 2D image segmentation in
adversarial learning of 3D neural scene structures, enabling compositable head
synthesis in diverse backgrounds. Benefiting from these designs, our method
significantly outperforms previous 3D GANs, generating high-quality 3D heads
with accurate geometry and diverse appearances, even with long wavy and afro
hairstyles, renderable from arbitrary poses. Furthermore, we show that our
system can reconstruct full 3D heads from single input images for personalized
realistic 3D avatars. |
PanoHead is the first 3D GAN framework to synthesize view-consistent, high-fidelity full-head images in 360° from only single-view images, enabling 3D portrait creation. |
Existing 3D GANs for head synthesis are either limited to near-frontal views or struggle to maintain 3D consistency across large view angles, hindering applications like digital avatars and telepresence. |
PanoHead builds upon EG3D and introduces: (1) foreground-aware tri-discriminator for separating foreground and background; (2) tri-grid volume representation to address feature entanglement in tri-plane; (3) two-stage image alignment with a self-adaptation module for robust training on in-the-wild images. |
Synthesizes high-fidelity 360° full-head images with detailed geometry, outperforming SOTA methods in qualitative and quantitative evaluations.
Generates background-free 3D head geometry, even with diverse hairstyles.
Demonstrates compelling single-view 3D head reconstruction and novel-view synthesis. |
Minor artifacts persist (e.g., teeth area, flickering textures).
Lacks finer high-frequency geometric details (e.g., hair tips). |
3d gan, full-head synthesis, 360° view synthesis, single-view reconstruction, neural rendering |
2303.13062
Report |
SIEDOB: Semantic Image Editing by Disentangling Object and Background |
Wuyang Luo, Su Yang, Xinjian Zhang, Weishan Zhang |
Semantic image editing provides users with a flexible tool to modify a given
image guided by a corresponding segmentation map. In this task, the features of
the foreground objects and the backgrounds are quite different. However, all
previous methods handle backgrounds and objects as a whole using a monolithic
model. Consequently, they remain limited in processing content-rich images and
suffer from generating unrealistic objects and texture-inconsistent
backgrounds. To address this issue, we propose a novel paradigm,
\textbf{S}emantic \textbf{I}mage \textbf{E}diting by \textbf{D}isentangling
\textbf{O}bject and \textbf{B}ackground (\textbf{SIEDOB}), the core idea of
which is to explicitly leverages several heterogeneous subnetworks for objects
and backgrounds. First, SIEDOB disassembles the edited input into background
regions and instance-level objects. Then, we feed them into the dedicated
generators. Finally, all synthesized parts are embedded in their original
locations and utilize a fusion network to obtain a harmonized result. Moreover,
to produce high-quality edited images, we propose some innovative designs,
including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch
Discriminator, and Style-Diversity Object Generator, and integrate them into
SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets
and exhibit that our method remarkably outperforms the baselines, especially in
synthesizing realistic and diverse objects and texture-consistent backgrounds. |
Presents SIEDOB, a novel semantic image editing framework that disentangles object and background generation for improved realism and texture consistency in complex scenes. |
Existing methods struggle with generating realistic and coherent edits in images with multiple, diverse objects and backgrounds, particularly in content-rich scenes. |
Employs a heterogeneous model that disassembles the edited image into background regions and instance-level objects. Different generators synthesize corresponding content, integrated via a fusion network. Introduces innovations like the Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator to enhance quality. |
Outperforms state-of-the-art methods in visual quality and metrics like FID, LPIPS, and mIoU.
Effectively handles scenes with dense, overlapping objects, producing superior results compared to methods that treat the entire image uniformly.
Demonstrates improved texture consistency between edited and known regions in both objects and backgrounds. |
Struggles with generating objects from rare categories due to limited training data.
Object generation quality is challenged by extreme poses or large-scale occlusions.
Future work could explore incorporating mechanisms to handle rare categories and challenging object configurations. |
semantic image editing, generative adversarial networks, image synthesis, disentanglement learning, texture consistency |
2303.13005
Report |
From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels |
Zhendong Yang, Ailing Zeng, Zhe Li, Tianke Zhang, Chun Yuan, Yu Li |
Knowledge Distillation (KD) uses the teacher's prediction logits as soft
labels to guide the student, while self-KD does not need a real teacher to
require the soft labels. This work unifies the formulations of the two tasks by
decomposing and reorganizing the generic KD loss into a Normalized KD (NKD)
loss and customized soft labels for both target class (image's category) and
non-target classes named Universal Self-Knowledge Distillation (USKD). We
decompose the KD loss and find the non-target loss from it forces the student's
non-target logits to match the teacher's, but the sum of the two non-target
logits is different, preventing them from being identical. NKD normalizes the
non-target logits to equalize their sum. It can be generally used for KD and
self-KD to better use the soft labels for distillation loss. USKD generates
customized soft labels for both target and non-target classes without a
teacher. It smooths the target logit of the student as the soft target label
and uses the rank of the intermediate feature to generate the soft non-target
labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art
performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1
accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For
self-KD without teachers, USKD is the first self-KD method that can be
effectively applied to both CNN and ViT models with negligible additional time
and memory cost, resulting in new state-of-the-art results, such as 1.17% and
0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our
codes are available at https://github.com/yzd-v/cls_KD. |
This paper proposes two novel methods: Normalized Knowledge Distillation (NKD), which improves upon traditional KD by normalizing non-target logits for better soft label utilization, and Universal Self-Knowledge Distillation (USKD), which introduces customized soft labels for both target and non-target classes, enabling self-KD for both CNN and ViT models. |
The paper aims to address limitations in knowledge distillation (KD) and self-knowledge distillation (self-KD) by improving the utilization of soft labels for distillation loss and proposing a more general and effective method for generating customized soft labels for self-KD. |
NKD normalizes the non-target logits in KD loss to equalize their sum with the teacher's, facilitating better knowledge transfer. USKD generates customized soft labels by smoothing the student's target logit and using the rank of intermediate features, determined through weak supervision, to generate soft non-target labels based on Zipf's law. |
NKD achieves state-of-the-art performance for KD, outperforming previous methods on CIFAR-100 and ImageNet, demonstrating significant accuracy gains.
USKD, with its customized soft labels, effectively performs self-KD on both CNN and ViT models, achieving state-of-the-art results with negligible additional time and resource consumption compared to baseline training.
The paper provides analysis and visualizations demonstrating the effectiveness of normalizing non-target logits, customizing soft labels, and the impact of different smoothing methods and rank determination approaches. |
The performance of USKD with varying hyperparameters for non-target loss requires further investigation.
Exploration of the effectiveness of NKD and USKD on more diverse tasks beyond image classification and object detection. |
knowledge distillation, self-knowledge distillation, normalized loss, customized soft labels, cnn and vit models |
2303.12950
Report |
LightPainter: Interactive Portrait Relighting with Freehand Scribble |
Yiqun Mei, He Zhang, Xuaner Zhang, Jianming Zhang, Zhixin Shu, Yilin Wang, Zijun Wei, Shi Yan, HyunJoon Jung, Vishal M. Patel |
Recent portrait relighting methods have achieved realistic results of
portrait lighting effects given a desired lighting representation such as an
environment map. However, these methods are not intuitive for user interaction
and lack precise lighting control. We introduce LightPainter, a scribble-based
relighting system that allows users to interactively manipulate portrait
lighting effect with ease. This is achieved by two conditional neural networks,
a delighting module that recovers geometry and albedo optionally conditioned on
skin tone, and a scribble-based module for relighting. To train the relighting
module, we propose a novel scribble simulation procedure to mimic real user
scribbles, which allows our pipeline to be trained without any human
annotations. We demonstrate high-quality and flexible portrait lighting editing
capability with both quantitative and qualitative experiments. User study
comparisons with commercial lighting editing tools also demonstrate consistent
user preference for our method. |
This paper introduces LightPainter, a novel scribble-based interactive portrait relighting system that enables users to manipulate portrait lighting effects easily. |
Existing portrait relighting methods rely on lighting representations like environment maps or exemplar images, which are not intuitive for user interaction and lack precise lighting control. |
LightPainter uses two conditional neural networks: a delighting module to recover geometry and albedo optionally conditioned on skin tone, and a scribble-based module for relighting. It uses a novel scribble simulation procedure to mimic real user scribbles for training. |
LightPainter allows flexible lighting editing with scribbles and enables skin tone control with SkinFill.
User study shows LightPainter is more user-friendly and generates more faithful relighting results than other methods.
LightPainter outperforms state-of-the-art methods in terms of photorealism and fidelity on both light stage and in-the-wild images. |
LightPainter's performance relies on accurate geometry estimation.
The current scribble simulation may not cover all real-world cases. |
portrait relighting, interactive image editing, scribble-based interface, deep learning, computer vision |
2303.12865
Report |
NeRF-GAN Distillation for Efficient 3D-Aware Generation with Convolutions |
Mohamad Shahbazi, Evangelos Ntavelis, Alessio Tonioni, Edo Collins, Danda Pani Paudel, Martin Danelljan, Luc Van Gool |
Pose-conditioned convolutional generative models struggle with high-quality
3D-consistent image generation from single-view datasets, due to their lack of
sufficient 3D priors. Recently, the integration of Neural Radiance Fields
(NeRFs) and generative models, such as Generative Adversarial Networks (GANs),
has transformed 3D-aware generation from single-view images. NeRF-GANs exploit
the strong inductive bias of neural 3D representations and volumetric rendering
at the cost of higher computational complexity. This study aims at revisiting
pose-conditioned 2D GANs for efficient 3D-aware generation at inference time by
distilling 3D knowledge from pretrained NeRF-GANs. We propose a simple and
effective method, based on re-using the well-disentangled latent space of a
pre-trained NeRF-GAN in a pose-conditioned convolutional network to directly
generate 3D-consistent images corresponding to the underlying 3D
representations. Experiments on several datasets demonstrate that the proposed
method obtains results comparable with volumetric rendering in terms of quality
and 3D consistency while benefiting from the computational advantage of
convolutional networks. The code will be available at:
https://github.com/mshahbazi72/NeRF-GAN-Distillation |
This paper introduces a novel method for distilling pretrained NeRF-GANs into pose-conditioned convolutional generators, enabling efficient 3D-aware image generation. |
While NeRF-GANs excel in 3D-aware generation from single-view images, their reliance on volumetric rendering makes them computationally expensive. This work addresses this limitation by transferring the 3D knowledge to a faster convolutional generator. |
The method leverages the disentangled latent space of a pretrained NeRF-GAN (EG3D) to supervise a convolutional generator. By sharing the latent space and training with a combination of reconstruction and adversarial losses, the convolutional generator learns to produce 3D-consistent images. |
The proposed method generates images with comparable quality and 3D consistency to the NeRF-GAN, as evidenced by FID/KID scores and pose/identity preservation metrics.
It significantly outperforms traditional pose-conditioned GANs and a recent baseline (SURF) in terms of 3D consistency.
The convolutional generator achieves superior efficiency compared to the volumetric rendering approach, allowing for larger batch sizes and faster inference. |
The quality and consistency of the generated images are inherently limited by the pretrained NeRF-GAN.
Future work could focus on achieving even stronger correspondence between the convolutional and volumetric rendering in terms of semantic details. |
generative adversarial networks, neural radiance fields, 3d-aware generation, knowledge distillation, convolutional networks |
2303.12790
Report |
$CrowdDiff$: Multi-hypothesis Crowd Density Estimation using Diffusion Models |
Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, Vishal M. Patel |
Crowd counting is a fundamental problem in crowd analysis which is typically
accomplished by estimating a crowd density map and summing over the density
values. However, this approach suffers from background noise accumulation and
loss of density due to the use of broad Gaussian kernels to create the ground
truth density maps. This issue can be overcome by narrowing the Gaussian
kernel. However, existing approaches perform poorly when trained with ground
truth density maps with broad kernels. To deal with this limitation, we propose
using conditional diffusion models to predict density maps, as diffusion models
show high fidelity to training data during generation. With that, we present
$CrowdDiff$ that generates the crowd density map as a reverse diffusion
process. Furthermore, as the intermediate time steps of the diffusion process
are noisy, we incorporate a regression branch for direct crowd estimation only
during training to improve the feature learning. In addition, owing to the
stochastic nature of the diffusion model, we introduce producing multiple
density maps to improve the counting performance contrary to the existing crowd
counting pipelines. We conduct extensive experiments on publicly available
datasets to validate the effectiveness of our method. $CrowdDiff$ outperforms
existing state-of-the-art crowd counting methods on several public crowd
analysis benchmarks with significant improvements. |
This paper introduces CrowdDiff, a novel crowd counting framework employing denoising diffusion probabilistic models to generate crowd density maps, enhancing accuracy by using narrow density kernels and enabling iterative improvement through multiple realizations. |
Existing density-based methods struggle with background noise and density loss, especially in congested scenes, while localization-based methods require crowd density heuristics. CrowdDiff addresses these limitations by combining the strengths of both approaches. |
CrowdDiff leverages a denoising diffusion process to generate crowd density maps, utilizing narrow Gaussian kernels for higher fidelity. A counting branch aids feature learning during training, and a novel fusion method combines multiple density map realizations for improved accuracy. |
CrowdDiff surpasses state-of-the-art crowd counting methods on public datasets, particularly excelling in dense scenes.
Using narrow kernels with diffusion models enables accurate counting in congested regions and reduces background noise accumulation.
The proposed crowd map fusion method significantly boosts counting performance by leveraging the stochastic nature of diffusion models. |
The iterative inference process of diffusion models leads to higher inference times compared to some existing methods.
Future work could explore the use of consistency models to speed up inference without sacrificing accuracy. |
crowd counting, diffusion models, density map estimation, crowd analysis, computer vision |
2303.12786
Report |
FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models |
Jianglong Ye, Naiyan Wang, Xiaolong Wang |
Recent works on generalizable NeRFs have shown promising results on novel
view synthesis from single or few images. However, such models have rarely been
applied on other downstream tasks beyond synthesis such as semantic
understanding and parsing. In this paper, we propose a novel framework named
FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision
foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D
pre-trained foundation models to 3D space via neural rendering, and then
extract deep features for 3D query points from NeRF MLPs. Consequently, it
allows to map 2D images to continuous 3D semantic feature volumes, which can be
used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D
semantic keypoint transfer and 2D/3D object part segmentation. Our extensive
experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D
semantic feature extractor. Our project page is available at
https://jianglongye.com/featurenerf/ . |
Proposes FeatureNeRF, a framework that distills pre-trained 2D vision foundation models into generalizable NeRFs to enable 3D semantic understanding from 2D images. |
Existing generalizable NeRFs focus on novel view synthesis and lack semantic understanding capabilities, while 3D foundation models are limited by the availability of large-scale 3D datasets. |
FeatureNeRF adds a feature branch to the NeRF MLP and trains it to match the features extracted from a 2D foundation model (e.g., DINO, Latent Diffusion) of the rendered image. It further introduces internal NeRF features and a coordinate loss to improve 3D semantic understanding. |
FeatureNeRF outperforms baselines in 2D/3D semantic keypoint transfer and 2D/3D object part segmentation tasks.
It enables novel view semantic keypoint transfer and part co-segmentation by rendering feature maps from unseen viewpoints.
The learned 3D semantic feature representation can be applied to editing applications, such as 3D part texture swapping. |
The performance of FeatureNeRF relies on the quality of the pre-trained 2D foundation models.
The current implementation requires known camera poses, limiting its application to in-the-wild images. |
neural radiance fields, nerf, foundation models, 3d semantic understanding, feature distillation |
2303.12733
Report |
On the De-duplication of LAION-2B |
Ryan Webster, Julien Rabin, Loic Simon, Frederic Jurie |
Generative models, such as DALL-E, Midjourney, and Stable Diffusion, have
societal implications that extend beyond the field of computer science. These
models require large image databases like LAION-2B, which contain two billion
images. At this scale, manual inspection is difficult and automated analysis is
challenging. In addition, recent studies show that duplicated images pose
copyright problems for models trained on LAION2B, which hinders its usability.
This paper proposes an algorithmic chain that runs with modest compute, that
compresses CLIP features to enable efficient duplicate detection, even for vast
image volumes. Our approach demonstrates that roughly 700 million images, or
about 30\%, of LAION-2B's images are likely duplicated. Our method also
provides the histograms of duplication on this dataset, which we use to reveal
more examples of verbatim copies by Stable Diffusion and further justify the
approach. The current version of the de-duplicated set will be distributed
online. |
This paper presents an algorithmic chain for efficient duplicate detection in large image datasets, focusing on LAION-2B, using compressed CLIP features. |
Duplicate images in massive datasets like LAION-2B pose copyright concerns for trained models, especially generative ones like Stable Diffusion. This work aims to address this issue by efficiently identifying and removing duplicates, improving dataset usability. |
The authors propose SNIP, a contrastive feature compression technique that preserves text-image alignment in CLIP features. They use SNIP with approximate nearest neighbor search (IVFPQ) to efficiently find duplicates in LAION-2B. An adaptive thresholding strategy based on asymmetric distances is used to identify duplicates. |
The SNIP compression method shows better semantic retention for multimodal tasks compared to MSE-based compression while maintaining competitive retrieval performance.
The proposed method identifies roughly 700 million duplicate images in LAION-2B, approximately one-third of the dataset, with a precision of 91%.
By synthesizing images from the most duplicated subset, the authors were able to identify additional cases of verbatim copying by Stable Diffusion with significantly fewer resources than previous studies. |
The current de-duplication method is based on a conservative threshold and may miss some duplicates.
Future work could explore the impact of prompt variability and distinctiveness on image duplication. |
de-duplication, clip, image retrieval, generative models, copyright |
2303.12688
Report |
Pix2Video: Video Editing using Image Diffusion |
Duygu Ceylan, Chun-Hao Paul Huang, Niloy J. Mitra |
Image diffusion models, trained on massive image collections, have emerged as
the most versatile image generator model in terms of quality and diversity.
They support inverting real images and conditional (e.g., text) generation,
making them attractive for high-quality image editing applications. We
investigate how to use such pre-trained image models for text-guided video
editing. The critical challenge is to achieve the target edits while still
preserving the content of the source video. Our method works in two simple
steps: first, we use a pre-trained structure-guided (e.g., depth) image
diffusion model to perform text-guided edits on an anchor frame; then, in the
key step, we progressively propagate the changes to the future frames via
self-attention feature injection to adapt the core denoising step of the
diffusion model. We then consolidate the changes by adjusting the latent code
for the frame before continuing the process. Our approach is training-free and
generalizes to a wide range of edits. We demonstrate the effectiveness of the
approach by extensive experimentation and compare it against four different
prior and parallel efforts (on ArXiv). We demonstrate that realistic
text-guided video edits are possible, without any compute-intensive
preprocessing or video-specific finetuning. |
This paper presents Pix2Video, a training-free method for text-guided video editing that leverages pre-trained image diffusion models, particularly a depth-conditioned Stable Diffusion model. |
Existing video editing techniques often require extensive training or per-video fine-tuning. This method aims to bridge this gap by leveraging the power of pre-trained image diffusion models for coherent and efficient video editing. |
The method employs a two-step process: (1) It uses a pre-trained depth-guided image diffusion model to perform text-guided editing on a selected anchor frame. (2) It propagates changes to other frames via self-attention feature injection in the diffusion model's denoising step and consolidates these changes by adjusting latent codes to ensure similarity with the preceding frame. |
Pix2Video can perform both localized (e.g., changing an object's color) and global (e.g., changing the overall style) edits on videos.
The method exhibits superior performance compared to baseline methods, achieving higher faithfulness to the text prompt and maintaining better temporal consistency.
User studies confirmed that Pix2Video generates edits that are more faithful to the prompts and are generally preferred over the results of other methods. |
The temporal coherency of the generated video can be further improved.
Handling longer videos requires addressing challenges with increasing distance from the anchor frame. |
video editing, image diffusion models, text-guided editing, stable diffusion, self-attention |
2303.12678
Report |
Uni-Fusion: Universal Continuous Mapping |
Yijun Yuan, Andreas Nuechter |
We present Uni-Fusion, a universal continuous mapping framework for surfaces,
surface properties (color, infrared, etc.) and more (latent features in CLIP
embedding space, etc.). We propose the first universal implicit encoding model
that supports encoding of both geometry and different types of properties (RGB,
infrared, features, etc.) without requiring any training. Based on this, our
framework divides the point cloud into regular grid voxels and generates a
latent feature in each voxel to form a Latent Implicit Map (LIM) for geometries
and arbitrary properties. Then, by fusing a local LIM frame-wisely into a
global LIM, an incremental reconstruction is achieved. Encoded with
corresponding types of data, our Latent Implicit Map is capable of generating
continuous surfaces, surface property fields, surface feature fields, and all
other possible options. To demonstrate the capabilities of our model, we
implement three applications: (1) incremental reconstruction for surfaces and
color (2) 2D-to-3D transfer of fabricated properties (3) open-vocabulary scene
understanding by creating a text CLIP feature field on surfaces. We evaluate
Uni-Fusion by comparing it in corresponding applications, from which Uni-Fusion
shows high-flexibility in various applications while performing best or being
competitive. The project page of Uni-Fusion is available at
https://jarrome.github.io/Uni-Fusion/ . |
Presents Uni-Fusion, a universal continuous mapping framework for surfaces, surface properties (color, infrared, etc.), and high-dimensional features like CLIP embeddings, without requiring any training. |
Addresses the need for a single, universal mapping model in robotics that can handle various types of information, including geometry and surface properties, for tasks like reconstruction and scene understanding. |
Decouples Gaussian Process Regression (GPR) using kernel function approximation to encode local point cloud data into latent vectors. These vectors form a Latent Implicit Map (LIM) that is incrementally reconstructed by fusing local LIMs frame-wise into a global LIM. |
Uni-Fusion achieves state-of-the-art surface reconstruction accuracy on ScanNet, outperforming previous methods like BNV-Fusion.
It demonstrates high-quality color reconstruction on the Replica dataset, achieving results comparable to NeRF-SLAM in visual quality.
Uni-Fusion successfully performs open-vocabulary scene understanding by constructing a surface field for CLIP embeddings, enabling it to respond to various semantic queries. |
Currently lacks support for remapping, which is necessary for bundle adjustment and loop closing.
Future work involves exploring Visual Language Navigation (VLN) applications leveraging Uni-Fusion's ability to construct 3D embedding maps. |
continuous mapping, surface reconstruction, scene understanding, open-vocabulary, neural implicit maps |
2303.12417
Report |
CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data |
Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, Hang Xu |
Contrastive Language-Image Pre-training, benefiting from large-scale
unlabeled text-image pairs, has demonstrated great performance in open-world
vision understanding tasks. However, due to the limited Text-3D data pairs,
adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains
an open problem. Existing works that leverage VLM for 3D understanding
generally resort to constructing intermediate 2D representations for the 3D
data, but at the cost of losing 3D geometry information. To take a step toward
open-world 3D vision understanding, we propose Contrastive Language-Image-Point
Cloud Pretraining (CLIP$^2$) to directly learn the transferable 3D point cloud
representation in realistic scenarios with a novel proxy alignment mechanism.
Specifically, we exploit naturally-existed correspondences in 2D and 3D
scenarios, and build well-aligned and instance-based text-image-point proxies
from those complex scenarios. On top of that, we propose a cross-modal
contrastive objective to learn semantic and instance-level aligned point cloud
representation. Experimental results on both indoor and outdoor scenarios show
that our learned 3D representation has great transfer ability in downstream
tasks, including zero-shot and few-shot 3D recognition, which boosts the
state-of-the-art methods by large margins. Furthermore, we provide analyses of
the capability of different representations in real scenarios and present the
optional ensemble scheme. |
Presents CLIP², a novel framework for Contrastive Language-Image-Point Cloud Pretraining that learns transferable 3D point cloud representation directly from real-world scenarios with a new proxy alignment mechanism. |
Addresses the challenge of adapting the success of 2D Vision-Language Models (VLM) to 3D space due to limited Text-3D data pairs, aiming for open-world 3D vision understanding. |
Leverages existing large-scale point cloud datasets and constructs language-image-point triplets (Triplet Proxy Collection) to pretrain a point cloud encoder using a cross-modal contrastive objective (Cross-Modal Pretraining). |
Achieves state-of-the-art zero-shot transfer performance on 5 datasets, including indoor/outdoor scenes and single-object benchmarks.
Significantly outperforms baseline methods on zero-shot 3D recognition tasks, demonstrating the effectiveness of the learned 3D representation.
Showcases strong open-vocabulary recognition and localization abilities in both indoor and outdoor scenarios, recognizing objects beyond the predefined ground truth vocabulary. |
Current proxy generation process cannot provide accurate tight bounding boxes for 3D objects as dedicated detectors.
Limited by the scale of current proxy data, further performance improvement is expected with larger and more diverse proxy datasets. |
3d vision, vision-language model, zero-shot learning, point cloud representation learning, open-world recognition |
2303.12368
Report |
MAIR: Multi-view Attention Inverse Rendering with 3D Spatially-Varying Lighting Estimation |
JunYong Choi, SeokYeong Lee, Haesol Park, Seung-Won Jung, Ig-Jae Kim, Junghyun Cho |
We propose a scene-level inverse rendering framework that uses multi-view
images to decompose the scene into geometry, a SVBRDF, and 3D spatially-varying
lighting. Because multi-view images provide a variety of information about the
scene, multi-view images in object-level inverse rendering have been taken for
granted. However, owing to the absence of multi-view HDR synthetic dataset,
scene-level inverse rendering has mainly been studied using single-view image.
We were able to successfully perform scene-level inverse rendering using
multi-view images by expanding OpenRooms dataset and designing efficient
pipelines to handle multi-view images, and splitting spatially-varying
lighting. Our experiments show that the proposed method not only achieves
better performance than single-view-based methods, but also achieves robust
performance on unseen real-world scene. Also, our sophisticated 3D
spatially-varying lighting volume allows for photorealistic object insertion in
any 3D location. |
This paper presents MAIR, the first multi-view inverse rendering framework for scene-level decomposition into geometry, spatially-varying BRDF, and 3D spatially-varying lighting. |
Existing single-view inverse rendering methods struggle with complex real-world scenes due to reliance on contextual information for specular reflectance. MAIR overcomes this limitation by exploiting multi-view images and MVS depth, enabling more accurate and robust scene decomposition. |
MAIR uses a three-stage training pipeline: 1) Estimate direct lighting and geometry from multi-view inputs. 2) Estimate material properties using the estimated direct lighting and multi-view aggregation. 3) Infer 3D spatially-varying lighting by combining all estimated components. The authors create OpenRooms FF, a multi-view extension of OpenRooms dataset, to train and evaluate MAIR. |
MAIR outperforms single-view methods in material and geometry estimation on OpenRooms FF dataset.
Qualitative results on real-world images demonstrate MAIR's robustness in handling complex scenes and separating materials from lighting.
MAIR enables realistic object insertion in both synthetic and real-world scenes by accurately reproducing 3D lighting. |
Cascaded pipeline structure makes MAIR susceptible to errors in depth estimation.
Non-parametric VSG lighting representation limits its application in tasks such as light source editing. |
inverse rendering, multi-view stereo, spatially-varying lighting, scene understanding, object insertion |
2303.12346
Report |
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation |
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan |
In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion
architecture for eXtremely Long video generation. Most current work generates
long videos segment by segment sequentially, which normally leads to the gap
between training on short videos and inferring long videos, and the sequential
generation is inefficient. Instead, our approach adopts a ``coarse-to-fine''
process, in which the video can be generated in parallel at the same
granularity. A global diffusion model is applied to generate the keyframes
across the entire time range, and then local diffusion models recursively fill
in the content between nearby frames. This simple yet effective strategy allows
us to directly train on long videos (3376 frames) to reduce the
training-inference gap, and makes it possible to generate all segments in
parallel. To evaluate our model, we build FlintstonesHD dataset, a new
benchmark for long video generation. Experiments show that our model not only
generates high-quality long videos with both global and local coherence, but
also decreases the average inference time from 7.55min to 26s (by 94.26\%) at
the same hardware setting when generating 1024 frames. The homepage link is
\url{https://msra-nuwa.azurewebsites.net/} |
NUWA-XL, a "Diffusion over Diffusion" architecture for generating extremely long videos using a "coarse-to-fine" process. |
Existing methods, relying on "Autoregressive over X" architectures, struggle with training-inference gap and inefficient sequential generation, leading to incoherent and unrealistic long videos. |
1. A global diffusion model generates keyframes spanning the entire video, creating a coarse storyline. 2. Local diffusion models recursively fill in content between adjacent keyframes with increasing detail. |
Directly trained on long videos (3376 frames), eliminating the training-inference gap.
Generates higher-quality long videos with better global and local coherence compared to "Autoregressive over X" methods.
Significantly faster inference (up to 94.26% speedup) due to parallel processing of local diffusions. |
Limited evaluation on open-domain long videos due to data availability; currently validated on a cartoon dataset.
Requires significant GPU resources for parallel inference to achieve the speedup. |
video generation, long video generation, diffusion models, coarse-to-fine, parallel inference |
2303.12343
Report |
LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation |
Koutilya Pnvr, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, David Jacobs |
Large-scale pre-training tasks like image classification, captioning, or
self-supervised techniques do not incentivize learning the semantic boundaries
of objects. However, recent generative foundation models built using text-based
latent diffusion techniques may learn semantic boundaries. This is because they
have to synthesize intricate details about all objects in an image based on a
text description. Therefore, we present a technique for segmenting real and
AI-generated images using latent diffusion models (LDMs) trained on
internet-scale datasets. First, we show that the latent space of LDMs (z-space)
is a better input representation compared to other feature representations like
RGB images or CLIP encodings for text-based image segmentation. By training the
segmentation models on the latent z-space, which creates a compressed
representation across several domains like different forms of art, cartoons,
illustrations, and photographs, we are also able to bridge the domain gap
between real and AI-generated images. We show that the internal features of
LDMs contain rich semantic information and present a technique in the form of
LD-ZNet to further boost the performance of text-based segmentation. Overall,
we show up to 6% improvement over standard baselines for text-to-image
segmentation on natural images. For AI-generated imagery, we show close to 20%
improvement compared to state-of-the-art techniques. The project is available
at https://koutilya-pnvr.github.io/LD-ZNet/. |
This paper presents LD-ZNet, a novel text-based image segmentation technique leveraging latent diffusion models (LDMs) trained on large-scale datasets. |
The work addresses the limitation of large-scale pre-training tasks (e.g., image classification) in learning semantic boundaries of objects, which is crucial for open-world image segmentation, particularly in editing AI-generated content. |
The methodology involves analyzing the latent space (z-space) of LDMs as input representation for segmentation and incorporating internal LDM features via cross-attention into a segmentation network (ZNet) to create LD-ZNet. |
LD-ZNet shows up to 6% improvement over baselines for text-to-image segmentation on natural images.
For AI-generated imagery, LD-ZNet achieves close to 20% improvement over state-of-the-art techniques.
Analysis reveals that the internal features of LDMs contain rich semantic information, especially in middle layers and specific timesteps during the denoising process. |
The approach's reliance on LDMs increases inference time compared to some baselines.
Future work could explore optimizing the trade-off between performance and computational cost. |
image segmentation, latent diffusion models, text-to-image synthesis, ai-generated images, semantic segmentation |
2303.12326
Report |
Make Encoder Great Again in 3D GAN Inversion through Geometry and Occlusion-Aware Encoding |
Ziyang Yuan, Yiming Zhu, Yu Li, Hongyu Liu, Chun Yuan |
3D GAN inversion aims to achieve high reconstruction fidelity and reasonable
3D geometry simultaneously from a single image input. However, existing 3D GAN
inversion methods rely on time-consuming optimization for each individual case.
In this work, we introduce a novel encoder-based inversion framework based on
EG3D, one of the most widely-used 3D GAN models. We leverage the inherent
properties of EG3D's latent space to design a discriminator and a background
depth regularization. This enables us to train a geometry-aware encoder capable
of converting the input image into corresponding latent code. Additionally, we
explore the feature space of EG3D and develop an adaptive refinement stage that
improves the representation ability of features in EG3D to enhance the recovery
of fine-grained textural details. Finally, we propose an occlusion-aware fusion
operation to prevent distortion in unobserved regions. Our method achieves
impressive results comparable to optimization-based methods while operating up
to 500 times faster. Our framework is well-suited for applications such as
semantic editing. |
This paper introduces a novel encoder-based 3D GAN inversion method for EG3D that leverages a "canonical latent space" within the model for improved reconstruction fidelity and 3D geometry. |
Existing 3D GAN inversion methods are either time-consuming (optimization-based) or lack fidelity (encoder-based). This method aims to achieve both efficiency and high-quality inversion. |
The method utilizes a geometry-aware encoder trained with a canonical latent discriminator and background depth regularization. It also uses an adaptive feature alignment module to refine generator features and an occlusion-aware fusion operation for multi-view consistency. |
Achieves high-quality inversion comparable to optimization-based methods while being significantly faster (up to 500 times).
Exhibits robust performance even when inverting images with extreme poses.
Demonstrates effectiveness for 3D-aware semantic editing applications. |
The reliance on paired data for training may limit generalization.
Future work could explore applying the method to other 3D GAN architectures. |
3d gan inversion, eg3d, canonical latent space, adaptive feature alignment, occlusion-aware fusion |
2303.12218
Report |
Compositional 3D Scene Generation using Locally Conditioned Diffusion |
Ryan Po, Gordon Wetzstein |
Designing complex 3D scenes has been a tedious, manual process requiring
domain expertise. Emerging text-to-3D generative models show great promise for
making this task more intuitive, but existing approaches are limited to
object-level generation. We introduce \textbf{locally conditioned diffusion} as
an approach to compositional scene diffusion, providing control over semantic
parts using text prompts and bounding boxes while ensuring seamless transitions
between these parts. We demonstrate a score distillation sampling--based
text-to-3D synthesis pipeline that enables compositional 3D scene generation at
a higher fidelity than relevant baselines. |
This paper introduces locally conditioned diffusion, a method for compositional 3D scene generation using diffusion models with control over semantic elements through text prompts and bounding boxes. |
Designing 3D scenes is a laborious process, and this method aims to simplify it while offering control over scene composition. |
The method leverages pre-trained text-conditioned 2D diffusion models and applies locally conditioned diffusion to a score distillation sampling-based 3D generation pipeline. It utilizes 3D bounding boxes and text prompts to guide the generation process. |
The method generates high-quality 3D scenes that adhere to the user-specified layout with seamless transitions between elements.
It provides control over the size and position of individual assets within the scene.
It outperforms baseline methods in terms of CLIP R-Precision, indicating better alignment with input prompts. |
The generation process can be slow, especially for scenes with multiple distinct elements, due to reliance on thousands of denoising iterations.
The heavy reliance on high guidance scales for score distillation sampling can lead to limited diversity in generated outputs. |
3d scene generation, diffusion models, compositional synthesis, text-to-3d, score distillation sampling |
2303.11938
Report |
3D-CLFusion: Fast Text-to-3D Rendering with Contrastive Latent Diffusion |
Yu-Jhe Li, Tao Xu, Ji Hou, Bichen Wu, Xiaoliang Dai, Albert Pumarola, Peizhao Zhang, Peter Vajda, Kris Kitani |
We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs
(NeRFs that generate 3D objects given input latent code). Recent works such as
DreamFusion and Magic3D have shown great success in generating 3D content using
NeRFs and text prompts, but the current approach of optimizing a NeRF for every
text prompt is 1) extremely time-consuming and 2) often leads to low-resolution
outputs. To address these challenges, we propose a novel method named
3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs
fast 3D content creation in less than a minute. In particular, we introduce a
latent diffusion prior network for learning the w latent from the input CLIP
text/image embeddings. This pipeline allows us to produce the w latent without
further optimization during inference and the pre-trained NeRF is able to
perform multi-view high-resolution 3D synthesis based on the latent. We note
that the novelty of our model lies in that we introduce contrastive learning
during training the diffusion prior which enables the generation of the valid
view-invariant latent code. We demonstrate through experiments the
effectiveness of our proposed view-invariant diffusion process for fast
text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our
model is able to serve as the role of a plug-and-play tool for text-to-3D with
pre-trained NeRFs. |
This paper introduces 3D-CLFusion, a novel method for fast text-to-3D creation that leverages pre-trained latent-based NeRFs and a latent diffusion prior network. |
Current text-to-3D methods using NeRFs are time-consuming (taking hours per object) and often produce low-resolution outputs due to optimizing a NeRF from scratch for each text prompt. 3D-CLFusion addresses these limitations by enabling fast generation (<1 minute) and high-resolution rendering. |
3D-CLFusion consists of a diffusion prior network trained on CLIP image embeddings and a pre-trained latent-based NeRF. It uses contrastive learning during training to ensure the generated latent codes are view-invariant, allowing for consistent 3D object generation from various viewpoints. |
3D-CLFusion generates 3D objects from text prompts significantly faster (around 100x) than methods like DreamFusion and Magic3D.
The use of contrastive learning in the diffusion process is crucial for achieving view-invariant latent codes and thus, consistent 3D objects.
The approach demonstrates promising results on different pre-trained NeRF generators, including StyleNeRF and EG3D, for various object classes like faces and cars. |
The generated 3D objects are limited to the domain of the pre-trained NeRF model used.
Future work could explore extending the method to handle a wider variety of objects and scenes by incorporating more diverse pre-trained NeRF models or developing techniques for generalizing across domains. |
text-to-3d, neural radiance fields (nerfs), latent diffusion models, contrastive learning, view-invariance |
2303.11916
Report |
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion |
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun |
This paper proposes a novel diffusion-based model, CompoDiff, for solving
zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper
also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8
million reference images, conditions, and corresponding target image triplets
to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the
previous CIR approaches, such as poor generalizability due to the small dataset
scale and the limited types of conditions. CompoDiff not only achieves a new
state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO,
and GeneCIS, but also enables a more versatile and controllable CIR by
accepting various conditions, such as negative text, and image mask conditions.
CompoDiff also shows the controllability of the condition strength between text
and image queries and the trade-off between inference speed and performance,
which are unavailable with existing CIR methods. The code and dataset are
available at https://github.com/navervision/CompoDiff |
This paper introduces CompoDiff, a novel diffusion-based model for zero-shot Composed Image Retrieval (ZS-CIR) using latent diffusion. It also presents SynthTriplets18M, a large synthetic dataset for training CIR models. |
Existing CIR methods suffer from limited generalizability due to small datasets and restricted condition types. This work aims to address these limitations and enable versatile CIR with diverse conditions. |
CompoDiff leverages a latent diffusion model with classifier-free guidance to edit reference images in CLIP latent space. It is trained on a massive synthetic dataset, SynthTriplets18M, generated by automatically creating and filtering image-caption triplets. |
CompoDiff achieves state-of-the-art zero-shot performance on FashionIQ, CIRR, CIRCO, and GeneCIS benchmarks.
Training existing CIR methods on SynthTriplets18M also leads to significant improvements, surpassing previous zero-shot methods.
CompoDiff allows versatile CIR with multiple conditions (negative text, image masks) and enables controlling condition strength and inference speed. |
Current CIR benchmarks might not fully represent real-world queries.
Quantitative evaluation of retrieval quality on large-scale databases requires further exploration. |
composed image retrieval, latent diffusion, zero-shot learning, synthetic dataset, classifier-free guidance |
2303.11797
Report |
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation |
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim |
Open-vocabulary semantic segmentation presents the challenge of labeling each
pixel within an image based on a wide range of text descriptions. In this work,
we introduce a novel cost-based approach to adapt vision-language foundation
models, notably CLIP, for the intricate task of semantic segmentation. Through
aggregating the cosine similarity score, i.e., the cost volume between image
and text embeddings, our method potently adapts CLIP for segmenting seen and
unseen classes by fine-tuning its encoders, addressing the challenges faced by
existing methods in handling unseen classes. Building upon this, we explore
methods to effectively aggregate the cost volume considering its multi-modal
nature of being established between image and text embeddings. Furthermore, we
examine various methods for efficiently fine-tuning CLIP. |
This paper presents CAT-Seg, a novel cost-based framework for open-vocabulary semantic segmentation that leverages CLIP by aggregating cosine similarity scores between image and text embeddings. |
Existing methods struggle to adapt CLIP for pixel-level prediction due to overfitting issues when fine-tuning. This paper addresses this gap by proposing a cost aggregation approach that effectively adapts CLIP to the segmentation task. |
CAT-Seg computes a cost volume from image and text embeddings of CLIP and aggregates it through spatial and class aggregation modules. Additionally, it utilizes embedding guidance and efficiently fine-tunes CLIP encoders for optimal performance. |
CAT-Seg achieves state-of-the-art results on standard open-vocabulary benchmarks, outperforming previous methods by a large margin.
The framework generalizes well to multi-domain datasets, showing robustness to domain shifts.
CAT-Seg demonstrates strong efficiency in both training and inference compared to region-text methods. |
The reliability of evaluation datasets for open-vocabulary semantic segmentation is questionable due to ambiguities in ground truth.
Further investigation into handcrafted text prompts for improved performance is a potential avenue for future work. |
open-vocabulary semantic segmentation, vision-language models, clip, cost aggregation, fine-tuning |
2303.11749
Report |
Detecting Everything in the Open World: Towards Universal Object Detection |
Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao, Shengjin Wang |
In this paper, we formally address universal object detection, which aims to
detect every scene and predict every category. The dependence on human
annotations, the limited visual information, and the novel categories in the
open world severely restrict the universality of traditional detectors. We
propose UniDetector, a universal object detector that has the ability to
recognize enormous categories in the open world. The critical points for the
universality of UniDetector are: 1) it leverages images of multiple sources and
heterogeneous label spaces for training through the alignment of image and text
spaces, which guarantees sufficient information for universal representations.
2) it generalizes to the open world easily while keeping the balance between
seen and unseen classes, thanks to abundant information from both vision and
language modalities. 3) it further promotes the generalization ability to novel
categories through our proposed decoupling training manner and probability
calibration. These contributions allow UniDetector to detect over 7k
categories, the largest measurable category size so far, with only about 500
classes participating in training. Our UniDetector behaves the strong zero-shot
generalization ability on large-vocabulary datasets like LVIS, ImageNetBoxes,
and VisualGenome - it surpasses the traditional supervised baselines by more
than 4\% on average without seeing any corresponding images. On 13 public
detection datasets with various scenes, UniDetector also achieves
state-of-the-art performance with only a 3\% amount of training data. |
Proposes UniDetector, a universal object detection framework capable of detecting a vast number of categories in open-world scenarios, even those not present during training. |
Addresses the limitations of traditional object detectors that struggle to generalize to unseen categories and diverse scenes, aiming for human-like generalization capabilities in object detection. |
Leverages image-text pre-training to align image and text spaces, enabling detection of novel categories. Employs decoupled training of proposal generation and RoI classification stages, along with probability calibration, to enhance open-world performance. |
Achieves state-of-the-art performance on 13 diverse object detection datasets with only 3% of the training data used by comparable methods.
Outperforms traditional supervised methods on large-vocabulary datasets by over 4% AP, demonstrating strong zero-shot generalization.
Achieves 49.3% AP on COCO with a ResNet50 backbone and 1x schedule, highlighting its effectiveness in both open and closed-world settings. |
Current implementation primarily focuses on object-centric datasets, with future work exploring more complex scenes.
Relies on accurate language embeddings for novel category detection, which could be limited by the quality of pre-trained language models. |
object detection, open-world learning, zero-shot learning, image-text pre-training, universal object detection |
2303.11686
Report |
Learning a 3D Morphable Face Reflectance Model from Low-cost Data |
Yuxuan Han, Zhibo Wang, Feng Xu |
Modeling non-Lambertian effects such as facial specularity leads to a more
realistic 3D Morphable Face Model. Existing works build parametric models for
diffuse and specular albedo using Light Stage data. However, only diffuse and
specular albedo cannot determine the full BRDF. In addition, the requirement of
Light Stage data is hard to fulfill for the research communities. This paper
proposes the first 3D morphable face reflectance model with spatially varying
BRDF using only low-cost publicly-available data. We apply linear shiness
weighting into parametric modeling to represent spatially varying specular
intensity and shiness. Then an inverse rendering algorithm is developed to
reconstruct the reflectance parameters from non-Light Stage data, which are
used to train an initial morphable reflectance model. To enhance the model's
generalization capability and expressive power, we further propose an
update-by-reconstruction strategy to finetune it on an in-the-wild dataset.
Experimental results show that our method obtains decent rendering results with
plausible facial specularities. Our code is released
\href{https://yxuhan.github.io/ReflectanceMM/index.html}{\textcolor{magenta}{here}}. |
This paper introduces the first 3D morphable face reflectance model that incorporates spatially varying Bidirectional Reflectance Distribution Function (BRDF) learned from readily accessible, low-cost data. |
Existing morphable face models struggle to realistically represent facial specularity, a key element for lifelike rendering. This work addresses this limitation by using a novel BRDF model trained on widely available data, removing the reliance on expensive and complex Light Stage setups. |
The authors utilize a linear combination of Blinn-Phong BRDFs with predefined exponents to characterize the specular component of reflectance. They develop an inverse rendering method to estimate reflectance parameters from the Multi-PIE dataset, generating an initial model. This model is then refined on the FFHQ dataset through a joint face reconstruction and model update process. |
The model successfully captures spatially varying specular intensity and shiness on faces, leading to more realistic renderings compared to models using a global specular exponent.
Fine-tuning the model on in-the-wild data significantly enhances its generalization ability, demonstrated through superior performance in photometric face reconstruction on the CelebA-HQ dataset.
The model exhibits plausible disentanglement of diffuse and specular shading and shows promise for relighting applications, generating realistic specular reflections under novel lighting conditions. |
The model currently employs a Lambertian BRDF for diffuse reflectance, limiting its ability to represent subsurface scattering effects. Integrating a more sophisticated diffuse reflectance model could enhance realism.
Representing the complex specular properties of the eye region remains a challenge. Further research is needed to effectively model specular reflections around the eyes. |
3d morphable face model, reflectance modeling, brdf, inverse rendering, face relighting |
2303.11424
Report |
Polynomial Implicit Neural Representations For Large Diverse Datasets |
Rajhans Singh, Ankita Shukla, Pavan Turaga |
Implicit neural representations (INR) have gained significant popularity for
signal and image representation for many end-tasks, such as superresolution, 3D
modeling, and more. Most INR architectures rely on sinusoidal positional
encoding, which accounts for high-frequency information in data. However, the
finite encoding size restricts the model's representational power. Higher
representational power is needed to go from representing a single given image
to representing large and diverse datasets. Our approach addresses this gap by
representing an image with a polynomial function and eliminates the need for
positional encodings. Therefore, to achieve a progressively higher degree of
polynomial representation, we use element-wise multiplications between features
and affine-transformed coordinate locations after every ReLU layer. The
proposed method is evaluated qualitatively and quantitatively on large datasets
like ImageNet. The proposed Poly-INR model performs comparably to
state-of-the-art generative models without any convolution, normalization, or
self-attention layers, and with far fewer trainable parameters. With much fewer
training parameters and higher representative power, our approach paves the way
for broader adoption of INR models for generative modeling tasks in complex
domains. The code is available at \url{https://github.com/Rajhans0/Poly_INR} |
This paper proposes Poly-INR, a novel Implicit Neural Representation (INR) model that leverages polynomial functions to represent large and diverse image datasets. |
Existing INRs, often reliant on sinusoidal positional encoding, face limitations in representational power when scaled to large datasets like ImageNet. Poly-INR addresses this by using polynomials, enabling efficient parameterization and handling of high-frequency information. |
Poly-INR consists of a mapping network (converting latent codes to affine parameters) and a synthesis network. The latter progressively increases the degree of the polynomial representation by element-wise multiplication between features and affine-transformed coordinate locations after each ReLU layer. |
Poly-INR achieves comparable performance to state-of-the-art CNN-based GANs (e.g., StyleGAN-XL) on ImageNet, with 3-4 times fewer parameters.
It outperforms previous INR-based GANs (CIPS, INR-GAN) on FFHQ dataset with a smaller model size.
The model demonstrates strong capabilities in image interpolation, extrapolation, style-mixing, high-resolution sampling, and inversion. |
Poly-INR's computational cost is higher than multi-scale CNN generators for high-resolution synthesis due to pixel-wise computation.
The model sometimes exhibits GAN artifacts (e.g., multiple limbs) potentially due to limitations in the discriminator's shape understanding. |
implicit neural representations, generative models, polynomial representation, image synthesis, stylegan |
2303.11396
Report |
Text2Tex: Text-driven Texture Synthesis via Diffusion Models |
Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner |
We present Text2Tex, a novel method for generating high-quality textures for
3D meshes from the given text prompts. Our method incorporates inpainting into
a pre-trained depth-aware image diffusion model to progressively synthesize
high resolution partial textures from multiple viewpoints. To avoid
accumulating inconsistent and stretched artifacts across views, we dynamically
segment the rendered view into a generation mask, which represents the
generation status of each visible texel. This partitioned view representation
guides the depth-aware inpainting model to generate and update partial textures
for the corresponding regions. Furthermore, we propose an automatic view
sequence generation scheme to determine the next best view for updating the
partial texture. Extensive experiments demonstrate that our method
significantly outperforms the existing text-driven approaches and GAN-based
methods. |
Text2Tex: a novel method for generating high-quality 3D textures on meshes from text prompts. |
Automating 3D texture design using text guidance is important for efficient 3D content creation, but existing methods struggle to produce high-quality and consistent results. |
Text2Tex leverages a pretrained depth-aware text-to-image diffusion model to progressively generate and refine textures. It uses a view partitioning technique for consistent inpainting and an automatic viewpoint selection scheme for refinement. |
Significantly outperforms existing text-driven methods in FID and KID on Objaverse dataset.
Outperforms category-specific GAN-based methods on ShapeNet car dataset.
Preferred by human users in a user study for realism and fidelity to text prompts. |
Generated textures can exhibit shading effects inherited from the diffusion backbone.
Future work could explore fine-tuning the diffusion model to remove shading artifacts. |
3d texture synthesis, text-guided generation, depth-aware diffusion model, view partitioning, automatic viewpoint selection |
2303.11324
Report |
Open-vocabulary Panoptic Segmentation with Embedding Modulation |
Xi Chen, Shuang Li, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao |
Open-vocabulary image segmentation is attracting increasing attention due to
its critical applications in the real world. Traditional closed-vocabulary
segmentation methods are not able to characterize novel objects, whereas
several recent open-vocabulary attempts obtain unsatisfactory results, i.e.,
notable performance reduction on the closed vocabulary and massive demand for
extra data. To this end, we propose OPSNet, an omnipotent and data-efficient
framework for Open-vocabulary Panoptic Segmentation. Specifically, the
exquisitely designed Embedding Modulation module, together with several
meticulous components, enables adequate embedding enhancement and information
exchange between the segmentation model and the visual-linguistic well-aligned
CLIP encoder, resulting in superior segmentation performance under both open-
and closed-vocabulary settings with much fewer need of additional data.
Extensive experimental evaluations are conducted across multiple datasets
(e.g., COCO, ADE20K, Cityscapes, and PascalContext) under various
circumstances, where the proposed OPSNet achieves state-of-the-art results,
which demonstrates the effectiveness and generality of the proposed approach.
The code and trained models will be made publicly available. |
This paper proposes OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panoptic Segmentation that uses an Embedding Modulation module for enhanced information exchange between the segmentation model and a visual-linguistic CLIP encoder. |
Open-vocabulary image segmentation is crucial for real-world applications as it allows for the segmentation and recognition of both known and unknown objects, overcoming limitations of traditional closed-vocabulary methods that fail to characterize novel objects. |
OPSNet predicts class-agnostic object masks and utilizes a Spatial Adapter to extract CLIP visual features. It employs Embedding Modulation, combining query and CLIP embeddings, to enhance recognition. Mask Filtering refines mask proposals, while Decoupled Supervision uses image-level labels for training, expanding training concepts. |
OPSNet achieves state-of-the-art results in open-vocabulary panoptic segmentation across multiple datasets (COCO, ADE20K, Cityscapes).
It demonstrates superior performance compared to previous open-vocabulary semantic segmentation methods while maintaining strong performance on closed-vocabulary datasets.
The proposed Embedding Modulation module effectively enhances object recognition by leveraging both query and CLIP embeddings, balancing in-domain accuracy and generalization to novel categories. |
The accuracy of open-vocabulary predictions can be further improved, as current results show occasional noise and misclassifications.
Future work includes exploring the use of larger and more diverse datasets for training to further enhance the generalization ability of OPSNet to broader object categories. |
open-vocabulary segmentation, panoptic segmentation, clip, embedding modulation, cross-dataset generalization |
2303.11316
Report |
Generative Semantic Segmentation |
Jiaqi Chen, Jiachen Lu, Xiatian Zhu, Li Zhang |
We present Generative Semantic Segmentation (GSS), a generative learning
approach for semantic segmentation. Uniquely, we cast semantic segmentation as
an image-conditioned mask generation problem. This is achieved by replacing the
conventional per-pixel discriminative learning with a latent prior learning
process. Specifically, we model the variational posterior distribution of
latent variables given the segmentation mask. To that end, the segmentation
mask is expressed with a special type of image (dubbed as maskige). This
posterior distribution allows to generate segmentation masks unconditionally.
To achieve semantic segmentation on a given image, we further introduce a
conditioning network. It is optimized by minimizing the divergence between the
posterior distribution of maskige (i.e., segmentation masks) and the latent
prior distribution of input training images. Extensive experiments on standard
benchmarks show that our GSS can perform competitively to prior art
alternatives in the standard semantic segmentation setting, whilst achieving a
new state of the art in the more challenging cross-domain setting. |
This paper introduces Generative Semantic Segmentation (GSS), a novel approach that formulates semantic segmentation as an image-conditioned mask generation problem, marking a shift from traditional discriminative learning paradigms. |
This new perspective allows leveraging the power of off-the-shelf big generative models pretrained on massive datasets, potentially leading to more efficient and domain-agnostic segmentation models. |
GSS uses a two-stage optimization process. First, it learns a latent posterior distribution for reconstructing segmentation masks efficiently using the concept of "maskige" and pretrained VQVAE. Second, it learns a latent prior distribution conditioned on input images to generate segmentation masks. |
GSS achieves competitive performance compared to state-of-the-art discriminative models on Cityscapes and ADE20K.
GSS outperforms existing methods in cross-domain semantic segmentation on the MSeg dataset, demonstrating superior domain generalization capabilities.
The proposed "maskige" mechanism proves to be efficient and domain-agnostic, enabling the use of pretrained generative models and transfer learning across datasets. |
While showing promising results, GSS's performance still lags behind the top discriminative models, indicating potential for improvement in segmentation accuracy.
The current approach is limited by the color space used for "maskige" representation, particularly when dealing with a large number of categories, and exploring higher-dimensional representations could be beneficial. |
generative semantic segmentation, maskige, vqvae, cross-domain segmentation, domain generalization |
2303.11313
Report |
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition |
Deepti Hegde, Jeya Maria Jose Valanarasu, Vishal M. Patel |
Vision-Language models like CLIP have been widely adopted for various tasks
due to their impressive zero-shot capabilities. However, CLIP is not suitable
for extracting 3D geometric features as it was trained on only images and text
by natural language supervision. We work on addressing this limitation and
propose a new framework termed CG3D (CLIP Goes 3D) where a 3D encoder is
learned to exhibit zero-shot capabilities. CG3D is trained using triplets of
pointclouds, corresponding rendered 2D images, and texts using natural language
supervision. To align the features in a multimodal embedding space, we utilize
contrastive loss on 3D features obtained from the 3D encoder, as well as visual
and text features extracted from CLIP. We note that the natural images used to
train CLIP and the rendered 2D images in CG3D have a distribution shift.
Attempting to train the visual and text encoder to account for this shift
results in catastrophic forgetting and a notable decrease in performance. To
solve this, we employ prompt tuning and introduce trainable parameters in the
input space to shift CLIP towards the 3D pre-training dataset utilized in CG3D.
We extensively test our pre-trained CG3D framework and demonstrate its
impressive capabilities in zero-shot, open scene understanding, and retrieval
tasks. Further, it also serves as strong starting weights for fine-tuning in
downstream 3D recognition tasks. |
Presents CG3D, a new framework for training a 3D encoder with natural language supervision leveraging CLIP, enabling zero-shot 3D recognition and serving as strong initialization for fine-tuning 3D recognition tasks. |
Addresses the lack of 3D networks with zero-shot capabilities similar to CLIP, crucial for 3D understanding tasks and improving existing 3D backbones. |
Trains a 3D encoder with contrastive loss aligning 3D, image, and text features using ShapeNet data. Employs prompt tuning to shift CLIP's visual encoder towards rendered 3D objects. |
Significantly outperforms PointCLIP in zero-shot 3D recognition on ModelNet and ScanObjectNN.
Demonstrates effective scene querying with language and cross-modal 3D retrieval.
Provides competitive performance boost when used as pre-training for downstream 3D recognition tasks. |
Limited pre-training dataset size and reliance on simulated point cloud objects.
Focus on objects rather than scenes, limiting full scene understanding capabilities. |
3d vision, zero-shot learning, vision-language models, clip, prompt tuning |
2303.11306
Report |
Localizing Object-level Shape Variations with Text-to-Image Diffusion Models |
Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, Daniel Cohen-Or |
Text-to-image models give rise to workflows which often begin with an
exploration step, where users sift through a large collection of generated
images. The global nature of the text-to-image generation process prevents
users from narrowing their exploration to a particular object in the image. In
this paper, we present a technique to generate a collection of images that
depicts variations in the shape of a specific object, enabling an object-level
shape exploration process. Creating plausible variations is challenging as it
requires control over the shape of the generated object while respecting its
semantics. A particular challenge when generating object variations is
accurately localizing the manipulation applied over the object's shape. We
introduce a prompt-mixing technique that switches between prompts along the
denoising process to attain a variety of shape choices. To localize the
image-space operation, we present two techniques that use the self-attention
layers in conjunction with the cross-attention layers. Moreover, we show that
these localization techniques are general and effective beyond the scope of
generating object variations. Extensive results and comparisons demonstrate the
effectiveness of our method in generating object variations, and the competence
of our localization techniques. |
This paper introduces a technique to generate variations in the shape of a specific object within an image using text-to-image diffusion models, enabling object-level shape exploration. |
Existing text-to-image generation methods lack object-level control, making it difficult to refine specific objects during exploration. This method addresses this limitation by allowing users to generate and explore variations of a chosen object within an image. |
The method uses a prompt-mixing technique that leverages the coarse-to-fine nature of the denoising process. It uses different prompts in different time intervals to control the image layout, object shape, and fine details. It also introduces two localization techniques: one based on injecting self-attention maps from the original image to preserve shapes of other objects, and another using a novel self-segmentation method based on attention maps to preserve the background and selected objects. |
The method generates a diverse range of plausible shape variations for a specific object in an image, while preserving the overall image composition.
The introduced localization techniques effectively preserve the shapes of other objects in the image and the background, resulting in more coherent and realistic variations.
Quantitative and qualitative comparisons with other methods demonstrate the effectiveness of the proposed method in generating diverse and faithful object-level variations while preserving the original image content. |
The automatic proxy word selection, while generally effective, may sometimes produce unexpected words that lead to less plausible shape variations.
The method currently explores a discrete word space for proxy words, which could be extended to a continuous space for more nuanced control over shape variations. |
text-to-image synthesis, diffusion models, object-level control, shape exploration, image editing |
2303.11162
Report |
Picture that Sketch: Photorealistic Image Generation from Abstract Sketches |
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song |
Given an abstract, deformed, ordinary sketch from untrained amateurs like you
and me, this paper turns it into a photorealistic image - just like those shown
in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in
that we do not dictate an edgemap-like sketch to start with, but aim to work
with abstract free-hand human sketches. In doing so, we essentially democratise
the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you
sketch. Our contribution at the outset is a decoupled encoder-decoder training
paradigm, where the decoder is a StyleGAN trained on photos only. This
importantly ensures that generated results are always photorealistic. The rest
is then all centred around how best to deal with the abstraction gap between
sketch and photo. For that, we propose an autoregressive sketch mapper trained
on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We
further introduce specific designs to tackle the abstract nature of human
sketches, including a fine-grained discriminative loss on the back of a trained
sketch-photo retrieval model, and a partial-aware sketch augmentation strategy.
Finally, we showcase a few downstream tasks our generation model enables,
amongst them is showing how fine-grained sketch-based image retrieval, a
well-studied problem in the sketch community, can be reduced to an image
(generated) to image retrieval task, surpassing state-of-the-arts. We put
forward generated results in the supplementary for everyone to scrutinise. |
This paper presents a novel method to generate photorealistic images from abstract, deformed, amateur sketches. |
Existing sketch-to-photo methods rely on pixel-aligned edgemaps and fail to produce realistic outputs from abstract human sketches. This work aims to democratize sketch-to-photo generation by enabling photorealistic generation from untrained amateur sketches. |
The proposed method employs a decoupled encoder-decoder training paradigm. A StyleGAN, pre-trained on photos, acts as a decoder, ensuring photorealistic outputs. An autoregressive sketch mapper, trained on sketch-photo pairs, encodes sketches into the StyleGAN's latent space. A fine-grained discriminative loss and a partial-aware sketch augmentation strategy further enhance the generation from abstract sketches. |
The proposed method significantly outperforms state-of-the-art methods in terms of photorealism and fidelity to the input sketch.
The model exhibits strong generalization ability, effectively handling sketches with varying levels of abstraction and noise.
The generated model enables downstream applications such as fine-grained sketch-based image retrieval and precise semantic editing. |
The quality of generated photos is limited by the diversity and quality of the training data.
Future work can explore more sophisticated architectures for the sketch mapper and incorporate additional constraints for enhanced control over the generation process. |
sketch-to-photo generation, generative adversarial networks (gans), image-to-image translation, stylegan, fine-grained image retrieval |
2303.11120
Report |
Positional Diffusion: Ordering Unordered Sets with Diffusion Probabilistic Models |
Francesco Giuliari, Gianluca Scarpellini, Stuart James, Yiming Wang, Alessio Del Bue |
Positional reasoning is the process of ordering unsorted parts contained in a
set into a consistent structure. We present Positional Diffusion, a
plug-and-play graph formulation with Diffusion Probabilistic Models to address
positional reasoning. We use the forward process to map elements' positions in
a set to random positions in a continuous space. Positional Diffusion learns to
reverse the noising process and recover the original positions through an
Attention-based Graph Neural Network. We conduct extensive experiments with
benchmark datasets including two puzzle datasets, three sentence ordering
datasets, and one visual storytelling dataset, demonstrating that our method
outperforms long-lasting research on puzzle solving with up to +18% compared to
the second-best deep learning method, and performs on par against the
state-of-the-art methods on sentence ordering and visual storytelling. Our work
highlights the suitability of diffusion models for ordering problems and
proposes a novel formulation and method for solving various ordering tasks.
Project website at https://iit-pavis.github.io/Positional_Diffusion/ |
\mnamenoit is a novel graph-based Diffusion Probabilistic Model (DPM) for positional reasoning, addressing the challenge of ordering elements in unordered sets. |
Positional reasoning is a fundamental human skill crucial for various tasks, and a robust, task-agnostic method for addressing this challenge is highly desirable. |
The method uses a graph representation of the set, where each element is a node, and employs an Attention-based Graph Neural Network (GNN) within a DPM framework. During training, the model learns to reverse a noising process applied to node positions, guided by node features. At inference, it iteratively refines initially random positions to recover the correct order. |
\mnamenoit achieves state-of-the-art performance on puzzle solving, outperforming previous methods by a significant margin.
It demonstrates competitive results on sentence ordering, achieving state-of-the-art performance on a subset of benchmark datasets.
The model also exhibits strong performance on visual storytelling, on par with state-of-the-art methods designed specifically for this task. |
The performance of \mnamenoit on sentence ordering tasks with loosely structured text, such as ROCStories, is relatively weaker.
Future work includes exploring different graph structures beyond fully connected graphs to potentially enhance performance. |
positional reasoning, diffusion probabilistic models, graph neural networks, puzzle solving, sentence ordering, visual storytelling |
2303.11108
Report |
CHATEDIT: Towards Multi-turn Interactive Facial Image Editing via Dialogue |
Xing Cui, Zekun Li, Peipei Li, Yibo Hu, Hailin Shi, Zhaofeng He |
This paper explores interactive facial image editing via dialogue and
introduces the ChatEdit benchmark dataset for evaluating image editing and
conversation abilities in this context. ChatEdit is constructed from the
CelebA-HQ dataset, incorporating annotated multi-turn dialogues corresponding
to user edit requests on the images. The dataset is challenging, as it requires
the system to dynamically track user requests, edit images, and generate
appropriate responses. Accordingly, we propose three benchmark tasks: (i) user
edit request tracking, (ii) image editing, and (iii) response generation. We
present a novel baseline framework that integrates a dialogue module for both
tracking user requests and generating responses and an image editing module for
image editing. Unlike previous approaches, our framework directly tracks user
edit requests from the entire dialogue history up to the current turn and
modifies the original image rather than adjusting the previous turn's output,
thereby reducing error accumulation and preventing attribute forgetfulness.
Extensive experiments on the ChatEdit dataset underline our framework's
superior performance against prior models, while also highlighting potential
room for further research. We will release the code and data publicly to
facilitate advancements in complex interactive facial image editing. |
This paper introduces ChatEdit, a benchmark dataset for multi-turn interactive facial image editing via dialogue, and proposes a novel framework combining a task-oriented dialogue module and an image editing module. |
Existing approaches to interactive image editing suffer from error accumulation, attribute forgetting, and limited response generation capabilities, highlighting the need for a dedicated benchmark and improved methods. |
ChatEdit is constructed from CelebA-HQ, enriched with multi-turn dialogues and user belief states. The proposed framework leverages a T5-based dialogue module for request tracking and response generation, and StyleCLIP for image editing guided by extracted requests. |
The proposed framework outperforms single-turn editing methods, demonstrating reduced error accumulation and improved image quality.
Extracting concise user requests from dialogues significantly improves performance compared to directly using raw dialogue context.
Human evaluations confirm that the framework generates higher-quality images and more engaging, human-like responses compared to previous methods. |
The dataset currently considers a limited set of editable attributes, potentially hindering generalization to out-of-domain user requests.
The current two-stage framework might benefit from end-to-end training to further enhance image editing quality. |
interactive image editing, facial image manipulation, dialogue systems, task-oriented dialogue, benchmark dataset |
2303.11086
Report |
Pluralistic Aging Diffusion Autoencoder |
Peipei Li, Rui Wang, Huaibo Huang, Ran He, Zhaofeng He |
Face aging is an ill-posed problem because multiple plausible aging patterns
may correspond to a given input. Most existing methods often produce one
deterministic estimation. This paper proposes a novel CLIP-driven Pluralistic
Aging Diffusion Autoencoder (PADA) to enhance the diversity of aging patterns.
First, we employ diffusion models to generate diverse low-level aging details
via a sequential denoising reverse process. Second, we present Probabilistic
Aging Embedding (PAE) to capture diverse high-level aging patterns, which
represents age information as probabilistic distributions in the common CLIP
latent space. A text-guided KL-divergence loss is designed to guide this
learning. Our method can achieve pluralistic face aging conditioned on
open-world aging texts and arbitrary unseen face images. Qualitative and
quantitative experiments demonstrate that our method can generate more diverse
and high-quality plausible aging results. |
This paper proposes PADA, a CLIP-driven Pluralistic Aging Diffusion Autoencoder, to generate diverse and plausible face aging results. |
Face aging is an ill-posed problem as multiple plausible aging patterns can exist for a single input. Existing methods often produce only one deterministic estimation, limiting their realism. |
PADA uses diffusion models for diverse low-level aging details and introduces Probabilistic Aging Embedding (PAE) to capture high-level aging patterns as distributions in the CLIP latent space. Text-guided KL-divergence loss aids in learning PAE. |
PADA generates more diverse aging results than state-of-the-art methods, capturing both high-level (e.g., shape, skin color) and low-level (e.g., wrinkles) variations.
PADA achieves superior aging accuracy and identity preservation compared to existing methods.
The method allows for flexible user interaction, enabling aging based on open-world text descriptions and arbitrary reference images. |
The conflict between aging accuracy and identity preservation requires careful balancing.
The reliance on pre-trained models might introduce biases. |
face aging, diffusion models, clip, probabilistic embedding, generative models |
2303.11073
Report |
Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models |
René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, Tomer Michaeli |
Denoising Diffusion Models (DDMs) have emerged as a strong competitor to
Generative Adversarial Networks (GANs). However, despite their widespread use
in image synthesis and editing applications, their latent space is still not as
well understood. Recently, a semantic latent space for DDMs, coined
`$h$-space', was shown to facilitate semantic image editing in a way
reminiscent of GANs. The $h$-space is comprised of the bottleneck activations
in the DDM's denoiser across all timesteps of the diffusion process. In this
paper, we explore the properties of h-space and propose several novel methods
for finding meaningful semantic directions within it. We start by studying
unsupervised methods for revealing interpretable semantic directions in
pretrained DDMs. Specifically, we show that global latent directions emerge as
the principal components in the latent space. Additionally, we provide a novel
method for discovering image-specific semantic directions by spectral analysis
of the Jacobian of the denoiser w.r.t. the latent code. Next, we extend the
analysis by finding directions in a supervised fashion in unconditional DDMs.
We demonstrate how such directions can be found by relying on either a labeled
data set of real images or by annotating generated samples with a
domain-specific attribute classifier. We further show how to semantically
disentangle the found direction by simple linear projection. Our approaches are
applicable without requiring any architectural modifications, text-based
guidance, CLIP-based optimization, or model fine-tuning. |
This paper proposes several supervised and unsupervised methods for discovering interpretable directions in the semantic latent space of Denoising Diffusion Models (DDMs). |
The work is important because it provides new approaches for semantic image editing in DDMs, an area that has been less explored compared to Generative Adversarial Networks (GANs). |
The authors leverage the recently proposed 'h-space' for DDMs, which comprises the bottleneck activations of the denoiser across all timesteps. They explore unsupervised methods like principal component analysis (PCA) and spectral analysis of the denoiser's Jacobian. They also propose a supervised method that utilizes labeled data or attribute classifiers to find directions corresponding to specific attributes. |
Principal components in the h-space correspond to global semantic directions like pose, gender, and age.
Spectral analysis of the denoiser's Jacobian reveals image-specific semantic directions, enabling localized edits like opening/closing of eyes and mouth.
Supervised methods using labeled data or attribute classifiers effectively discover directions for specific attributes and can be disentangled using linear projection. |
Unsupervised methods show limited interpretability when applied to DDMs trained on less structured datasets.
Future work includes exploring alternative unsupervised techniques for diverse datasets and extending the supervised method to more complex attributes. |
denoising diffusion models, semantic image editing, latent space manipulation, unsupervised learning, supervised learning |
2303.11052
Report |
ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real Novel View Synthesis via Contrastive Learning |
Hao Yang, Lanqing Hong, Aoxue Li, Tianyang Hu, Zhenguo Li, Gim Hee Lee, Liwei Wang |
Although many recent works have investigated generalizable NeRF-based novel
view synthesis for unseen scenes, they seldom consider the synthetic-to-real
generalization, which is desired in many practical applications. In this work,
we first investigate the effects of synthetic data in synthetic-to-real novel
view synthesis and surprisingly observe that models trained with synthetic data
tend to produce sharper but less accurate volume densities. For pixels where
the volume densities are correct, fine-grained details will be obtained.
Otherwise, severe artifacts will be produced. To maintain the advantages of
using synthetic data while avoiding its negative effects, we propose to
introduce geometry-aware contrastive learning to learn multi-view consistent
features with geometric constraints. Meanwhile, we adopt cross-view attention
to further enhance the geometry perception of features by querying features
across input views. Experiments demonstrate that under the synthetic-to-real
setting, our method can render images with higher quality and better
fine-grained details, outperforming existing generalizable novel view synthesis
methods in terms of PSNR, SSIM, and LPIPS. When trained on real data, our
method also achieves state-of-the-art results. |
This paper proposes ContraNeRF, a novel view synthesis method based on neural radiance fields (NeRF) that generalizes well from synthetic data to real data using contrastive learning with geometry consistency. |
Synthetic-to-real novel view synthesis, while desired for its cost-effectiveness, is seldom investigated in existing generalizable NeRF methods. This paper observes that models trained on synthetic data often produce sharper but less accurate volume densities on real data, leading to artifacts. ContraNeRF addresses this issue by incorporating geometry awareness during training. |
The proposed method consists of three key components: (1) Geometry Aware Feature Extraction enhances image features by exchanging information between source views through cross-view attention. (2) Geometry Aware Contrastive Learning utilizes geometric constraints to enhance multi-view consistency by comparing similarities of local features between pairs of source views. (3) Rendering utilizes a coarse-to-fine sampling strategy, accumulating colors along the ray weighted by densities after softmax. |
ContraNeRF outperforms existing generalizable NeRF methods in synthetic-to-real novel view synthesis, achieving higher PSNR, SSIM, and lower LPIPS on ScanNet dataset.
The method also demonstrates state-of-the-art results on DTU and LLFF datasets under the real-to-real setting.
Experiments reveal that a small proportion of real data combined with synthetic data can achieve performance comparable to using real data alone. |
Current methods, including ContraNeRF, struggle to generate high-quality images for highly blurred scenes, a common occurrence in real-world datasets.
Future work can explore incorporating deblurring techniques within the synthetic-to-real generalization framework for improved performance. |
novel view synthesis, neural radiance fields, synthetic-to-real generalization, contrastive learning, geometry awareness |
2303.10598
Report |
StyleRF: Zero-shot 3D Style Transfer of Neural Radiance Fields |
Kunhao Liu, Fangneng Zhan, Yiwen Chen, Jiahui Zhang, Yingchen Yu, Abdulmotaleb El Saddik, Shijian Lu, Eric Xing |
3D style transfer aims to render stylized novel views of a 3D scene with
multi-view consistency. However, most existing work suffers from a three-way
dilemma over accurate geometry reconstruction, high-quality stylization, and
being generalizable to arbitrary new styles. We propose StyleRF (Style Radiance
Fields), an innovative 3D style transfer technique that resolves the three-way
dilemma by performing style transformation within the feature space of a
radiance field. StyleRF employs an explicit grid of high-level features to
represent 3D scenes, with which high-fidelity geometry can be reliably restored
via volume rendering. In addition, it transforms the grid features according to
the reference style which directly leads to high-quality zero-shot style
transfer. StyleRF consists of two innovative designs. The first is
sampling-invariant content transformation that makes the transformation
invariant to the holistic statistics of the sampled 3D points and accordingly
ensures multi-view consistency. The second is deferred style transformation of
2D feature maps which is equivalent to the transformation of 3D points but
greatly reduces memory footprint without degrading multi-view consistency.
Extensive experiments show that StyleRF achieves superior 3D stylization
quality with precise geometry reconstruction and it can generalize to various
new styles in a zero-shot manner. |
Presents StyleRF, a method for zero-shot 3D style transfer of neural radiance fields. |
Enables stylization of 3D scenes using arbitrary artistic styles without requiring paired training data. |
Leverages a feature grid to represent the 3D scene and employs a novel style transfer network that operates on the feature grid, enabling view-consistent stylization. |
Achieves high-quality stylization of 3D scenes with fidelity to both the input content and reference styles.
Exhibits strong multi-view consistency, producing stylized novel views that align seamlessly.
Outperforms existing 2D and 3D style transfer methods in terms of visual quality and view consistency. |
Limited to relatively simple scenes due to the computational cost of NeRF-based representations.
Future work could explore incorporating temporal consistency for stylizing dynamic scenes. |
3d style transfer, neural radiance fields, zero-shot learning, computer vision, deep learning |
2303.10340
Report |
3D Data Augmentation for Driving Scenes on Camera |
Wenwen Tong, Jiangwei Xie, Tianyu Li, Hanming Deng, Xiangwei Geng, Ruoyi Zhou, Dingchen Yang, Bo Dai, Lewei Lu, Hongyang Li |
Driving scenes are extremely diverse and complicated that it is impossible to
collect all cases with human effort alone. While data augmentation is an
effective technique to enrich the training data, existing methods for camera
data in autonomous driving applications are confined to the 2D image plane,
which may not optimally increase data diversity in 3D real-world scenarios. To
this end, we propose a 3D data augmentation approach termed Drive-3DAug, aiming
at augmenting the driving scenes on camera in the 3D space. We first utilize
Neural Radiance Field (NeRF) to reconstruct the 3D models of background and
foreground objects. Then, augmented driving scenes can be obtained by placing
the 3D objects with adapted location and orientation at the pre-defined valid
region of backgrounds. As such, the training database could be effectively
scaled up. However, the 3D object modeling is constrained to the image quality
and the limited viewpoints. To overcome these problems, we modify the original
NeRF by introducing a geometric rectified loss and a symmetric-aware training
strategy. We evaluate our method for the camera-only monocular 3D detection
task on the Waymo and nuScences datasets. The proposed data augmentation
approach contributes to a gain of 1.7% and 1.4% in terms of detection accuracy,
on Waymo and nuScences respectively. Furthermore, the constructed 3D models
serve as digital driving assets and could be recycled for different detectors
or other 3D perception tasks. |
This paper proposes Drive-3DAug, a novel 3D data augmentation approach for camera-based 3D perception in autonomous driving, which leverages NeRF to reconstruct and manipulate 3D models of driving scenes. |
Existing data augmentation methods for camera data are limited to 2D image manipulations, hindering diversity in generated driving scenes which is crucial for improving 3D perception, especially in handling challenging long-tail scenarios. |
The approach uses NeRF to reconstruct 3D models of backgrounds and foreground objects from driving scenes. It introduces a geometric rectified loss to handle imperfect object extraction and a symmetric-aware training strategy to enhance viewpoint diversity. The augmented scenes are created by placing objects in valid regions of the backgrounds while considering physical constraints. |
Drive-3DAug significantly improves monocular 3D detection performance, achieving a 1.7% gain on Waymo and 1.4% on nuScenes.
It effectively addresses challenges in previous methods by enabling realistic object rotation and translation in 3D space.
Reconstructed 3D models serve as reusable digital driving assets, benefiting various perception tasks. |
The current method primarily augments data under good illumination conditions and with limited object classes.
Future work includes expanding the approach to encompass diverse weather conditions and a wider range of objects. |
data augmentation, 3d object detection, autonomous driving, neural radiance fields (nerf), digital driving assets |
2303.10137
Report |
A Recipe for Watermarking Diffusion Models |
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, Min Lin |
Diffusion models (DMs) have demonstrated advantageous potential on generative
tasks. Widespread interest exists in incorporating DMs into downstream
applications, such as producing or editing photorealistic images. However,
practical deployment and unprecedented power of DMs raise legal issues,
including copyright protection and monitoring of generated content. In this
regard, watermarking has been a proven solution for copyright protection and
content monitoring, but it is underexplored in the DMs literature.
Specifically, DMs generate samples from longer tracks and may have newly
designed multimodal structures, necessitating the modification of conventional
watermarking pipelines. To this end, we conduct comprehensive analyses and
derive a recipe for efficiently watermarking state-of-the-art DMs (e.g., Stable
Diffusion), via training from scratch or finetuning. Our recipe is
straightforward but involves empirically ablated implementation details,
providing a foundation for future research on watermarking DMs. The code is
available at https://github.com/yunqing-me/WatermarkDM. |
This paper presents a comprehensive empirical study and derives a practical recipe for watermarking diffusion models (DMs) for copyright protection and content monitoring. |
The widespread use of DMs in generating realistic images raises legal concerns about copyright and the proliferation of generated content. |
- For unconditional/class-conditional DMs, embed predefined watermarks directly into training data using an encoder-decoder architecture.
- For text-to-image DMs, finetune pretrained models to generate a specific watermark image in response to a trigger prompt, using a weight-constrained regularization to minimize performance degradation. |
Watermarks can be reliably embedded in and recovered from both unconditional/class-conditional and text-to-image DMs.
Increasing the complexity of watermarks can lead to a degradation of the generative performance of DMs.
Weight-constrained finetuning effectively mitigates performance degradation in text-to-image DMs during watermark embedding. |
Embedding complex watermarks can negatively impact the quality of generated images.
Future research can explore more sophisticated watermarking techniques to further improve robustness and minimize performance impact. |
diffusion models, watermarking, copyright protection, content monitoring, generative models |
2303.10126
Report |
IRGen: Generative Modeling for Image Retrieval |
Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Baining Guo |
While generative modeling has been ubiquitous in natural language processing
and computer vision, its application to image retrieval remains unexplored. In
this paper, we recast image retrieval as a form of generative modeling by
employing a sequence-to-sequence model, contributing to the current unified
theme. Our framework, IRGen, is a unified model that enables end-to-end
differentiable search, thus achieving superior performance thanks to direct
optimization. While developing IRGen we tackle the key technical challenge of
converting an image into quite a short sequence of semantic units in order to
enable efficient and effective retrieval. Empirical experiments demonstrate
that our model yields significant improvement over three commonly used
benchmarks, for example, 22.9\% higher than the best baseline method in
precision@10 on In-shop dataset with comparable recall@10 score. |
This paper recasts image retrieval as a generative modeling problem, proposing IRGen, a sequence-to-sequence model that predicts discrete visual tokens representing a query image's nearest neighbor. |
Existing image retrieval pipelines, relying on separate feature extraction and ANN search stages, lack end-to-end optimization. IRGen aims to bridge this gap, offering direct optimization for superior performance. |
IRGen utilizes a novel semantic image tokenizer that compresses image representations into short sequences of semantically meaningful tokens. A Transformer-based encoder-decoder architecture then predicts identifiers of nearest neighbors during retrieval. |
IRGen outperforms state-of-the-art image retrieval methods on In-shop Clothes, CUB200, and Cars196 datasets, demonstrating superior precision.
The model shows promising scalability, achieving excellent results on million-level datasets like ImageNet and Places365.
The proposed semantic image tokenizer proves more effective than random identifiers or those derived from hierarchical k-means or RQ-VAE. |
Handling billion-scale datasets efficiently requires further research on balancing model capacity and inference speed.
Efficiently updating the model with fresh data without retraining remains an open challenge. |
image retrieval, generative modeling, sequence-to-sequence, semantic image tokenizer, transformer |
2303.10083
Report |
$α$Surf: Implicit Surface Reconstruction for Semi-Transparent and Thin Objects with Decoupled Geometry and Opacity |
Tianhao Wu, Hanxue Liang, Fangcheng Zhong, Gernot Riegler, Shimon Vainer, Cengiz Oztireli |
Implicit surface representations such as the signed distance function (SDF)
have emerged as a promising approach for image-based surface reconstruction.
However, existing optimization methods assume solid surfaces and are therefore
unable to properly reconstruct semi-transparent surfaces and thin structures,
which also exhibit low opacity due to the blending effect with the background.
While neural radiance field (NeRF) based methods can model semi-transparency
and achieve photo-realistic quality in synthesized novel views, their
volumetric geometry representation tightly couples geometry and opacity, and
therefore cannot be easily converted into surfaces without introducing
artifacts. We present $\alpha$Surf, a novel surface representation with
decoupled geometry and opacity for the reconstruction of semi-transparent and
thin surfaces where the colors mix. Ray-surface intersections on our
representation can be found in closed-form via analytical solutions of cubic
polynomials, avoiding Monte-Carlo sampling and is fully differentiable by
construction. Our qualitative and quantitative evaluations show that our
approach can accurately reconstruct surfaces with semi-transparent and thin
parts with fewer artifacts, achieving better reconstruction quality than
state-of-the-art SDF and NeRF methods. Website: https://alphasurf.netlify.app/ |
\name{} is a novel grid-based surface representation for reconstructing semi-transparent and thin objects with decoupled geometry and opacity. |
Existing SDF-based methods struggle to reconstruct surfaces exhibiting semi-transparency or thin structures due to the assumption of solid surfaces. NeRF-based methods can model semi-transparency but their volumetric geometry representation tightly couples geometry and opacity. |
The representation utilizes separate values on a grid to model geometry, opacity, and appearance. It leverages a closed-form solution for finding ray-surface intersections via cubic polynomial root finding and employs differentiable alpha compositing during rendering. The method is initialized from a pre-trained Plenoxels model and incorporates surface-specific regularization for optimization. |
\name{} accurately reconstructs surfaces with semi-transparent and thin parts with fewer artifacts than state-of-the-art SDF and NeRF methods.
It produces higher quality surfaces, as evidenced by lower Chamfer distance scores compared to baselines.
The method effectively removes noisy inner surfaces and density floaters commonly present in NeRF reconstructions. |
The reconstructed surfaces tend to be less smooth compared to MLP-based SDF methods due to the lack of spatial smoothness encoded in the MLP.
The method currently requires a separate background model for 360° real-world scenes. |
surface reconstruction, semi-transparent surfaces, thin structures, implicit surface representation, differentiable rendering |
2303.10073
Report |
DialogPaint: A Dialog-based Image Editing Model |
Jingxuan Wei, Shiyu Wu, Xin Jiang, Yequan Wang |
We introduce DialogPaint, a novel framework that bridges conversational
interactions with image editing, enabling users to modify images through
natural dialogue. By integrating a dialogue model with the Stable Diffusion
image transformation technique, DialogPaint offers a more intuitive and
interactive approach to image modifications. Our method stands out by
effectively interpreting and executing both explicit and ambiguous
instructions, handling tasks such as object replacement, style transfer, and
color modification. Notably, DialogPaint supports iterative, multi-round
editing, allowing users to refine image edits over successive interactions.
Comprehensive evaluations highlight the robustness and versatility of our
approach, marking a significant advancement in dialogue-driven image editing. |
Introduces DialogPaint, a novel framework for interactive image editing through multi-round dialogue. |
Addresses limitations of existing image editing methods that struggle with ambiguous instructions and lack intuitive human-computer interaction. |
Combines a dialogue model with Stable Diffusion image editing. Leverages self-instruct methodology to generate synthetic dialogue and image pairs for model training. |
Outperforms baseline models like InstructPix2Pix in preserving background details and isolating object edits.
Successfully handles multi-turn edits, including object addition/removal, color modifications, and scene transformations.
Demonstrates robustness and versatility across diverse image editing tasks, evidenced by quantitative metrics (Perplexity, FID, PRD) and positive user feedback (Overall Satisfaction, MOS). |
Limited dataset diversity and volume may hinder performance in complex editing scenarios.
Future work includes refining the model's ability to balance transformation with preservation for more natural edits. |
dialogue-based image editing, natural language processing, image transformation, multi-round interactions, stable diffusion |
2303.09833
Report |
FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model |
Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, Jian Zhang |
Recently, conditional diffusion models have gained popularity in numerous
applications due to their exceptional generation ability. However, many
existing methods are training-required. They need to train a time-dependent
classifier or a condition-dependent score estimator, which increases the cost
of constructing conditional diffusion models and is inconvenient to transfer
across different conditions. Some current works aim to overcome this limitation
by proposing training-free solutions, but most can only be applied to a
specific category of tasks and not to more general conditions. In this work, we
propose a training-Free conditional Diffusion Model (FreeDoM) used for various
conditions. Specifically, we leverage off-the-shelf pre-trained networks, such
as a face detection model, to construct time-independent energy functions,
which guide the generation process without requiring training. Furthermore,
because the construction of the energy function is very flexible and adaptable
to various conditions, our proposed FreeDoM has a broader range of applications
than existing training-free methods. FreeDoM is advantageous in its simplicity,
effectiveness, and low cost. Experiments demonstrate that FreeDoM is effective
for various conditions and suitable for diffusion models of diverse data
domains, including image and latent code domains. |
Proposes FreeDoM, a training-free method for conditional diffusion models using off-the-shelf pre-trained networks to construct time-independent energy functions for guidance. |
Addresses the inflexibility and high cost of retraining conditional diffusion models for new conditions. |
Approximates time-dependent energy functions using time-independent distance measuring functions based on pre-trained networks. Employs an efficient time-travel strategy for large data domains. Constructs energy functions by projecting conditions and intermediate results into the same feature space for distance measurement. |
Generates high-quality images consistent with diverse conditions including text, segmentation maps, sketches, landmarks, face IDs, and style images.
Offers controllability by adjusting the learning rate of the energy function.
Demonstrates faster inference speed and better alignment with conditioned style images compared to UGD. |
Sampling time is higher than training-required methods due to energy function derivative computation and time-travel strategy.
Controlling fine-grained structure features in large data domains using the energy function is difficult. |
conditional diffusion models, training-free, energy guidance, image generation, time-travel strategy |
2303.09826
Report |
Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution |
Zixi Tuo, Huan Yang, Jianlong Fu, Yujie Dun, Xueming Qian |
Existing real-world video super-resolution (VSR) methods focus on designing a
general degradation pipeline for open-domain videos while ignoring data
intrinsic characteristics which strongly limit their performance when applying
to some specific domains (eg., animation videos). In this paper, we thoroughly
explore the characteristics of animation videos and leverage the rich priors in
real-world animation data for a more practical animation VSR model. In
particular, we propose a multi-scale Vector-Quantized Degradation model for
animation video Super-Resolution (VQD-SR) to decompose the local details from
global structures and transfer the degradation priors in real-world animation
videos to a learned vector-quantized codebook for degradation modeling. A
rich-content Real Animation Low-quality (RAL) video dataset is collected for
extracting the priors. We further propose a data enhancement strategy for
high-resolution (HR) training videos based on our observation that existing HR
videos are mostly collected from the Web which contains conspicuous compression
artifacts. The proposed strategy is valid to lift the upper bound of animation
VSR performance, regardless of the specific VSR model. Experimental results
demonstrate the superiority of the proposed VQD-SR over state-of-the-art
methods, through extensive quantitative and qualitative evaluations of the
latest animation video super-resolution benchmark. The code and pre-trained
models can be downloaded at https://github.com/researchmm/VQD-SR. |
This paper introduces VQD-SR, a novel multi-scale vector-quantized degradation model for animation video super-resolution. |
Existing real-world video super-resolution methods often fail to generalize to the animation domain as they disregard the inherent characteristics of such videos, resulting in subpar outcomes. |
The authors collected a large-scale Real Animation Low-quality (RAL) video dataset to study real-world animation degradation. They designed a multi-scale VQGAN trained on RAL to learn and transfer degradation priors. A stochastic top-k VQ strategy expands the degradation space for better generalization. Lastly, they proposed an HR-SR data enhancement strategy to improve the quality of HR training videos. |
VQD-SR outperforms state-of-the-art methods in quantitative metrics (MANIQA) on the AVC-RealLQ benchmark.
Qualitative comparisons demonstrate VQD-SR's ability to restore sharper lines, reduce artifacts, and handle intended scenarios like out-of-focus blur more naturally.
Extensive ablation studies validate the effectiveness of the VQ degradation model, HR-SR enhancement strategy, and other design choices. |
VQD-SR may still struggle with extreme cases of degradation, such as severe color distortions.
The HR-SR enhancement strategy, while effective for animation, may not directly apply to natural videos due to their complexity. |
animation video super-resolution, degradation modeling, vector quantization, vqgan, data enhancement |
2303.09813
Report |
DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery |
Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxiang Liu, Yu Wang, Ya Zhang, Yanfeng Wang |
Learning from a large corpus of data, pre-trained models have achieved
impressive progress nowadays. As popular generative pre-training, diffusion
models capture both low-level visual knowledge and high-level semantic
relations. In this paper, we propose to exploit such knowledgeable diffusion
models for mainstream discriminative tasks, i.e., unsupervised object
discovery: saliency segmentation and object localization. However, the
challenges exist as there is one structural difference between generative and
discriminative models, which limits the direct use. Besides, the lack of
explicitly labeled data significantly limits performance in unsupervised
settings. To tackle these issues, we introduce DiffusionSeg, one novel
synthesis-exploitation framework containing two-stage strategies. To alleviate
data insufficiency, we synthesize abundant images, and propose a novel
training-free AttentionCut to obtain masks in the first synthesis stage. In the
second exploitation stage, to bridge the structural gap, we use the inversion
technique, to map the given image back to diffusion features. These features
can be directly used by downstream architectures. Extensive experiments and
ablation studies demonstrate the superiority of adapting diffusion for
unsupervised object discovery. |
This paper proposes DiffusionSeg, a novel synthesis-exploitation framework leveraging the visual knowledge of pre-trained text-to-image diffusion models for unsupervised object discovery, including saliency segmentation and object localization. |
This work is significant as it explores the potential of generative pre-trained models for mainstream discriminative tasks, aiming to bridge the gap between generative and discriminative modeling. |
The synthesis stage generates abundant image-mask pairs by leveraging cross- and self-attention within a pre-trained diffusion model, utilizing a novel training-free method called AttentionCut. The exploitation stage employs diffusion inversion to map a given image back to diffusion features, which are then used by a lightweight decoder trained on the synthetic data for object discovery. |
DiffusionSeg achieves state-of-the-art performance on six standard object discovery benchmarks, surpassing previous methods in both saliency segmentation and object localization.
The analysis of the synthesized dataset demonstrates its ability to reliably simulate real-world data properties, enabling effective training of object discovery models.
Ablation studies validate the effectiveness of each component, highlighting the importance of AttentionCut and the CLIP-classifiable prior for knowledge extraction and object discovery. |
The current method primarily focuses on single-object discovery, and extending it to multi-object scenarios requires further investigation.
The computational cost of diffusion inversion remains a challenge for real-time applications, requiring exploration of efficient inference strategies. |
diffusion models, unsupervised object discovery, saliency segmentation, object localization, generative pre-training |
2303.09604
Report |
DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion |
Maham Tanveer, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang |
We introduce a novel method to automatically generate an artistic typography
by stylizing one or more letter fonts to visually convey the semantics of an
input word, while ensuring that the output remains readable. To address an
assortment of challenges with our task at hand including conflicting goals
(artistic stylization vs. legibility), lack of ground truth, and immense search
space, our approach utilizes large language models to bridge texts and visual
images for stylization and build an unsupervised generative model with a
diffusion model backbone. Specifically, we employ the denoising generator in
Latent Diffusion Model (LDM), with the key addition of a CNN-based
discriminator to adapt the input style onto the input text. The discriminator
uses rasterized images of a given letter/word font as real samples and output
of the denoising generator as fake samples. Our model is coined DS-Fusion for
discriminated and stylized diffusion. We showcase the quality and versatility
of our method through numerous examples, qualitative and quantitative
evaluation, as well as ablation studies. User studies comparing to strong
baselines including CLIPDraw and DALL-E 2, as well as artist-crafted
typographies, demonstrate strong performance of DS-Fusion. |
DS-Fusion, a novel method for automatically generating artistic typography by stylizing letter fonts to visually convey the semantics of an input word while maintaining readability. |
Artistic typography is challenging due to conflicting goals (artistic stylization vs. legibility), lack of ground truth, and an immense search space. |
Employs Latent Diffusion Model with a CNN-based discriminator. Generates style images from the input word using LDM. Fine-tunes the denoising generator on these images using diffusion loss and discriminator loss to adapt the input style onto the input text. |
Generates artistic typography by blending styles into glyph shapes.
Outperforms baselines like DALL-E 2 and CLIPDraw in user studies for style and legibility.
Demonstrates versatility in accommodating different semantics, letters, and styles. |
Struggles with multi-letter inputs when style images and letters are dissimilar.
Current implementation optimizes for specific style and glyph combinations, limiting generalizability. |
artistic typography, diffusion models, generative design, text-to-image synthesis, adversarial learning |
2303.09556
Report |
Efficient Diffusion Training via Min-SNR Weighting Strategy |
Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, Baining Guo |
Denoising diffusion models have been a mainstream approach for image
generation, however, training these models often suffers from slow convergence.
In this paper, we discovered that the slow convergence is partly due to
conflicting optimization directions between timesteps. To address this issue,
we treat the diffusion training as a multi-task learning problem, and introduce
a simple yet effective approach referred to as Min-SNR-$\gamma$. This method
adapts loss weights of timesteps based on clamped signal-to-noise ratios, which
effectively balances the conflicts among timesteps. Our results demonstrate a
significant improvement in converging speed, 3.4$\times$ faster than previous
weighting strategies. It is also more effective, achieving a new record FID
score of 2.06 on the ImageNet $256\times256$ benchmark using smaller
architectures than that employed in previous state-of-the-art. The code is
available at https://github.com/TiankaiHang/Min-SNR-Diffusion-Training. |
This paper introduces Min-SNR-γ, a novel loss weighting strategy for diffusion model training that addresses the issue of slow convergence caused by conflicting optimization directions between timesteps. |
Training diffusion models is computationally expensive and slow convergence is a major bottleneck for research. This work tackles this issue, potentially enabling faster experimentation and development of diffusion models. |
The authors treat diffusion training as a multi-task learning problem and propose the Min-SNR-γ strategy, which assigns loss weights to each timestep based on a clamped signal-to-noise ratio. This approach aims to balance the optimization conflicts between different noise levels during training. |
Min-SNR-γ significantly accelerates convergence speed, achieving a 3.4x speedup compared to previous weighting strategies.
The method effectively balances loss across different noise levels, resulting in a more efficient training process.
It achieves state-of-the-art FID score of 2.06 on ImageNet 256x256 benchmark. |
The paper mainly focuses on image generation and further exploration is needed for other applications of diffusion models.
The optimal value for the hyperparameter γ may require task-specific tuning. |
diffusion models, image generation, multi-task learning, loss weighting, signal-to-noise ratio |
2303.09551
Report |
SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving |
Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu |
3D scene understanding plays a vital role in vision-based autonomous driving.
While most existing methods focus on 3D object detection, they have difficulty
describing real-world objects of arbitrary shapes and infinite classes. Towards
a more comprehensive perception of a 3D scene, in this paper, we propose a
SurroundOcc method to predict the 3D occupancy with multi-camera images. We
first extract multi-scale features for each image and adopt spatial 2D-3D
attention to lift them to the 3D volume space. Then we apply 3D convolutions to
progressively upsample the volume features and impose supervision on multiple
levels. To obtain dense occupancy prediction, we design a pipeline to generate
dense occupancy ground truth without expansive occupancy annotations.
Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static
scenes separately. Then we adopt Poisson Reconstruction to fill the holes and
voxelize the mesh to get dense occupancy labels. Extensive experiments on
nuScenes and SemanticKITTI datasets demonstrate the superiority of our method.
Code and dataset are available at https://github.com/weiyithu/SurroundOcc |
This paper proposes SurroundOcc, a method to predict dense and accurate 3D occupancy from multi-camera images for autonomous driving. |
3D occupancy prediction provides a more comprehensive understanding of the scene compared to 3D object detection, which can struggle with real-world objects of arbitrary shapes and classes. |
The method uses a 2D backbone to extract multi-scale features from each image, then employs 2D-3D spatial attention to lift the information to 3D volume features. A 3D convolution network upsamples and fuses these features to predict occupancy. To train the network, the authors created a pipeline to generate dense occupancy ground truth from sparse LiDAR point clouds and existing 3D detection labels. |
SurroundOcc achieves state-of-the-art performance on 3D semantic occupancy prediction on the nuScenes dataset.
The method also excels in 3D scene reconstruction, outperforming depth estimation methods and other 3D reconstruction approaches.
Experiments demonstrate the effectiveness of dense occupancy supervision over sparse LiDAR points for training. |
The current method only focuses on single-frame occupancy prediction, limiting its application to scenarios like motion prediction where occupancy flow is crucial.
The authors plan to extend the framework to predict occupancy flow from multi-frame surrounding images and explore self-supervised occupancy prediction without LiDAR data. |
3d occupancy prediction, autonomous driving, multi-camera perception, dense ground truth generation, 2d-3d spatial attention |
2303.09522
Report |
P+: Extended Textual Conditioning in Text-to-Image Generation |
Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, Kfir Aberman |
We introduce an Extended Textual Conditioning space in text-to-image models,
referred to as $P+$. This space consists of multiple textual conditions,
derived from per-layer prompts, each corresponding to a layer of the denoising
U-net of the diffusion model.
We show that the extended space provides greater disentangling and control
over image synthesis. We further introduce Extended Textual Inversion (XTI),
where the images are inverted into $P+$, and represented by per-layer tokens.
We show that XTI is more expressive and precise, and converges faster than
the original Textual Inversion (TI) space. The extended inversion method does
not involve any noticeable trade-off between reconstruction and editability and
induces more regular inversions.
We conduct a series of extensive experiments to analyze and understand the
properties of the new space, and to showcase the effectiveness of our method
for personalizing text-to-image models. Furthermore, we utilize the unique
properties of this space to achieve previously unattainable results in
object-style mixing using text-to-image models. Project page:
https://prompt-plus.github.io |
This paper introduces \(\mathcal{P}+\), an extended textual conditioning space for text-to-image models, which uses multiple textual conditions corresponding to different layers of the denoising U-net, allowing for greater disentanglement and control over image synthesis. |
This is important because it allows for more fine-grained control over image generation and enables new possibilities for personalization and style mixing. |
The authors analyze the properties of different U-net layers, revealing their varying influence on image attributes. They then leverage this insight to develop Extended Textual Inversion (XTI), which learns per-layer token embeddings for representing specific concepts. Finally, they showcase the capabilities of \(\mathcal{P}+\) and XTI through various experiments and a user study. |
Different layers of the U-net demonstrate distinct sensitivities to image attributes, with coarse layers predominantly influencing shape and structure, while fine layers primarily affect appearance.
XTI outperforms the original Textual Inversion (TI) in terms of both subject fidelity and text similarity, while also demonstrating faster convergence.
\(\mathcal{P}+\) enables effective object-style mixing by combining token embeddings from different XTI inversions across various layers. |
While XTI achieves impressive results, it still falls short of the reconstruction quality achievable by fine-tuning the entire model.
The disentanglement of attributes across U-net layers is not perfect, which can limit the level of control in style mixing. |
text-to-image synthesis, diffusion models, textual inversion, style mixing, personalization |
2303.09472
Report |
DiffIR: Efficient Diffusion Model for Image Restoration |
Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Luc Van Gool |
Diffusion model (DM) has achieved SOTA performance by modeling the image
synthesis process into a sequential application of a denoising network.
However, different from image synthesis, image restoration (IR) has a strong
constraint to generate results in accordance with ground-truth. Thus, for IR,
traditional DMs running massive iterations on a large model to estimate whole
images or feature maps is inefficient. To address this issue, we propose an
efficient DM for IR (DiffIR), which consists of a compact IR prior extraction
network (CPEN), dynamic IR transformer (DIRformer), and denoising network.
Specifically, DiffIR has two training stages: pretraining and training DM. In
pretraining, we input ground-truth images into CPEN$_{S1}$ to capture a compact
IR prior representation (IPR) to guide DIRformer. In the second stage, we train
the DM to directly estimate the same IRP as pretrained CPEN$_{S1}$ only using
LQ images. We observe that since the IPR is only a compact vector, DiffIR can
use fewer iterations than traditional DM to obtain accurate estimations and
generate more stable and realistic results. Since the iterations are few, our
DiffIR can adopt a joint optimization of CPEN$_{S2}$, DIRformer, and denoising
network, which can further reduce the estimation error influence. We conduct
extensive experiments on several IR tasks and achieve SOTA performance while
consuming less computational costs. Code is available at
\url{https://github.com/Zj-BinXia/DiffIR}. |
This paper introduces DiffIR, an efficient Diffusion Model (DM) designed for Image Restoration (IR) tasks like inpainting, super-resolution, and deblurring. |
Traditional DMs, highly effective for image synthesis, are computationally expensive for IR tasks where most input pixels are already present. DiffIR addresses this inefficiency by leveraging the strength of DMs in estimating data distributions to guide the restoration process. |
DiffIR consists of a compact IR prior extraction network (CPEN), a dynamic IR transformer (DIRformer), and a denoising network. It operates in two stages: (1) pretraining where CPEN extracts a compact IR prior representation (IPR) from ground-truth images to guide the DIRformer, and (2) training the DM to estimate the IPR solely from LQ images. This allows for joint optimization of DM and DIRformer, enhancing robustness against estimation errors. |
DiffIR achieves state-of-the-art performance on benchmark datasets for inpainting, super-resolution, and deblurring tasks.
It outperforms other DM-based methods significantly in terms of efficiency, consuming considerably less computational resources while achieving better or comparable results.
DiffIR demonstrates faster convergence speed compared to traditional DMs due to its focus on estimating a compact IPR rather than generating entire images. |
The current implementation of DiffIR primarily focuses on single-image restoration tasks.
Exploration of more complex diffusion processes and network architectures within the DiffIR framework could be beneficial. |
image restoration, diffusion model, deep learning, image inpainting, super-resolution |
2303.09431
Report |
NeRFMeshing: Distilling Neural Radiance Fields into Geometrically-Accurate 3D Meshes |
Marie-Julie Rakotosaona, Fabian Manhardt, Diego Martin Arroyo, Michael Niemeyer, Abhijit Kundu, Federico Tombari |
With the introduction of Neural Radiance Fields (NeRFs), novel view synthesis
has recently made a big leap forward. At the core, NeRF proposes that each 3D
point can emit radiance, allowing to conduct view synthesis using
differentiable volumetric rendering. While neural radiance fields can
accurately represent 3D scenes for computing the image rendering, 3D meshes are
still the main scene representation supported by most computer graphics and
simulation pipelines, enabling tasks such as real time rendering and
physics-based simulations. Obtaining 3D meshes from neural radiance fields
still remains an open challenge since NeRFs are optimized for view synthesis,
not enforcing an accurate underlying geometry on the radiance field. We thus
propose a novel compact and flexible architecture that enables easy 3D surface
reconstruction from any NeRF-driven approach. Upon having trained the radiance
field, we distill the volumetric 3D representation into a Signed Surface
Approximation Network, allowing easy extraction of the 3D mesh and appearance.
Our final 3D mesh is physically accurate and can be rendered in real time on an
array of devices. |
Presents NeRFMeshing, a novel method for extracting geometrically accurate and compact 3D meshes from trained NeRF models, enabling real-time rendering and integration with existing graphics pipelines. |
NeRFs excel at view synthesis but lack accurate underlying geometry needed for tasks like real-time rendering, physics simulations, and integration with standard computer graphics pipelines. |
Introduces a Signed Surface Approximation Network (SSAN) trained on pre-trained NeRF data to approximate a Truncated Signed Distance Field (TSDF) and appearance features. It leverages NeRF's rendered depth distribution and enforces smoothness and normal consistency. A 3D mesh is extracted using marching cubes and rendered in real-time using an appearance network. |
NeRFMeshing achieves superior geometric accuracy compared to baselines like SNeRG and MobileNeRF on the Blender Synthetic dataset.
The method demonstrates high-quality mesh reconstruction even on challenging unbounded scenes from the Mip-NeRF 360 dataset.
The extracted meshes are suitable for real-time rendering and can be readily used in physics-based simulations and scene editing. |
Rendering highly detailed surfaces can lead to large mesh sizes, suggesting a need for adaptive mesh reconstruction.
Large and detailed scenes are limited by resolution constraints to manage model size. |
neural radiance fields, 3d mesh reconstruction, real-time rendering, novel view synthesis, computer graphics |
2303.09412
Report |
NeRFtrinsic Four: An End-To-End Trainable NeRF Jointly Optimizing Diverse Intrinsic and Extrinsic Camera Parameters |
Hannah Schieber, Fabian Deuser, Bernhard Egger, Norbert Oswald, Daniel Roth |
Novel view synthesis using neural radiance fields (NeRF) is the
state-of-the-art technique for generating high-quality images from novel
viewpoints. Existing methods require a priori knowledge about extrinsic and
intrinsic camera parameters. This limits their applicability to synthetic
scenes, or real-world scenarios with the necessity of a preprocessing step.
Current research on the joint optimization of camera parameters and NeRF
focuses on refining noisy extrinsic camera parameters and often relies on the
preprocessing of intrinsic camera parameters. Further approaches are limited to
cover only one single camera intrinsic. To address these limitations, we
propose a novel end-to-end trainable approach called NeRFtrinsic Four. We
utilize Gaussian Fourier features to estimate extrinsic camera parameters and
dynamically predict varying intrinsic camera parameters through the supervision
of the projection error. Our approach outperforms existing joint optimization
methods on LLFF and BLEFF. In addition to these existing datasets, we introduce
a new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic
Four is a step forward in joint optimization NeRF-based view synthesis and
enables more realistic and flexible rendering in real-world scenarios with
varying camera parameters. |
Presents NeRFtrinsic Four, an end-to-end trainable neural radiance field (NeRF) framework for novel view synthesis that jointly optimizes diverse intrinsic and extrinsic camera parameters, eliminating the need for preprocessing steps like SfM. |
Existing NeRF methods require a priori knowledge of camera parameters, limiting their applicability to synthetic scenes or requiring preprocessing. This work addresses limitations in current joint optimization methods, enabling more realistic and flexible rendering for real-world scenarios with varying camera parameters. |
Utilizes Gaussian Fourier features to estimate extrinsic camera parameters and dynamically predicts varying intrinsic camera parameters through the supervision of projection error. Introduces a novel dataset, iFF, with varying intrinsic camera parameters. |
Outperforms existing joint optimization methods (NeRF-- and SiNeRF) on LLFF and BLEFF benchmarks in terms of image quality and camera parameter estimation.
Demonstrates superior performance on the newly introduced iFF dataset, highlighting the advantage of handling diverse intrinsic camera parameters.
Shows improved stability and accuracy in camera pose prediction through the use of Gaussian Fourier features and an SSIM loss function. |
Limited to forward-facing scenes and not yet suitable for 360° scenes.
Pose MLP initialization remains challenging, suggesting potential for future work in regularization methods. |
neural radiance fields, novel view synthesis, camera parameter estimation, gaussian fourier features, joint optimization |
2303.09319
Report |
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation |
Yiyang Ma, Huan Yang, Wenjing Wang, Jianlong Fu, Jiaying Liu |
Language-guided image generation has achieved great success nowadays by using
diffusion models. However, texts can be less detailed to describe
highly-specific subjects such as a particular dog or a certain car, which makes
pure text-to-image generation not accurate enough to satisfy user requirements.
In this work, we present a novel Unified Multi-Modal Latent Diffusion
(UMM-Diffusion) which takes joint texts and images containing specified
subjects as input sequences and generates customized images with the subjects.
To be more specific, both input texts and images are encoded into one unified
multi-modal latent space, in which the input images are learned to be projected
to pseudo word embedding and can be further combined with text to guide image
generation. Besides, to eliminate the irrelevant parts of the input images such
as background or illumination, we propose a novel sampling technique of
diffusion models used by the image generator which fuses the results guided by
multi-modal input and pure text input. By leveraging the large-scale
pre-trained text-to-image generator and the designed image encoder, our method
is able to generate high-quality images with complex semantics from both
aspects of input texts and images. |
This paper introduces UMM-Diffusion, a novel framework for generating images from both text descriptions and specific subjects provided as images, encoding them into a unified multimodal latent space. |
Current text-to-image generation models struggle to accurately depict highly specific subjects. This work allows for greater customization and control over the generated images by incorporating visual subjects. |
The method uses a Text-and-Image Unified Encoder (TIUE) that leverages pre-trained CLIP encoders to project both text and images into a shared latent space. A novel fusing sampling technique combines multi-modal and text-only guidance to mitigate overfitting on irrelevant image details. |
UMM-Diffusion generates high-quality, customizable images with diverse novel views of the target subjects, aligning with text descriptions while preserving visual features.
The model successfully disentangles style information, allowing for stylized image generation guided by either text or input image styles.
UMM-Diffusion supports multiple image guidance within a single input, enabling composition of several subjects into one image. |
The model may mix features of multiple subjects when provided, leading to fused visual representations.
Generating rare or highly-fictitious subjects can result in distorted or inaccurate visual details. |
image generation, multi-modal learning, diffusion models, text-to-image synthesis, clip |
2303.09295
Report |
DIRE for Diffusion-Generated Image Detection |
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, Houqiang Li |
Diffusion models have shown remarkable success in visual synthesis, but have
also raised concerns about potential abuse for malicious purposes. In this
paper, we seek to build a detector for telling apart real images from
diffusion-generated images. We find that existing detectors struggle to detect
images generated by diffusion models, even if we include generated images from
a specific diffusion model in their training data. To address this issue, we
propose a novel image representation called DIffusion Reconstruction Error
(DIRE), which measures the error between an input image and its reconstruction
counterpart by a pre-trained diffusion model. We observe that
diffusion-generated images can be approximately reconstructed by a diffusion
model while real images cannot. It provides a hint that DIRE can serve as a
bridge to distinguish generated and real images. DIRE provides an effective way
to detect images generated by most diffusion models, and it is general for
detecting generated images from unseen diffusion models and robust to various
perturbations. Furthermore, we establish a comprehensive diffusion-generated
benchmark including images generated by eight diffusion models to evaluate the
performance of diffusion-generated image detectors. Extensive experiments on
our collected benchmark demonstrate that DIRE exhibits superiority over
previous generated-image detectors. The code and dataset are available at
https://github.com/ZhendongWang6/DIRE. |
This paper proposes DIRE (Diffusion Reconstruction Error), a novel image representation for detecting diffusion-generated images, which leverages the distinct reconstruction errors between real and generated images. |
The rise of diffusion models in visual synthesis necessitates reliable detectors to prevent misuse for malicious purposes, such as generating deepfakes. |
DIRE measures the error between an input image and its reconstruction obtained by inverting and reconstructing the image using a pre-trained diffusion model (DDIM). A binary classifier is then trained on DIRE representations to distinguish real from generated images. |
DIRE exhibits superior generalization ability compared to existing generated image detectors, achieving high accuracy on unseen diffusion models.
DIRE shows robustness to image perturbations, including Gaussian blur and JPEG compression.
Analysis of noise patterns and frequency information in DIRE further supports its effectiveness in distinguishing real and generated images. |
The reliance on a pre-trained diffusion model for reconstruction might limit DIRE's applicability if the model is not robust or generalized.
Future work could explore the use of DIRE for detecting images generated by other generative models beyond diffusion models. |
diffusion model, image generation, image forensics, deepfake detection, generalization |
2303.09270
Report |
SpectralCLIP: Preventing Artifacts in Text-Guided Style Transfer from a Spectral Perspective |
Zipeng Xu, Songlong Xing, Enver Sangineto, Nicu Sebe |
Owing to the power of vision-language foundation models, e.g., CLIP, the area
of image synthesis has seen recent important advances. Particularly, for style
transfer, CLIP enables transferring more general and abstract styles without
collecting the style images in advance, as the style can be efficiently
described with natural language, and the result is optimized by minimizing the
CLIP similarity between the text description and the stylized image. However,
directly using CLIP to guide style transfer leads to undesirable artifacts
(mainly written words and unrelated visual entities) spread over the image. In
this paper, we propose SpectralCLIP, which is based on a spectral
representation of the CLIP embedding sequence, where most of the common
artifacts occupy specific frequencies. By masking the band including these
frequencies, we can condition the generation process to adhere to the target
style properties (e.g., color, texture, paint stroke, etc.) while excluding the
generation of larger-scale structures corresponding to the artifacts.
Experimental results show that SpectralCLIP prevents the generation of
artifacts effectively in quantitative and qualitative terms, without impairing
the stylisation quality. We also apply SpectralCLIP to text-conditioned image
generation and show that it prevents written words in the generated images. Our
code is available at https://github.com/zipengxuc/SpectralCLIP. |
Proposes SpectralCLIP, a novel method that leverages spectral analysis to prevent artifact generation in CLIP-guided style transfer. |
Addresses the problem of undesirable visual and textual artifacts in CLIP-guided style transfer, enhancing the quality and realism of generated images. |
Transforms the CLIP embedding sequence into the frequency domain and employs band-stop filters to remove frequencies associated with artifact scales. |
Significantly reduces both visual and textual artifacts in stylized images.
Maintains high consistency with target styles while preventing artifacts.
Outperforms CLIPstyler and forget-to-spell CLIP in terms of artifact reduction and overall quality based on user study. |
Lacks a clear explanation for the relationship between artifact scales and target styles.
Relies on empirically defined band combinations, requiring manual selection for new styles. |
style transfer, clip, artifact removal, spectral analysis, image generation |
2303.09252
Report |
GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning |
Jiayi Lin, Shaogang Gong |
A vision-language foundation model pretrained on very large-scale image-text
paired data has the potential to provide generalizable knowledge representation
for downstream visual recognition and detection tasks, especially on
supplementing the undersampled categories in downstream model training. Recent
studies utilizing CLIP for object detection have shown that a two-stage
detector design typically outperforms a one-stage detector, while requiring
more expensive training resources and longer inference time. In this work, we
propose a one-stage detector GridCLIP that narrows its performance gap to those
of two-stage detectors, with approximately 43 and 5 times faster than its
two-stage counterpart (ViLD) in the training and test process respectively.
GridCLIP learns grid-level representations to adapt to the intrinsic principle
of one-stage detection learning by expanding the conventional CLIP image-text
holistic mapping to a more fine-grained, grid-text alignment. This differs from
the region-text mapping in two-stage detectors that apply CLIP directly by
treating regions as images. Specifically, GridCLIP performs Grid-level
Alignment to adapt the CLIP image-level representations to grid-level
representations by aligning to CLIP category representations to learn the
annotated (especially frequent) categories. To learn generalizable visual
representations of broader categories, especially undersampled ones, we perform
Image-level Alignment during training to propagate broad pre-learned categories
in the CLIP image encoder from the image-level to the grid-level
representations. Experiments show that the learned CLIP-based grid-level
representations boost the performance of undersampled (infrequent and novel)
categories, reaching comparable detection performance on the LVIS benchmark. |
This paper introduces GridCLIP, a one-stage object detector that leverages CLIP's representation space to improve the detection of undersampled categories. |
Existing object detection datasets often suffer from long-tail distributions, where some categories have very few training samples, hindering the performance on these undersampled categories. GridCLIP aims to address this issue by transferring knowledge from the CLIP model. |
GridCLIP employs two key alignment strategies: (1) Grid-level Alignment maps localized grid-level image features to CLIP's text embeddings for base categories. (2) Image-level Alignment performs knowledge distillation from a fixed CLIP image encoder to guide the learning of both base and novel categories at the image level. |
GridCLIP achieves comparable performance to two-stage detectors on LVIS while being significantly faster in training and inference.
Both grid-level and image-level alignments are shown to contribute to the improved detection of undersampled categories.
Analysis reveals that CLIP's image encoder can effectively capture multiple categories within an image, supporting the effectiveness of image-level alignment. |
The gap between base and novel categories in GridCLIP suggests potential for further improvement by refining the alignment strategies for novel categories.
Exploring alternative one-stage detectors or incorporating learnable prompts could further enhance GridCLIP's performance. |
object detection, vision-language models, clip, undersampled categories, long-tail distribution |
2303.09181
Report |
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation |
Kunyang Han, Yong Liu, Jun Hao Liew, Henghui Ding, Yunchao Wei, Jiajun Liu, Yitong Wang, Yansong Tang, Yujiu Yang, Jiashi Feng, Yao Zhao |
Recent advancements in pre-trained vision-language models, such as CLIP, have
enabled the segmentation of arbitrary concepts solely from textual inputs, a
process commonly referred to as open-vocabulary semantic segmentation (OVS).
However, existing OVS techniques confront a fundamental challenge: the trained
classifier tends to overfit on the base classes observed during training,
resulting in suboptimal generalization performance to unseen classes. To
mitigate this issue, recent studies have proposed the use of an additional
frozen pre-trained CLIP for classification. Nonetheless, this approach incurs
heavy computational overheads as the CLIP vision encoder must be repeatedly
forward-passed for each mask, rendering it impractical for real-world
applications. To address this challenge, our objective is to develop a fast OVS
model that can perform comparably or better without the extra computational
burden of the CLIP image encoder during inference. To this end, we propose a
core idea of preserving the generalizable representation when fine-tuning on
known classes. Specifically, we introduce a text diversification strategy that
generates a set of synonyms for each training category, which prevents the
learned representation from collapsing onto specific known category names.
Additionally, we employ a text-guided knowledge distillation method to preserve
the generalizable knowledge of CLIP. Extensive experiments demonstrate that our
proposed model achieves robust generalization performance across various
datasets. Furthermore, we perform a preliminary exploration of open-vocabulary
video segmentation and present a benchmark that can facilitate future
open-vocabulary research in the video domain. |
This paper proposes Global Knowledge Calibration (GKC), a method for fast open-vocabulary segmentation that preserves generalizability to unseen categories during training and doesn't require an additional frozen CLIP model during inference, leading to faster inference speed. |
Existing open-vocabulary segmentation (OVS) models often overfit to seen categories, limiting their generalization to unseen ones. While using a frozen CLIP model during inference helps, it introduces significant computational overhead. |
GKC introduces two key components: 1) a text diversification strategy using WordNet synonyms to prevent overfitting to specific category names and 2) a text-guided knowledge distillation approach that utilizes CLIP's multi-modal alignment to guide the training process. |
GKC achieves state-of-the-art performance on multiple benchmarks while being significantly faster than previous methods.
Text diversification and text-guided distillation are shown to effectively improve generalization ability.
The paper introduces a preliminary exploration of open-vocabulary video segmentation and constructs a new benchmark for future research. |
The video OVS model suffers from overfitting with increased training iterations.
Future work will focus on addressing the overfitting issue in video OVS. |
open-vocabulary segmentation, knowledge distillation, text diversification, vision-language models, video segmentation |
2303.08914
Report |
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge |
Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof |
Large scale Vision-Language (VL) models have shown tremendous success in
aligning representations between visual and text modalities. This enables
remarkable progress in zero-shot recognition, image generation & editing, and
many other exciting tasks. However, VL models tend to over-represent objects
while paying much less attention to verbs, and require additional tuning on
video data for best zero-shot action recognition performance. While previous
work relied on large-scale, fully-annotated data, in this work we propose an
unsupervised approach. We adapt a VL model for zero-shot and few-shot action
recognition using a collection of unlabeled videos and an unpaired action
dictionary. Based on that, we leverage Large Language Models and VL models to
build a text bag for each unlabeled video via matching, text expansion and
captioning. We use those bags in a Multiple Instance Learning setup to adapt an
image-text backbone to video data. Although finetuned on unlabeled video data,
our resulting models demonstrate high transferability to numerous unseen
zero-shot downstream tasks, improving the base VL model performance by up to
14\%, and even comparing favorably to fully-supervised baselines in both
zero-shot and few-shot video recognition transfer. The code will be released
later at \url{https://github.com/wlin-at/MAXI}. |
This paper presents \oursfull{} (\OurMethod), an unsupervised finetuning approach for zero-shot action recognition using unlabeled videos and language knowledge. |
Existing \vl{} models often underperform in zero-shot action recognition due to their object-centric focus. Previous works relied on fully annotated video data for finetuning, which is costly and limits generalizability. This paper proposes an unsupervised approach to overcome these limitations. |
The proposed \OurMethod constructs a \textit{text bag} for each unlabeled video by combining information from a predefined action dictionary, GPT-3 text expansion, and BLIP captioning. It then uses Multiple Instance Learning (MIL) to finetune a \vl{} model on these unlabeled video-text bag pairs. |
Unsupervised finetuning with \OurMethod significantly improves the zero-shot action recognition performance of the \vl{} model by up to 14% on seven unseen benchmarks.
The proposed approach even outperforms several state-of-the-art supervised methods that are trained with full annotation on the same data.
\OurMethod demonstrates strong few-shot learning capability, outperforming baselines in most cases, even with extremely limited data. |
Performance improvement is not consistent across all datasets due to varying domain shifts to the action dictionary used for training.
Further exploration of temporal modeling in the unsupervised finetuning setting is needed. |
zero-shot learning, action recognition, vision-language models, unsupervised learning, multiple instance learning |
2303.08817
Report |
DeepMIM: Deep Supervision for Masked Image Modeling |
Sucheng Ren, Fangyun Wei, Samuel Albanie, Zheng Zhang, Han Hu |
Deep supervision, which involves extra supervisions to the intermediate
features of a neural network, was widely used in image classification in the
early deep learning era since it significantly reduces the training difficulty
and eases the optimization like avoiding gradient vanish over the vanilla
training. Nevertheless, with the emergence of normalization techniques and
residual connection, deep supervision in image classification was gradually
phased out. In this paper, we revisit deep supervision for masked image
modeling (MIM) that pre-trains a Vision Transformer (ViT) via a
mask-and-predict scheme. Experimentally, we find that deep supervision drives
the shallower layers to learn more meaningful representations, accelerates
model convergence, and expands attention diversities. Our approach, called
DeepMIM, significantly boosts the representation capability of each layer. In
addition, DeepMIM is compatible with many MIM models across a range of
reconstruction targets. For instance, using ViT-B, DeepMIM on MAE achieves 84.2
top-1 accuracy on ImageNet, outperforming MAE by +0.6. By combining DeepMIM
with a stronger tokenizer CLIP, our model achieves state-of-the-art performance
on various downstream tasks, including image classification (85.6 top-1
accuracy on ImageNet-1K, outperforming MAE-CLIP by +0.8), object detection
(52.8 APbox on COCO) and semantic segmentation (53.1 mIoU on ADE20K). Code and
models are available at https://github.com/OliverRensu/DeepMIM. |
This paper revisits deep supervision for Masked Image Modeling (MIM) and proposes DeepMIM, a framework that applies deep supervision to intermediate features in the encoder of MIM models, enhancing representation learning, particularly in shallower layers. |
MIM pretraining often results in weaker informative feedback to shallower encoder layers due to the implicit deepening from the decoder. Deep supervision aims to address this by strengthening the representation learning capability of these layers. |
DeepMIM appends lightweight decoders to intermediate encoder blocks, introducing deep supervision. It optionally incorporates a hybrid target generator that produces progressively easier reconstruction targets for shallower layers, further enhancing learning. |
DeepMIM consistently improves performance across a range of MIM models and reconstruction targets.
DeepMIM-MAE achieves 84.2% top-1 accuracy on ImageNet-1K, outperforming MAE by +0.6%.
Combined with a CLIP tokenizer, DeepMIM achieves state-of-the-art results on ImageNet-1K (85.6%), COCO object detection (52.8 AP), and ADE20K segmentation (53.1 mIoU). |
The hybrid target generator, while beneficial, introduces additional computational overhead.
Exploration of alternative hybrid target generation methods with less computational cost. |
self-supervised learning, masked image modeling, deep supervision, vision transformer, representation learning |
2303.08767
Report |
Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion |
Inhwa Han, Serin Yang, Taesung Kwon, Jong Chul Ye |
Diffusion models have shown superior performance in image generation and
manipulation, but the inherent stochasticity presents challenges in preserving
and manipulating image content and identity. While previous approaches like
DreamBooth and Textual Inversion have proposed model or latent representation
personalization to maintain the content, their reliance on multiple reference
images and complex training limits their practicality. In this paper, we
present a simple yet highly effective approach to personalization using highly
personalized (HiPer) text embedding by decomposing the CLIP embedding space for
personalization and content manipulation. Our method does not require model
fine-tuning or identifiers, yet still enables manipulation of background,
texture, and motion with just a single image and target text. Through
experiments on diverse target texts, we demonstrate that our approach produces
highly personalized and complex semantic image edits across a wide range of
tasks. We believe that the novel understanding of the text embedding space
presented in this work has the potential to inspire further research across
various tasks. |
This paper introduces HiPer, a novel approach for personalized text-to-image generation using Stable Diffusion, which enables precise image manipulation while preserving subject identity from a single source image. |
Existing diffusion models struggle with maintaining content and identity during image manipulation. While methods like DreamBooth and Textual Inversion offer some personalization, they require multiple reference images and extensive training. |
HiPer leverages a novel understanding of CLIP embedding space. It decomposes embeddings, designating a portion for personalization (HiPer embedding) and optimizing it while maintaining source image semantics. This allows manipulation of background, texture, and motion using only a single image and target text. |
HiPer successfully manipulates images across various attributes like motion, background, and texture while preserving subject identity.
Compared to Imagic, DreamBooth, and Textual Inversion, HiPer demonstrates superior performance in both qualitative and quantitative evaluations, showcasing better content preservation and semantic alignment.
The method is computationally efficient, requiring only around 3 minutes for optimization. |
HiPer exhibits limitations in manipulating images with prompts requiring counting or specific color matching, and struggles with complex artificial objects.
While preserving identity, the overall generated image may appear somewhat unnatural due to limitations of the base Stable Diffusion Model. |
image manipulation, text-to-image synthesis, diffusion models, personalization, clip embedding |
2303.08714
Report |
ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution |
Shuyao Shang, Zhengyang Shan, Guangxing Liu, LunQian Wang, XingHua Wang, Zekai Zhang, Jinglin Zhang |
Adapting the Diffusion Probabilistic Model (DPM) for direct image
super-resolution is wasteful, given that a simple Convolutional Neural Network
(CNN) can recover the main low-frequency content. Therefore, we present
ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for
Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN,
which restores primary low-frequency components, and a DPM, which predicts the
residual between the ground-truth image and the CNN predicted image. In
contrast to the common diffusion-based methods that directly use LR images to
guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction
to direct the noise towards the residual space between HR space and
CNN-predicted space, which not only accelerates the generation process but also
acquires superior sample quality. Additionally, a frequency-domain-based loss
function for CNN is introduced to facilitate its restoration, and a
frequency-domain guided diffusion is designed for DPM on behalf of predicting
high-frequency details. The extensive experiments on multiple benchmark
datasets demonstrate that ResDiff outperforms previous diffusion based methods
in terms of shorter model convergence time, superior generation quality, and
more diverse samples. |
This paper presents ResDiff, a novel residual Diffusion Probabilistic Model (DPM) for Single Image Super-Resolution (SISR). It leverages a CNN to recover low-frequency image components and a DPM to predict the residual high-frequency details, improving efficiency and quality. |
Current DPMs for SISR are inefficient as they attempt to recover the entire image from noise. This work addresses this by separating low and high-frequency recovery, leading to faster convergence and better quality. |
ResDiff employs a pre-trained CNN with frequency-domain loss functions for initial image restoration. A Frequency Domain-guided Diffusion (FD-guided Diffusion) then refines high-frequency details using novel modules: Frequency-Domain Information Splitter and high-frequency guided cross-attention. |
ResDiff achieves faster convergence than previous diffusion-based SISR methods.
It generates higher-quality super-resolved images with finer details.
The model produces more diverse samples compared to other approaches. |
While ResDiff improves convergence speed, operations like DWT remain computationally expensive.
The model's performance is limited by its smaller size compared to some SOTA methods. Utilizing larger U-net models could bridge this gap. |
super-resolution, diffusion models, deep learning, computer vision, image restoration |
2303.08695
Report |
RefiNeRF: Modelling dynamic neural radiance fields with inconsistent or missing camera parameters |
Shuja Khalid, Frank Rudzicz |
Novel view synthesis (NVS) is a challenging task in computer vision that
involves synthesizing new views of a scene from a limited set of input images.
Neural Radiance Fields (NeRF) have emerged as a powerful approach to address
this problem, but they require accurate knowledge of camera \textit{intrinsic}
and \textit{extrinsic} parameters. Traditionally, structure-from-motion (SfM)
and multi-view stereo (MVS) approaches have been used to extract camera
parameters, but these methods can be unreliable and may fail in certain cases.
In this paper, we propose a novel technique that leverages unposed images from
dynamic datasets, such as the NVIDIA dynamic scenes dataset, to learn camera
parameters directly from data. Our approach is highly extensible and can be
integrated into existing NeRF architectures with minimal modifications. We
demonstrate the effectiveness of our method on a variety of static and dynamic
scenes and show that it outperforms traditional SfM and MVS approaches. The
code for our method is publicly available at
\href{https://github.com/redacted/refinerf}{https://github.com/redacted/refinerf}.
Our approach offers a promising new direction for improving the accuracy and
robustness of NVS using NeRF, and we anticipate that it will be a valuable tool
for a wide range of applications in computer vision and graphics. |
This paper introduces refiNeRF, a novel method to model dynamic neural radiance fields (NeRFs) even with inconsistent or missing camera parameters. |
Accurate camera parameters are crucial for NeRFs to synthesize novel views, but traditional methods like SfM can be unreliable. RefiNeRF aims to improve the accuracy and robustness of NeRFs in challenging real-world scenarios. |
The method refines camera parameters by jointly optimizing them with the NeRF model using a photometric loss. It employs a learning scheduler for stable training and leverages multi-resolution encoding for high-fidelity reconstruction. |
RefiNeRF outperforms state-of-the-art methods like BARF and NeRF-- in novel view synthesis quality on the NVIDIA dynamic scenes dataset.
The method effectively refines even significantly perturbed camera poses, improving reconstruction metrics compared to using coarse initializations.
RefiNeRF demonstrates generalizability by enabling novel view synthesis on the challenging Cholec80 dataset, where traditional SfM methods struggle. |
RefiNeRF inherits limitations of the original NeRF, such as slow optimization and rendering speed.
The current method is computationally demanding, limiting the length of processable video clips, particularly for high-resolution data. |
neural radiance fields, novel view synthesis, camera pose estimation, dynamic scenes, deep learning |
2303.08686
Report |
Weakly Supervised Monocular 3D Object Detection using Multi-View Projection and Direction Consistency |
Runzhou Tao, Wencheng Han, Zhongying Qiu, Cheng-zhong Xu, Jianbing Shen |
Monocular 3D object detection has become a mainstream approach in automatic
driving for its easy application. A prominent advantage is that it does not
need LiDAR point clouds during the inference. However, most current methods
still rely on 3D point cloud data for labeling the ground truths used in the
training phase. This inconsistency between the training and inference makes it
hard to utilize the large-scale feedback data and increases the data collection
expenses. To bridge this gap, we propose a new weakly supervised monocular 3D
objection detection method, which can train the model with only 2D labels
marked on images. To be specific, we explore three types of consistency in this
task, i.e. the projection, multi-view and direction consistency, and design a
weakly-supervised architecture based on these consistencies. Moreover, we
propose a new 2D direction labeling method in this task to guide the model for
accurate rotation direction prediction. Experiments show that our
weakly-supervised method achieves comparable performance with some fully
supervised methods. When used as a pre-training method, our model can
significantly outperform the corresponding fully-supervised baseline with only
1/3 3D labels. https://github.com/weakmono3d/weakmono3d |
This paper proposes a weakly supervised method for monocular 3D object detection, eliminating the need for 3D point cloud annotations during training by using only 2D bounding box and direction labels on images. |
This is important because it enables the utilization of large-scale feedback data from production cars, which lack 3D annotations but are crucial for improving model robustness in real-world scenarios. |
The method leverages three types of consistency: 1) Projection Consistency ensures the projected 3D bounding boxes align with 2D labels. 2) Multi-view Consistency enforces consistency between predictions from different viewpoints of the same object. 3) Direction Consistency aligns predicted 3D box rotations with newly proposed 2D direction labels. |
The method achieves comparable performance to some fully supervised methods on KITTI benchmark.
It performs well on a newly collected dataset (ProdCars) from production cars.
As a pre-training method, it outperforms the fully supervised baseline with only 1/3 of 3D labels. |
The performance using video sequences as multi-view data is not as good as using multi-camera data.
The method assumes objects are stationary between frames for video sequence data, which might not always hold true in real-world scenarios.
Future work could explore handling object motion in video sequences and extending the method to other label-efficient settings. |
3d object detection, weakly supervised learning, monocular vision, autonomous driving, consistency loss |
2303.08622
Report |
Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer |
Serin Yang, Hyunmin Hwang, Jong Chul Ye |
Diffusion models have shown great promise in text-guided image style
transfer, but there is a trade-off between style transformation and content
preservation due to their stochastic nature. Existing methods require
computationally expensive fine-tuning of diffusion models or additional neural
network. To address this, here we propose a zero-shot contrastive loss for
diffusion models that doesn't require additional fine-tuning or auxiliary
networks. By leveraging patch-wise contrastive loss between generated samples
and original image embeddings in the pre-trained diffusion model, our method
can generate images with the same semantic content as the source image in a
zero-shot manner. Our approach outperforms existing methods while preserving
content and requiring no additional training, not only for image style transfer
but also for image-to-image translation and manipulation. Our experimental
results validate the effectiveness of our proposed method. |
This paper proposes ZeCon, a zero-shot contrastive loss for diffusion models, enabling image style transfer while preserving content without requiring fine-tuning or auxiliary networks. |
Existing diffusion-based style transfer methods struggle with the trade-off between style transformation and content preservation due to their stochastic nature. |
ZeCon leverages patch-wise contrastive loss between generated samples and original image embeddings within a pre-trained diffusion model. By incorporating this loss, the model maintains semantic consistency throughout the generation process. |
ZeCon outperforms GAN-based methods in terms of content preservation and style transformation quality.
Compared to other diffusion models, ZeCon achieves superior style transfer from unseen domains without additional training.
ZeCon is computationally efficient, requiring no training and demonstrating faster inference than methods like DiffusionCLIP. |
Finding optimal weights for different losses still requires user adjustment.
The method occasionally exhibits limitations by displaying text prompts from the targeted style on the generated images. |
image style transfer, diffusion models, contrastive loss, content preservation, zero-shot learning |
2303.08594
Report |
FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation |
Junjie He, Pengyu Li, Yifeng Geng, Xuansong Xie |
Recent attention in instance segmentation has focused on query-based models.
Despite being non-maximum suppression (NMS)-free and end-to-end, the
superiority of these models on high-accuracy real-time benchmarks has not been
well demonstrated. In this paper, we show the strong potential of query-based
models on efficient instance segmentation algorithm designs. We present
FastInst, a simple, effective query-based framework for real-time instance
segmentation. FastInst can execute at a real-time speed (i.e., 32.5 FPS) while
yielding an AP of more than 40 (i.e., 40.5 AP) on COCO test-dev without bells
and whistles. Specifically, FastInst follows the meta-architecture of recently
introduced Mask2Former. Its key designs include instance activation-guided
queries, dual-path update strategy, and ground truth mask-guided learning,
which enable us to use lighter pixel decoders, fewer Transformer decoder
layers, while achieving better performance. The experiments show that FastInst
outperforms most state-of-the-art real-time counterparts, including strong
fully convolutional baselines, in both speed and accuracy. Code can be found at
https://github.com/junjiehe96/FastInst . |
This paper proposes FastInst, a simple and efficient query-based model for real-time instance segmentation. |
Real-time instance segmentation is crucial for applications like self-driving cars and robotics, but existing query-based methods are often computationally expensive. FastInst addresses this gap by demonstrating the potential of query-based models for efficient instance segmentation. |
FastInst builds upon the Mask2Former architecture and introduces three key innovations: (1) Instance activation-guided queries that dynamically select pixel embeddings with high semantics as initial queries, (2) a dual-path Transformer decoder that alternately updates query and pixel features for richer embeddings, and (3) ground truth mask-guided learning to enhance the performance of masked attention. |
FastInst surpasses most state-of-the-art real-time instance segmentation methods in both speed and accuracy on the COCO dataset.
With a ResNet-50 backbone, FastInst-D1 achieves 35.6 AP at 53.8 FPS, outperforming strong convolutional baselines.
Using a ResNet-50-d-DCN backbone, FastInst-D3 achieves real-time performance (32.5 FPS) with an AP exceeding 40 (40.5 AP). |
Like other query-based models, FastInst struggles with segmenting small objects.
While effective, ground truth mask-guided learning increases training costs. |
instance segmentation, query-based model, real-time, transformer, computer vision |
2303.08566
Report |
Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning |
Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, Bohan Zhuang |
Visual Parameter-Efficient Fine-Tuning (PEFT) has become a powerful
alternative for full fine-tuning so as to adapt pre-trained vision models to
downstream tasks, which only tunes a small number of parameters while freezing
the vast majority ones to ease storage burden and optimization difficulty.
However, existing PEFT methods introduce trainable parameters to the same
positions across different tasks depending solely on human heuristics and
neglect the domain gaps. To this end, we study where to introduce and how to
allocate trainable parameters by proposing a novel Sensitivity-aware visual
Parameter-efficient fine-Tuning (SPT) scheme, which adaptively allocates
trainable parameters to task-specific important positions given a desired
tunable parameter budget. Specifically, our SPT first quickly identifies the
sensitive parameters that require tuning for a given task in a data-dependent
way. Next, our SPT further boosts the representational capability for the
weight matrices whose number of sensitive parameters exceeds a pre-defined
threshold by utilizing existing structured tuning methods, e.g., LoRA [23] or
Adapter [22], to replace directly tuning the selected sensitive parameters
(unstructured tuning) under the budget. Extensive experiments on a wide range
of downstream recognition tasks show that our SPT is complementary to the
existing PEFT methods and largely boosts their performance, e.g., SPT improves
Adapter with supervised pre-trained ViT-B/16 backbone by 4.2% and 1.4% mean
Top-1 accuracy, reaching SOTA performance on FGVC and VTAB-1k benchmarks,
respectively. Source code is at https://github.com/ziplab/SPT |
This paper presents SPT, a Sensitivity-aware visual Parameter-efficient fine-Tuning scheme that adaptively allocates trainable parameters to task-specific important positions. |
Existing PEFT methods introduce trainable parameters to the same positions across different tasks, neglecting domain gaps. SPT addresses this by identifying and leveraging task-specific parameter sensitivity. |
SPT identifies sensitive parameters using a data-dependent criterion based on loss reduction when tuned. It then allocates a budget of trainable parameters using both unstructured (directly tuning sensitive parameters) and structured (using methods like LoRA or Adapter) tuning. |
SPT consistently outperforms existing PEFT methods and full fine-tuning, especially with self-supervised backbones.
Structured tuning is particularly effective for datasets with large domain gaps.
SPT is robust to the number of training samples used to calculate parameter sensitivity. |
The fine-tuning memory cost of SPT is slightly higher than some reparameterization-based methods due to sparse gradient updates.
Future work includes adapting SPT to more downstream tasks and improving training efficiency. |
parameter-efficient fine-tuning, transfer learning, vision transformers, sensitivity analysis, structured tuning |
2303.08370
Report |
Harnessing Low-Frequency Neural Fields for Few-Shot View Synthesis |
Liangchen Song, Zhong Li, Xuan Gong, Lele Chen, Zhang Chen, Yi Xu, Junsong Yuan |
Neural Radiance Fields (NeRF) have led to breakthroughs in the novel view
synthesis problem. Positional Encoding (P.E.) is a critical factor that brings
the impressive performance of NeRF, where low-dimensional coordinates are
mapped to high-dimensional space to better recover scene details. However,
blindly increasing the frequency of P.E. leads to overfitting when the
reconstruction problem is highly underconstrained, \eg, few-shot images for
training. We harness low-frequency neural fields to regularize high-frequency
neural fields from overfitting to better address the problem of few-shot view
synthesis. We propose reconstructing with a low-frequency only field and then
finishing details with a high-frequency equipped field. Unlike most existing
solutions that regularize the output space (\ie, rendered images), our
regularization is conducted in the input space (\ie, signal frequency). We
further propose a simple-yet-effective strategy for tuning the frequency to
avoid overfitting few-shot inputs: enforcing consistency among the frequency
domain of rendered 2D images. Thanks to the input space regularizing scheme,
our method readily applies to inputs beyond spatial locations, such as the time
dimension in dynamic scenes. Comparisons with state-of-the-art on both
synthetic and natural datasets validate the effectiveness of our proposed
solution for few-shot view synthesis. Code is available at
\href{https://github.com/lsongx/halo}{https://github.com/lsongx/halo}. |
This paper presents HALO, a novel method for few-shot view synthesis that leverages low-frequency neural fields to regularize high-frequency neural fields and prevent overfitting. |
NeRF struggles with few-shot view synthesis due to overfitting to limited training views, resulting in inaccurate scene representations. HALO addresses this limitation by harnessing the smooth geometry produced by low-frequency neural fields to guide the learning of high-frequency details. |
HALO consists of a three-stage training process: (1) Train a low-frequency NeRF (Lo-NeRF) with tuned frequency; (2) Train a ray-based field supervised by Lo-NeRF to efficiently predict rough depth for each ray; (3) Train a high-frequency NeRF (Hi-NeRF), guided by the ray-based field and regularized to maintain geometry consistency with Lo-NeRF. |
HALO achieves comparable results to state-of-the-art methods like DietNeRF on 360° rendering tasks, without relying on external semantic supervision.
The method demonstrates superior extrapolation ability compared to DietNeRF, accurately reconstructing periodic structures and textures in unseen areas.
HALO effectively improves novel view synthesis quality on forward-facing light field data and dynamic scenes, demonstrating its generalizability. |
The optimal frequency for Lo-NeRF is determined empirically, potentially limiting its applicability to diverse scenes.
The method assumes the availability of at least three views for reconstructing a reasonable initial geometry. |
novel view synthesis, neural radiance fields (nerf), few-shot learning, positional encoding, frequency regularization |
2303.08331
Report |
Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting |
Gen Li, Jie Ji, Minghai Qin, Wei Niu, Bin Ren, Fatemeh Afghah, Linke Guo, Xiaolong Ma |
As deep convolutional neural networks (DNNs) are widely used in various
fields of computer vision, leveraging the overfitting ability of the DNN to
achieve video resolution upscaling has become a new trend in the modern video
delivery system. By dividing videos into chunks and overfitting each chunk with
a super-resolution model, the server encodes videos before transmitting them to
the clients, thus achieving better video quality and transmission efficiency.
However, a large number of chunks are expected to ensure good overfitting
quality, which substantially increases the storage and consumes more bandwidth
resources for data transmission. On the other hand, decreasing the number of
chunks through training optimization techniques usually requires high model
capacity, which significantly slows down execution speed. To reconcile such, we
propose a novel method for high-quality and efficient video resolution
upscaling tasks, which leverages the spatial-temporal information to accurately
divide video into chunks, thus keeping the number of chunks as well as the
model size to minimum. Additionally, we advance our method into a single
overfitting model by a data-aware joint training technique, which further
reduces the storage requirement with negligible quality drop. We deploy our
models on an off-the-shelf mobile phone, and experimental results show that our
method achieves real-time video super-resolution with high video quality.
Compared with the state-of-the-art, our method achieves 28 fps streaming speed
with 41.6 PSNR, which is 14$\times$ faster and 2.29 dB better in the live video
resolution upscaling tasks. Code available in
https://github.com/coulsonlee/STDO-CVPR2023.git |
This paper proposes STDO, a novel spatial-temporal data overfitting approach for high-quality and efficient video resolution upscaling, which leverages spatial-temporal information to divide video into chunks for overfitting with independent or a single jointly trained SR model. |
Existing video resolution upscaling methods suffer from limited generalization ability or require large models with high computation costs for overfitting. STDO addresses these limitations by efficiently encoding HR videos into LR videos and compact SR models while maintaining high super-resolution quality. |
STDO divides video frames into patches and groups them into chunks based on PSNR values obtained from a pre-trained SR model. It then overfits each chunk with an independent SR model. Additionally, it introduces JSTDO, which utilizes a data-aware joint training technique to generate a single SR model for the entire video with minimal quality loss. |
STDO consistently outperforms state-of-the-art methods in video super-resolution quality (PSNR) while using smaller SR models with lower computation costs.
JSTDO effectively reduces the model size while maintaining comparable PSNR to STDO, making it suitable for deployment on resource-constrained devices.
Deploying on mobile devices, STDO achieves real-time video super-resolution performance with high video quality. |
The performance of STDO may be affected when encountering significant scene changes in long videos.
Future work includes exploring more sophisticated data scheduling strategies for joint training in JSTDO to further improve efficiency and quality. |
video super-resolution, data overfitting, spatial-temporal information, joint training, mobile deployment |
2303.08320
Report |
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation |
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan |
A diffusion probabilistic model (DPM), which constructs a forward diffusion
process by gradually adding noise to data points and learns the reverse
denoising process to generate new samples, has been shown to handle complex
data distribution. Despite its recent success in image synthesis, applying DPMs
to video generation is still challenging due to high-dimensional data spaces.
Previous methods usually adopt a standard diffusion process, where frames in
the same video clip are destroyed with independent noises, ignoring the content
redundancy and temporal correlation. This work presents a decomposed diffusion
process via resolving the per-frame noise into a base noise that is shared
among all frames and a residual noise that varies along the time axis. The
denoising pipeline employs two jointly-learned networks to match the noise
decomposition accordingly. Experiments on various datasets confirm that our
approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based
alternatives in high-quality video generation. We further show that our
decomposed formulation can benefit from pre-trained image diffusion models and
well-support text-conditioned video creation. |
Presents VideoFusion, a decomposed diffusion probabilistic model for high-quality video generation that decomposes per-frame noise into shared base noise and time-varying residual noise, enabling efficient learning of spatial-temporal correlations. |
Addresses the challenge of applying diffusion models to high-dimensional video data by leveraging content redundancy and temporal correlations within video frames. |
Decomposes diffusion process into shared base noise and residual noise; employs two jointly-learned networks for denoising, leveraging pre-trained image diffusion models for base noise estimation. |
Outperforms GAN-based and diffusion-based methods on UCF101, Sky Time-lapse, and TaiChi-HD datasets in terms of FVD, KVD, and IS.
Effectively leverages pre-trained image diffusion models, improving efficiency and results.
Exhibits potential for content control and generation of longer coherent video sequences. |
Shared base noise might limit motion diversity in generated videos.
Current implementation relies on pre-trained prior for conditioning, potentially limiting performance in long-text video generation. |
video generation, diffusion models, decomposed representation, pre-trained models, content control |
2303.08132
Report |
InstMove: Instance Motion for Object-centric Video Segmentation |
Qihao Liu, Junfeng Wu, Yi Jiang, Xiang Bai, Alan Yuille, Song Bai |
Despite significant efforts, cutting-edge video segmentation methods still
remain sensitive to occlusion and rapid movement, due to their reliance on the
appearance of objects in the form of object embeddings, which are vulnerable to
these disturbances. A common solution is to use optical flow to provide motion
information, but essentially it only considers pixel-level motion, which still
relies on appearance similarity and hence is often inaccurate under occlusion
and fast movement. In this work, we study the instance-level motion and present
InstMove, which stands for Instance Motion for Object-centric Video
Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on
instance-level motion information that is free from image feature embeddings,
and features physical interpretations, making it more accurate and robust
toward occlusion and fast-moving objects. To better fit in with the video
segmentation tasks, InstMove uses instance masks to model the physical presence
of an object and learns the dynamic model through a memory network to predict
its position and shape in the next frame. With only a few lines of code,
InstMove can be integrated into current SOTA methods for three different video
segmentation tasks and boost their performance. Specifically, we improve the
previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and
4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects.
These results suggest that instance-level motion is robust and accurate, and
hence serving as a powerful solution in complex scenarios for object-centric
video segmentation. |
This paper introduces InstMove, a novel instance motion module for object-centric video segmentation that predicts object motion and deformation directly from instance masks. |
Existing video segmentation methods struggle with occlusion and rapid movement due to their reliance on appearance-based object embeddings. InstMove addresses this by leveraging instance-level motion information, which is more robust and accurate. |
InstMove utilizes an RNN-based module with a memory network to extract motion features from previous instance masks, store and retrieve dynamic information, and predict the position and shape of the object in the next frame. Image features can be incorporated to refine boundary prediction. |
InstMove significantly outperforms optical flow-based motion prediction, especially in challenging scenarios with occlusions or fast-moving objects.
Integrating InstMove with SOTA methods for VIS, VOS, and MOTS tasks leads to consistent performance improvements on benchmarks like OVIS, YouTubeVIS-Long, and BDD100K.
The improvements are particularly notable in complex scenarios with heavy occlusion and rapid motion, highlighting the effectiveness of incorporating instance-level motion information. |
The current implementation relies on low-level image features for boundary refinement, which might limit its generalizability.
The computational cost of InstMove could be further optimized for real-time applications. |
video segmentation, instance motion, motion prediction, object tracking, occlusion handling |
2303.08131
Report |
A Simple Framework for Open-Vocabulary Segmentation and Detection |
Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang |
We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection
framework that jointly learns from different segmentation and detection
datasets. To bridge the gap of vocabulary and annotation granularity, we first
introduce a pre-trained text encoder to encode all the visual concepts in two
tasks and learn a common semantic space for them. This gives us reasonably good
results compared with the counterparts trained on segmentation task only. To
further reconcile them, we locate two discrepancies: $i$) task discrepancy --
segmentation requires extracting masks for both foreground objects and
background stuff, while detection merely cares about the former; $ii$) data
discrepancy -- box and mask annotations are with different spatial granularity,
and thus not directly interchangeable. To address these issues, we propose a
decoupled decoding to reduce the interference between foreground/background and
a conditioned mask decoding to assist in generating masks for given boxes. To
this end, we develop a simple encoder-decoder model encompassing all three
techniques and train it jointly on COCO and Objects365. After pre-training, our
model exhibits competitive or stronger zero-shot transferability for both
segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art
method for open-vocabulary instance and panoptic segmentation across 5
datasets, and outperforms previous work for open-vocabulary detection on LVIS
and ODinW under similar settings. When transferred to specific tasks, our model
achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance
segmentation on ADE20K and Cityscapes.
Finally, we note that OpenSeeD is the first to explore the potential of joint
training on segmentation and detection, and hope it can be received as a strong
baseline for developing a single model for both tasks in open world. |
This paper proposes OpenSeeD, a simple framework for building an open-vocabulary model that can perform both segmentation and detection by jointly learning from segmentation and detection datasets. |
Existing methods primarily focus on either open-vocabulary detection or segmentation, but not both. This work explores bridging the gap between detection and segmentation to achieve a single model for both tasks in the open world. |
OpenSeeD employs a shared text encoder to align visual and textual semantics. It tackles task discrepancies by using decoupled foreground/background decoding and addresses data discrepancies via conditioned mask decoding. |
OpenSeeD achieves state-of-the-art zero-shot segmentation performance on multiple datasets, outperforming methods like ODISE and X-Decoder.
It exhibits competitive zero-shot detection performance, surpassing GLIP on LVIS under similar settings.
OpenSeeD sets new state-of-the-art results for task-specific transfer on COCO and ADE20K panoptic segmentation and ADE20K and Cityscapes instance segmentation. |
The model currently doesn't incorporate referring/grounding data or large-scale image-text pairs, which could further enhance training data and semantic coverage.
Future work will explore a more comprehensive joint training approach that leverages these additional data sources. |
open vocabulary, segmentation, detection, joint learning, conditioned mask decoding |
2303.08129
Report |
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection |
Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu, Yandong Guo, Shanghang Zhang |
Masked Autoencoders learn strong visual representations and achieve
state-of-the-art results in several independent modalities, yet very few works
have addressed their capabilities in multi-modality settings. In this work, we
focus on point cloud and RGB image data, two modalities that are often
presented together in the real world, and explore their meaningful
interactions. To improve upon the cross-modal synergy in existing works, we
propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D
interaction through three aspects. Specifically, we first notice the importance
of masking strategies between the two sources and utilize a projection module
to complementarily align the mask and visible tokens of the two modalities.
Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared
decoder to promote cross-modality interaction in the mask tokens. Finally, we
design a unique cross-modal reconstruction module to enhance representation
learning for both modalities. Through extensive experiments performed on
large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we
discover it is nontrivial to interactively learn point-image features, where we
greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers
by 2.9%, 6.7%, and 2.4%, respectively. Code is available at
https://github.com/BLVLab/PiMAE. |
PiMAE, a novel self-supervised pre-training framework for 3D object detection, is introduced. It leverages masked autoencoders to learn interactive point cloud and RGB image representations. |
Existing methods struggle to effectively bridge 3D and 2D data for enhanced feature learning in multi-modal settings. PiMAE addresses this limitation by maximizing cross-modal synergy between point cloud and image data. |
PiMAE employs a two-branch MAE architecture with a shared decoder, ensuring both modal-specific and cross-modal learning. It introduces a novel complementary masking strategy, aligning masks between projected point tokens and image patches, and incorporates a cross-modal reconstruction module to strengthen representation learning. |
PiMAE significantly improves the performance of 3D object detectors, outperforming state-of-the-art methods by a large margin on SUN RGB-D and ScanNetV2 datasets.
PiMAE demonstrates strong generalization ability, enhancing 2D object detection on ScanNetV2 and monocular 3D detection on KITTI.
PiMAE excels in few-shot image classification, demonstrating the effectiveness of its learned image representations on CIFAR-FS, FC100, and miniImageNet datasets. |
The reliance on projection for alignment might limit its applicability to scenarios with inaccurate camera poses.
Future work can investigate alternative fusion mechanisms beyond the shared decoder to further enhance cross-modal learning. |
3d object detection, multi-modal learning, masked autoencoders, self-supervised learning, point cloud, rgb image fusion |
2303.08120
Report |
Blind Video Deflickering by Neural Filtering with a Flawed Atlas |
Chenyang Lei, Xuanchi Ren, Zhaoxiang Zhang, Qifeng Chen |
Many videos contain flickering artifacts. Common causes of flicker include
video processing algorithms, video generation algorithms, and capturing videos
under specific situations. Prior work usually requires specific guidance such
as the flickering frequency, manual annotations, or extra consistent videos to
remove the flicker. In this work, we propose a general flicker removal
framework that only receives a single flickering video as input without
additional guidance. Since it is blind to a specific flickering type or
guidance, we name this "blind deflickering." The core of our approach is
utilizing the neural atlas in cooperation with a neural filtering strategy. The
neural atlas is a unified representation for all frames in a video that
provides temporal consistency guidance but is flawed in many cases. To this
end, a neural network is trained to mimic a filter to learn the consistent
features (e.g., color, brightness) and avoid introducing the artifacts in the
atlas. To validate our method, we construct a dataset that contains diverse
real-world flickering videos. Extensive experiments show that our method
achieves satisfying deflickering performance and even outperforms baselines
that use extra guidance on a public benchmark. |
This paper proposes the first "blind deflickering" approach, capable of removing diverse flickering artifacts from videos without needing to know the specific flicker type or requiring extra guidance. |
Many videos suffer from flickering artifacts due to various reasons (e.g., old cameras, high-speed cameras, video processing algorithms), and existing methods are often task-specific or require additional guidance like consistent videos, which limits their applicability. |
The method leverages a neural atlas to represent all video frames in a unified manner, ensuring temporal consistency. Since the atlas can have flaws, it employs a neural filtering strategy to learn invariant features from distorted versions of the atlas and input frames, effectively removing flicker while preserving important details. |
The proposed method achieves state-of-the-art performance on a newly constructed dataset containing diverse flickering videos.
It outperforms baselines designed for specific flickering types.
Even without using extra input videos for guidance, it surpasses methods that rely on them, demonstrating its effectiveness and broader applicability. |
The method might not handle temporal inconsistencies arising from content variations (e.g., significant content differences in generated videos or large scratches in old films).
Future work could explore extensions to address these limitations and apply the blind deflickering concept to other tasks like novel view synthesis. |
video deflickering, neural atlas, neural filtering, temporal consistency, video processing |
2303.08096
Report |
MELON: NeRF with Unposed Images in SO(3) |
Axel Levy, Mark Matthews, Matan Sela, Gordon Wetzstein, Dmitry Lagun |
Neural radiance fields enable novel-view synthesis and scene reconstruction
with photorealistic quality from a few images, but require known and accurate
camera poses. Conventional pose estimation algorithms fail on smooth or
self-similar scenes, while methods performing inverse rendering from unposed
views require a rough initialization of the camera orientations. The main
difficulty of pose estimation lies in real-life objects being almost invariant
under certain transformations, making the photometric distance between rendered
views non-convex with respect to the camera parameters. Using an equivalence
relation that matches the distribution of local minima in camera space, we
reduce this space to its quotient set, in which pose estimation becomes a more
convex problem. Using a neural-network to regularize pose estimation, we
demonstrate that our method - MELON - can reconstruct a neural radiance field
from unposed images with state-of-the-art accuracy while requiring ten times
fewer views than adversarial approaches. |
MELON infers a neural radiance field from unposed images by simultaneously training a CNN encoder that maps images to camera poses and a neural radiance field of the scene. |
This approach eliminates the need for known camera poses in neural rendering, which is a significant limitation in applications like novel view synthesis and scene reconstruction. |
MELON introduces a Modulo-Equivalent Loss (MEL) that replicates encoder outputs based on an equivalence relation in camera space. This allows the encoder to operate in a quotient set, simplifying pose estimation. The method is illustrated with a 1D toy problem and then applied to 3D inverse rendering. |
MELON demonstrates competitive reconstruction metrics on synthetic and real datasets, outperforming existing methods like GNeRF in terms of pose accuracy and novel view synthesis quality.
It exhibits robustness to noise, generating noise-free novel views even from noisy input images.
The method requires significantly fewer views compared to adversarial approaches, successfully reconstructing scenes from as few as six unposed images. |
Characterizing the full loss landscape of 3D inverse rendering with unknown poses is still an open question, and the theoretical analysis currently relies on simplifying assumptions.
The assumption of a perfectly object-centered setup limits its applicability to real-world scenarios, and predicting camera extrinsics in SE(3) remains a challenge. |
neural radiance fields, pose estimation, novel view synthesis, inverse rendering, unposed images |
2303.08085
Report |
Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations |
Hagay Michaeli, Tomer Michaeli, Daniel Soudry |
Although CNNs are believed to be invariant to translations, recent works have
shown this is not the case, due to aliasing effects that stem from downsampling
layers. The existing architectural solutions to prevent aliasing are partial
since they do not solve these effects, that originate in non-linearities. We
propose an extended anti-aliasing method that tackles both downsampling and
non-linear layers, thus creating truly alias-free, shift-invariant CNNs. We
show that the presented model is invariant to integer as well as fractional
(i.e., sub-pixel) translations, thus outperforming other shift-invariant
methods in terms of robustness to adversarial translations. |
This paper proposes Alias-Free Convnet (AFC), a convolutional neural network architecture that achieves shift-invariance by eliminating aliasing effects through the use of polynomial activations and alias-free downsampling layers. |
Shift-invariance is a desirable property in CNNs for image classification as it improves generalization, robustness to adversarial attacks, and consistency of predictions under image translations. |
The authors modify the ConvNeXt architecture by replacing standard activations with polynomial activations and strided convolutions with BlurPool layers. Polynomial activations have limited bandwidth expansion, which is addressed by upsampling and downsampling within the activation function. BlurPool layers implement alias-free downsampling using low-pass filtering before subsampling. |
AFC achieves 100% consistency to both integer and fractional pixel translations.
AFC demonstrates superior robustness to adversarial attacks based on image translations, maintaining high accuracy even under fractional pixel shifts.
The study provides the first demonstration of competitive performance with polynomial activations on ImageNet. |
The modifications for alias-free properties come at the cost of a slight reduction in standard test accuracy compared to the baseline ConvNeXt model.
The guaranteed shift-invariance of AFC is limited to circular translations. While the paper shows improved robustness to other types of translations, perfect invariance is not guaranteed. |
convolutional neural networks, shift-invariance, aliasing, polynomial activations, adversarial robustness |
2303.08084
Report |
Editing Implicit Assumptions in Text-to-Image Diffusion Models |
Hadas Orgad, Bahjat Kawar, Yonatan Belinkov |
Text-to-image diffusion models often make implicit assumptions about the
world when generating images. While some assumptions are useful (e.g., the sky
is blue), they can also be outdated, incorrect, or reflective of social biases
present in the training data. Thus, there is a need to control these
assumptions without requiring explicit user input or costly re-training. In
this work, we aim to edit a given implicit assumption in a pre-trained
diffusion model. Our Text-to-Image Model Editing method, TIME for short,
receives a pair of inputs: a "source" under-specified prompt for which the
model makes an implicit assumption (e.g., "a pack of roses"), and a
"destination" prompt that describes the same setting, but with a specified
desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's
cross-attention layers, as these layers assign visual meaning to textual
tokens. We edit the projection matrices in these layers such that the source
prompt is projected close to the destination prompt. Our method is highly
efficient, as it modifies a mere 2.2% of the model's parameters in under one
second. To evaluate model editing approaches, we introduce TIMED (TIME
Dataset), containing 147 source and destination prompt pairs from various
domains. Our experiments (using Stable Diffusion) show that TIME is successful
in model editing, generalizes well for related prompts unseen during editing,
and imposes minimal effect on unrelated generations. |
TIME is a method for editing implicit assumptions in text-to-image diffusion models after training. |
Text-to-image models often make implicit assumptions that are useful but can also be outdated, incorrect, or reflect societal biases. |
TIME leverages the ability of diffusion models to generate different outputs based on explicit specification. It modifies the projection matrices in the cross-attention layers to map a user-specified 'source' prompt closer to a 'destination' prompt with the desired attribute. |
TIME successfully edits model assumptions for various prompts.
The method generalizes well to related prompts unseen during editing.
The overall generative quality of the model remains unaffected after editing as measured by FID and CLIP Score. |
TIME inherits the generative limitations of the diffusion model it edits, it cannot teach the model entirely new concepts.
The method can sometimes apply an edit too mildly or aggressively, hindering its generalizability or specificity. |
text-to-image generation, diffusion models, model editing, implicit assumptions, social bias mitigation |
2303.08063
Report |
Interpretable ODE-style Generative Diffusion Model via Force Field Construction |
Weiyang Jin, Yongpei Zhu, Yuxi Peng |
For a considerable time, researchers have focused on developing a method that
establishes a deep connection between the generative diffusion model and
mathematical physics. Despite previous efforts, progress has been limited to
the pursuit of a single specialized method. In order to advance the
interpretability of diffusion models and explore new research directions, it is
essential to establish a unified ODE-style generative diffusion model. Such a
model should draw inspiration from physical models and possess a clear
geometric meaning. This paper aims to identify various physical models that are
suitable for constructing ODE-style generative diffusion models accurately from
a mathematical perspective. We then summarize these models into a unified
method. Additionally, we perform a case study where we use the theoretical
model identified by our method to develop a range of new diffusion model
methods, and conduct experiments. Our experiments on CIFAR-10 demonstrate the
effectiveness of our approach. We have constructed a computational framework
that attains highly proficient results with regards to image generation speed,
alongside an additional model that demonstrates exceptional performance in both
Inception score and FID score. These results underscore the significance of our
method in advancing the field of diffusion models. |
This paper proposes a novel method for constructing interpretable ODE-style generative diffusion models by leveraging force field construction inspired by physical models. |
This work aims to enhance the interpretability of diffusion models and explore new research avenues by establishing a unified ODE-style generative diffusion model framework grounded in mathematical physics. |
The authors establish a connection between ODE-style diffusion models and the transport equation from physics. They utilize Green's functions to construct vector fields satisfying initial and final distribution conditions and provide solutions for specific cases like isotropic fields. Different trajectory types (linear, distribution-based, curve) are proposed and their learning process is formulated as a score matching objective. Sampling methods tailored for each trajectory type are also presented. |
The diffusion model with multi-sample linear superposition achieved the best Inception and FID scores on CIFAR-10.
The diffusion model with a one-sample straight line demonstrated high efficiency within a limited number of iterations.
The study revealed that excessively high numbers of fitted curves in multi-sample straight line models can lead to mode collapse. |
The paper primarily focuses on image generation, and further investigation is needed to extend the proposed method to other data modalities.
The analysis of mode collapse in multi-sample straight-line trajectories, while insightful, could benefit from more detailed theoretical exploration. |
generative diffusion models, force field construction, ode-style diffusion models, interpretable machine learning, score matching |
2303.07945
Report |
Edit-A-Video: Single Video Editing with Object-Aware Consistency |
Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon |
Despite the fact that text-to-video (TTV) model has recently achieved
remarkable success, there have been few approaches on TTV for its extension to
video editing. Motivated by approaches on TTV models adapting from
diffusion-based text-to-image (TTI) models, we suggest the video editing
framework given only a pretrained TTI model and a single pair,
which we term Edit-A-Video. The framework consists of two stages: (1) inflating
the 2D model into the 3D model by appending temporal modules and tuning on the
source video (2) inverting the source video into the noise and editing with
target text prompt and attention map injection. Each stage enables the temporal
modeling and preservation of semantic attributes of the source video. One of
the key challenges for video editing include a background inconsistency
problem, where the regions not included for the edit suffer from undesirable
and inconsistent temporal alterations. To mitigate this issue, we also
introduce a novel mask blending method, termed as sparse-causal blending (SC
Blending). We improve previous mask blending methods to reflect the temporal
consistency so that the area where the editing is applied exhibits smooth
transition while also achieving spatio-temporal consistency of the unedited
regions. We present extensive experimental results over various types of text
and videos, and demonstrate the superiority of the proposed method compared to
baselines in terms of background consistency, text alignment, and video editing
quality. |
This paper introduces Edit-A-Video, a novel framework for text-guided video editing using a pre-trained text-to-image (TTI) model and a single source video. |
Existing text-to-video editing methods often struggle to maintain temporal consistency, especially in background regions, leading to unrealistic and jarring edits. |
Edit-A-Video consists of two stages: 1) Inflating a 2D TTI model to a 3D model by adding temporal modules and finetuning on the source video, 2) Inverting the source video into noise and iteratively denoising it towards the target text while injecting attention maps. Crucially, it employs a novel "temporal-consistent blending" (TC Blending) method to ensure smooth and consistent edits across frames. |
Edit-A-Video successfully edits videos to match target text prompts while preserving background consistency and source video dynamics.
The proposed TC Blending method significantly reduces background inconsistencies compared to traditional blending techniques.
Quantitative and qualitative comparisons demonstrate Edit-A-Video's superiority over existing methods in terms of editing quality, text alignment, and background preservation. |
The method is currently limited to editing short video clips due to computational constraints.
Future work could explore more sophisticated temporal modeling techniques and user-interactive editing tools. |
video editing, text-guided synthesis, diffusion models, temporal consistency, attention mechanisms |
2303.07938
Report |
Controllable Mesh Generation Through Sparse Latent Point Diffusion Models |
Zhaoyang Lyu, Jinyi Wang, Yuwei An, Ya Zhang, Dahua Lin, Bo Dai |
Mesh generation is of great value in various applications involving computer
graphics and virtual content, yet designing generative models for meshes is
challenging due to their irregular data structure and inconsistent topology of
meshes in the same category. In this work, we design a novel sparse latent
point diffusion model for mesh generation. Our key insight is to regard point
clouds as an intermediate representation of meshes, and model the distribution
of point clouds instead. While meshes can be generated from point clouds via
techniques like Shape as Points (SAP), the challenges of directly generating
meshes can be effectively avoided. To boost the efficiency and controllability
of our mesh generation method, we propose to further encode point clouds to a
set of sparse latent points with point-wise semantic meaningful features, where
two DDPMs are trained in the space of sparse latent points to respectively
model the distribution of the latent point positions and features at these
latent points. We find that sampling in this latent space is faster than
directly sampling dense point clouds. Moreover, the sparse latent points also
enable us to explicitly control both the overall structures and local details
of the generated meshes. Extensive experiments are conducted on the ShapeNet
dataset, where our proposed sparse latent point diffusion model achieves
superior performance in terms of generation quality and controllability when
compared to existing methods. |
This paper proposes SLIDE, a novel sparse latent point diffusion model for controllable 3D mesh generation. |
Mesh generation is crucial in computer graphics but challenging due to irregular mesh data and inconsistent topology. Existing methods struggle with limited topology and quality issues. This work aims to address these challenges by using point clouds as an intermediate representation and introducing a novel sparse latent point diffusion model. |
The approach involves: 1) Training an autoencoder that encodes a point cloud to features at a sparse set of latent points and decodes it back. 2) Training two DDPMs in the latent space, one for the distribution of sparse latent point positions and the other for the distribution of features at these points. |
SLIDE generates high-quality meshes with diverse topologies, outperforming baselines in visual quality and metrics like 1-NN, MMD, and COV.
The model allows controllable mesh generation by manipulating the positions of sparse latent points, enabling control over overall structure and local details without part annotations.
SLIDE is efficient, achieving faster generation speeds compared to DDPMs directly trained on point clouds. |
The correspondence between sparse latent points across different shapes needs improvement for better control.
Exploring alternative surface reconstruction techniques beyond SAP might further enhance mesh quality. |
mesh generation, point cloud, diffusion models, deep learning, controllable generation |
2303.07937
Report |
Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation |
Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Hyeonsu Kim, Jaehoon Ko, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, Seungryong Kim |
Text-to-3D generation has shown rapid progress in recent days with the advent
of score distillation, a methodology of using pretrained text-to-2D diffusion
models to optimize neural radiance field (NeRF) in the zero-shot setting.
However, the lack of 3D awareness in the 2D diffusion models destabilizes score
distillation-based methods from reconstructing a plausible 3D scene. To address
this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness
into pretrained 2D diffusion models, enhancing the robustness and 3D
consistency of score distillation-based methods. We realize this by first
constructing a coarse 3D structure of a given text prompt and then utilizing
projected, view-specific depth map as a condition for the diffusion model.
Additionally, we introduce a training strategy that enables the 2D diffusion
model learns to handle the errors and sparsity within the coarse 3D structure
for robust generation, as well as a method for ensuring semantic consistency
throughout all viewpoints of the scene. Our framework surpasses the limitations
of prior arts, and has significant implications for 3D consistent generation of
2D diffusion models. |
This paper introduces 3DFuse, a novel framework that improves 3D consistency in text-to-3D generation by incorporating 3D awareness into pretrained 2D diffusion models. |
Existing score distillation-based text-to-3D generation methods often produce geometrically inconsistent scenes due to the lack of 3D awareness in 2D diffusion models. |
3DFuse uses a consistency injection module to condition the diffusion model on sparse depth projections of a generated point cloud, effectively guiding the generation process with 3D information. It also employs semantic code sampling to reduce ambiguity in text prompts and enhance semantic consistency. |
3DFuse significantly improves the geometric consistency and fidelity of generated 3D scenes compared to baselines like DreamFusion, SJC, and ProlificDreamer.
Qualitative results and a proposed COLMAP-based quantitative metric demonstrate the effectiveness of 3DFuse in ensuring geometric consistency.
User studies confirm that 3DFuse generates 3D scenes with higher fidelity and better geometric consistency than previous methods. |
The approach inherits the limitations of pretrained diffusion models in reflecting complex user prompts.
Potential societal biases inherent in the training data may affect the generated 3D scenes.
Future work could explore incorporating more sophisticated 3D priors and addressing the limitations of pretrained diffusion models. |
text-to-3d generation, score distillation sampling, 3d consistency, diffusion models, neural radiance fields (nerf) |
2303.07820
Report |
Adaptive Rotated Convolution for Rotated Object Detection |
Yifan Pu, Yiru Wang, Zhuofan Xia, Yizeng Han, Yulin Wang, Weihao Gan, Zidong Wang, Shiji Song, Gao Huang |
Rotated object detection aims to identify and locate objects in images with
arbitrary orientation. In this scenario, the oriented directions of objects
vary considerably across different images, while multiple orientations of
objects exist within an image. This intrinsic characteristic makes it
challenging for standard backbone networks to extract high-quality features of
these arbitrarily orientated objects. In this paper, we present Adaptive
Rotated Convolution (ARC) module to handle the aforementioned challenges. In
our ARC module, the convolution kernels rotate adaptively to extract object
features with varying orientations in different images, and an efficient
conditional computation mechanism is introduced to accommodate the large
orientation variations of objects within an image. The two designs work
seamlessly in rotated object detection problem. Moreover, ARC can conveniently
serve as a plug-and-play module in various vision backbones to boost their
representation ability to detect oriented objects accurately. Experiments on
commonly used benchmarks (DOTA and HRSC2016) demonstrate that equipped with our
proposed ARC module in the backbone network, the performance of multiple
popular oriented object detectors is significantly improved (\eg +3.03\% mAP on
Rotated RetinaNet and +4.16\% on CFA). Combined with the highly competitive
method Oriented R-CNN, the proposed approach achieves state-of-the-art
performance on the DOTA dataset with 81.77\% mAP. Code is available at
\url{https://github.com/LeapLabTHU/ARC}. |
This paper proposes Adaptive Rotated Convolution (ARC), a plug-and-play module for boosting backbone network performance in rotated object detection. |
Standard backbone networks struggle to extract quality features from arbitrarily oriented objects, as their orientations differ significantly across and within images. |
ARC rotates convolution kernels adaptively based on input image features using a routing function. It employs conditional computation to efficiently handle multiple object orientations within a single image. |
ARC significantly improves the performance of various rotated object detectors (single-stage and two-stage) on DOTA and HRSC2016 datasets.
Combined with Oriented R-CNN, ARC achieves state-of-the-art performance on DOTA.
ARC maintains efficiency with a minimal increase in FLOPs and a slight drop in FPS compared to baseline models. |
The paper mainly focuses on replacing 3x3 convolutions and doesn't explore other kernel sizes extensively.
Future work could investigate the integration of ARC with transformer-based backbones. |
rotated object detection, adaptive convolution, dynamic networks, conditional computation, backbone networks |
2303.07418
Report |
FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization |
Jiawei Yang, Marco Pavone, Yue Wang |
Novel view synthesis with sparse inputs is a challenging problem for neural
radiance fields (NeRF). Recent efforts alleviate this challenge by introducing
external supervision, such as pre-trained models and extra depth signals, and
by non-trivial patch-based rendering. In this paper, we present Frequency
regularized NeRF (FreeNeRF), a surprisingly simple baseline that outperforms
previous methods with minimal modifications to the plain NeRF. We analyze the
key challenges in few-shot neural rendering and find that frequency plays an
important role in NeRF's training. Based on the analysis, we propose two
regularization terms. One is to regularize the frequency range of NeRF's
inputs, while the other is to penalize the near-camera density fields. Both
techniques are ``free lunches'' at no additional computational cost. We
demonstrate that even with one line of code change, the original NeRF can
achieve similar performance as other complicated methods in the few-shot
setting. FreeNeRF achieves state-of-the-art performance across diverse
datasets, including Blender, DTU, and LLFF. We hope this simple baseline will
motivate a rethinking of the fundamental role of frequency in NeRF's training
under the low-data regime and beyond. |
This paper introduces FreeNeRF, a simple yet effective baseline for few-shot neural rendering that leverages frequency and occlusion regularization. |
Few-shot neural rendering is challenging because NeRF models often overfit to limited training views and struggle to generalize to novel views. |
The authors analyze the failure modes of NeRF in few-shot settings and propose (1) frequency regularization to stabilize training by gradually introducing high-frequency components, and (2) occlusion regularization to penalize dense fields near the camera, mitigating artifacts like floaters. |
FreeNeRF outperforms previous state-of-the-art methods on Blender, DTU, and LLFF datasets in terms of novel view synthesis quality.
The proposed method introduces minimal computational overhead compared to a plain NeRF, requiring no pre-training or additional rendering steps.
Ablation studies validate the effectiveness of both frequency and occlusion regularization in improving few-shot neural rendering performance. |
A longer frequency curriculum can cause blurriness, resulting in lower LPIPS scores despite higher PSNR.
Occlusion regularization might lead to over-regularization and incomplete representations of near-camera objects. |
neural rendering, nerf, few-shot learning, frequency regularization, occlusion regularization |
2303.07274
Report |
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images |
Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz |
Weird, unusual, and uncanny images pique the curiosity of observers because
they challenge commonsense. For example, an image released during the 2022
world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo
playing chess, which playfully violates our expectation that their competition
should occur on the football field. Humans can easily recognize and interpret
these unconventional images, but can AI models do the same? We introduce
WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is
comprised of purposefully commonsense-defying images created by designers using
publicly-available image generation tools like Midjourney. We consider several
tasks posed over the dataset. In addition to image captioning, cross-modal
matching, and visual question answering, we introduce a difficult explanation
generation task, where models must identify and explain why a given image is
unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2
still lag behind human performance on WHOOPS!. We hope our dataset will inspire
the development of AI models with stronger visual commonsense reasoning
abilities. Data, models and code are available at the project website:
whoops-benchmark.github.io |
Introduced WHOOPS!, a novel dataset of 500 synthetic images designed to challenge AI models' ability to reason about commonsense and compositionality in vision-and-language tasks. |
Existing vision-and-language models struggle to demonstrate commonsense reasoning and often rely on superficial correlations. WHOOPS! provides a challenging benchmark to foster development in this area. |
Employed a human-in-the-loop approach, using designers and text-to-image models (e.g., Midjourney) to craft images that violate commonsense expectations. Collected annotations for four tasks: explanation generation, image captioning, cross-modal matching, and visual question answering. |
State-of-the-art models lag significantly behind human performance on all tasks, particularly in generating explanations for the unusual images.
Analysis reveals that the challenge stems from the 'weirdness' of the images, not their synthetic nature.
Developed an automatic evaluation metric for explanation generation using GPT4, achieving over 81% accuracy compared to human judgment. |
The dataset size, while sufficient for the current study, could be expanded to encompass a wider range of commonsense violations.
Despite efforts to filter for potentially offensive content, some images might still be perceived as such and require further refinement. |
commonsense reasoning, vision and language, image captioning, visual question answering, explanation generation |
2303.07216
Report |
Parallel Vertex Diffusion for Unified Visual Grounding |
Zesen Cheng, Kehan Li, Peng Jin, Xiangyang Ji, Li Yuan, Chang Liu, Jie Chen |
Unified visual grounding pursues a simple and generic technical route to
leverage multi-task data with less task-specific design. The most advanced
methods typically present boxes and masks as vertex sequences to model
referring detection and segmentation as an autoregressive sequential vertex
generation paradigm. However, generating high-dimensional vertex sequences
sequentially is error-prone because the upstream of the sequence remains static
and cannot be refined based on downstream vertex information, even if there is
a significant location gap. Besides, with limited vertexes, the inferior
fitting of objects with complex contours restricts the performance upper bound.
To deal with this dilemma, we propose a parallel vertex generation paradigm for
superior high-dimension scalability with a diffusion model by simply modifying
the noise dimension. An intuitive materialization of our paradigm is Parallel
Vertex Diffusion (PVD) to directly set vertex coordinates as the generation
target and use a diffusion model to train and infer. We claim that it has two
flaws: (1) unnormalized coordinate caused a high variance of loss value; (2)
the original training objective of PVD only considers point consistency but
ignores geometry consistency. To solve the first flaw, Center Anchor Mechanism
(CAM) is designed to convert coordinates as normalized offset values to
stabilize the training loss value. For the second flaw, Angle summation loss
(ASL) is designed to constrain the geometry difference of prediction and ground
truth vertexes for geometry-level consistency. Empirical results show that our
PVD achieves state-of-the-art in both referring detection and segmentation, and
our paradigm is more scalable and efficient than sequential vertex generation
with high-dimension data. |
This paper proposes Parallel Vertex Diffusion (PVD), a novel paradigm for unified visual grounding that leverages a diffusion model to generate vertexes of bounding boxes and masks in parallel, overcoming limitations of sequential generation methods. |
Existing sequential vertex generation methods for unified visual grounding suffer from error accumulation and struggle to scale to high-dimensional data (complex object boundaries). This paper addresses these issues with a parallel approach using diffusion models, enabling more accurate and efficient grounding. |
The proposed PVD method utilizes a diffusion model with a specifically designed "Denoiser" network. To further enhance performance, a Center Anchor Mechanism (CAM) is introduced for coordinate normalization and an Angle Summation Loss (ASL) for ensuring geometry consistency. |
PVD achieves state-of-the-art results on benchmark datasets for both referring expression comprehension (REC) and referring image segmentation (RIS).
PVD demonstrates superior scalability compared to sequential methods, exhibiting improved performance and efficiency with an increasing number of vertexes.
Quantitative analysis highlights the effectiveness of CAM and ASL in stabilizing training and enhancing geometry consistency. |
The current implementation of PVD is limited to generating a fixed number of vertexes. Adaptively determining the optimal number based on object complexity could be explored.
Investigating more sophisticated geometry constraints beyond angle summation could further improve performance, particularly for highly irregular objects. |
visual grounding, referring expression comprehension, referring image segmentation, diffusion models, parallel vertex generation |
2303.06994
Report |
Synthesizing Realistic Image Restoration Training Pairs: A Diffusion Approach |
Tao Yang, Peiran Ren, Xuansong xie, Lei Zhang |
In supervised image restoration tasks, one key issue is how to obtain the
aligned high-quality (HQ) and low-quality (LQ) training image pairs.
Unfortunately, such HQ-LQ training pairs are hard to capture in practice, and
hard to synthesize due to the complex unknown degradation in the wild. While
several sophisticated degradation models have been manually designed to
synthesize LQ images from their HQ counterparts, the distribution gap between
the synthesized and real-world LQ images remains large. We propose a new
approach to synthesizing realistic image restoration training pairs using the
emerging denoising diffusion probabilistic model (DDPM).
First, we train a DDPM, which could convert a noisy input into the desired LQ
image, with a large amount of collected LQ images, which define the target data
distribution. Then, for a given HQ image, we synthesize an initial LQ image by
using an off-the-shelf degradation model, and iteratively add proper Gaussian
noises to it. Finally, we denoise the noisy LQ image using the pre-trained DDPM
to obtain the final LQ image, which falls into the target distribution of
real-world LQ images. Thanks to the strong capability of DDPM in distribution
approximation, the synthesized HQ-LQ image pairs can be used to train robust
models for real-world image restoration tasks, such as blind face image
restoration and blind image super-resolution. Experiments demonstrated the
superiority of our proposed approach to existing degradation models. Code and
data will be released. |
This paper presents a novel diffusion-based approach for synthesizing realistic image restoration training pairs, aiming to bridge the distribution gap between synthetic and real-world low-quality images. |
Acquiring aligned high-quality and low-quality image pairs for training supervised image restoration models is challenging. Existing degradation models often fail to capture the complexities of real-world degradations, leading to limited performance on real images. This diffusion-based approach generates more realistic training pairs, potentially improving the robustness of trained models. |
The proposed method first trains a denoising diffusion probabilistic model (DDPM) using a large dataset of real-world low-quality images. To synthesize a training pair, an initial low-quality image is generated from a high-quality image using an off-the-shelf degradation model. This initial image is then iteratively denoised using the pre-trained DDPM, guiding it towards the target distribution of real-world degradations. |
Synthesized image pairs using the proposed method achieve lower FID and higher PSNR/SSIM compared to pairs generated using only handcrafted degradation models, indicating closer distribution to real-world data and better structural preservation.
Blind face restoration models trained on the proposed pairs demonstrate superior performance on both synthetic and real-world images, achieving better FID/LPIPS/PSNR/SSIM and improved visual quality with finer details, as evidenced by quantitative metrics and user study.
Blind image super-resolution models trained on the proposed pairs exhibit enhanced ability to handle complex real-world degradations, generating higher quality reconstructions with fewer artifacts and better detail preservation, outperforming existing methods in FID/LPIPS/PSNR/SSIM and user preference. |
The quality of synthesized pairs is influenced by the initial handcrafted degradation model and the number of diffusion steps, requiring careful parameter selection.
Collecting a diverse and representative LQ image dataset is crucial for training an effective DDPM, which can be time-consuming and laborious. Future work could explore unsupervised or semi-supervised methods to alleviate this reliance on large labeled datasets. |
image restoration, denoising diffusion probabilistic model (ddpm), degradation modeling, blind face restoration, blind image super-resolution |
2303.06930
Report |
Twin Contrastive Learning with Noisy Labels |
Zhizhong Huang, Junping Zhang, Hongming Shan |
Learning from noisy data is a challenging task that significantly degenerates
the model performance. In this paper, we present TCL, a novel twin contrastive
learning model to learn robust representations and handle noisy labels for
classification. Specifically, we construct a Gaussian mixture model (GMM) over
the representations by injecting the supervised model predictions into GMM to
link label-free latent variables in GMM with label-noisy annotations. Then, TCL
detects the examples with wrong labels as the out-of-distribution examples by
another two-component GMM, taking into account the data distribution. We
further propose a cross-supervision with an entropy regularization loss that
bootstraps the true targets from model predictions to handle the noisy labels.
As a result, TCL can learn discriminative representations aligned with
estimated labels through mixup and contrastive learning. Extensive experimental
results on several standard benchmarks and real-world datasets demonstrate the
superior performance of TCL. In particular, TCL achieves 7.5\% improvements on
CIFAR-10 with 90\% noisy label -- an extremely noisy scenario. The source code
is available at \url{https://github.com/Hzzone/TCL}. |
This paper proposes TCL, a twin contrastive learning model, to learn robust representations and handle noisy labels for image classification. |
Learning with noisy labels is a crucial problem as mislabeled data is prevalent and can significantly degrade model performance. |
TCL leverages a Gaussian mixture model (GMM) over contrastive learning representations, linking label-free latent variables with noisy annotations. It then detects mislabeled samples as out-of-distribution examples using another two-component GMM and utilizes cross-supervision with entropy regularization to estimate true labels. |
TCL demonstrates superior performance on CIFAR-10/100 with various noise ratios, especially achieving 7.5% improvement on CIFAR-10 with 90% noise.
The proposed out-of-distribution label noise detection method proves effective in handling extremely noisy scenarios.
TCL outperforms state-of-the-art methods on real-world noisy datasets like WebVision and Clothing1M. |
The assumption of uniform label distribution might not hold for all datasets.
Future work includes incorporating semantic information for low noise ratios and exploring dynamic GMM updates. |
noisy labels, contrastive learning, out-of-distribution detection, cross-supervision, robust representation learning |
2303.06919
Report |
NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer |
Kun Zhou, Wenbo Li, Yi Wang, Tao Hu, Nianjuan Jiang, Xiaoguang Han, Jiangbo Lu |
Neural radiance fields (NeRF) show great success in novel view synthesis.
However, in real-world scenes, recovering high-quality details from the source
images is still challenging for the existing NeRF-based approaches, due to the
potential imperfect calibration information and scene representation
inaccuracy. Even with high-quality training frames, the synthetic novel views
produced by NeRF models still suffer from notable rendering artifacts, such as
noise, blur, etc. Towards to improve the synthesis quality of NeRF-based
approaches, we propose NeRFLiX, a general NeRF-agnostic restorer paradigm by
learning a degradation-driven inter-viewpoint mixer. Specially, we design a
NeRF-style degradation modeling approach and construct large-scale training
data, enabling the possibility of effectively removing NeRF-native rendering
artifacts for existing deep neural networks. Moreover, beyond the degradation
removal, we propose an inter-viewpoint aggregation framework that is able to
fuse highly related high-quality training images, pushing the performance of
cutting-edge NeRF models to entirely new levels and producing highly
photo-realistic synthetic views. |
This paper proposes NeRFLiX, a general-purpose NeRF-agnostic restoration method for improving the quality of neural view synthesis by learning a degradation-driven inter-viewpoint mixer. |
Existing NeRF models often produce synthetic views with notable artifacts due to imperfect camera calibration, scene representation inaccuracy, and other limitations. NeRFLiX addresses this issue by learning to remove these artifacts and enhance the quality of NeRF-rendered images. |
The authors introduce a NeRF-style degradation simulator (NDS) to generate a large-scale paired dataset of degraded and high-quality views. This dataset is used to train an inter-viewpoint mixer (IVM) that learns to restore a high-quality view by aggregating information from multiple neighboring high-quality reference views. A view selection strategy is also proposed to efficiently choose the most relevant reference views. |
NeRFLiX consistently improves the performance of various state-of-the-art NeRF models on different datasets, including LLFF, Tanks and Temples, and Noisy LLFF Synthetic.
The proposed NDS effectively simulates NeRF-style degradations, outperforming existing image degradation methods.
NeRFLiX enables training acceleration for NeRF models, achieving better results with reduced training time. |
The proposed NDS is one of many possible solutions for NeRF degradation simulation and can be further explored.
Exploring real-time inter-viewpoint mixers would be beneficial for practical applications. |
neural radiance fields (nerf), novel view synthesis, image restoration, degradation simulation, inter-viewpoint aggregation |
2303.06885
Report |
DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration |
Zhixin Wang, Xiaoyun Zhang, Ziying Zhang, Huangjie Zheng, Mingyuan Zhou, Ya Zhang, Yanfeng Wang |
Blind face restoration usually synthesizes degraded low-quality data with a
pre-defined degradation model for training, while more complex cases could
happen in the real world. This gap between the assumed and actual degradation
hurts the restoration performance where artifacts are often observed in the
output. However, it is expensive and infeasible to include every type of
degradation to cover real-world cases in the training data. To tackle this
robustness issue, we propose Diffusion-based Robust Degradation Remover (DR2)
to first transform the degraded image to a coarse but degradation-invariant
prediction, then employ an enhancement module to restore the coarse prediction
to a high-quality image. By leveraging a well-performing denoising diffusion
probabilistic model, our DR2 diffuses input images to a noisy status where
various types of degradation give way to Gaussian noise, and then captures
semantic information through iterative denoising steps. As a result, DR2 is
robust against common degradation (e.g. blur, resize, noise and compression)
and compatible with different designs of enhancement modules. Experiments in
various settings show that our framework outperforms state-of-the-art methods
on heavily degraded synthetic and real-world datasets. |
This paper introduces DR2E, a two-stage blind face restoration framework that first removes degradation from inputs using a pre-trained diffusion model and then enhances the coarse output for high-quality restoration. |
Existing blind face restoration methods struggle with real-world degraded images due to the reliance on pre-defined degradation models during training, leading to artifacts in the output. |
DR2E consists of a Diffusion-based Robust Degradation Remover (DR2) and an Enhancement module. DR2 leverages a pre-trained denoising diffusion probabilistic model (DDPM) to transform degraded images into coarse but degradation-invariant predictions by diffusing them into a noisy status where degradation becomes similar to Gaussian noise. The Enhancement module then refines the coarse prediction into a high-quality image. |
DR2E demonstrates robustness against various degradation types like blur, resize, noise, and compression.
The framework is flexible and compatible with different Enhancement module designs, allowing for incorporating various restoration methods.
Experiments on synthetic and real-world datasets show that DR2E outperforms state-of-the-art methods, particularly on heavily degraded images. |
The sampling process of DR2, relying on a DDPM, can be slow.
Choosing the optimal controlling parameters for DR2 currently requires manual tuning. |
blind face restoration, denoising diffusion probabilistic model, degradation removal, robust image restoration, deep learning |
2303.06880
Report |
Uni3D: A Unified Baseline for Multi-dataset 3D Object Detection |
Bo Zhang, Jiakang Yuan, Botian Shi, Tao Chen, Yikang Li, Yu Qiao |
Current 3D object detection models follow a single dataset-specific training
and testing paradigm, which often faces a serious detection accuracy drop when
they are directly deployed in another dataset. In this paper, we study the task
of training a unified 3D detector from multiple datasets. We observe that this
appears to be a challenging task, which is mainly due to that these datasets
present substantial data-level differences and taxonomy-level variations caused
by different LiDAR types and data acquisition standards. Inspired by such
observation, we present a Uni3D which leverages a simple data-level correction
operation and a designed semantic-level coupling-and-recoupling module to
alleviate the unavoidable data-level and taxonomy-level differences,
respectively. Our method is simple and easily combined with many 3D object
detection baselines such as PV-RCNN and Voxel-RCNN, enabling them to
effectively learn from multiple off-the-shelf 3D datasets to obtain more
discriminative and generalizable representations. Experiments are conducted on
many dataset consolidation settings including Waymo-nuScenes, nuScenes-KITTI,
Waymo-KITTI, and Waymo-nuScenes-KITTI consolidations. Their results demonstrate
that Uni3D exceeds a series of individual detectors trained on a single
dataset, with a 1.04x parameter increase over a selected baseline detector. We
expect this work will inspire the research of 3D generalization since it will
push the limits of perceptual performance. |
This paper proposes Uni3D, a unified 3D object detection framework trained on multiple datasets to address the accuracy drop when single-dataset models are tested on different datasets (dataset-interference issue). |
Current 3D object detection models are trained and evaluated on single datasets, leading to significant accuracy drops when deployed on datasets with different distributions, hindering generalization. |
Uni3D uses a data-level correction operation to normalize features based on dataset-specific mean and variance. It also employs a semantic-level coupling-and-recoupling module to learn dataset-agnostic features using spatial-wise and dataset-level attention. Finally, it uses dataset-specific detection heads for prediction. |
Uni3D significantly improves cross-dataset detection accuracy compared to single-dataset training or pre-training.
The data-level correction and semantic-level coupling-and-recoupling modules are shown to be effective in addressing data-level and taxonomy-level differences between datasets.
Uni3D enhances the zero-shot learning ability of the baseline detector, making it more robust to unseen scenes. |
The parameter sharing of coordinate-origin shift across different classes may be suboptimal and needs further exploration.
The BEV feature copy method, while ensuring training-and-testing consistency, is not the optimal solution for addressing the inconsistency between multi-dataset training and single-dataset inference. Further research is needed to explore better fusion strategies for BEV features from different datasets. |
3d object detection, multi-dataset learning, domain generalization, lidar point cloud, autonomous driving |
2303.06840
Report |
DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion |
Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, Luc Van Gool |
Multi-modality image fusion aims to combine different modalities to produce
fused images that retain the complementary features of each modality, such as
functional highlights and texture details. To leverage strong generative priors
and address challenges such as unstable training and lack of interpretability
for GAN-based generative methods, we propose a novel fusion algorithm based on
the denoising diffusion probabilistic model (DDPM). The fusion task is
formulated as a conditional generation problem under the DDPM sampling
framework, which is further divided into an unconditional generation subproblem
and a maximum likelihood subproblem. The latter is modeled in a hierarchical
Bayesian manner with latent variables and inferred by the
expectation-maximization (EM) algorithm. By integrating the inference solution
into the diffusion sampling iteration, our method can generate high-quality
fused images with natural image generative priors and cross-modality
information from source images. Note that all we required is an unconditional
pre-trained generative model, and no fine-tuning is needed. Our extensive
experiments indicate that our approach yields promising fusion results in
infrared-visible image fusion and medical image fusion. The code is available
at \url{https://github.com/Zhaozixiang1228/MMIF-DDFM}. |
This paper proposes DDFM, a novel multi-modality image fusion algorithm based on denoising diffusion probabilistic models (DDPM). |
Existing GAN-based image fusion methods suffer from unstable training and lack interpretability. DDFM leverages the strong generative priors of DDPM for high-quality fusion while addressing the limitations of GAN-based methods. |
DDFM formulates image fusion as a conditional generation problem within the DDPM sampling framework. It decomposes the problem into an unconditional generation part handled by a pre-trained DDPM and a likelihood rectification part. The latter utilizes a hierarchical Bayesian model with latent variables and is inferred by the Expectation-Maximization (EM) algorithm. The solution is then integrated into the DDPM loop for conditional image generation. |
DDFM effectively preserves structural and detail information from source images in both infrared-visible and medical image fusion tasks.
DDFM consistently outperforms state-of-the-art methods on various datasets based on quantitative metrics including EN, SD, MI, VIF, Qabf, and SSIM.
Ablation studies validate the contribution of individual components in DDFM, including the DDPM module and EM module. |
The current implementation of DDFM relies on a pre-trained DDPM, which might limit its performance on specific datasets.
Future work could explore incorporating task-specific information during training for improved fusion results. |
image fusion, denoising diffusion probabilistic model, generative model, multi-modality, likelihood rectification |
2303.06705
Report |
Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement |
Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, Yulun Zhang |
When enhancing low-light images, many deep learning algorithms are based on
the Retinex theory. However, the Retinex model does not consider the
corruptions hidden in the dark or introduced by the light-up process. Besides,
these methods usually require a tedious multi-stage training pipeline and rely
on convolutional neural networks, showing limitations in capturing long-range
dependencies. In this paper, we formulate a simple yet principled One-stage
Retinex-based Framework (ORF). ORF first estimates the illumination information
to light up the low-light image and then restores the corruption to produce the
enhanced image. We design an Illumination-Guided Transformer (IGT) that
utilizes illumination representations to direct the modeling of non-local
interactions of regions with different lighting conditions. By plugging IGT
into ORF, we obtain our algorithm, Retinexformer. Comprehensive quantitative
and qualitative experiments demonstrate that our Retinexformer significantly
outperforms state-of-the-art methods on thirteen benchmarks. The user study and
application on low-light object detection also reveal the latent practical
values of our method. Code, models, and results are available at
https://github.com/caiyuanhao1998/Retinexformer |
This paper proposes Retinexformer, the first Transformer-based algorithm for low-light image enhancement. |
Many deep learning methods for low-light image enhancement rely on the Retinex theory but ignore corruptions or require multi-stage training. Existing methods also struggle to capture long-range dependencies. |
This work formulates the One-stage Retinex-based Framework (ORF) and designs an Illumination-Guided Transformer (IGT). ORF estimates illumination and restores corruption in a single stage. IGT, plugged into ORF as the corruption restorer, uses illumination representations to guide long-range dependency modeling. |
Retinexformer significantly outperforms state-of-the-art methods on thirteen benchmarks, achieving up to 6dB improvement on SID and SDSD datasets.
User study confirms Retinexformer's superior visual quality compared to competing algorithms.
Retinexformer effectively preprocesses low-light images for object detection, improving average precision by 0.8 AP compared to the best fully supervised method. |
The model's performance on extremely dark images could be further improved.
Exploring more efficient self-attention mechanisms for greater computational efficiency is a promising direction for future work. |
low-light image enhancement, retinex theory, transformer, illumination-guided attention, one-stage framework |
2303.06678
Report |
PointPatchMix: Point Cloud Mixing with Patch Scoring |
Yi Wang, Jiaze Wang, Jinpeng Li, Zixu Zhao, Guangyong Chen, Anfeng Liu, Pheng-Ann Heng |
Data augmentation is an effective regularization strategy for mitigating
overfitting in deep neural networks, and it plays a crucial role in 3D vision
tasks, where the point cloud data is relatively limited. While mixing-based
augmentation has shown promise for point clouds, previous methods mix point
clouds either on block level or point level, which has constrained their
ability to strike a balance between generating diverse training samples and
preserving the local characteristics of point clouds. Additionally, the varying
importance of each part of the point clouds has not been fully considered,
cause not all parts contribute equally to the classification task, and some
parts may contain unimportant or redundant information. To overcome these
challenges, we propose PointPatchMix, a novel approach that mixes point clouds
at the patch level and integrates a patch scoring module to generate
content-based targets for mixed point clouds. Our approach preserves local
features at the patch level, while the patch scoring module assigns targets
based on the content-based significance score from a pre-trained teacher model.
We evaluate PointPatchMix on two benchmark datasets, ModelNet40 and
ScanObjectNN, and demonstrate significant improvements over various baselines
in both synthetic and real-world datasets, as well as few-shot settings. With
Point-MAE as our baseline, our model surpasses previous methods by a
significant margin, achieving 86.3% accuracy on ScanObjectNN and 94.1% accuracy
on ModelNet40. Furthermore, our approach shows strong generalization across
multiple architectures and enhances the robustness of the baseline model. |
This paper proposes PointPatchMix, a novel point cloud data augmentation method based on patch-level mixing and content-based target generation using a pre-trained teacher model. |
Data augmentation is crucial for point cloud processing due to limited data availability, and existing methods struggle to balance diversity and local feature preservation while assigning accurate targets to mixed point clouds. |
PointPatchMix divides point clouds into patches, mixes them using an optimal assignment algorithm based on Earth Mover's Distance (EMD), and assigns content-based targets using patch significance scores derived from a pre-trained teacher model. |
PointPatchMix significantly outperforms state-of-the-art methods on both synthetic (ModelNet40) and real-world (ScanObjectNN) point cloud classification datasets.
The method demonstrates strong generalization ability across various network architectures (PointNet, PointNet++, Transformer) and improves performance in few-shot learning settings.
Ablation studies confirm the effectiveness of patch-level mixing, content-based target generation, and the choice of optimal patch assignment strategy. |
The current study primarily focuses on point cloud classification, and future work could explore its application to other domains like segmentation.
Investigating the computational cost and efficiency of PointPatchMix, particularly in resource-constrained environments, could be beneficial. |
point cloud, data augmentation, pointpatchmix, classification, transformer |
2303.06628
Report |
Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models |
Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, Yang You |
Continual learning (CL) can help pre-trained vision-language models
efficiently adapt to new or under-trained data distributions without
re-training. Nevertheless, during the continual training of the Contrastive
Language-Image Pre-training (CLIP) model, we observe that the model's zero-shot
transfer ability significantly degrades due to catastrophic forgetting.
Existing CL methods can mitigate forgetting by replaying previous data.
However, since the CLIP dataset is private, replay methods cannot access the
pre-training dataset. In addition, replaying data of previously learned
downstream tasks can enhance their performance but comes at the cost of
sacrificing zero-shot performance. To address this challenge, we propose a
novel method ZSCL to prevent zero-shot transfer degradation in the continual
learning of vision-language models in both feature and parameter space. In the
feature space, a reference dataset is introduced for distillation between the
current and initial models. The reference dataset should have semantic
diversity but no need to be labeled, seen in pre-training, or matched
image-text pairs. In parameter space, we prevent a large parameter shift by
averaging weights during the training. We propose a more challenging
Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different
methods, where tasks are from various domains instead of class-separated in a
single dataset. Our method outperforms other methods in the traditional
class-incremental learning setting and the MTIL by 9.7% average score. Our code
locates at https://github.com/Thunderbeee/ZSCL. |
This paper investigates and addresses the problem of zero-shot transfer degradation in Continual Learning (CL) of vision-language models, particularly the Contrastive Language-Image Pre-training (CLIP) model. |
Continual learning is crucial for efficiently adapting pre-trained vision-language models to new data distributions without costly retraining. However, current methods suffer from catastrophic forgetting, leading to degraded zero-shot transfer ability. |
The paper introduces ZSCL, a novel method that prevents zero-shot transfer degradation in both feature and parameter space. It employs distillation with a reference dataset for feature space preservation and weight ensemble during training for parameter space regularization. |
ZSCL effectively prevents zero-shot transfer degradation in CLIP, maintaining high performance on both previously learned and new tasks.
The use of a reference dataset with diverse semantics for distillation proves crucial for preserving the feature space learned during pre-training.
ZSCL consistently outperforms existing CL methods on both conventional class-incremental learning benchmarks and the proposed Multi-domain Task Incremental Learning (MTIL) benchmark. |
ZSCL currently relies on a reference dataset, and future work could explore methods to remove this dependency.
The authors plan to expand ZSCL for next-token prediction tasks in multi-modality models utilizing large language models. |
continual learning, vision-language models, zero-shot transfer, catastrophic forgetting, clip |
2303.06547
Report |
Towards Universal Vision-language Omni-supervised Segmentation |
Bowen Dong, Jiaxi Gu, Jianhua Han, Hang Xu, Wangmeng Zuo |
Existing open-world universal segmentation approaches usually leverage CLIP
and pre-computed proposal masks to treat open-world segmentation tasks as
proposal classification. However, 1) these works cannot handle universal
segmentation in an end-to-end manner, and 2) the limited scale of panoptic
datasets restricts the open-world segmentation ability on things classes. In
this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS).
VLOSS starts from a Mask2Former universal segmentation framework with CLIP text
encoder. To improve the open-world segmentation ability, we leverage
omni-supervised data (i.e., panoptic segmentation data, object detection data,
and image-text pairs data) into training, thus enriching the open-world
segmentation ability and achieving better segmentation accuracy. To better
improve the training efficiency and fully release the power of omni-supervised
data, we propose several advanced techniques, i.e., FPN-style encoder,
switchable training technique, and positive classification loss. Benefiting
from the end-to-end training manner with proposed techniques, VLOSS can be
applied to various open-world segmentation tasks without further adaptation.
Experimental results on different open-world panoptic and instance segmentation
benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer
parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in
terms of mask AP on LVIS v1 dataset. |
This paper proposes VLOSS, a universal open-world segmentation framework that leverages omni-supervised data (panoptic, detection, image-text pairs) and CLIP for enhanced recognition. |
Existing open-world segmentation methods are limited by end-to-end training capabilities and restricted open-world recognition due to dataset limitations. |
VLOSS utilizes a Mask2Former base with CLIP, trained on a mix of panoptic, detection, and image-text data. It introduces a FPN-style encoder, switchable training technique, and positive classification loss to improve training efficiency and leverage diverse annotations. |
VLOSS achieves comparable results to state-of-the-art MaskCLIP on ADE20K panoptic segmentation with fewer parameters.
On LVIS v1, VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in mask AP.
Qualitative results showcase VLOSS's capability to segment and recognize both seen and unseen things and stuff classes. |
The current method underutilizes regions not present in annotations for weakly-supervised datasets.
The work doesn't incorporate visual grounding datasets, which could further improve open-world recognition. |
open-world segmentation, universal segmentation, vision-language models, omni-supervised learning, clip |
2303.06464
Report |
PARASOL: Parametric Style Control for Diffusion Image Synthesis |
Gemma Canet Tarrés, Dan Ruta, Tu Bui, John Collomosse |
We propose PARASOL, a multi-modal synthesis model that enables disentangled,
parametric control of the visual style of the image by jointly conditioning
synthesis on both content and a fine-grained visual style embedding. We train a
latent diffusion model (LDM) using specific losses for each modality and adapt
the classifier-free guidance for encouraging disentangled control over
independent content and style modalities at inference time. We leverage
auxiliary semantic and style-based search to create training triplets for
supervision of the LDM, ensuring complementarity of content and style cues.
PARASOL shows promise for enabling nuanced control over visual style in
diffusion models for image creation and stylization, as well as generative
search where text-based search results may be adapted to more closely match
user intent by interpolating both content and style descriptors. |
PARASOL is a multi-modal synthesis model that enables disentangled, parametric control of visual style in images, jointly conditioning synthesis on both content and fine-grained visual style embedding. |
Current deep generative models lack fine-grained control over visual style, often limited by coarse-grained inputs like text descriptions or struggle to disentangle content from style information. |
PARASOL leverages a latent diffusion model (LDM) trained with specific losses for content and style, employing classifier-free guidance for disentangled control at inference. It uses auxiliary semantic and style-based search to create training triplets, ensuring complementarity of content and style cues. |
PARASOL achieves superior performance in transferring specific styles compared to existing multi-modal and style transfer models.
The model enables fine-grained control over the degree of style transfer and content preservation via parameters like 'lambda' for inversion and 'g_s', 'g_y' for classifier-free guidance.
PARASOL supports style and content interpolation, enabling the creation of novel images by combining different styles and semantic concepts. |
Challenges remain in disentangling style from content for certain ambiguous styles.
Addressing challenging content like faces often requires additional specialized training. |
image synthesis, style control, diffusion models, multi-modal learning, generative search |
2303.06424
Report |
Regularized Vector Quantization for Tokenized Image Synthesis |
Jiahui Zhang, Fangneng Zhan, Christian Theobalt, Shijian Lu |
Quantizing images into discrete representations has been a fundamental
problem in unified generative modeling. Predominant approaches learn the
discrete representation either in a deterministic manner by selecting the
best-matching token or in a stochastic manner by sampling from a predicted
distribution. However, deterministic quantization suffers from severe codebook
collapse and misalignment with inference stage while stochastic quantization
suffers from low codebook utilization and perturbed reconstruction objective.
This paper presents a regularized vector quantization framework that allows to
mitigate above issues effectively by applying regularization from two
perspectives. The first is a prior distribution regularization which measures
the discrepancy between a prior token distribution and the predicted token
distribution to avoid codebook collapse and low codebook utilization. The
second is a stochastic mask regularization that introduces stochasticity during
quantization to strike a good balance between inference stage misalignment and
unperturbed reconstruction objective. In addition, we design a probabilistic
contrastive loss which serves as a calibrated metric to further mitigate the
perturbed reconstruction objective. Extensive experiments show that the
proposed quantization framework outperforms prevailing vector quantization
methods consistently across different generative models including
auto-regressive models and diffusion models. |
This paper presents a regularized vector quantization framework for tokenized image synthesis that addresses limitations of existing deterministic and stochastic quantization methods, such as codebook collapse, low codebook utilization, and perturbed reconstruction objectives. |
Quantizing images into discrete representations is crucial for unified generative modeling. Existing methods struggle to balance accurate representation learning, efficient codebook usage, and high-fidelity image generation. |
The proposed framework employs a prior distribution regularization to encourage full codebook utilization and prevent collapse. It also introduces a stochastic mask regularization to balance deterministic and stochastic quantization, mitigating misalignment during inference. Finally, a probabilistic contrastive loss is designed for elastic image reconstruction, adapting to the varying discrepancies caused by stochastic sampling. |
The regularized quantization consistently outperforms existing methods in image reconstruction and generation quality across diverse datasets and generative models (auto-regressive and diffusion).
Prior distribution regularization and stochastic mask regularization are shown to effectively mitigate codebook collapse and inference stage misalignment respectively.
The probabilistic contrastive loss improves image reconstruction and generation quality by enabling elastic image reconstruction and adapting to the perturbations caused by stochastic sampling. |
The current method employs the same learning objective for the encoder and decoder, which may not be optimal for both accurate representation and realistic image generation.
Exploring different prior distributions, such as Gaussian, for potential performance improvement. |
vector quantization, image synthesis, generative modeling, discrete representation learning, contrastive learning |
2303.06373
Report |
Recursive Generalization Transformer for Image Super-Resolution |
Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang |
Transformer architectures have exhibited remarkable performance in image
super-resolution (SR). Since the quadratic computational complexity of the
self-attention (SA) in Transformer, existing methods tend to adopt SA in a
local region to reduce overheads. However, the local design restricts the
global context exploitation, which is crucial for accurate image
reconstruction. In this work, we propose the Recursive Generalization
Transformer (RGT) for image SR, which can capture global spatial information
and is suitable for high-resolution images. Specifically, we propose the
recursive-generalization self-attention (RG-SA). It recursively aggregates
input features into representative feature maps, and then utilizes
cross-attention to extract global information. Meanwhile, the channel
dimensions of attention matrices (query, key, and value) are further scaled to
mitigate the redundancy in the channel domain. Furthermore, we combine the
RG-SA with local self-attention to enhance the exploitation of the global
context, and propose the hybrid adaptive integration (HAI) for module
integration. The HAI allows the direct and effective fusion between features at
different levels (local or global). Extensive experiments demonstrate that our
RGT outperforms recent state-of-the-art methods quantitatively and
qualitatively. Code and pre-trained models are available at
https://github.com/zhengchen1999/RGT. |
This paper proposes the Recursive Generalization Transformer (RGT) for image super-resolution, which can effectively capture global spatial information with linear computational complexity, making it suitable for high-resolution images. |
Existing Transformer-based image SR methods rely on local attention mechanisms to reduce computational complexity, limiting their ability to capture global context crucial for accurate image reconstruction. This work addresses this limitation by enabling effective global information modeling with manageable complexity. |
The paper introduces the recursive-generalization self-attention (RG-SA) module. RG-SA first employs a recursive generalization module (RGM) to compress input features into representative feature maps. Then, it performs cross-attention between the input features and the representative maps to capture global dependencies. Additionally, it scales the channel dimensions of attention matrices to reduce redundancy and complexity. The RGT architecture combines RG-SA with local self-attention in an alternating arrangement, further enhancing global context utilization through the proposed hybrid adaptive integration (HAI) method. |
RGT quantitatively outperforms recent state-of-the-art image SR methods on benchmark datasets across different scaling factors.
RGT qualitatively surpasses other methods in handling challenging cases, reconstructing more image details and alleviating blurring artifacts.
RGT achieves a better trade-off between model complexity and performance compared to existing CNN-based and Transformer-based methods. |
The current design of RGM mainly utilizes depth-wise convolutions, which could be further explored for better feature aggregation.
Exploring the application of RGT in other low-level vision tasks beyond image super-resolution. |
image super-resolution, vision transformer, global attention, recursive generalization, hybrid adaptive integration |
2303.06329
Report |
MetaViewer: Towards A Unified Multi-View Representation |
Ren Wang, Haoliang Sun, Yuling Ma, Xiaoming Xi, Yilong Yin |
Existing multi-view representation learning methods typically follow a
specific-to-uniform pipeline, extracting latent features from each view and
then fusing or aligning them to obtain the unified object representation.
However, the manually pre-specify fusion functions and view-private redundant
information mixed in features potentially degrade the quality of the derived
representation. To overcome them, we propose a novel
bi-level-optimization-based multi-view learning framework, where the
representation is learned in a uniform-to-specific manner. Specifically, we
train a meta-learner, namely MetaViewer, to learn fusion and model the
view-shared meta representation in outer-level optimization. Start with this
meta representation, view-specific base-learners are then required to rapidly
reconstruct the corresponding view in inner-level. MetaViewer eventually
updates by observing reconstruction processes from uniform to specific over all
views, and learns an optimal fusion scheme that separates and filters out
view-private information. Extensive experimental results in downstream tasks
such as classification and clustering demonstrate the effectiveness of our
method. |
Proposes MetaViewer, a novel bi-level optimization framework for multi-view representation learning that learns a unified representation in a uniform-to-specific manner. |
Addresses limitations of traditional specific-to-uniform multi-view learning methods that struggle with data-driven fusion and filtering view-private redundant information. |
MetaViewer uses a meta-learner to learn the fusion of view-shared information and base-learners to reconstruct individual views, effectively separating view-private information. |
MetaViewer outperforms state-of-the-art methods in clustering tasks on multiple benchmarks.
The learned unified representation achieves superior classification results, particularly in datasets with a large number of classes.
Ablation studies confirm the effectiveness of meta-learning fusion and the robustness to hyperparameter settings. |
The current implementation primarily focuses on reconstruction-based self-supervision.
Exploring alternative meta-learner architectures beyond convolutional layers could be beneficial. |
multi-view learning, representation learning, meta-learning, bi-level optimization, self-supervision |
2303.06285
Report |
DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation |
Yueming Lyu, Tianwei Lin, Fu Li, Dongliang He, Jing Dong, Tieniu Tan |
Text-driven image manipulation remains challenging in training or inference
flexibility. Conditional generative models depend heavily on expensive
annotated training data. Meanwhile, recent frameworks, which leverage
pre-trained vision-language models, are limited by either per text-prompt
optimization or inference-time hyper-parameters tuning. In this work, we
propose a novel framework named \textit{DeltaEdit} to address these problems.
Our key idea is to investigate and identify a space, namely delta image and
text space that has well-aligned distribution between CLIP visual feature
differences of two images and CLIP textual embedding differences of source and
target texts. Based on the CLIP delta space, the DeltaEdit network is designed
to map the CLIP visual features differences to the editing directions of
StyleGAN at training phase. Then, in inference phase, DeltaEdit predicts the
StyleGAN's editing directions from the differences of the CLIP textual
features. In this way, DeltaEdit is trained in a text-free manner. Once
trained, it can well generalize to various text prompts for zero-shot inference
without bells and whistles. Code is available at
https://github.com/Yueming6568/DeltaEdit. |
Proposes DeltaEdit, a novel framework for text-driven image manipulation that uses a text-free training paradigm, eliminating the need for expensive annotated text data during training. |
Addresses the limitations of previous text-driven image manipulation methods that suffer from training/inference inflexibility, poor generalization, and dependence on expensive annotated training data. |
Leverages the semantically aligned CLIP delta image-text feature space to train a Delta Mapper network. This network learns a mapping from image feature differences to StyleGAN's latent style space changes, enabling text-driven manipulation during inference by utilizing CLIP text embedding differences. |
Achieves state-of-the-art performance on various datasets (FFHQ, LSUN Cat, Church, Horse) with high-quality and disentangled editing results.
Generalizes well to unseen text prompts for zero-shot inference without requiring per-prompt optimization or hyper-parameter tuning.
Demonstrates superior efficiency compared to previous methods, with significantly reduced training and inference times. |
The quality of manipulation relies on the pre-trained StyleGAN and CLIP models.
Struggles with manipulating images containing attributes not well-represented in the training dataset. |
text-driven image manipulation, text-free training, clip, stylegan, zero-shot learning |
2303.05970
Report |
Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception |
Chunrui Han, Jinrong Yang, Jianjian Sun, Zheng Ge, Runpei Dong, Hongyu Zhou, Weixin Mao, Yuang Peng, Xiangyu Zhang |
Long-term temporal fusion is a crucial but often overlooked technique in
camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly
in a parallel manner. While parallel fusion can benefit from long-term
information, it suffers from increasing computational and memory overheads as
the fusion window size grows. Alternatively, BEVFormer adopts a recurrent
fusion pipeline so that history information can be efficiently integrated, yet
it fails to benefit from longer temporal frames. In this paper, we explore an
embarrassingly simple long-term recurrent fusion strategy built upon the
LSS-based methods and find it already able to enjoy the merits from both sides,
i.e., rich long-term information and efficient fusion pipeline. A temporal
embedding module is further proposed to improve the model's robustness against
occasionally missed frames in practical scenarios. We name this simple but
effective fusing pipeline VideoBEV. Experimental results on the nuScenes
benchmark show that VideoBEV obtains strong performance on various camera-based
3D perception tasks, including object detection (55.4\% mAP and 62.9\% NDS),
segmentation (48.6\% vehicle mIoU), tracking (54.8\% AMOTA), and motion
prediction (0.80m minADE and 0.463 EPA). |
This paper proposes VideoBEV, a simple yet effective recurrent long-term temporal fusion framework for camera-based Bird's-Eye-View (BEV) 3D perception. |
Long-term temporal fusion is crucial for accurate 3D perception in autonomous driving but often overlooked or limited in existing methods. VideoBEV addresses this by efficiently fusing long-term information for comprehensive scene understanding. |
VideoBEV leverages a recurrent fusion module that sequentially integrates BEV features from a long sequence of frames. Additionally, a temporal embedding module is introduced to enhance robustness against missed frames in real-world scenarios. |
VideoBEV achieves state-of-the-art performance on the nuScenes benchmark across various 3D perception tasks, including 3D object detection (55.4% mAP and 62.9% NDS), map segmentation, and 3D object tracking (54.8% AMOTA).
The study demonstrates, for the first time, that recurrent temporal fusion with longer sequences (e.g., 16 frames in 8s) brings further benefits for perception accuracy.
VideoBEV maintains efficiency compared to parallel fusion methods, with consistently low overhead for memory and computation even with longer video inputs. |
The paper primarily focuses on high-level BEV feature fusion, leaving room for exploration of incorporating more advanced temporal fusion techniques at lower feature levels.
Further research could investigate extending VideoBEV with more sophisticated motion modeling and prediction capabilities. |
3d perception, autonomous driving, temporal fusion, "birds-eye-view (bev)", recurrent neural networks |
2303.05892
Report |
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection |
Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, Si Liu |
Open-vocabulary object detection aims to provide object detectors trained on
a fixed set of object categories with the generalizability to detect objects
described by arbitrary text queries. Previous methods adopt knowledge
distillation to extract knowledge from Pretrained Vision-and-Language Models
(PVLMs) and transfer it to detectors. However, due to the non-adaptive proposal
cropping and single-level feature mimicking processes, they suffer from
information destruction during knowledge extraction and inefficient knowledge
transfer. To remedy these limitations, we propose an Object-Aware Distillation
Pyramid (OADP) framework, including an Object-Aware Knowledge Extraction (OAKE)
module and a Distillation Pyramid (DP) mechanism. When extracting object
knowledge from PVLMs, the former adaptively transforms object proposals and
adopts object-aware mask attention to obtain precise and complete knowledge of
objects. The latter introduces global and block distillation for more
comprehensive knowledge transfer to compensate for the missing relation
information in object distillation. Extensive experiments show that our method
achieves significant improvement compared to current methods. Especially on the
MS-COCO dataset, our OADP framework reaches $35.6$ mAP$^{\text{N}}_{50}$,
surpassing the current state-of-the-art method by $3.3$ mAP$^{\text{N}}_{50}$.
Code is released at https://github.com/LutingWang/OADP. |
This paper proposes OADP, an Open-vocabulary Adaptive Proposal Distillation Pyramid framework for open-vocabulary object detection. |
Existing methods suffer from information loss during knowledge extraction from pre-trained vision-and-language models and inefficient knowledge transfer to detectors. |
OADP uses an OAKE module to adaptively transform object proposals and extract precise knowledge with masked attention. It also employs a DP mechanism with global, block, and object distillation for comprehensive knowledge transfer. |
OADP achieves 35.6 mAPN@50 on OV-COCO, surpassing the previous state-of-the-art by 3.3 mAPN@50.
On OV-LVIS, OADP achieves 21.9 AP_r for object detection and 21.7 AP_r for instance segmentation, outperforming previous methods by more than 1.1 AP_r and 1.9 AP_r respectively.
Ablation studies demonstrate the effectiveness of OAKE and the DP mechanism in improving detection performance. |
The training cost of OADP is high due to the use of a large-scale image-text model.
The performance of OADP on novel categories is still lower than that on base categories. |
open-vocabulary object detection, knowledge distillation, vision and language, object proposal, adaptive learning |
2303.05828
Report |
Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection |
Nikolas Adaloglou, Felix Michels, Tim Kaiser, Markus Kollmann |
We present a comprehensive experimental study on pretrained feature
extractors for visual out-of-distribution (OOD) detection, focusing on adapting
contrastive language-image pretrained (CLIP) models. Without fine-tuning on the
training data, we are able to establish a positive correlation ($R^2\geq0.92$)
between in-distribution classification and unsupervised OOD detection for CLIP
models in $4$ benchmarks. We further propose a new simple and scalable method
called \textit{pseudo-label probing} (PLP) that adapts vision-language models
for OOD detection. Given a set of label names of the training set, PLP trains a
linear layer using the pseudo-labels derived from the text encoder of CLIP. To
test the OOD detection robustness of pretrained models, we develop a novel
feature-based adversarial OOD data manipulation approach to create adversarial
samples. Intriguingly, we show that (i) PLP outperforms the previous
state-of-the-art \citep{ming2022mcm} on all $5$ large-scale benchmarks based on
ImageNet, specifically by an average AUROC gain of 3.4\% using the largest CLIP
model (ViT-G), (ii) we show that linear probing outperforms fine-tuning by
large margins for CLIP architectures (i.e. CLIP ViT-H achieves a mean gain of
7.3\% AUROC on average on all ImageNet-based benchmarks), and (iii)
billion-parameter CLIP models still fail at detecting adversarially manipulated
OOD images. The code and adversarially created datasets will be made publicly
available. |
This paper investigates the use of pretrained CLIP models for visual out-of-distribution (OOD) detection and proposes a new method called pseudo-label probing (PLP) for adapting CLIP to this task. |
Accurate OOD detection is crucial for real-world applications to ensure safety during deployment, and leveraging pretrained models like CLIP can significantly benefit this task. |
The paper conducts experiments with 25 pretrained feature extractors on various OOD benchmarks. PLP utilizes CLIP's text encoder to generate pseudo-labels for training a linear layer on top of CLIP's visual features. The authors also introduce a novel feature-based adversarial OOD data manipulation technique. |
CLIP models show a strong correlation between in-distribution classification accuracy and unsupervised OOD detection performance.
PLP outperforms previous state-of-the-art methods on ImageNet benchmarks, achieving an average AUROC gain of 3.4% with CLIP ViT-G.
Linear probing on CLIP features surpasses fine-tuning for OOD detection on ImageNet-based benchmarks, indicating that OOD-related information is readily available in large-scale models. |
The study primarily focuses on ImageNet-based benchmarks and might not generalize to other datasets.
Further research is needed to understand the impact of PLP on in-distribution test accuracy and explore its applicability to other visual feature extractors. |
out-of-distribution detection, clip, contrastive language-image pretraining, pseudo-label probing, adversarial robustness |
2303.05807
Report |
Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields |
Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada |
Common capture low-light scenes are challenging for most computer vision
techniques, including Neural Radiance Fields (NeRF). Vanilla NeRF is
viewer-centred simplifies the rendering process only as light emission from 3D
locations in the viewing direction, thus failing to model the low-illumination
induced darkness. Inspired by the emission theory of ancient Greeks that visual
perception is accomplished by rays casting from eyes, we make slight
modifications on vanilla NeRF to train on multiple views of low-light scenes,
we can thus render out the well-lit scene in an unsupervised manner. We
introduce a surrogate concept, Concealing Fields, that reduces the transport of
light during the volume rendering stage. Specifically, our proposed method,
Aleth-NeRF, directly learns from the dark image to understand volumetric object
representation and concealing field under priors. By simply eliminating
Concealing Fields, we can render a single or multi-view well-lit image(s) and
gain superior performance over other 2D low-light enhancement methods.
Additionally, we collect the first paired LOw-light and normal-light Multi-view
(LOM) datasets for future research. This version is invalid, please refer to
our new AAAI version: arXiv:2312.09093 |
Presents Aleth-NeRF, the first NeRF-based method trained on dark multi-view sRGB images for unsupervised low-light enhancement. |
Vanilla NeRF struggles with low-light scenes due to its viewer-centered approach, failing to model light attenuation. Existing solutions either require known lighting conditions or rely on 2D enhancement methods that lack 3D consistency. |
Introduces 'Concealing Fields' into the NeRF framework to simulate light attenuation, effectively extending the transmittance function. Employs priors like value range, structure similarity, and color constancy to enable unsupervised learning of these fields from low-light images. |
Achieves state-of-the-art performance on the LOL dataset for single-image low-light enhancement, demonstrating high generation quality.
Introduces LOM, the first paired low-light and normal-light multi-view dataset for benchmarking.
Outperforms existing 2D enhancement methods on the LOM dataset, exhibiting superior image quality and multi-view consistency in low-light scene rendering. |
Requires separate training for each scene, limiting generalizability.
May struggle with scenes exhibiting non-uniform lighting or strong shadows.
Future work includes addressing these limitations and exploring applications in dynamic scene relighting. |
neural radiance fields, low-light image enhancement, unsupervised learning, novel view synthesis, 3d scene understanding |
2303.05775
Report |
Self-NeRF: A Self-Training Pipeline for Few-Shot Neural Radiance Fields |
Jiayang Bai, Letian Huang, Wen Gong, Jie Guo, Yanwen Guo |
Recently, Neural Radiance Fields (NeRF) have emerged as a potent method for
synthesizing novel views from a dense set of images. Despite its impressive
performance, NeRF is plagued by its necessity for numerous calibrated views and
its accuracy diminishes significantly in a few-shot setting. To address this
challenge, we propose Self-NeRF, a self-evolved NeRF that iteratively refines
the radiance fields with very few number of input views, without incorporating
additional priors. Basically, we train our model under the supervision of
reference and unseen views simultaneously in an iterative procedure. In each
iteration, we label unseen views with the predicted colors or warped pixels
generated by the model from the preceding iteration. However, these expanded
pseudo-views are afflicted by imprecision in color and warping artifacts, which
degrades the performance of NeRF. To alleviate this issue, we construct an
uncertainty-aware NeRF with specialized embeddings. Some techniques such as
cone entropy regularization are further utilized to leverage the pseudo-views
in the most efficient manner. Through experiments under various settings, we
verified that our Self-NeRF is robust to input with uncertainty and surpasses
existing methods when trained on limited training data. |
This paper introduces Self-NeRF, a novel iterative self-training pipeline for Neural Radiance Fields (NeRF) designed to enhance novel view synthesis from a limited set of input views (few-shot). |
NeRF often struggles with few-shot scenarios, leading to degenerate solutions and overfitting. This work aims to address this challenge by iteratively refining NeRF reconstructions without relying on additional priors. |
Self-NeRF operates by iteratively training an uncertainty-aware NeRF model using both seen views and synthesized pseudo-views. The pseudo-views are generated in two ways: by warping seen views based on predicted depth and by using direct predictions from the previous iteration's model. The uncertainty-aware nature of the model allows it to handle inaccuracies inherent in these pseudo-views. |
Self-NeRF effectively synthesizes novel views with superior detail compared to existing few-shot NeRF methods.
The iterative training process demonstrably improves the quality of reconstructions over multiple iterations.
Self-NeRF exhibits robustness to varying numbers of input views, demonstrating its effectiveness in few-shot settings. |
The reliance on iterative training increases the overall computational cost.
While effective, the performance gains of Self-NeRF decrease as the number of input views increases. |
neural radiance fields, nerf, few-shot learning, novel view synthesis, self-training |
2303.05724
Report |
3D Cinemagraphy from a Single Image |
Xingyi Li, Zhiguo Cao, Huiqiang Sun, Jianming Zhang, Ke Xian, Guosheng Lin |
We present 3D Cinemagraphy, a new technique that marries 2D image animation
with 3D photography. Given a single still image as input, our goal is to
generate a video that contains both visual content animation and camera motion.
We empirically find that naively combining existing 2D image animation and 3D
photography methods leads to obvious artifacts or inconsistent animation. Our
key insight is that representing and animating the scene in 3D space offers a
natural solution to this task. To this end, we first convert the input image
into feature-based layered depth images using predicted depth values, followed
by unprojecting them to a feature point cloud. To animate the scene, we perform
motion estimation and lift the 2D motion into the 3D scene flow. Finally, to
resolve the problem of hole emergence as points move forward, we propose to
bidirectionally displace the point cloud as per the scene flow and synthesize
novel views by separately projecting them into target image planes and blending
the results. Extensive experiments demonstrate the effectiveness of our method.
A user study is also conducted to validate the compelling rendering results of
our method. |
Presents 3D Cinemagraphy, a novel technique generating videos with plausible animation and camera motion from a single still image. |
Traditional cinemagraphs lack 3D immersion and parallax effects; this work aims to bridge the gap between 2D image animation and 3D photography for a more realistic experience. |
Converts input image to feature-based layered depth images, unprojects to a feature point cloud, estimates and lifts 2D motion to 3D scene flow, animates the point cloud bidirectionally to address holes, and renders novel views at each time step. |
Outperforms baselines combining 2D animation and novel view synthesis in quantitative metrics (PSNR, SSIM, LPIPS).
Produces visually compelling results with fewer artifacts like flickering or jelly-like effects compared to alternative approaches.
Demonstrates generalization ability on in-the-wild photos, paintings, and synthetic images, especially with user-provided masks and flow hints for controlled animation. |
Performance depends on the accuracy of depth estimation, particularly for challenging structures like thin objects.
Currently focuses on fluid motion animation, leaving more complex motions like cyclic movements for future exploration. |
3d cinemagraphy, image animation, novel view synthesis, point cloud animation, single image animation |
2303.05699
Report |
Feature Unlearning for Pre-trained GANs and VAEs |
Saemi Moon, Seunghyuk Cho, Dongwoo Kim |
We tackle the problem of feature unlearning from a pre-trained image
generative model: GANs and VAEs. Unlike a common unlearning task where an
unlearning target is a subset of the training set, we aim to unlearn a specific
feature, such as hairstyle from facial images, from the pre-trained generative
models. As the target feature is only presented in a local region of an image,
unlearning the entire image from the pre-trained model may result in losing
other details in the remaining region of the image. To specify which features
to unlearn, we collect randomly generated images that contain the target
features. We then identify a latent representation corresponding to the target
feature and then use the representation to fine-tune the pre-trained model.
Through experiments on MNIST, CelebA, and FFHQ datasets, we show that target
features are successfully removed while keeping the fidelity of the original
models. Further experiments with an adversarial attack show that the unlearned
model is more robust under the presence of malicious parties. |
This paper presents a novel framework for unlearning specific features from pre-trained image generative models, such as GANs and VAEs. |
This addresses the problem of unwanted or harmful content generation while avoiding the need to retrain the entire model. |
The method involves identifying a latent representation of the target feature and then fine-tuning the pre-trained model to prevent the generation of images with that feature. This is achieved by collecting images with the target feature, identifying a corresponding latent vector, and then using that vector to guide the fine-tuning process. |
The unlearned models successfully reduce the generation of target features, achieving similar target feature ratios to oracle models trained without the target feature.
The unlearning process maintains high image quality, as demonstrated by comparable Inception Score and Fréchet Inception Distance scores to the original and oracle models.
The unlearned models show increased robustness against adversarial attacks, making them less susceptible to manipulation for generating unwanted content. |
The proposed method's effectiveness heavily relies on the quality of the latent space disentanglement.
Future work includes exploring more sophisticated feature disentanglement algorithms to improve the precision of feature unlearning. |
generative adversarial networks, variational autoencoders, machine unlearning, feature unlearning, adversarial robustness |
2303.05646
Report |
Iterative Few-shot Semantic Segmentation from Image Label Text |
Haohan Wang, Liang Liu, Wuhao Zhang, Jiangning Zhang, Zhenye Gan, Yabiao Wang, Chengjie Wang, Haoqian Wang |
Few-shot semantic segmentation aims to learn to segment unseen class objects
with the guidance of only a few support images. Most previous methods rely on
the pixel-level label of support images. In this paper, we focus on a more
challenging setting, in which only the image-level labels are available. We
propose a general framework to firstly generate coarse masks with the help of
the powerful vision-language model CLIP, and then iteratively and mutually
refine the mask predictions of support and query images. Extensive experiments
on PASCAL-5i and COCO-20i datasets demonstrate that our method not only
outperforms the state-of-the-art weakly supervised approaches by a significant
margin, but also achieves comparable or better results to recent supervised
methods. Moreover, our method owns an excellent generalization ability for the
images in the wild and uncommon classes. Code will be available at
https://github.com/Whileherham/IMR-HSNet. |
This document provides style instructions for authors submitting papers to the IJCAI--22 Proceedings. |
These guidelines ensure uniformity in formatting for all published papers in the IJCAI-22 proceedings. |
The paper details specific formatting requirements for various aspects like layout, fonts, headings, citations, illustrations, tables, formulas, algorithms etc. It also provides downloadable LaTeX and Microsoft Word templates that implement these guidelines. |
The use of Adobe's Portable Document Format (PDF) is mandatory for the electronic manuscript submission.
For uniformity, Adobe's Times Roman font is strongly recommended.
Authors are required to use the provided LaTeX or Microsoft Word templates for formatting. |
The document assumes the use of LaTeX or Microsoft Word, it does not provide instructions for other word processing software.
The document doesn't extensively cover accessibility aspects for readers with disabilities. |
ijcai, conference paper formatting, style guidelines, latex template, microsoft word template |
2303.05503
Report |
Open-world Instance Segmentation: Top-down Learning with Bottom-up Supervision |
Tarun Kalluri, Weiyao Wang, Heng Wang, Manmohan Chandraker, Lorenzo Torresani, Du Tran |
Many top-down architectures for instance segmentation achieve significant
success when trained and tested on pre-defined closed-world taxonomy. However,
when deployed in the open world, they exhibit notable bias towards seen classes
and suffer from significant performance drop. In this work, we propose a novel
approach for open world instance segmentation called bottom-Up and top-Down
Open-world Segmentation (UDOS) that combines classical bottom-up segmentation
algorithms within a top-down learning framework. UDOS first predicts parts of
objects using a top-down network trained with weak supervision from bottom-up
segmentations. The bottom-up segmentations are class-agnostic and do not
overfit to specific taxonomies. The part-masks are then fed into affinity-based
grouping and refinement modules to predict robust instance-level segmentations.
UDOS enjoys both the speed and efficiency from the top-down architectures and
the generalization ability to unseen categories from bottom-up supervision. We
validate the strengths of UDOS on multiple cross-category as well as
cross-dataset transfer tasks from 5 challenging datasets including MS-COCO,
LVIS, ADE20k, UVO and OpenImages, achieving significant improvements over
state-of-the-art across the board. Our code and models are available on our
project page. |
This paper introduces UDOS (Bottom-Up and Top-Down Open-World Segmentation), a novel method for open-world instance segmentation that combines the strengths of bottom-up and top-down approaches. |
Open-world instance segmentation is crucial for real-world applications where models encounter novel objects not present in the training taxonomy. |
UDOS leverages a top-down network trained with weak supervision from class-agnostic bottom-up segmentation to predict object parts. These parts are then grouped using affinity scores and refined for boundary accuracy, enabling the detection of both seen and unseen objects. |
UDOS outperforms state-of-the-art methods in cross-category generalization on COCO, achieving 33.5% box AR and 31.6% mask AR.
The method excels in cross-dataset generalization, setting new state-of-the-art results on UVO, ADE20k, and OpenImages datasets without fine-tuning.
Ablation studies validate the contribution of each module and the importance of design choices. |
UDOS faces challenges with densely clustered objects of similar appearance.
Future work could explore more robust grouping methods or incorporate recent innovations like Segment Anything (SAM) for improved initial segmentation. |
open-world learning, instance segmentation, bottom-up segmentation, top-down learning, cross-category generalization |
2303.05499
Report |
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection |
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang |
In this paper, we present an open-set object detector, called Grounding DINO,
by marrying Transformer-based detector DINO with grounded pre-training, which
can detect arbitrary objects with human inputs such as category names or
referring expressions. The key solution of open-set object detection is
introducing language to a closed-set detector for open-set concept
generalization. To effectively fuse language and vision modalities, we
conceptually divide a closed-set detector into three phases and propose a tight
fusion solution, which includes a feature enhancer, a language-guided query
selection, and a cross-modality decoder for cross-modality fusion. While
previous works mainly evaluate open-set object detection on novel categories,
we propose to also perform evaluations on referring expression comprehension
for objects specified with attributes. Grounding DINO performs remarkably well
on all three settings, including benchmarks on COCO, LVIS, ODinW, and
RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection
zero-shot transfer benchmark, i.e., without any training data from COCO. It
sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code
will be available at \url{https://github.com/IDEA-Research/GroundingDINO}. |
The paper presents Grounding DINO, an open-set object detector that leverages grounded pre-training to enable the detection of arbitrary objects specified by human language input. |
Open-set object detection is crucial for developing visual intelligence systems capable of understanding novel concepts, with applications ranging from image editing to generic object detection. |
The paper proposes a tight fusion approach, incorporating language information into multiple phases of a DINO detector: a feature enhancer for cross-modality fusion, a language-guided query selection module, and a cross-modality decoder. |
Grounding DINO achieves state-of-the-art performance on open-set detection benchmarks, including 52.5 AP on COCO zero-shot transfer and 26.1 mean AP on ODinW zero-shot.
The paper extends open-set object detection evaluation to Referring Expression Comprehension (REC) tasks, revealing a need for future work to focus on REC zero-shot performance.
Ablation studies demonstrate the effectiveness of each proposed fusion component in enhancing open-set object detection performance. |
While achieving impressive open-set detection results, Grounding DINO lacks segmentation capabilities.
The training data scale is limited compared to the largest GLIP models, potentially hindering further performance improvement. |
open-set object detection, referring expression comprehension, transformer-based detectors, multi-modal learning, grounded pre-training |
2303.05498
Report |
Mark My Words: Dangers of Watermarked Images in ImageNet |
Kirill Bykov, Klaus-Robert Müller, Marina M. -C. Höhne |
The utilization of pre-trained networks, especially those trained on
ImageNet, has become a common practice in Computer Vision. However, prior
research has indicated that a significant number of images in the ImageNet
dataset contain watermarks, making pre-trained networks susceptible to learning
artifacts such as watermark patterns within their latent spaces. In this paper,
we aim to assess the extent to which popular pre-trained architectures display
such behavior and to determine which classes are most affected. Additionally,
we examine the impact of watermarks on the extracted features. Contrary to the
popular belief that the Chinese logographic watermarks impact the "carton"
class only, our analysis reveals that a variety of ImageNet classes, such as
"monitor", "broom", "apron" and "safe" rely on spurious correlations. Finally,
we propose a simple approach to mitigate this issue in fine-tuned networks by
ignoring the encodings from the feature-extractor layer of ImageNet pre-trained
networks that are most susceptible to watermark imprints. |
This paper investigates the impact of watermarks on ImageNet pre-trained models and reveals that many ImageNet classes, beyond the previously known "carton" class, are susceptible to spurious correlations with watermarks, particularly Chinese logograms. |
This is important because the presence of watermarks in training data can lead to models learning unintended artifacts, hindering their generalization ability and potentially leading to incorrect predictions. |
The authors analyzed the activations of 20 popular ImageNet pre-trained architectures on datasets with and without watermarks (Chinese, Latin, Hindi, and Numeric). They then measured the models' ability to differentiate between watermarked and normal images using AUC ROC. |
Numerous ImageNet classes exhibit sensitivity to Chinese watermarks, not just "carton".
This sensitivity is prevalent across all tested pre-trained architectures.
Ignoring the most watermark-sensitive representations during fine-tuning can mitigate the reliance on watermarks without significantly impacting performance. |
The study primarily focuses on Chinese logographic watermarks.
Future work can explore the impact of other watermark types and develop more sophisticated mitigation techniques. |
imagenet, watermarks, spurious correlations, deep learning, transfer learning |
2303.05456
Report |
Restoration based Generative Models |
Jaemoo Choi, Yesom Park, Myungjoo Kang |
Denoising diffusion models (DDMs) have recently attracted increasing
attention by showing impressive synthesis quality. DDMs are built on a
diffusion process that pushes data to the noise distribution and the models
learn to denoise. In this paper, we establish the interpretation of DDMs in
terms of image restoration (IR). Integrating IR literature allows us to use an
alternative objective and diverse forward processes, not confining to the
diffusion process. By imposing prior knowledge on the loss function grounded on
MAP-based estimation, we eliminate the need for the expensive sampling of DDMs.
Also, we propose a multi-scale training, which improves the performance
compared to the diffusion process, by taking advantage of the flexibility of
the forward process. Experimental results demonstrate that our model improves
the quality and efficiency of both training and inference. Furthermore, we show
the applicability of our model to inverse problems. We believe that our
framework paves the way for designing a new type of flexible general generative
model. |
This paper introduces Restoration-based Generative Models (RGMs), a flexible generative model family inspired by image restoration techniques, to enhance the efficiency and flexibility of Denoising Diffusion Models (DDMs). |
DDMs, while effective, suffer from slow and computationally expensive sampling processes due to their reliance on iterative denoising and Gaussian noising processes. |
The authors leverage a Maximum A Posteriori (MAP)-based estimation with a learned prior term to replace the MMSE objective of DDMs. This approach enables efficient sampling with fewer steps by alleviating the ill-posedness inherent in inverse problems. Furthermore, they introduce flexibility in designing the degradation process, proposing a multi-scale approach that progressively reduces image dimension for more efficient latent representation. |
RGMs achieve comparable image generation quality to state-of-the-art DDMs, with significantly faster inference speed (e.g., FID 2.47 on CIFAR10 with only seven network function evaluations).
The framework demonstrates flexibility through successful implementation of various prior terms (KLD, MMD, DSWD) and degradation processes, showcasing its adaptability and potential for further exploration.
Beyond image generation, RGMs exhibit promising results in solving inverse problems like super-resolution and colorization when incorporated into Plug-and-Play algorithms. |
While showing strong empirical performance, the paper lacks theoretical justification for the effectiveness of RGMs.
The exploration of more sophisticated and effective forward processes beyond the multi-scale approach is left as future work. |
generative models, denoising diffusion models, image restoration, map estimation, plug-and-play algorithms |
2303.05371
Report |
3DGen: Triplane Latent Diffusion for Textured Mesh Generation |
Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, Barlas Oğuz |
Latent diffusion models for image generation have crossed a quality threshold
which enabled them to achieve mass adoption. Recently, a series of works have
made advancements towards replicating this success in the 3D domain,
introducing techniques such as point cloud VAE, triplane representation, neural
implicit surfaces and differentiable rendering based training. We take another
step along this direction, combining these developments in a two-step pipeline
consisting of 1) a triplane VAE which can learn latent representations of
textured meshes and 2) a conditional diffusion model which generates the
triplane features. For the first time this architecture allows conditional and
unconditional generation of high quality textured or untextured 3D meshes
across multiple diverse categories in a few seconds on a single GPU. It
outperforms previous work substantially on image-conditioned and unconditional
generation on mesh quality as well as texture generation. Furthermore, we
demonstrate the scalability of our model to large datasets for increased
quality and diversity. We will release our code and trained models. |
Presents 3DGen, a two-stage pipeline for high-quality textured 3D mesh generation, utilizing a triplane VAE for latent representation learning and a conditional diffusion model for feature generation. |
Aims to address limitations in existing 3D generation methods, such as scalability, joint geometry and texture learning, and practical computational constraints, bridging the gap towards practical and high-quality 3D object generation. |
Combines a triplane VAE, trained with rendering-based reconstruction loss, with a conditional diffusion model incorporating 3D-aware convolutions and classifier-free guidance, enabling image-conditioned, text-conditioned, and unconditional generation. |
Achieves state-of-the-art performance in unconditional and image-conditioned mesh generation, outperforming competitors like NFD and 3DILG in FID scores and geometry fidelity.
Demonstrates superior performance in textured mesh generation compared to GET3D, with significant FID score improvements, generating high-quality meshes with detailed geometry and textures.
Shows scalability and improved quality by pre-training on the large-scale Objaverse dataset, particularly benefiting low-resource categories and enabling text-guided generation. |
Despite improvements, the model's generality still lags behind image generation models trained on massive datasets.
Future work can explore utilizing 2D image datasets as weak supervision or leveraging 2D generative models to further enhance 3D generation capabilities. |
3d mesh generation, textured mesh, latent diffusion model, triplane representation, variational autoencoder |
2303.05342
Report |
Knowledge-augmented Few-shot Visual Relation Detection |
Tianyu Yu, Yangning Li, Jiaoyan Chen, Yinghui Li, Hai-Tao Zheng, Xi Chen, Qingbin Liu, Wenqiang Liu, Dongxiao Huang, Bei Wu, Yexin Wang |
Visual Relation Detection (VRD) aims to detect relationships between objects
for image understanding. Most existing VRD methods rely on thousands of
training samples of each relationship to achieve satisfactory performance. Some
recent papers tackle this problem by few-shot learning with elaborately
designed pipelines and pre-trained word vectors. However, the performance of
existing few-shot VRD models is severely hampered by the poor generalization
capability, as they struggle to handle the vast semantic diversity of visual
relationships. Nonetheless, humans have the ability to learn new relationships
with just few examples based on their knowledge. Inspired by this, we devise a
knowledge-augmented, few-shot VRD framework leveraging both textual knowledge
and visual relation knowledge to improve the generalization ability of few-shot
VRD. The textual knowledge and visual relation knowledge are acquired from a
pre-trained language model and an automatically constructed visual relation
knowledge graph, respectively. We extensively validate the effectiveness of our
framework. Experiments conducted on three benchmarks from the commonly used
Visual Genome dataset show that our performance surpasses existing
state-of-the-art models with a large improvement. |
This paper proposes \modelname, a knowledge-augmented few-shot visual relation detection framework that leverages textual and visual relation knowledge to improve generalization ability. |
Existing few-shot VRD models struggle with the vast semantic diversity of visual relationships, limiting their performance. Humans, however, leverage knowledge to learn new relationships from few examples, motivating this work. |
The framework acquires textual knowledge from a pre-trained language model using prompt-based representations. It constructs a visual relation knowledge graph from image captions, encoded into a distributed representation using a pre-trained BERT model. A Mixture-of-Experts module fuses both knowledge sources to predict relationships. |
\modelname significantly outperforms state-of-the-art models on three VRD benchmarks, even without VRD pre-training or vision-language pre-training.
Both textual and visual relation knowledge significantly improve performance, especially for unseen object pairs and triplets.
A novel prompt template for textual knowledge and a visual relation knowledge graph constructed from a large corpus of image captions contribute to the performance gains. |
The current approach only utilizes image captions for visual relation knowledge; exploring other sources like videos could be beneficial.
Future work can explore more sophisticated knowledge fusion approaches beyond the Mixture-of-Experts module. |
visual relation detection, few-shot learning, knowledge augmentation, textual knowledge, visual relation knowledge graph |
2303.05323
Report |
Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODE |
Yucheng Xu, Li Nanbo, Arushi Goel, Zijian Guo, Zonghai Yao, Hamidreza Kasaei, Mohammadreze Kasaei, Zhibin Li |
Videos depict the change of complex dynamical systems over time in the form
of discrete image sequences. Generating controllable videos by learning the
dynamical system is an important yet underexplored topic in the computer vision
community. This paper presents a novel framework, TiV-ODE, to generate highly
controllable videos from a static image and a text caption. Specifically, our
framework leverages the ability of Neural Ordinary Differential
Equations~(Neural ODEs) to represent complex dynamical systems as a set of
nonlinear ordinary differential equations. The resulting framework is capable
of generating videos with both desired dynamics and content. Experiments
demonstrate the ability of the proposed method in generating highly
controllable and visually consistent videos, and its capability of modeling
dynamical systems. Overall, this work is a significant step towards developing
advanced controllable video generation models that can handle complex and
dynamic scenes. |
Presents TiV-ODE, a novel framework for generating highly controllable videos from a static image and text caption by leveraging Neural ODEs to represent complex dynamical systems. |
Addresses limitations of traditional video generation methods by enabling control over both motion and appearance, and modeling the underlying continuous dynamical system for flexible frame rate generation. |
Combines image and text embeddings using a transformer, uses these as initial conditions for a Neural ODE, solves the ODE at desired timesteps to generate latent vectors, and decodes these into video frames using a VQ-VAE. |
Outperforms state-of-the-art methods like MAGE in metrics like FID and LPIPS on datasets like CATER and a new synthetic robot pick-and-place dataset.
Demonstrates controllability by accurately manipulating objects in videos based on text captions.
Successfully models continuous dynamics, enabling video generation with arbitrary and non-uniform frame rates (e.g., slow-motion effects). |
Training and solving the Neural ODE can be time-consuming, especially for complex motions.
Reliance on the first frame for visual information can lead to weaker constraints on later frames and potential blurring. |
video generation, controllable generation, neural ode, dynamical systems, text-to-video |
2303.05275
Report |
Detecting Images Generated by Diffusers |
Davide Alessandro Coccomini, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, Giuseppe Amato |
This paper explores the task of detecting images generated by text-to-image
diffusion models. To evaluate this, we consider images generated from captions
in the MSCOCO and Wikimedia datasets using two state-of-the-art models: Stable
Diffusion and GLIDE. Our experiments show that it is possible to detect the
generated images using simple Multi-Layer Perceptrons (MLPs), starting from
features extracted by CLIP, or traditional Convolutional Neural Networks
(CNNs). We also observe that models trained on images generated by Stable
Diffusion can detect images generated by GLIDE relatively well, however, the
reverse is not true. Lastly, we find that incorporating the associated textual
information with the images rarely leads to significant improvement in
detection results but that the type of subject depicted in the image can have a
significant impact on performance. This work provides insights into the
feasibility of detecting generated images, and has implications for security
and privacy concerns in real-world applications. The code to reproduce our
results is available at:
https://github.com/davide-coccomini/Detecting-Images-Generated-by-Diffusers |
This paper investigates the detection of images generated by text-to-image diffusion models, specifically Stable Diffusion and GLIDE, using simple MLPs and CNNs. |
The ability to detect synthetic images is crucial for addressing concerns related to misinformation, deepfakes, and the integrity of online information, especially with the increasing accessibility and realism of text-to-image generation models. |
The study uses MLPs and CNNs trained on images from MSCOCO and Wikimedia datasets. They evaluate the models' performance in detecting images generated by Stable Diffusion and GLIDE, both within and across training methods. Additionally, they analyze the impact of image category and linguistic features on detection. |
Pretrained CNNs and MLPs using CLIP-extracted features can effectively detect images generated by Stable Diffusion and GLIDE when trained on data generated by the same method.
Models trained on Stable Diffusion-generated images show some ability to detect GLIDE-generated images, but not vice versa.
Images depicting inanimate objects are more challenging to classify correctly, suggesting that generating believable animate objects is more difficult for current text-to-image models. |
The generalization ability of classifiers across different text-to-image generation methods is limited.
Further research is needed to explore the impact of more sophisticated language models in a multimodal detection setup. |
image generation, diffusion models, synthetic image detection, stable diffusion, glide |
2303.05266
Report |
From Visual Prompt Learning to Zero-Shot Transfer: Mapping Is All You Need |
Ziqing Yang, Zeyang Sha, Michael Backes, Yang Zhang |
Visual prompt learning, as a newly emerged technique, leverages the knowledge
learned by a large-scale pre-trained model and adapts it to downstream tasks
through the usage of prompts. While previous research has focused on designing
effective prompts, in this work, we argue that compared to prompt design, a
good mapping strategy matters more. In this sense, we propose SeMap, a more
effective mapping using the semantic alignment between the pre-trained model's
knowledge and the downstream task. Our experimental results show that SeMap can
largely boost the performance of visual prompt learning. Moreover, our
experiments show that SeMap is capable of achieving competitive zero-shot
transfer, indicating that it can perform the downstream task without any
fine-tuning on the corresponding dataset. This demonstrates the potential of
our proposed method to be used in a broader range of applications where the
zero-shot transfer is desired. Results suggest that our proposed SeMap could
lead to significant advancements in both visual prompt learning and zero-shot
transfer. We hope with SeMap, we can help the community move forward to more
efficient and lightweight utilization of large vision models. |
This paper proposes \method, a semantics-based mapping strategy for visual prompt learning that leverages semantic alignment between pre-trained and downstream tasks. |
Existing visual prompt learning methods primarily focus on prompt design, neglecting the importance of effective mapping strategies for performance improvement. |
The paper introduces \method[-1] (1-on-1 mapping based on highest semantic similarity) and \method[-a] (adaptive k-on-1 mapping based on semantic similarity clustering) to map pre-trained model outputs to downstream task labels using CLIP's text encoder for semantic similarity measurement. |
\method consistently outperforms existing visual prompt learning methods (RM-VP, FM-VP) by a large margin across various datasets.
\method achieves competitive zero-shot transfer performance without prompt optimization, even surpassing some visual prompt learning methods.
Mapping strategy shows a greater impact on performance compared to prompt design, highlighting its importance in visual prompt learning. |
The performance of zero-shot transfer using \method heavily relies on the similarity between downstream and pre-trained datasets.
Future work includes exploring the effectiveness of \method on more challenging downstream tasks and investigating its generalization capabilities to other pre-trained models. |
visual prompt learning, zero-shot transfer, mapping strategy, semantic alignment, pre-trained models |
2303.05251
Report |
Masked Image Modeling with Local Multi-Scale Reconstruction |
Haoqing Wang, Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhi-Hong Deng, Kai Han |
Masked Image Modeling (MIM) achieves outstanding success in self-supervised
representation learning. Unfortunately, MIM models typically have huge
computational burden and slow learning process, which is an inevitable obstacle
for their industrial applications. Although the lower layers play the key role
in MIM, existing MIM models conduct reconstruction task only at the top layer
of encoder. The lower layers are not explicitly guided and the interaction
among their patches is only used for calculating new activations. Considering
the reconstruction task requires non-trivial inter-patch interactions to reason
target signals, we apply it to multiple local layers including lower and upper
layers. Further, since the multiple layers expect to learn the information of
different scales, we design local multi-scale reconstruction, where the lower
and upper layers reconstruct fine-scale and coarse-scale supervision signals
respectively. This design not only accelerates the representation learning
process by explicitly guiding multiple layers, but also facilitates multi-scale
semantical understanding to the input. Extensive experiments show that with
significantly less pre-training burden, our model achieves comparable or better
performance on classification, detection and segmentation tasks than existing
MIM models. |
This paper proposes LocalMIM, a new Masked Image Modeling (MIM) technique that uses local multi-scale reconstruction to learn visual representations. It introduces reconstruction tasks at multiple layers of the encoder, with each layer focusing on different scales of the input image, rather than solely at the top layer like traditional MIM methods. |
Existing MIM models suffer from high computational cost and slow learning, hindering their practical use. The authors argue that lower encoder layers are crucial for learning but are not explicitly guided in existing MIM models. LocalMIM aims to address this by explicitly guiding the learning of both lower and upper layers through multi-scale reconstruction, leading to faster and more efficient representation learning. |
LocalMIM divides the input image into regions and extracts supervision signals (e.g., HOG features, normalized pixels) at multiple scales. It applies local reconstruction losses at specific layers of the encoder, with lower layers reconstructing fine-scale signals and upper layers reconstructing coarse-scale signals. The model uses an asymmetric encoder-decoder structure where the decoder is lightweight, minimizing computational overhead. |
LocalMIM significantly outperforms existing MIM models in terms of pre-training efficiency, achieving comparable or better results on ImageNet-1K classification with considerably less training time.
The learned representations generalize well to downstream tasks, demonstrating superior performance on ADE20K semantic segmentation and COCO object detection/segmentation compared to previous MIM methods.
Ablation studies validate the importance of local reconstructions, multi-scale supervisions, and the choice of reconstruction targets and decoder design. |
The selection of optimal locations for local reconstruction in the encoder is currently based on empirical observations and may require further investigation.
While the paper primarily focuses on image-level representation learning, exploring the application of LocalMIM to other vision tasks like video understanding could be a promising future direction. |
self-supervised learning, masked image modeling, representation learning, vision transformers, multi-scale learning |
2303.05122
Report |
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios |
Ning Liao, Xiaopeng Zhang, Min Cao, Junchi Yan, Qi Tian |
In realistic open-set scenarios where labels of a part of testing data are
totally unknown, when vision-language (VL) prompt learning methods encounter
inputs related to unknown classes (i.e., not seen during training), they always
predict them as one of the training classes. The exhibited label bias causes
difficulty in open set recognition (OSR), in which an image should be correctly
predicted as one of the known classes or the unknown one. To achieve this goal,
we propose a vision-language prompt tuning method with mitigated label bias
(M-Tuning). It introduces open words from the WordNet to extend the range of
words forming the prompt texts from only closed-set label words to more, and
thus prompts are tuned in a simulated open-set scenario. Besides, inspired by
the observation that classifying directly on large datasets causes a much
higher false positive rate than on small datasets, we propose a Combinatorial
Tuning and Testing (CTT) strategy for improving performance. CTT decomposes
M-Tuning on large datasets as multiple independent group-wise tuning on fewer
classes, then makes accurate and comprehensive predictions by selecting the
optimal sub-prompt. Finally, given the lack of VL-based OSR baselines in the
literature, especially for prompt methods, we contribute new baselines for fair
comparisons. Our method achieves the best performance on datasets with various
scales, and extensive ablation studies also validate its effectiveness. |
Proposes M-Tuning, a vision-language prompt tuning method that mitigates label bias for open-set recognition (OSR) by introducing open words from WordNet during training, simulating an open-set scenario and improving generalization to unknown classes. |
Existing VL prompt learning methods struggle with open-set scenarios where testing data includes unknown classes, leading to misclassification. This work addresses this limitation by enabling OSR within VL prompt learning. |
M-Tuning extends prompt vocabulary with open words unrelated to training/testing labels, simulating an open-set scenario. For large datasets, a Combinatorial Tuning and Testing (CTT) strategy divides the data into groups for independent tuning, improving accuracy. New baselines are constructed for fair comparison. |
M-Tuning significantly outperforms existing prompt learning and OSR methods in unknown detection tasks on various datasets.
CTT strategy effectively improves performance on large-scale datasets by decomposing the tuning and inference processes.
Analysis shows that open words less similar to closed-set classes improve performance, suggesting flexibility in open word selection. |
The performance of M-Tuning may vary with different choices of open words and grouping strategies.
Further exploration is needed to optimize the selection and utilization of open words from potentially diverse sources beyond WordNet. |
vision-language, open set recognition, prompt tuning, label bias, combinatorial tuning and testing |
2303.05031
Report |
CoralStyleCLIP: Co-optimized Region and Layer Selection for Image Editing |
Ambareesh Revanur, Debraj Basu, Shradha Agrawal, Dhwanit Agarwal, Deepak Pai |
Edit fidelity is a significant issue in open-world controllable generative
image editing. Recently, CLIP-based approaches have traded off simplicity to
alleviate these problems by introducing spatial attention in a handpicked layer
of a StyleGAN. In this paper, we propose CoralStyleCLIP, which incorporates a
multi-layer attention-guided blending strategy in the feature space of
StyleGAN2 for obtaining high-fidelity edits. We propose multiple forms of our
co-optimized region and layer selection strategy to demonstrate the variation
of time complexity with the quality of edits over different architectural
intricacies while preserving simplicity. We conduct extensive experimental
analysis and benchmark our method against state-of-the-art CLIP-based methods.
Our findings suggest that CoralStyleCLIP results in high-quality edits while
preserving the ease of use. |
CoralStyleCLIP, a novel method for text-driven image editing, co-optimizes region and layer selection in StyleGAN2 for high-fidelity edits with minimal manual intervention. |
Existing methods struggle to achieve both ease of use and edit fidelity, often requiring manual selection of layers and resulting in undesirable edits to unintended regions. |
CoralStyleCLIP introduces a multi-layer attention-guided blending strategy, learning both latent edit directions and spatial masks for each StyleGAN2 layer. Two variants are presented: one using pre-trained segment selection for faster training and another using a convolutional attention network for finer control. |
CoralStyleCLIP produces high-quality edits localized to relevant regions, outperforming baselines in accuracy and minimizing unwanted modifications.
The method automatically learns appropriate layers for editing, selecting earlier layers for coarse edits (e.g., shape) and later layers for finer details (e.g., color).
Segment selection offers significant speed advantages over the attention network, achieving comparable results for less complex edits. |
Segment selection can be limited by the pre-defined segments of the model, potentially leading to over- or under-selection.
The attention network variant, while more accurate, incurs higher training costs. |
image editing, text-guided image manipulation, stylegan, clip, attention mechanisms |
2303.04989
Report |
ARS-DETR: Aspect Ratio-Sensitive Detection Transformer for Aerial Oriented Object Detection |
Ying Zeng, Yushi Chen, Xue Yang, Qingyun Li, Junchi Yan |
Existing oriented object detection methods commonly use metric AP$_{50}$ to
measure the performance of the model. We argue that AP$_{50}$ is inherently
unsuitable for oriented object detection due to its large tolerance in angle
deviation. Therefore, we advocate using high-precision metric, e.g. AP$_{75}$,
to measure the performance of models. In this paper, we propose an Aspect Ratio
Sensitive Oriented Object Detector with Transformer, termed ARS-DETR, which
exhibits a competitive performance in high-precision oriented object detection.
Specifically, a new angle classification method, calling Aspect Ratio aware
Circle Smooth Label (AR-CSL), is proposed to smooth the angle label in a more
reasonable way and discard the hyperparameter that introduced by previous work
(e.g. CSL). Then, a rotated deformable attention module is designed to rotate
the sampling points with the corresponding angles and eliminate the
misalignment between region features and sampling points. Moreover, a dynamic
weight coefficient according to the aspect ratio is adopted to calculate the
angle loss. Comprehensive experiments on several challenging datasets show that
our method achieves competitive performance on the high-precision oriented
object detection task. |
The paper proposes Aspect Ratio-Sensitive Detection Transformer (ARS-DETR) for high-precision oriented object detection in aerial images. |
Current oriented object detectors often neglect the sensitivity of objects with different aspect ratios to angle, and existing metrics like AP50 aren't sensitive enough to reflect angle prediction accuracy. This hinders high-precision oriented object detection crucial for tasks like fine-grained recognition. |
The authors propose Aspect Ratio Aware Circle Smooth Label (AR-CSL) to smooth angle labels dynamically based on object aspect ratio using SkewIoU. They also introduce a Rotated Deformable Attention module to align features with object orientation and use aspect ratio sensitive matching/loss during training. The effectiveness of these methods is demonstrated using Deformable DETR as the base architecture. |
AR-CSL outperforms CSL with various radii and angle discrete granularities, achieving better AP75 across different detectors.
The Rotated Deformable Attention module aligns features effectively, leading to significant improvements in AP75.
ARS-DETR achieves competitive performance on high-precision oriented object detection across DOTA-v1.0, DIOR-R, and OHD-SJTU datasets. |
The paper focuses on AP75 as the main metric for high-precision detection. While justified, exploring other metrics like AP90 could be interesting.
The computational complexity of ARS-DETR compared to other oriented object detectors is not discussed. |
oriented object detection, high-precision detection, detection transformer, feature alignment, remote sensing |
2303.04970
Report |
LMR: A Large-Scale Multi-Reference Dataset for Reference-based Super-Resolution |
Lin Zhang, Xin Li, Dongliang He, Errui Ding, Zhaoxiang Zhang |
It is widely agreed that reference-based super-resolution (RefSR) achieves
superior results by referring to similar high quality images, compared to
single image super-resolution (SISR). Intuitively, the more references, the
better performance. However, previous RefSR methods have all focused on
single-reference image training, while multiple reference images are often
available in testing or practical applications. The root cause of such
training-testing mismatch is the absence of publicly available multi-reference
SR training datasets, which greatly hinders research efforts on multi-reference
super-resolution. To this end, we construct a large-scale, multi-reference
super-resolution dataset, named LMR. It contains 112,142 groups of 300x300
training images, which is 10x of the existing largest RefSR dataset. The image
size is also much larger. More importantly, each group is equipped with 5
reference images with different similarity levels. Furthermore, we propose a
new baseline method for multi-reference super-resolution: MRefSR, including a
Multi-Reference Attention Module (MAM) for feature fusion of an arbitrary
number of reference images, and a Spatial Aware Filtering Module (SAFM) for the
fused feature selection. The proposed MRefSR achieves significant improvements
over state-of-the-art approaches on both quantitative and qualitative
evaluations. Our code and data would be made available soon. |
This paper introduces LMR, the first large-scale multi-reference dataset for reference-based super-resolution (RefSR), and proposes MRefSR, a novel multi-reference RefSR baseline method. |
Existing RefSR methods rely on single-reference training datasets, limiting their effectiveness in real-world applications where multiple reference images are often available. |
LMR is constructed from the MegaDepth dataset by selecting and cropping image patches with varying similarity levels. MRefSR leverages a Multi-Reference Attention Module (MAM) for feature fusion and a Spatial Aware Filtering Module (SAFM) for feature selection. |
MRefSR significantly outperforms state-of-the-art methods on both CUFED5 and LMR datasets.
Models trained on LMR exhibit strong generalization ability on other RefSR datasets.
MRefSR effectively utilizes multiple reference images, leading to superior visual quality compared to single-reference methods. |
The impact of different similarity levels of reference images requires further exploration.
Investigating the effectiveness of MRefSR on other computer vision tasks is a promising future direction. |
reference-based super-resolution, multi-reference super-resolution, dataset, deep learning, computer vision |
2303.04838
Report |
The Casual Conversations v2 Dataset |
Bilal Porgali, Vítor Albiero, Jordan Ryda, Cristian Canton Ferrer, Caner Hazirbas |
This paper introduces a new large consent-driven dataset aimed at assisting
in the evaluation of algorithmic bias and robustness of computer vision and
audio speech models in regards to 11 attributes that are self-provided or
labeled by trained annotators. The dataset includes 26,467 videos of 5,567
unique paid participants, with an average of almost 5 videos per person,
recorded in Brazil, India, Indonesia, Mexico, Vietnam, Philippines, and the
USA, representing diverse demographic characteristics. The participants agreed
for their data to be used in assessing fairness of AI models and provided
self-reported age, gender, language/dialect, disability status, physical
adornments, physical attributes and geo-location information, while trained
annotators labeled apparent skin tone using the Fitzpatrick Skin Type and Monk
Skin Tone scales, and voice timbre. Annotators also labeled for different
recording setups and per-second activity annotations. |
Introduces Casual Conversations v2, a large and diverse dataset designed for evaluating fairness and robustness in audio, vision, and speech AI models. |
Addresses the lack of ethically constructed benchmarks for identifying fairness issues in AI models, particularly concerning demographic attributes. |
Collected 26,467 videos from 5,567 participants across 7 countries, encompassing self-reported demographics, annotated physical attributes, and diverse recording setups. |
Models trained on FairFace exhibit better accuracy across datasets and demographic groups.
Significant performance bias exists in vision models towards household items from higher-income backgrounds.
Strong correlation observed between native language and spoken language among participants. |
Limited number of race categories in some datasets like UTKFace and RFW.
Reliance on binary gender categories in several datasets can be limiting and potentially discriminatory. |
fairness, robustness, dataset, audio-visual, speech |
2303.04803
Report |
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models |
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello |
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation,
which unifies pre-trained text-image diffusion and discriminative models to
perform open-vocabulary panoptic segmentation. Text-to-image diffusion models
have the remarkable ability to generate high-quality images with diverse
open-vocabulary language descriptions. This demonstrates that their internal
representation space is highly correlated with open concepts in the real world.
Text-image discriminative models like CLIP, on the other hand, are good at
classifying images into open-vocabulary labels. We leverage the frozen internal
representations of both these models to perform panoptic segmentation of any
category in the wild. Our approach outperforms the previous state of the art by
significant margins on both open-vocabulary panoptic and semantic segmentation
tasks. In particular, with COCO training only, our method achieves 23.4 PQ and
30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement
over the previous state of the art. We open-source our code and models at
https://github.com/NVlabs/ODISE . |
This paper proposes ODISE, an open-vocabulary diffusion-based panoptic segmentation model that leverages the internal representations of pre-trained text-to-image diffusion models for open-vocabulary segmentation. |
Open-vocabulary recognition is crucial for real-world applications, as it allows models to recognize limitless categories, unlike closed-vocabulary approaches that are limited by their training data. |
ODISE uses a frozen text-to-image diffusion model to extract visual features from an image and its caption. It trains a mask generator on these features to predict panoptic masks and utilizes a mask classification module to categorize masks into open-vocabulary categories. |
ODISE achieves state-of-the-art performance on open-vocabulary panoptic segmentation, outperforming previous methods by a significant margin.
It also excels in open-vocabulary semantic segmentation, object detection, and open-world instance segmentation tasks.
The study finds that the internal representations of text-to-image diffusion models are better suited for open-vocabulary segmentation compared to traditional discriminative models. |
The category definitions in existing datasets can be ambiguous, affecting evaluation accuracy.
Potential bias in the pre-trained diffusion model's internal representation due to web-crawled data. |
open-vocabulary, panoptic segmentation, diffusion models, text-to-image generation, computer vision |
2303.04761
Report |
Video-P2P: Video Editing with Cross-attention Control |
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, Jiaya Jia |
This paper presents Video-P2P, a novel framework for real-world video editing
with cross-attention control. While attention control has proven effective for
image editing with pre-trained image generation models, there are currently no
large-scale video generation models publicly available. Video-P2P addresses
this limitation by adapting an image generation diffusion model to complete
various video editing tasks. Specifically, we propose to first tune a
Text-to-Set (T2S) model to complete an approximate inversion and then optimize
a shared unconditional embedding to achieve accurate video inversion with a
small memory cost. For attention control, we introduce a novel
decoupled-guidance strategy, which uses different guidance strategies for the
source and target prompts. The optimized unconditional embedding for the source
prompt improves reconstruction ability, while an initialized unconditional
embedding for the target prompt enhances editability. Incorporating the
attention maps of these two branches enables detailed editing. These technical
designs enable various text-driven editing applications, including word swap,
prompt refinement, and attention re-weighting. Video-P2P works well on
real-world videos for generating new characters while optimally preserving
their original poses and scenes. It significantly outperforms previous
approaches. |
Video-P2P, a novel framework for realistic video editing using cross-attention control with pre-trained image generation diffusion models. |
Addresses the lack of publicly available large-scale video generation models for video editing tasks, enabling detailed control over object properties and actions within real-world videos. |
Adapts a text-to-image diffusion model into a text-to-set model for video processing. Optimizes a shared unconditional embedding for accurate video inversion. Introduces a decoupled-guidance strategy for attention control, utilizing separate guidance for source and target prompts. |
Successfully performs local and global video editing tasks like word swapping, prompt refinement, and attention re-weighting.
Demonstrates superior performance in preserving temporal coherence and background details compared to existing methods like Tune-A-Video and Dreamix.
Quantitative analysis using metrics like CLIP Score, Masked PSNR, LPIPS, and Object Semantic Variance (OSV) confirms the effectiveness of Video-P2P in achieving semantic consistency and high editing quality. |
Limited ability to edit video motion due to the use of an image-based diffusion model.
Future work will focus on enhancing Video-P2P to handle more complex editing scenarios, such as adding new objects into the video. |
video editing, diffusion models, cross-attention control, text-to-video generation, unconditional embedding |
2303.04748
Report |
CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP |
Junbo Zhang, Runpei Dong, Kaisheng Ma |
Training a 3D scene understanding model requires complicated human
annotations, which are laborious to collect and result in a model only encoding
close-set object semantics. In contrast, vision-language pre-training models
(e.g., CLIP) have shown remarkable open-world reasoning properties. To this
end, we propose directly transferring CLIP's feature space to 3D scene
understanding model without any form of supervision. We first modify CLIP's
input and forwarding process so that it can be adapted to extract dense pixel
features for 3D scene contents. We then project multi-view image features to
the point cloud and train a 3D scene understanding model with feature
distillation. Without any annotations or additional training, our model
achieves promising annotation-free semantic segmentation results on
open-vocabulary semantics and long-tailed concepts. Besides, serving as a
cross-modal pre-training framework, our method can be used to improve data
efficiency during fine-tuning. Our model outperforms previous SOTA methods in
various zero-shot and data-efficient learning benchmarks. Most importantly, our
model successfully inherits CLIP's rich-structured knowledge, allowing 3D scene
understanding models to recognize not only object concepts but also open-world
semantics. |
This paper proposes CLIP-FO3D, a novel method for transferring CLIP's feature space to 3D scene understanding models without any human supervision, enabling open-world 3D scene understanding. |
Training 3D scene understanding models typically requires extensive human annotations, limiting them to recognizing only a fixed set of object semantics. CLIP-FO3D addresses this limitation by leveraging the open-world reasoning capabilities of CLIP. |
CLIP-FO3D extracts dense pixel features from 3D scene RGB views by modifying CLIP's input and forwarding process. It then projects these features to the point cloud and trains a 3D scene understanding model using feature distillation. |
CLIP-FO3D achieves impressive annotation-free semantic segmentation results on standard benchmarks (ScanNet, S3DIS) and open-vocabulary concepts.
The model exhibits remarkable open-world properties, recognizing long-tailed categories and successfully identifying regions relevant to open-world text queries (e.g., color, affordance).
CLIP-FO3D outperforms existing methods in zero-shot and data-efficient learning tasks, demonstrating its effectiveness in scenarios with limited or no annotations. |
CLIP-FO3D's performance on recognizing smaller objects and fine-grained details could be further improved.
The computational cost of extracting dense pixel features from CLIP can be high, posing challenges for real-time applications. |
3d scene understanding, open-world learning, zero-shot learning, data-efficient learning, clip |
2303.04707
Report |
DiM: Distilling Dataset into Generative Model |
Kai Wang, Jianyang Gu, Daquan Zhou, Zheng Zhu, Wei Jiang, Yang You |
Dataset distillation reduces the network training cost by synthesizing small
and informative datasets from large-scale ones. Despite the success of the
recent dataset distillation algorithms, three drawbacks still limit their wider
application: i). the synthetic images perform poorly on large architectures;
ii). they need to be re-optimized when the distillation ratio changes; iii).
the limited diversity restricts the performance when the distillation ratio is
large. In this paper, we propose a novel distillation scheme to
\textbf{D}istill information of large train sets \textbf{i}nto generative
\textbf{M}odels, named DiM. Specifically, DiM learns to use a generative model
to store the information of the target dataset. During the distillation phase,
we minimize the differences in logits predicted by a models pool between real
and generated images. At the deployment stage, the generative model synthesizes
various training samples from random noises on the fly. Due to the simple yet
effective designs, the trained DiM can be directly applied to different
distillation ratios and large architectures without extra cost. We validate the
proposed DiM across 4 datasets and achieve state-of-the-art results on all of
them. To the best of our knowledge, we are the first to achieve higher accuracy
on complex architectures than simple ones, such as 75.1\% with ResNet-18 and
72.6\% with ConvNet-3 on ten images per class of CIFAR-10. Besides, DiM
outperforms previous methods with 10\% $\sim$ 22\% when images per class are 1
and 10 on the SVHN dataset. |
This paper proposes DiM, a novel dataset distillation method that distills information from a large training dataset into a generative model instead of synthetic images. |
Existing dataset distillation methods have limitations in cross-architecture generalization, redeployment efficiency (require redistillation when target data size changes), and often perform poorly on larger architectures. |
DiM employs a conditional GAN trained with a logits matching strategy using a pool of diverse models. This allows the generator to learn to synthesize discriminative images helpful for downstream tasks across different architectures. |
DiM significantly outperforms state-of-the-art methods on various benchmarks, especially for large architectures and low image-per-class settings.
It achieves superior cross-architecture generalization, effectively distilling knowledge from simple architectures to larger ones.
DiM demonstrates high redeployment efficiency, needing only a single training for various target data sizes, unlike previous methods. |
Generating training samples during deployment introduces extra computational effort compared to using static synthetic images.
Future work includes exploring lighter generative models and applying DiM to large-scale datasets and tasks like object detection and semantic segmentation. |
dataset distillation, generative adversarial networks, cross-architecture generalization, logits matching, model pooling |
2303.04664
Report |
Centroid-centered Modeling for Efficient Vision Transformer Pre-training |
Xin Yan, Zuchao Li, Lefei Zhang, Bo Du, Dacheng Tao |
Masked Image Modeling (MIM) is a new self-supervised vision pre-training
paradigm using Vision Transformer (ViT). Previous works can be pixel-based or
token-based, using original pixels or discrete visual tokens from parametric
tokenizer models, respectively. Our proposed approach, \textbf{CCViT},
leverages k-means clustering to obtain centroids for image modeling without
supervised training of tokenizer model. The centroids represent patch pixels
and index tokens and have the property of local invariance. Non-parametric
centroid tokenizer only takes seconds to create and is faster for token
inference. Specifically, we adopt patch masking and centroid replacement
strategies to construct corrupted inputs, and two stacked encoder blocks to
predict corrupted patch tokens and reconstruct original patch pixels.
Experiments show that the ViT-B model with only 300 epochs achieves 84.3\%
top-1 accuracy on ImageNet-1K classification and 51.6\% on ADE20K semantic
segmentation. Our approach achieves competitive results with BEiTv2 without
distillation training from other models and outperforms other methods such as
MAE. |
This paper introduces CCViT, a novel Vision Transformer pre-training framework called centroid-centered MIM using k-means clustering for efficient image modeling without training a separate tokenizer. |
Existing token-based MIM methods are computationally expensive due to the need for training a separate tokenizer, while pixel-based methods require a redundant decoder. CCViT addresses these limitations by leveraging centroids for both token and pixel learning. |
CCViT uses k-means clustering on a small subset of the pre-training data to obtain centroids, which act as both token indices and representative patch pixels. The model is pre-trained using blockwise masking and centroid replacement strategies, with a dual objective of predicting centroid tokens and reconstructing original patch pixels. |
CCViT achieves 84.3% top-1 accuracy on ImageNet-1K classification and 51.6% mIoU on ADE20K segmentation with only 300 epochs.
The centroid-based tokenizer is significantly faster to train and infer compared to parametric tokenizers used in previous works.
CCViT demonstrates better noise resistance compared to BEiT and BEiTv2, suggesting its ability to learn more robust and locally invariant representations. |
The study is limited to a base-size ViT model and 300 pre-training epochs due to resource constraints.
Future work will investigate scaling up the model and data size, and exploring knowledge distillation for potential improvement. |
vision transformer, self-supervised learning, masked image modeling, k-means clustering, centroid-based representation |
2303.04587
Report |
A Prompt Log Analysis of Text-to-Image Generation Systems |
Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, Qiaozhu Mei |
Recent developments in large language models (LLM) and generative AI have
unleashed the astonishing capabilities of text-to-image generation systems to
synthesize high-quality images that are faithful to a given reference text,
known as a "prompt". These systems have immediately received lots of attention
from researchers, creators, and common users. Despite the plenty of efforts to
improve the generative models, there is limited work on understanding the
information needs of the users of these systems at scale. We conduct the first
comprehensive analysis of large-scale prompt logs collected from multiple
text-to-image generation systems. Our work is analogous to analyzing the query
logs of Web search engines, a line of work that has made critical contributions
to the glory of the Web search industry and research. Compared with Web search
queries, text-to-image prompts are significantly longer, often organized into
special structures that consist of the subject, form, and intent of the
generation tasks and present unique categories of information needs. Users make
more edits within creation sessions, which present remarkable exploratory
patterns. There is also a considerable gap between the user-input prompts and
the captions of the images included in the open training data of the generative
models. Our findings provide concrete implications on how to improve
text-to-image generation systems for creation purposes. |
This paper presents the first comprehensive analysis of large-scale prompt logs from text-to-image generation systems (Midjourney, Stable Diffusion, LDMs), revealing user information needs and workflows. |
Understanding user information needs is crucial for improving text-to-image generation systems and facilitating AI-powered creativity. |
The authors analyze millions of user prompts, comparing them to Web search queries and image captions in training datasets, examining term frequencies, prompt structures, session patterns, and correlations with user ratings. |
Prompts typically describe the subject, form, and intent of the desired image.
Text-to-image prompts differ significantly from Web search queries, exhibiting greater length, exploratory patterns, and a new category of “exploratory prompts”.
Longer prompts and specific terms correlate with higher-rated generated images. |
The analysis relies on open datasets, potentially excluding private training data.
Further research is needed to develop tools and glossaries for extracting subject, form, and intent from prompts. |
text-to-image generation, ai-generated content (aigc), ai for creativity, prompt analysis, query log analysis |
2303.04248
Report |
TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation |
David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbott, Eric Gu |
Denoising Diffusion models have demonstrated their proficiency for generative
sampling. However, generating good samples often requires many iterations.
Consequently, techniques such as binary time-distillation (BTD) have been
proposed to reduce the number of network calls for a fixed architecture. In
this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new
method that extends BTD. For single step diffusion,TRACT improves FID by up to
2.4x on the same architecture, and achieves new single-step Denoising Diffusion
Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for
CIFAR10). Finally we tease apart the method through extended ablations. The
PyTorch implementation will be released soon. |
This paper proposes TRAnsitive Closure Time-distillation (TRACT), a novel method for distilling diffusion models to significantly improve the quality of generated samples within a few steps, ideally one. |
Generating high-quality samples from diffusion models typically demands a large number of inference steps, hindering their efficiency. TRACT addresses this limitation by enabling the generation of high-quality samples in just one or two steps. |
TRACT extends Binary Time-Distillation (BTD) by using a self-teaching approach with exponential moving average (EMA) to distill the output of a teacher model's inference from step t to t' with t' < t. The method is independent of specific noise schedules or samplers, demonstrating effectiveness with both variance preserving and variance exploding schedules, and with both DDIM and Runge-Kutta samplers. |
TRACT achieves state-of-the-art FID scores for single-step diffusion models, notably 7.4 for ImageNet64 and 3.8 for CIFAR10.
Ablations confirm the importance of the self-teaching EMA momentum and demonstrate that 2-phase distillation schedules generally outperform schedules with more phases.
Beyond time distillation, TRACT can also distill knowledge to other, potentially smaller, architectures with minimal performance loss. |
Self-teaching in TRACT might lead to less efficient objectives compared to supervised training in BTD.
Future work could explore distilling from even higher step count teachers, enabled by TRACT's flexible reduction in steps between training phases, potentially unlocking new applications for diffusion models. |
diffusion models, generative sampling, knowledge distillation, time distillation, self-teaching |
2303.04244
Report |
A Light-Weight Contrastive Approach for Aligning Human Pose Sequences |
Robert T. Collins |
We present a simple unsupervised method for learning an encoder mapping short
3D pose sequences into embedding vectors suitable for sequence-to-sequence
alignment by dynamic time warping. Training samples consist of temporal windows
of frames containing 3D body points such as mocap markers or skeleton joints. A
light-weight, 3-layer encoder is trained using a contrastive loss function that
encourages embedding vectors of augmented sample pairs to have cosine
similarity 1, and similarity 0 with all other samples in a minibatch. When
multiple scripted training sequences are available, temporal alignments
inferred from an initial round of training are harvested to extract additional,
cross-performance match pairs for a second phase of training to refine the
encoder. In addition to being simple, the proposed method is fast to train,
making it easy to adapt to new data using different marker sets or skeletal
joint layouts. Experimental results illustrate ease of use, transferability,
and utility of the learned embeddings for comparing and analyzing human
behavior sequences. |
This paper presents a simple unsupervised contrastive learning approach for aligning 3D human pose sequences. |
Temporal alignment of pose sequences is crucial for various tasks in human motion understanding, including studying variability, identifying repeated actions, transferring labels, and detecting anomalies. |
A lightweight 3-layer encoder is trained using a contrastive loss function based on cosine similarity. The method utilizes data augmentation and a two-phase training approach, refining the encoder with cross-performance matching pairs obtained through dynamic time warping (DTW). |
The method produces discriminative pose sequence representations, effectively aligning complex, multi-action sequences.
Learned representations demonstrate transferability across datasets with different sensors, point sets, and even movement styles (Taiji to Karate).
Experiments on the Penn Action dataset show state-of-the-art alignment performance, outperforming several previous methods. |
The current method relies on DTW, limiting its ability to handle out-of-order action sequences.
Manual correspondence mapping between different point sets is required, presenting a challenge for automation. |
pose sequence alignment, contrastive learning, dynamic time warping, unsupervised learning, human motion analysis |
2303.04186
Report |
End-to-end Face-swapping via Adaptive Latent Representation Learning |
Chenhao Lin, Pengbin Hu, Chao Shen, Qian Li |
Taking full advantage of the excellent performance of StyleGAN, style
transfer-based face swapping methods have been extensively investigated
recently. However, these studies require separate face segmentation and
blending modules for successful face swapping, and the fixed selection of the
manipulated latent code in these works is reckless, thus degrading face
swapping quality, generalizability, and practicability. This paper proposes a
novel and end-to-end integrated framework for high resolution and attribute
preservation face swapping via Adaptive Latent Representation Learning.
Specifically, we first design a multi-task dual-space face encoder by sharing
the underlying feature extraction network to simultaneously complete the facial
region perception and face encoding. This encoder enables us to control the
face pose and attribute individually, thus enhancing the face swapping quality.
Next, we propose an adaptive latent codes swapping module to adaptively learn
the mapping between the facial attributes and the latent codes and select
effective latent codes for improved retention of facial attributes. Finally,
the initial face swapping image generated by StyleGAN2 is blended with the
facial region mask generated by our encoder to address the background blur
problem. Our framework integrating facial perceiving and blending into the
end-to-end training and testing process can achieve high realistic
face-swapping on wild faces without segmentation masks. Experimental results
demonstrate the superior performance of our approach over state-of-the-art
methods. |
This paper introduces FS-ALL, an end-to-end framework for high-resolution face swapping that uses adaptive latent representation learning to improve identity transfer and attribute preservation. |
Existing face swapping methods struggle to achieve high generalizability and realism simultaneously, often requiring separate segmentation and blending modules. Fixed latent code manipulation in these methods leads to low-quality swapping and poor attribute preservation. |
The framework uses a multi-task dual-space encoder (MDE) to perceive facial regions and map faces into separate pose and attribute latent spaces. An adaptive latent code swapping module (ALS) then selects and swaps effective latent codes based on a learnable network, enhancing attribute retention. Finally, StyleGAN2 generates the swapped face, refined by an internal blending module. |
FS-ALL demonstrates superior performance over state-of-the-art methods in both qualitative and quantitative evaluations.
The adaptive latent code swapping module improves identity transfer and attribute preservation compared to fixed latent code manipulation.
The multi-task dual-space encoder effectively maintains facial details and generates accurate masks for seamless blending. |
The decoupling of latent codes for attribute control on certain datasets requires further improvement.
The method currently relies on a pre-trained StyleGAN2 model, limiting its flexibility in generating specific face styles. |
face swapping, deepfake, adaptive latent representation learning, generative adversarial networks (gans), attribute preservation |
2303.04105
Report |
Your representations are in the network: composable and parallel adaptation for large scale models |
Yonatan Dukler, Alessandro Achille, Hao Yang, Varsha Vivek, Luca Zancato, Benjamin Bowman, Avinash Ravichandran, Charless Fowlkes, Ashwin Swaminathan, Stefano Soatto |
We propose InCA, a lightweight method for transfer learning that
cross-attends to any activation layer of a pre-trained model. During training,
InCA uses a single forward pass to extract multiple activations, which are
passed to external cross-attention adapters, trained anew and combined or
selected for downstream tasks. We show that, even when selecting a single
top-scoring adapter, InCA achieves performance comparable to full fine-tuning,
at a cost comparable to fine-tuning just the last layer. For example, with a
cross-attention probe 1.3% the size of a pre-trained ViT-L/16 model, we achieve
performance within 0.2% of the full fine-tuning paragon at a computational
training cost of 51% of the baseline, on average across 11 downstream
classification. Unlike other forms of efficient adaptation, InCA does not
require backpropagating through the pre-trained model, thus leaving its
execution unaltered at both training and inference. The versatility of InCA is
best illustrated in fine-grained tasks, which may require accessing information
absent in the last layer but accessible in intermediate layer activations.
Since the backbone is fixed, InCA allows parallel ensembling as well as
parallel execution of multiple tasks. InCA achieves state-of-the-art
performance in the ImageNet-to-Sketch multi-task benchmark. |
This paper introduces InCA (Introspective-Cross-Attention), a novel transfer learning framework that adapts large pre-trained models by attaching lightweight cross-attention modules to intermediate activations. |
Full fine-tuning of large-scale models is computationally expensive and impractical for many real-world applications. InCA provides an efficient and versatile alternative for adapting these models to downstream tasks. |
InCA works by attaching and training multiple isolated, lightweight cross-attention adapters in parallel to different activations of a frozen pre-trained model. These adapters learn to extract task-relevant information from the activations, enabling efficient adaptation without modifying the original model. |
A single InCA adapter, only 1.3% the size of the full model, achieves comparable accuracy to full fine-tuning on 11 diverse downstream classification tasks.
InCA's isolated adaptation is highly computationally efficient, allowing adaptation of massive models like ViT-G/14 on a single GPU.
The method enables flexible learning scenarios, including multi-task learning and class-incremental learning, by combining or incrementally modifying learned adapters. |
The paper primarily focuses on image classification tasks. Further investigation is needed to evaluate InCA's performance on other vision tasks like object detection or segmentation.
While the paper explores ensembling adapters, more sophisticated ensembling techniques and their impact on robustness and out-of-distribution performance remain to be explored. |
transfer learning, parameter-efficient fine-tuning, cross-attention, intermediate representations, multi-task learning |
2303.04001
Report |
ELODIN: Naming Concepts in Embedding Spaces |
Rodrigo Mello, Filipe Calegario, Geber Ramalho |
Despite recent advancements, the field of text-to-image synthesis still
suffers from lack of fine-grained control. Using only text, it remains
challenging to deal with issues such as concept coherence and concept
contamination. We propose a method to enhance control by generating specific
concepts that can be reused throughout multiple images, effectively expanding
natural language with new words that can be combined much like a painter's
palette. Unlike previous contributions, our method does not copy visuals from
input data and can generate concepts through text alone. We perform a set of
comparisons that finds our method to be a significant improvement over
text-only prompts. |
Introduces ELODIN, a method for generating and using 'named concepts' (namecons), custom keywords associated with specific visual concepts in the embedding space of text-to-image models, enhancing control over concept coherence and contamination in generated images. |
Addresses limitations of text-based prompts in achieving precise visual consistency and preventing unintended interactions between concepts in text-to-image synthesis. |
ELODIN searches the embedding space by optimizing an embedding vector through backpropagation, guided by a similarity loss (e.g., text-image or face similarity), and associates it with a user-defined keyword (namecon). Namecons are then integrated into prompts, replacing guiding concepts' embeddings during inference. |
ELODIN reduces concept contamination (e.g., maintaining distinct colors) compared to text-only prompts.
ELODIN improves concept coherence (e.g., preserving consistent facial features) across multiple generations.
Quantitative analysis using face similarity metrics shows higher coherence for images generated with namecons. |
Limited exploration of loss functions beyond text-image and face similarity.
Further research needed on applicability to non-visual modalities and tasks like segmentation/object detection. |
text-to-image synthesis, concept naming, embedding space, fine-grained control, concept coherence |
2303.03991
Report |
OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception |
Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, Xingang Wang |
Semantic occupancy perception is essential for autonomous driving, as
automated vehicles require a fine-grained perception of the 3D urban
structures. However, existing relevant benchmarks lack diversity in urban
scenes, and they only evaluate front-view predictions. Towards a comprehensive
benchmarking of surrounding perception algorithms, we propose OpenOccupancy,
which is the first surrounding semantic occupancy perception benchmark. In the
OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense
semantic occupancy annotations. Previous annotations rely on LiDAR points
superimposition, where some occupancy labels are missed due to sparse LiDAR
channels. To mitigate the problem, we introduce the Augmenting And Purifying
(AAP) pipeline to ~2x densify the annotations, where ~4000 human hours are
involved in the labeling process. Besides, camera-based, LiDAR-based and
multi-modal baselines are established for the OpenOccupancy benchmark.
Furthermore, considering the complexity of surrounding occupancy perception
lies in the computational burden of high-resolution 3D predictions, we propose
the Cascade Occupancy Network (CONet) to refine the coarse prediction, which
relatively enhances the performance by ~30% than the baseline. We hope the
OpenOccupancy benchmark will boost the development of surrounding occupancy
perception algorithms. |
This paper presents OpenOccupancy, the first benchmark designed for surrounding semantic occupancy perception in driving scenarios. |
Surrounding semantic occupancy perception is crucial for autonomous driving as it enables a fine-grained understanding of 3D urban structures, which is essential for safe navigation. Existing benchmarks lack diversity in urban scenes and focus on front-view predictions, limiting their effectiveness in evaluating surrounding perception algorithms. |
The benchmark extends the nuScenes dataset with dense semantic occupancy annotations using the Augmenting And Purifying (AAP) pipeline. It proposes camera-based, LiDAR-based, and multi-modal baselines, and introduces the Cascade Occupancy Network (CONet) to improve efficiency and accuracy of high-resolution occupancy predictions. |
Surrounding occupancy perception paradigm outperforms single-view methods.
Multi-modal baseline significantly enhances performance by adaptively fusing camera and LiDAR data.
CONet improves efficiency and accuracy by refining low-resolution predictions. |
Benchmark currently relies on the nuScenes dataset and could be expanded to include more diverse driving scenarios.
Future work can explore more sophisticated fusion methods and architectures for improved performance. |
autonomous driving, semantic occupancy perception, benchmarking, multi-modal fusion, 3d perception |
2303.03932
Report |
FFT-based Dynamic Token Mixer for Vision |
Yuki Tatsunami, Masato Taki |
Multi-head-self-attention (MHSA)-equipped models have achieved notable
performance in computer vision. Their computational complexity is proportional
to quadratic numbers of pixels in input feature maps, resulting in slow
processing, especially when dealing with high-resolution images. New types of
token-mixer are proposed as an alternative to MHSA to circumvent this problem:
an FFT-based token-mixer involves global operations similar to MHSA but with
lower computational complexity. However, despite its attractive properties, the
FFT-based token-mixer has not been carefully examined in terms of its
compatibility with the rapidly evolving MetaFormer architecture. Here, we
propose a novel token-mixer called Dynamic Filter and novel image recognition
models, DFFormer and CDFFormer, to close the gaps above. The results of image
classification and downstream tasks, analysis, and visualization show that our
models are helpful. Notably, their throughput and memory efficiency when
dealing with high-resolution image recognition is remarkable. Our results
indicate that Dynamic Filter is one of the token-mixer options that should be
seriously considered. The code is available at
https://github.com/okojoalg/dfformer |
This paper introduces the "dynamic filter," a novel mechanism for dynamically generating global filters in vision models, and proposes two new MetaFormer classes: DFFormer and CDFFormer. |
This work addresses the limitations of traditional global filters and aims to close the performance gap between FFT-based models and state-of-the-art vision models, particularly in handling high-resolution images. |
The authors develop a dynamic filter that generates global filters based on image content using an MLP. This dynamic filter is then integrated into a MetaFormer architecture, resulting in DFFormer and a hybrid model with convolutions, CDFFormer. Extensive experiments are conducted on ImageNet-1K, ADE20K, and COCO benchmarks. |
DFFormer and CDFFormer achieve competitive performance compared to state-of-the-art MHSA-free models on ImageNet-1K image classification.
The models demonstrate superior performance in downstream tasks like semantic segmentation (ADE20K) and object detection (COCO) compared to ResNet and PoolFormer backbones.
DFFormer and CDFFormer exhibit significantly faster processing and lower memory usage than MHSA-based models at high resolutions, making them beneficial for tasks requiring high-resolution inputs. |
The current implementation of the dynamic filter does not inherently support arbitrary resolutions due to its reliance on element-wise products.
Further investigation is needed to understand the observed differences in representation learning between FFT-based token-mixers and MHSA within hierarchical architectures. |
computer vision, vision transformers, metaformer, fft, dynamic filter |
2303.03887
Report |
How to Construct Energy for Images? Denoising Autoencoder Can Be Energy Based Model |
Weili Zeng |
Energy-based models parameterize the unnormalized log-probability of data
samples, but there is a lack of guidance on how to construct the "energy". In
this paper, we propose a Denoising-EBM which decomposes the image energy into
"semantic energy" and "texture energy". We define the "semantic energy" in the
latent space of DAE to model the high-level representations, and define the
pixel-level reconstruction error for denoising as "texture energy". Inspired by
score-based model, our model utilizes multi-scale noisy samples for
maximum-likelihood training and it outputs a vector instead of a scalar for
exploring a larger set of functions during optimization. After training, the
semantics are first synthesized by fast MCMC through "semantic energy", and
then the pixel-level refinement of semantic image will be performed to generate
perfect samples based on "texture energy". Ultimately, our model can outperform
most EBMs in image generation. And we also demonstrate that Denoising-EBM has
top performance among EBMs for out-of-distribution detection. |
This paper proposes Denoising-EBM, a novel energy-based model framework for images, decomposing image energy into 'semantic energy' learned in the latent space of a Denoising Autoencoder (DAE) and 'texture energy' defined by pixel-level denoising reconstruction error. |
Existing energy-based models lack guidance on constructing physically meaningful energy functions for images and often face computational challenges in training and sampling due to high dimensionality. This work addresses these limitations by leveraging DAEs to learn energy functions in a more efficient and interpretable manner. |
Denoising-EBM utilizes a DAE with a U-Net structure and a semantic decoder. It models the latent distribution of noisy real data in the DAE's latent space as 'semantic energy' and defines 'texture energy' using denoising reconstruction error. The model is trained using maximum likelihood with a two-stage MCMC sampling strategy for efficient and stable optimization. |
Denoising-EBM outperforms most existing EBMs in image generation tasks on datasets like CIFAR-10 and CelebA, achieving comparable results to GAN-based methods.
The two-stage sampling strategy allows for faster generation than traditional EBMs and score-based models, as demonstrated by significantly reduced sampling time on CIFAR-10.
Denoising-EBM demonstrates top performance among EBMs in out-of-distribution detection tasks, indicating its ability to accurately estimate data likelihood and penalize non-data-like regions. |
The performance of Denoising-EBM is sensitive to the choice of noise levels and interval density during training, requiring careful tuning.
Future work includes generalizing the energy function to continuous time and applying the framework to larger-scale images. |
energy-based models, denoising autoencoders, image generation, out-of-distribution detection, mcmc sampling |
2303.03808
Report |
Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis |
Kang Han, Wei Xiang |
Rendering novel views from captured multi-view images has made considerable
progress since the emergence of the neural radiance field. This paper aims to
further advance the quality of view synthesis by proposing a novel approach
dubbed the neural radiance feature field (NRFF). We first propose a multiscale
tensor decomposition scheme to organize learnable features so as to represent
scenes from coarse to fine scales. We demonstrate many benefits of the proposed
multiscale representation, including more accurate scene shape and appearance
reconstruction, and faster convergence compared with the single-scale
representation. Instead of encoding view directions to model view-dependent
effects, we further propose to encode the rendering equation in the feature
space by employing the anisotropic spherical Gaussian mixture predicted from
the proposed multiscale representation. The proposed NRFF improves
state-of-the-art rendering results by over 1 dB in PSNR on both the NeRF and
NSVF synthetic datasets. A significant improvement has also been observed on
the real-world Tanks & Temples dataset. Code can be found at
https://github.com/imkanghan/nrff. |
This paper introduces NRFF, a novel approach for view synthesis using neural radiance feature fields, employing multiscale tensor decomposition and rendering equation encoding for enhanced quality. |
View synthesis methods often compromise between compact representation and computational efficiency. NRFF addresses this by combining the strengths of neural and learnable feature representations. |
NRFF represents scenes at multiple scales using tensor decomposition, allowing for detailed reconstruction. It then encodes the rendering equation in feature space using anisotropic spherical Gaussians, enabling effective modeling of view-dependent effects. |
NRFF surpasses state-of-the-art methods by over 1 dB in PSNR on both synthetic and real-world datasets.
The multiscale representation results in faster convergence and better rendering quality compared to single-scale methods.
Encoding the rendering equation in feature space proves superior to traditional view direction encoding methods, leading to more accurate reflections and illumination effects. |
NRFF currently utilizes a larger MLP compared to some learnable feature methods, impacting training and testing time.
Multiscale representation increases computational overhead due to interpolation weight calculations, which could be addressed through optimization and GPU texture memory. |
view synthesis, neural rendering, rendering equation, multiscale representation, tensor decomposition |
2303.03667
Report |
Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks |
Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, S. -H. Gary Chan |
To design fast neural networks, many works have been focusing on reducing the
number of floating-point operations (FLOPs). We observe that such reduction in
FLOPs, however, does not necessarily lead to a similar level of reduction in
latency. This mainly stems from inefficiently low floating-point operations per
second (FLOPS). To achieve faster networks, we revisit popular operators and
demonstrate that such low FLOPS is mainly due to frequent memory access of the
operators, especially the depthwise convolution. We hence propose a novel
partial convolution (PConv) that extracts spatial features more efficiently, by
cutting down redundant computation and memory access simultaneously. Building
upon our PConv, we further propose FasterNet, a new family of neural networks,
which attains substantially higher running speed than others on a wide range of
devices, without compromising on accuracy for various vision tasks. For
example, on ImageNet-1k, our tiny FasterNet-T0 is $2.8\times$, $3.3\times$, and
$2.4\times$ faster than MobileViT-XXS on GPU, CPU, and ARM processors,
respectively, while being $2.9\%$ more accurate. Our large FasterNet-L achieves
impressive $83.5\%$ top-1 accuracy, on par with the emerging Swin-B, while
having $36\%$ higher inference throughput on GPU, as well as saving $37\%$
compute time on CPU. Code is available at
\url{https://github.com/JierunChen/FasterNet}. |
This paper introduces FasterNet, a novel family of neural networks designed for high-speed inference on various devices, and a new operator called Partial Convolution (PConv) as its core building block. |
Many neural network designs focus on reducing FLOPs, but this doesn't always translate to reduced latency due to inefficiently low FLOPS caused by frequent memory access in operators like Depthwise Convolution. |
The authors propose PConv, which extracts spatial features by applying a regular convolution on a subset of input channels while leaving others untouched. This reduces both FLOPs and memory access. FasterNet leverages PConv with Pointwise Convolutions in an inverted residual block structure, optimizing normalization and activation layer placement for further latency reduction. |
PConv achieves significantly higher FLOPS than Depthwise Convolution and Group Convolution with reduced FLOPs.
PConv, combined with Pointwise Convolution, effectively approximates a regular convolution for feature transformation.
FasterNet consistently outperforms state-of-the-art networks in terms of accuracy-latency/throughput trade-off on ImageNet-1k classification and COCO object detection/instance segmentation tasks. |
The stride of PConv is limited to 1 to ensure spatial resolution alignment.
FasterNet's receptive field might be limited by its convolutional architecture. |
neural networks, efficient inference, convolutional neural networks, partial convolution, fasternet |
2303.03595
Report |
LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion |
Xin Li, Tao Ma, Yuenan Hou, Botian Shi, Yuchen Yang, Youquan Liu, Xingjiao Wu, Qin Chen, Yikang Li, Yu Qiao, Liang He |
LiDAR-camera fusion methods have shown impressive performance in 3D object
detection. Recent advanced multi-modal methods mainly perform global fusion,
where image features and point cloud features are fused across the whole scene.
Such practice lacks fine-grained region-level information, yielding suboptimal
fusion performance. In this paper, we present the novel Local-to-Global fusion
network (LoGoNet), which performs LiDAR-camera fusion at both local and global
levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous
literature, while we exclusively use point centroids to more precisely
represent the position of voxel features, thus achieving better cross-modal
alignment. As to the Local Fusion (LoF), we first divide each proposal into
uniform grids and then project these grid centers to the images. The image
features around the projected grid points are sampled to be fused with
position-decorated point cloud features, maximally utilizing the rich
contextual information around the proposals. The Feature Dynamic Aggregation
(FDA) module is further proposed to achieve information interaction between
these locally and globally fused features, thus producing more informative
multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD)
and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D
detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection
leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy
that, for the first time, the detection performance on three classes surpasses
80 APH (L2) simultaneously. Code will be available at
\url{https://github.com/sankin97/LoGoNet}. |
This paper proposes LoGoNet, a novel local-to-global fusion network for 3D object detection in autonomous driving, improving LiDAR-camera fusion by incorporating both global and local feature interactions. |
Existing LiDAR-camera fusion methods for 3D object detection primarily focus on global fusion, lacking fine-grained local information crucial for accurate object localization and classification, especially in complex scenes. |
LoGoNet utilizes a global fusion module for scene-level fusion, a local fusion module for proposal-level fusion with position encoding, and a feature dynamic aggregation module for interaction between local and global features. |
Achieves state-of-the-art performance on Waymo Open Dataset and KITTI dataset, ranking 1st on Waymo 3D object detection leaderboard with 81.02 mAPH (L2).
Demonstrates the effectiveness of local-to-global fusion by surpassing previous best methods, including BEVFusion, by a significant margin.
Shows consistent improvements across different object classes (vehicle, pedestrian, cyclist) and difficulty levels on both benchmarks. |
The method utilizes a frozen image branch, potentially limiting further performance improvements from joint optimization.
Future work can explore extending the local-to-global fusion strategy to other multi-modal tasks and incorporating temporal information for more robust detection in dynamic environments. |
3d object detection, lidar-camera fusion, local-to-global fusion, autonomous driving, waymo open dataset |
2303.03405
Report |
Neural Style Transfer for Vector Graphics |
Valeria Efimova, Artyom Chebykin, Ivan Jarsky, Evgenii Prosvirnin, Andrey Filchenkov |
Neural style transfer draws researchers' attention, but the interest focuses
on bitmap images. Various models have been developed for bitmap image
generation both online and offline with arbitrary and pre-trained styles.
However, the style transfer between vector images has not almost been
considered. Our research shows that applying standard content and style losses
insignificantly changes the vector image drawing style because the structure of
vector primitives differs a lot from pixels. To handle this problem, we
introduce new loss functions. We also develop a new method based on
differentiable rasterization that uses these loss functions and can change the
color and shape parameters of the content image corresponding to the drawing of
the style image. Qualitative experiments demonstrate the effectiveness of the
proposed VectorNST method compared with the state-of-the-art neural style
transfer approaches for bitmap images and the only existing approach for
stylizing vector images, DiffVG. Although the proposed model does not achieve
the quality and smoothness of style transfer between bitmap images, we consider
our work an important early step in this area. VectorNST code and demo service
are available at https://github.com/IzhanVarsky/VectorNST. |
This paper introduces VectorNST, a novel neural style transfer method specifically designed for vector graphics, addressing the limitations of existing bitmap-based approaches. |
Style transfer has largely focused on bitmap images, neglecting the unique characteristics and advantages of vector graphics. VectorNST offers a solution for scalable style transfer without the drawbacks of rasterization and vectorization. |
The method leverages differentiable rasterization (DiffVG) to enable backpropagation through the vector image representation. It employs a modified LPIPS loss for style capture and a novel contour loss to preserve content fidelity during style transfer. |
VectorNST successfully transfers artistic styles to vector images, preserving sharp contours and object shapes.
Qualitative comparisons demonstrate superior performance over existing vector and raster-based style transfer methods.
User study confirms that VectorNST produces more aesthetically pleasing stylized vector images. |
The method inherits limitations from DiffVG, such as the inability to optimize vector topology (number of curves).
The feature extractor (VGG-19) is trained on bitmap images, limiting its ability to fully capture vector image characteristics. |
neural style transfer, vector graphics, differentiable rasterization, perceptual loss, contour loss |
2303.03361
Report |
Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision |
Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, Kyle Genova |
We address efficient and structure-aware 3D scene representation from images.
Nerflets are our key contribution -- a set of local neural radiance fields that
together represent a scene. Each nerflet maintains its own spatial position,
orientation, and extent, within which it contributes to panoptic, density, and
radiance reconstructions. By leveraging only photometric and inferred panoptic
image supervision, we can directly and jointly optimize the parameters of a set
of nerflets so as to form a decomposed representation of the scene, where each
object instance is represented by a group of nerflets. During experiments with
indoor and outdoor environments, we find that nerflets: (1) fit and approximate
the scene more efficiently than traditional global NeRFs, (2) allow the
extraction of panoptic and photometric renderings from arbitrary views, and (3)
enable tasks rare for NeRFs, such as 3D panoptic segmentation and interactive
editing. |
This paper introduces Nerflets, a novel 3D scene representation composed of multiple local neural radiance fields, to efficiently represent 3D scenes with structure awareness from 2D image supervision. |
Existing methods for 3D scene representation from images often require 3D ground truth, lack efficiency, or don't handle object instances well. Nerflets address these issues, offering a compact, efficient, and comprehensive solution. |
Each Nerflet is a small NeRF with spatial pose, influencing a local region. They are jointly optimized using photometric and 2D panoptic segmentation losses to decompose the scene. A greedy merge algorithm then groups Nerflets into object instances. |
Nerflets achieve state-of-the-art performance for panoptic novel view synthesis on KITTI-360, outperforming methods relying on 3D supervision.
They demonstrate superior novel view synthesis quality on ScanNet compared to baselines, capturing object details better due to explicit parameter allocation.
Nerflets enable interactive scene editing by directly manipulating individual Nerflets, leading to cleaner results than methods without explicit scene decomposition. |
Nerflets currently do not model dynamic scene content, which could be a potential future direction.
The assumption of a fixed number of Nerflets regardless of scene complexity might be limiting. Dynamically adjusting the number of Nerflets based on scene complexity could be explored. |
3d scene representation, neural radiance fields, nerflets, panoptic segmentation, scene editing |
2303.03003
Report |
Efficient Large-scale Scene Representation with a Hybrid of High-resolution Grid and Plane Features |
Yuqi Zhang, Guanying Chen, Shuguang Cui |
Existing neural radiance fields (NeRF) methods for large-scale scene modeling
require days of training using multiple GPUs, hindering their applications in
scenarios with limited computing resources. Despite fast optimization NeRF
variants have been proposed based on the explicit dense or hash grid features,
their effectivenesses are mainly demonstrated in object-scale scene
representation. In this paper, we point out that the low feature resolution in
explicit representation is the bottleneck for large-scale unbounded scene
representation. To address this problem, we introduce a new and efficient
hybrid feature representation for NeRF that fuses the 3D hash-grids and
high-resolution 2D dense plane features. Compared with the dense-grid
representation, the resolution of a dense 2D plane can be scaled up more
efficiently. Based on this hybrid representation, we propose a fast
optimization NeRF variant, called GP-NeRF, that achieves better rendering
results while maintaining a compact model size. Extensive experiments on
multiple large-scale unbounded scene datasets show that our model can converge
in 1.5 hours using a single GPU while achieving results comparable to or even
better than the existing method that requires about one day's training with 8
GPUs. |
This paper introduces GP-NeRF, a novel neural radiance field variant that uses a hybrid feature representation of 3D hash-grids and high-resolution 2D dense plane features for efficient large-scale unbounded scene modeling. |
Existing large-scale scene modeling methods often require days of training with multiple GPUs due to low feature resolution, hindering their practicality for users with limited computational resources. |
The method combines a space contraction strategy for compact unbounded scene representation with a hybrid feature representation. This representation leverages the efficiency of hash-grids and enhances it with multi-resolution dense plane features to mitigate collision issues, allowing for high-resolution scene representation with low memory consumption. The model then uses a lightweight MLP to regress density and color from the interpolated hybrid features. |
GP-NeRF achieves comparable or better rendering quality than state-of-the-art Mega-NeRF while being significantly faster, converging in 1.5 hours on a single GPU compared to a day on 8 GPUs for Mega-NeRF.
The proposed hybrid feature representation outperforms baselines using only dense-grids, hash-grids, or TensoRF representations in terms of rendering quality and efficiency.
Ablation studies confirm the effectiveness of plane features in enhancing the hybrid representation, improving accuracy with minimal parameter increase. |
While significantly faster, GP-NeRF still doesn't achieve real-time scene reconstruction.
The method lacks explicit modeling of dynamic objects, limiting its application to static scenes. |
neural radiance fields, large-scale scene modeling, 3d reconstruction, hybrid feature representation, fast nerf optimization |
2303.02995
Report |
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention |
Shijie Geng, Jianbo Yuan, Yu Tian, Yuxiao Chen, Yongfeng Zhang |
The success of large-scale contrastive vision-language pretraining (CLIP) has
benefited both visual recognition and multimodal content understanding. The
concise design brings CLIP the advantage in inference efficiency against other
vision-language models with heavier cross-attention fusion layers, making it a
popular choice for a wide spectrum of downstream tasks. However, CLIP does not
explicitly capture the hierarchical nature of high-level and fine-grained
semantics conveyed in images and texts, which is arguably critical to
vision-language understanding and reasoning. To this end, we equip both the
visual and language branches in CLIP with hierarchy-aware attentions, namely
Hierarchy-aware CLIP (HiCLIP), to progressively discover semantic hierarchies
layer-by-layer from both images and texts in an unsupervised manner. As a
result, such hierarchical aggregation significantly improves the cross-modal
alignment. To demonstrate the advantages of HiCLIP, we conduct qualitative
analysis on its unsupervised hierarchy induction during inference, as well as
extensive quantitative experiments on both visual recognition and
vision-language downstream tasks. |
The paper proposes HiCLIP, a model that incorporates hierarchy-aware attention into CLIP to explicitly capture hierarchical structures in vision and language. |
CLIP lacks explicit mechanisms to capture the hierarchical nature of semantics in images and texts, which is crucial for multimodal understanding and reasoning. |
HiCLIP introduces "hierarchy-aware attention" by utilizing affinity scores between adjacent patches/tokens to guide the attention mechanism. These scores, evolving layer-by-layer, encourage grouping spatially and semantically similar elements, progressively building hierarchical representations. |
HiCLIP significantly outperforms CLIP family models on zero-shot image-text retrieval.
HiCLIP achieves superior performance on visual recognition tasks, especially when combined with self-supervised learning (HiDeCLIP).
Visualization of HiCLIP's hierarchy induction process demonstrates its capability to discover meaningful visual and language hierarchies in an unsupervised manner. |
The current unsupervised hierarchy induction relies on manually set thresholds for visual parsing.
The paper primarily focuses on evaluating up to 30 million image-text pairs due to computational limitations. |
multimodal learning, contrastive learning, vision-language models, hierarchy-aware attention, unsupervised hierarchy induction |
2303.02984
Report |
Learning multi-scale local conditional probability models of images |
Zahra Kadkhodaie, Florentin Guth, Stéphane Mallat, Eero P Simoncelli |
Deep neural networks can learn powerful prior probability models for images,
as evidenced by the high-quality generations obtained with recent score-based
diffusion methods. But the means by which these networks capture complex global
statistical structure, apparently without suffering from the curse of
dimensionality, remain a mystery. To study this, we incorporate diffusion
methods into a multi-scale decomposition, reducing dimensionality by assuming a
stationary local Markov model for wavelet coefficients conditioned on
coarser-scale coefficients. We instantiate this model using convolutional
neural networks (CNNs) with local receptive fields, which enforce both the
stationarity and Markov properties. Global structures are captured using a CNN
with receptive fields covering the entire (but small) low-pass image. We test
this model on a dataset of face images, which are highly non-stationary and
contain large-scale geometric structures. Remarkably, denoising,
super-resolution, and image synthesis results all demonstrate that these
structures can be captured with significantly smaller conditioning
neighborhoods than required by a Markov model implemented in the pixel domain.
Our results show that score estimation for large complex images can be reduced
to low-dimensional Markov conditional models across scales, alleviating the
curse of dimensionality. |
This paper presents a low-dimensional image probability model based on a multi-scale decomposition with local Markov conditional probabilities of wavelet coefficients. |
The work aims to address the curse of dimensionality in score-based diffusion models for images, investigating how these models capture global structure despite high data dimensionality. |
The model factorizes the image probability distribution into conditional probabilities of wavelet coefficients conditioned on coarser scales, assuming stationarity and locality. These conditional scores are estimated using CNNs with local receptive fields. |
Multi-scale denoising with the proposed model significantly outperforms conventional pixel-domain denoisers, especially for high noise levels.
The model captures long-range dependencies in face images even with small receptive fields (as small as 9x9) in the wavelet domain.
Super-resolution and synthesis experiments demonstrate that the model generates more realistic face images compared to models based on local Markov assumptions in the pixel domain. |
The dimensionality of conditioning neighborhoods, while reduced, is still high and requires further reduction.
Further research is needed to extend the model to more diverse datasets beyond centered faces. |
diffusion models, wavelet transform, markov random fields, image denoising, super-resolution |
2303.02943
Report |
Adaptive Texture Filtering for Single-Domain Generalized Segmentation |
Xinhui Li, Mingjia Li, Yaxing Wang, Chuan-Xian Ren, Xiaojie Guo |
Domain generalization in semantic segmentation aims to alleviate the
performance degradation on unseen domains through learning domain-invariant
features. Existing methods diversify images in the source domain by adding
complex or even abnormal textures to reduce the sensitivity to domain specific
features. However, these approaches depend heavily on the richness of the
texture bank, and training them can be time-consuming. In contrast to importing
textures arbitrarily or augmenting styles randomly, we focus on the single
source domain itself to achieve generalization. In this paper, we present a
novel adaptive texture filtering mechanism to suppress the influence of texture
without using augmentation, thus eliminating the interference of
domain-specific features. Further, we design a hierarchical guidance
generalization network equipped with structure-guided enhancement modules,
which purpose is to learn the domain-invariant generalized knowledge. Extensive
experiments together with ablation studies on widely-used datasets are
conducted to verify the effectiveness of the proposed model, and reveal its
superiority over other state-of-the-art alternatives. |
This paper proposes a novel adaptive filtering mechanism (AFM) and a hierarchical guidance generalization network (HGGN) for single-domain generalization in semantic segmentation, aiming to learn domain-invariant features by suppressing domain-specific textures. |
Domain generalization in semantic segmentation is crucial for real-world applications like autonomous driving where models need to generalize well to unseen domains. |
The AFM adaptively filters out textures from images to generate content-dependent representations, while the HGGN with structure-guided enhancement modules learns domain-invariant features under contour supervision. |
The proposed method outperforms state-of-the-art methods on benchmark datasets (GTA5, SYNTHIA, Cityscapes, BDD-100K, Mapillary) for domain generalization in semantic segmentation.
The adaptive texture filtering in AFM proves to be more effective than fixed filtering levels.
The hierarchical design of HGGN with contour supervision significantly improves the generalization ability compared to using only the backbone network. |
The method's performance relies on the pre-trained texture filtering generator, which might be limited by the diversity of the training data.
Future work could explore incorporating other domain-invariant features beyond texture and shape information. |
domain generalization, semantic segmentation, texture suppression, adaptive filtering, hierarchical guidance |
2303.02936
Report |
UniHCP: A Unified Model for Human-Centric Perceptions |
Yuanzheng Ci, Yizhou Wang, Meilin Chen, Shixiang Tang, Lei Bai, Feng Zhu, Rui Zhao, Fengwei Yu, Donglian Qi, Wanli Ouyang |
Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian
detection, person re-identification, etc.) play a key role in industrial
applications of visual models. While specific human-centric tasks have their
own relevant semantic aspect to focus on, they also share the same underlying
semantic structure of the human body. However, few works have attempted to
exploit such homogeneity and design a general-propose model for human-centric
tasks. In this work, we revisit a broad range of human-centric tasks and unify
them in a minimalist manner. We propose UniHCP, a Unified Model for
Human-Centric Perceptions, which unifies a wide range of human-centric tasks in
a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can
outperform strong baselines on several in-domain and downstream tasks by direct
evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a
wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing,
86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID,
and 85.8 JI on CrowdHuman for pedestrian detection, performing better than
specialized models tailored for each task. |
This paper proposes UniHCP, a unified model for human-centric perceptions that can simultaneously handle pose estimation, semantic part segmentation, pedestrian detection, ReID, and person attribute recognition using a single architecture. |
A unified model for human-centric tasks can exploit the shared underlying semantic structure of the human body to improve performance, enable fast adaptation to new tasks, and decrease memory cost in large-scale multitask system deployment. |
UniHCP utilizes a plain vision transformer as a shared encoder-decoder architecture and introduces task-specific queries and a task-guided interpreter to handle the diversity of data and output structures across different tasks. |
UniHCP achieves state-of-the-art performance on nine out of twelve human-centric benchmark datasets after fine-tuning.
The model demonstrates strong performance even with direct evaluation on in-pretrain datasets and shows promising transferability to unseen datasets.
Ablation studies show the effectiveness of weight sharing design and data-efficient transferring ability with prompt tuning. |
The person ReID task requires finetuning for optimal performance due to its disparity from other tasks.
The model raises ethical concerns regarding potential identity information leaking in ReID, which requires careful handling and limited release of the pretrained model. |
human-centric perception, unified model, vision transformer, multitask learning, weight sharing |
2303.02688
Report |
Text2Face: A Multi-Modal 3D Face Model |
Will Rowan, Patrik Huber, Nick Pears, Andrew Keeling |
We present the first 3D morphable modelling approach, whereby 3D face shape
can be directly and completely defined using a textual prompt. Building on work
in multi-modal learning, we extend the FLAME head model to a common
image-and-text latent space. This allows for direct 3D Morphable Model (3DMM)
parameter generation and therefore shape manipulation from textual
descriptions. Our method, Text2Face, has many applications; for example:
generating police photofits where the input is already in natural language. It
further enables multi-modal 3DMM image fitting to sketches and sculptures, as
well as images. |
Presents Text2Face, the first 3D morphable modelling approach that directly generates 3D face shapes from textual descriptions. |
Automates 3D face creation from text, enabling applications like generating police photofits directly from witness descriptions, multi-modal 3DMM fitting (sketches, sculptures), and improved initialization for model-to-image fitting. |
Trains a deep MLP (Text2Face) to map CLIP embeddings to FLAME model parameters, using a dataset of synthetic faces with corresponding CLIP embeddings and FLAME parameters extracted via DECA. |
Successfully generates 3D faces with identity, expression, and detail from text prompts.
Demonstrates multi-modal fitting capabilities, generating 3D faces from sketches and sculptures.
Enables texture mapping from DALL-E generated images onto the generated 3D meshes. |
Potential for inherited gender and racial biases from CLIP impacting 3D face generation.
Limited exploration of text prompts for fine-grained shape manipulation. |
3d morphable model, text-to-3d, clip, face generation, multi-modal learning |
2303.02584
Report |
Super-Resolution Neural Operator |
Min Wei, Xuesong Zhang |
We propose Super-resolution Neural Operator (SRNO), a deep operator learning
framework that can resolve high-resolution (HR) images at arbitrary scales from
the low-resolution (LR) counterparts. Treating the LR-HR image pairs as
continuous functions approximated with different grid sizes, SRNO learns the
mapping between the corresponding function spaces. From the perspective of
approximation theory, SRNO first embeds the LR input into a higher-dimensional
latent representation space, trying to capture sufficient basis functions, and
then iteratively approximates the implicit image function with a kernel
integral mechanism, followed by a final dimensionality reduction step to
generate the RGB representation at the target coordinates. The key
characteristics distinguishing SRNO from prior continuous SR works are: 1) the
kernel integral in each layer is efficiently implemented via the Galerkin-type
attention, which possesses non-local properties in the spatial domain and
therefore benefits the grid-free continuum; and 2) the multilayer attention
architecture allows for the dynamic latent basis update, which is crucial for
SR problems to "hallucinate" high-frequency information from the LR image.
Experiments show that SRNO outperforms existing continuous SR methods in terms
of both accuracy and running time. Our code is at
https://github.com/2y7c3/Super-Resolution-Neural-Operator |
This paper proposes Super-Resolution Neural Operator (SRNO), a deep operator learning framework to resolve high-resolution (HR) images at arbitrary scales from low-resolution (LR) counterparts. |
Existing deep learning-based SR methods often require training separate models for each scaling factor, proving inefficient for arbitrary scale requirements. SRNO addresses this limitation by learning the mapping between continuous function spaces representing LR-HR image pairs. |
SRNO leverages a three-step methodology: 1) Lifting: Embeds LR input into a higher-dimensional latent space using a CNN encoder and spatial interpolation; 2) Iterative Kernel Integral: Approximates the image function with a kernel integral mechanism, efficiently implemented via Galerkin-type attention for non-local spatial relationship capturing; 3) Projection: Reduces the final dimensionality to generate the RGB representation at the target coordinates. |
SRNO outperforms existing continuous SR methods in both reconstruction accuracy and running time, irrespective of the encoder used.
The Galerkin-type attention mechanism in SRNO contributes to its superior function approximation capability.
SRNO effectively captures global image structures, leading to better visual quality with fewer artifacts compared to methods like LIIF and LTE. |
The impact of varying the number of basis functions and iterative updating layers requires further investigation.
Exploring alternative sampling strategies beyond random and sequential methods could potentially yield further performance improvements. |
super-resolution, neural operator, deep learning, galerkin-type attention, continuous image representation |
2303.02416
Report |
PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling |
Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin |
Masked Image Modeling (MIM) has achieved promising progress with the advent
of Masked Autoencoders (MAE) and BEiT. However, subsequent works have
complicated the framework with new auxiliary tasks or extra pre-trained models,
inevitably increasing computational overhead. This paper undertakes a
fundamental analysis of MIM from the perspective of pixel reconstruction, which
examines the input image patches and reconstruction target, and highlights two
critical but previously overlooked bottlenecks. Based on this analysis, we
propose a remarkably simple and effective method, {\ourmethod}, that entails
two strategies: 1) filtering the high-frequency components from the
reconstruction target to de-emphasize the network's focus on texture-rich
details and 2) adopting a conservative data transform strategy to alleviate the
problem of missing foreground in MIM training. {\ourmethod} can be easily
integrated into most existing pixel-based MIM approaches (\ie, using raw images
as reconstruction target) with negligible additional computation. Without bells
and whistles, our method consistently improves three MIM approaches, MAE,
ConvMAE, and LSMAE, across various downstream tasks. We believe this effective
plug-and-play method will serve as a strong baseline for self-supervised
learning and provide insights for future improvements of the MIM framework.
Code and models are available at
\url{https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim}. |
This paper presents PixMIM, a simple yet effective method for improving masked image modeling (MIM) by focusing on pixel reconstruction. |
Existing MIM methods either complicate the framework with extra tasks or rely on computationally expensive pre-trained models for target generation. This paper addresses these limitations by revisiting the fundamental aspects of pixel reconstruction. |
PixMIM introduces two key strategies: (1) Filtering high-frequency components from the reconstruction target to prioritize learning of low-frequency patterns like shapes and global structures. (2) Replacing Random Resized Crop (RRC) with Simple Resized Crop (SRC) to preserve more semantically important foreground information in input patches. |
PixMIM consistently improves the performance of three baselines (MAE, ConvMAE, LSMAE) on various downstream tasks like ImageNet classification, COCO object detection, and ADE20K semantic segmentation.
The method enhances model robustness against domain shifts, evidenced by superior performance on out-of-distribution ImageNet variants.
PixMIM leads to models with a stronger shape bias, aligning them more closely with human visual perception. |
Current experiments primarily focus on ViT-B architecture; further evaluation on larger models is needed.
The bandwidth of the low-pass filter is a hyperparameter that might require tuning for different datasets and input resolutions. Investigating a self-adaptive bandwidth is a potential future direction. |
self-supervised learning, masked image modeling, pixel reconstruction, vision transformer, representation learning |
2303.02151
Report |
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners |
Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Hanqiu Deng, Hongsheng Li, Yu Qiao, Peng Gao |
Visual recognition in low-data regimes requires deep neural networks to learn
generalized representations from limited training samples. Recently, CLIP-based
methods have shown promising few-shot performance benefited from the
contrastive language-image pre-training. We then question, if the more diverse
pre-training knowledge can be cascaded to further assist few-shot
representation learning. In this paper, we propose CaFo, a Cascade of
Foundation models that incorporates diverse prior knowledge of various
pre-training paradigms for better few-shot learning. Our CaFo incorporates
CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge,
DALL-E's vision-generative knowledge, and GPT-3's language-generative
knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly,
we leverage GPT-3 to produce textual inputs for prompting CLIP with rich
downstream linguistic semantics. Then, we generate synthetic images via DALL-E
to expand the few-shot training data without any manpower. At last, we
introduce a learnable cache model to adaptively blend the predictions from CLIP
and DINO. By such collaboration, CaFo can fully unleash the potential of
different pre-training methods and unify them to perform state-of-the-art for
few-shot classification. Code is available at
https://github.com/ZrrSkywalker/CaFo. |
This paper proposes CaFo, a cascade of foundation models (CLIP, DINO, DALL-E, GPT-3) for few-shot image classification. |
Few-shot learning requires good generalization from limited data, and leveraging diverse pre-trained knowledge can enhance this ability. |
CaFo employs a 'Prompt, Generate, then Cache' pipeline: 1) GPT-3 generates semantic prompts for CLIP. 2) DALL-E synthesizes additional training images. 3) A learnable cache model fuses predictions from CLIP and DINO based on distribution similarity. |
CaFo achieves state-of-the-art few-shot classification performance on 11 datasets.
Zero-shot CaFo, trained only on DALL-E generated images, demonstrates competitive results.
Ablations confirm the contribution of each component and the effectiveness of the adaptive inference strategy. |
The current work explores a limited set of foundation models.
Future work could investigate incorporating more diverse pre-trained models, such as masked-generative or 3D models. |
few-shot learning, foundation models, vision-language pre-training, data augmentation, knowledge ensemble |
2303.02091
Report |
Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement |
Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, Gang Zeng |
Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough in
image-based 3D reconstruction. However, their implicit volumetric
representations differ significantly from the widely-adopted polygonal meshes
and lack support from common 3D software and hardware, making their rendering
and manipulation inefficient. To overcome this limitation, we present a novel
framework that generates textured surface meshes from images. Our approach
begins by efficiently initializing the geometry and view-dependency decomposed
appearance with a NeRF. Subsequently, a coarse mesh is extracted, and an
iterative surface refining algorithm is developed to adaptively adjust both
vertex positions and face density based on re-projected rendering errors. We
jointly refine the appearance with geometry and bake it into texture images for
real-time rendering. Extensive experiments demonstrate that our method achieves
superior mesh quality and competitive rendering quality. |
This paper proposes NeRF2Mesh, a framework for reconstructing textured surface meshes from multi-view RGB images, enabling compatibility with common 3D hardware and software. |
NeRF's implicit volumetric representations are inefficient for rendering and manipulation, lacking support from standard 3D tools. Polygonal meshes address these limitations, but their direct reconstruction poses challenges. |
NeRF2Mesh first initializes geometry and decomposed appearance (diffuse and specular) using a grid-based NeRF. It then extracts a coarse mesh, followed by iterative refinement of vertex positions and face density based on re-projected rendering errors. Finally, the appearance is baked into texture images. |
NeRF2Mesh achieves superior mesh quality with accurate thin structure reconstruction compared to previous methods.
The method results in relatively smaller mesh sizes due to adaptive face density adjustment.
The framework achieves competitive rendering quality and enables real-time rendering with standard 3D software and hardware. |
The current method bakes lighting into textures, limiting relighting capabilities.
The relatively small appearance network struggles with complex view-dependent effects, impacting surface quality in those regions. |
neural radiance fields, surface reconstruction, mesh generation, texture baking, 3d reconstruction |
2303.02001
Report |
Zero-shot Object Counting |
Jingyi Xu, Hieu Le, Vu Nguyen, Viresh Ranjan, Dimitris Samaras |
Class-agnostic object counting aims to count object instances of an arbitrary
class at test time. It is challenging but also enables many potential
applications. Current methods require human-annotated exemplars as inputs which
are often unavailable for novel categories, especially for autonomous systems.
Thus, we propose zero-shot object counting (ZSC), a new setting where only the
class name is available during test time. Such a counting system does not
require human annotators in the loop and can operate automatically. Starting
from a class name, we propose a method that can accurately identify the optimal
patches which can then be used as counting exemplars. Specifically, we first
construct a class prototype to select the patches that are likely to contain
the objects of interest, namely class-relevant patches. Furthermore, we
introduce a model that can quantitatively measure how suitable an arbitrary
patch is as a counting exemplar. By applying this model to all the candidate
patches, we can select the most suitable patches as exemplars for counting.
Experimental results on a recent class-agnostic counting dataset, FSC-147,
validate the effectiveness of our method. Code is available at
https://github.com/cvlab-stonybrook/zero-shot-counting |
This supplementary material provides additional experiments and analyses for the zero-shot object counting method presented in the main paper. |
This supplementary material aims to enhance the understanding and validate the effectiveness of the proposed zero-shot object counting method. |
The authors conduct ablation studies on different aspects of their method including exploring different methods for acquiring candidate patches, comparing the use of predicted counting errors versus objectness scores for selecting exemplars, and evaluating the performance of using correlation matching with a generated prototype as an alternative to patch selection. |
Using a combination of randomly sampled patches and RPN proposals as candidate patches yields the best performance.
Selecting counting exemplars based on predicted counting error outperforms using objectness scores.
The proposed patch selection method achieves better results compared to directly using a generated prototype for correlation matching. |
The study primarily focuses on the FSC-147 dataset and further evaluation on other datasets is needed.
Future work could explore incorporating additional information, such as object scale and shape, to further improve exemplar selection. |
zero-shot learning, object counting, exemplar selection, patch selection, error prediction |
2303.01559
Report |
Improving GAN Training via Feature Space Shrinkage |
Haozhe Liu, Wentian Zhang, Bing Li, Haoqian Wu, Nanjun He, Yawen Huang, Yuexiang Li, Bernard Ghanem, Yefeng Zheng |
Due to the outstanding capability for data generation, Generative Adversarial
Networks (GANs) have attracted considerable attention in unsupervised learning.
However, training GANs is difficult, since the training distribution is dynamic
for the discriminator, leading to unstable image representation. In this paper,
we address the problem of training GANs from a novel perspective, \emph{i.e.,}
robust image classification. Motivated by studies on robust image
representation, we propose a simple yet effective module, namely AdaptiveMix,
for GANs, which shrinks the regions of training data in the image
representation space of the discriminator. Considering it is intractable to
directly bound feature space, we propose to construct hard samples and narrow
down the feature distance between hard and easy samples. The hard samples are
constructed by mixing a pair of training images. We evaluate the effectiveness
of our AdaptiveMix with widely-used and state-of-the-art GAN architectures. The
evaluation results demonstrate that our AdaptiveMix can facilitate the training
of GANs and effectively improve the image quality of generated samples. We also
show that our AdaptiveMix can be further applied to image classification and
Out-Of-Distribution (OOD) detection tasks, by equipping it with
state-of-the-art methods. Extensive experiments on seven publicly available
datasets show that our method effectively boosts the performance of baselines.
The code is publicly available at
https://github.com/WentianZhang-ML/AdaptiveMix. |
This paper introduces AdaptiveMix, a novel module designed to enhance the training stability of Generative Adversarial Networks (GANs) by shrinking the feature space representation within the discriminator. |
Training GANs is inherently challenging due to the dynamic nature of the training distribution, often leading to unstable image representation and low-quality generated samples. This work addresses this challenge by enhancing the robustness of image representation in the discriminator. |
AdaptiveMix operates by constructing hard samples through the linear mixing of training images and then minimizes the feature distance between these hard samples and easy (original) training samples. This process effectively shrinks the regions occupied by training data in the discriminator’s feature space, enhancing representation robustness. |
AdaptiveMix significantly improves the performance of various GAN architectures, including DCGAN and StyleGAN-V2, achieving lower FID scores and generating higher-quality images.
The module exhibits effectiveness across different datasets, particularly showcasing substantial improvements when trained on a limited number of samples.
Beyond image generation, AdaptiveMix demonstrates applicability to image classification and Out-Of-Distribution (OOD) detection tasks, consistently boosting the performance of baseline models. |
The paper primarily focuses on linear mixing for hard sample generation, exploring other mixing strategies could be a potential avenue for future work.
While the paper provides theoretical analysis connecting AdaptiveMix to Lipschitz continuity under the L1 norm, extending this analysis to other distance metrics could further strengthen the theoretical grounding. |
generative adversarial networks, image generation, robust image classification, out-of-distribution detection, feature space shrinkage |
2303.01494
Report |
Image as Set of Points |
Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, Yun Fu |
What is an image and how to extract latent features? Convolutional Networks
(ConvNets) consider an image as organized pixels in a rectangular shape and
extract features via convolutional operation in local region; Vision
Transformers (ViTs) treat an image as a sequence of patches and extract
features via attention mechanism in a global range. In this work, we introduce
a straightforward and promising paradigm for visual representation, which is
called Context Clusters. Context clusters (CoCs) view an image as a set of
unorganized points and extract features via simplified clustering algorithm. In
detail, each point includes the raw feature (e.g., color) and positional
information (e.g., coordinates), and a simplified clustering algorithm is
employed to group and extract deep features hierarchically. Our CoCs are
convolution- and attention-free, and only rely on clustering algorithm for
spatial interaction. Owing to the simple design, we show CoCs endow gratifying
interpretability via the visualization of clustering process. Our CoCs aim at
providing a new perspective on image and visual representation, which may enjoy
broad applications in different domains and exhibit profound insights. Even
though we are not targeting SOTA performance, COCs still achieve comparable or
even better results than ConvNets or ViTs on several benchmarks. Codes are
available at: https://github.com/ma-xu/Context-Cluster. |
This paper introduces Context Clusters (CoCs), a novel visual representation paradigm that uses a simplified clustering algorithm to extract features from images viewed as sets of unorganized points. |
This approach offers a new perspective on image understanding and feature extraction, distinct from Convolutional Networks (ConvNets) and Vision Transformers (ViTs). It provides promising interpretability through visualization of the clustering process and demonstrates strong generalization ability across different data domains. |
CoCs treat images as point clouds, with each point containing color and positional information. A hierarchical clustering algorithm groups these points into clusters, aggregates features within each cluster, and dispatches the aggregated information back to individual points. This process facilitates context-aware feature learning. |
CoCs achieve comparable or superior performance to ConvNets and ViTs on ImageNet-1K classification, demonstrating the effectiveness of clustering for visual representation.
Visualization of the clustering process reveals that CoCs can effectively group semantically similar image regions, highlighting their interpretability.
CoCs exhibit strong generalization ability by achieving promising results on 3D point cloud classification (ScanObjectNN), object detection and instance segmentation (MS COCO), and semantic segmentation (ADE20K). |
The fixed-center clustering strategy, adopted for computational efficiency, may limit the model's ability to capture complex relationships compared to dynamic center updates.
The current CoC architecture requires compromises to accommodate the rectangular feature map format of common detection and segmentation heads, potentially limiting its performance for those tasks. |
visual representation learning, clustering algorithms, image understanding, point cloud analysis, interpretability |
2303.01416
Report |
3D generation on ImageNet |
Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, Sergey Tulyakov |
Existing 3D-from-2D generators are typically designed for well-curated
single-category datasets, where all the objects have (approximately) the same
scale, 3D location, and orientation, and the camera always points to the center
of the scene. This makes them inapplicable to diverse, in-the-wild datasets of
non-alignable scenes rendered from arbitrary camera poses. In this work, we
develop a 3D generator with Generic Priors (3DGP): a 3D synthesis framework
with more general assumptions about the training data, and show that it scales
to very challenging datasets, like ImageNet. Our model is based on three new
ideas. First, we incorporate an inaccurate off-the-shelf depth estimator into
3D GAN training via a special depth adaptation module to handle the
imprecision. Then, we create a flexible camera model and a regularization
strategy for it to learn its distribution parameters during training. Finally,
we extend the recent ideas of transferring knowledge from pre-trained
classifiers into GANs for patch-wise trained models by employing a simple
distillation-based technique on top of the discriminator. It achieves more
stable training than the existing methods and speeds up the convergence by at
least 40%. We explore our model on four datasets: SDIP Dogs 256x256, SDIP
Elephants 256x256, LSUN Horses 256x256, and ImageNet 256x256, and demonstrate
that 3DGP outperforms the recent state-of-the-art in terms of both texture and
geometry quality. Code and visualizations:
https://snap-research.github.io/3dgp. |
The paper presents 3DGP, a 3D-aware generative model capable of synthesizing diverse, in-the-wild images from datasets like ImageNet, overcoming the limitations of existing models designed for single-category, aligned datasets. |
Existing 3D-from-2D generators struggle with diverse, non-alignable datasets due to the lack of a single canonical pose and the variability in object scales and camera parameters. This work tackles these challenges to enable 3D synthesis for in-the-wild data. |
The model incorporates three key novelties: a learnable 'Ball-in-Sphere' camera distribution to handle diverse camera poses, adversarial depth supervision using an off-the-shelf depth estimator with a depth adaptor to guide geometry learning, and knowledge distillation from a pre-trained ResNet50 into the discriminator for improved image fidelity. |
3DGP outperforms state-of-the-art 3D-aware generators in image appearance (FID) and geometry quality on non-aligned single-category datasets (SDIP Dogs, SDIP Elephants, LSUN Horses).
The model successfully demonstrates multi-categorical 3D synthesis on the challenging ImageNet dataset, producing realistic images and outperforming baselines.
Ablation studies demonstrate the effectiveness of each proposed component (learnable camera, adversarial depth supervision, knowledge distillation) in improving geometry and overall generation quality. |
The visual quality of 3DGP, while exceeding existing 3D generators, is still lower than state-of-the-art 2D generators.
The model exhibits background sticking artifacts, potentially due to dataset bias towards frontal views and limitations of the tri-plane representation. |
3d synthesis, generative adversarial networks, depth supervision, camera distribution learning, knowledge distillation |
2303.01267
Report |
Token Contrast for Weakly-Supervised Semantic Segmentation |
Lixiang Ru, Heliang Zheng, Yibing Zhan, Bo Du |
Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels
typically utilizes Class Activation Map (CAM) to generate the pseudo labels.
Limited by the local structure perception of CNN, CAM usually cannot identify
the integral object regions. Though the recent Vision Transformer (ViT) can
remedy this flaw, we observe it also brings the over-smoothing issue, \ie, the
final patch tokens incline to be uniform. In this work, we propose Token
Contrast (ToCo) to address this issue and further explore the virtue of ViT for
WSSS. Firstly, motivated by the observation that intermediate layers in ViT can
still retain semantic diversity, we designed a Patch Token Contrast module
(PTC). PTC supervises the final patch tokens with the pseudo token relations
derived from intermediate layers, allowing them to align the semantic regions
and thus yield more accurate CAM. Secondly, to further differentiate the
low-confidence regions in CAM, we devised a Class Token Contrast module (CTC)
inspired by the fact that class tokens in ViT can capture high-level semantics.
CTC facilitates the representation consistency between uncertain local regions
and global objects by contrasting their class tokens. Experiments on the PASCAL
VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other
single-stage competitors and achieve comparable performance with
state-of-the-art multi-stage methods. Code is available at
https://github.com/rulixiang/ToCo. |
This paper proposes Token Contrast (ToCo), a novel approach for weakly-supervised semantic segmentation (WSSS) that leverages Vision Transformer (ViT) and addresses the over-smoothing issue inherent in ViT. |
WSSS with image-level labels typically relies on Class Activation Map (CAM), but existing methods using CNNs or ViTs have limitations in accurately identifying integral object regions due to local structure perception or over-smoothing. |
ToCo introduces two novel modules: Patch Token Contrast (PTC) and Class Token Contrast (CTC). PTC utilizes intermediate layer knowledge to supervise and diversify final patch tokens, mitigating over-smoothing. CTC contrasts global and local class tokens to enhance representation consistency between less discriminative and global object regions. |
ToCo significantly outperforms state-of-the-art single-stage WSSS methods on PASCAL VOC and MS COCO datasets.
The proposed method achieves comparable results to multi-stage WSSS methods while only using image-level labels.
Extensive ablation studies validate the effectiveness of PTC and CTC in addressing over-smoothing and improving CAM quality. |
The paper mainly evaluates ToCo on natural image datasets, and its generalization to other domains is not extensively studied.
The computational cost of ViT, especially for larger variants, may be a limitation for real-time applications. |
weakly-supervised semantic segmentation, vision transformer, over-smoothing, class activation map, token contrast |
2303.01237
Report |
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation |
Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li |
FlowFormer introduces a transformer architecture into optical flow estimation
and achieves state-of-the-art performance. The core component of FlowFormer is
the transformer-based cost-volume encoder. Inspired by the recent success of
masked autoencoding (MAE) pretraining in unleashing transformers' capacity of
encoding visual representation, we propose Masked Cost Volume Autoencoding
(MCVA) to enhance FlowFormer by pretraining the cost-volume encoder with a
novel MAE scheme. Firstly, we introduce a block-sharing masking strategy to
prevent masked information leakage, as the cost maps of neighboring source
pixels are highly correlated. Secondly, we propose a novel pre-text
reconstruction task, which encourages the cost-volume encoder to aggregate
long-range information and ensures pretraining-finetuning consistency. We also
show how to modify the FlowFormer architecture to accommodate masks during
pretraining. Pretrained with MCVA, FlowFormer++ ranks 1st among published
methods on both Sintel and KITTI-2015 benchmarks. Specifically, FlowFormer++
achieves 1.07 and 1.94 average end-point error (AEPE) on the clean and final
pass of Sintel benchmark, leading to 7.76\% and 7.18\% error reductions from
FlowFormer. FlowFormer++ obtains 4.52 F1-all on the KITTI-2015 test set,
improving FlowFormer by 0.16. |
This paper proposes Masked Cost Volume Autoencoding (MCVA), a self-supervised pretraining scheme to enhance the cost-volume encoding of FlowFormer for better optical flow estimation. |
Pretraining transformers on large datasets is crucial for optical flow estimation, and MCVA enables pretraining of the FlowFormer's cost-volume encoder for improved performance. |
The paper introduces block-sharing masking to prevent information leakage and proposes a novel pre-text reconstruction task mimicking the FlowFormer's decoding process to ensure pretraining-finetuning consistency. |
FlowFormer++ with MCVA ranks 1st among published methods on Sintel and KITTI-2015 benchmarks.
It achieves 1.07 and 1.94 AEPE on Sintel clean and final pass, a 7.76% and 7.18% error reduction from FlowFormer.
On KITTI-2015, it obtains 4.52 F1-all, improving FlowFormer by 0.16 and outperforming the previous best model S-Flow by 0.12. |
The pretraining process requires large-scale video datasets like YouTube-VOS.
Further investigation into more efficient pretraining strategies for optical flow estimation is needed. |
optical flow estimation, transformer, self-supervised learning, masked autoencoding, pretraining |
2303.01091
Report |
OPE-SR: Orthogonal Position Encoding for Designing a Parameter-free Upsampling Module in Arbitrary-scale Image Super-Resolution |
Gaochao Song, Luo Zhang, Ran Su, Jianfeng Shi, Ying He, Qian Sun |
Implicit neural representation (INR) is a popular approach for
arbitrary-scale image super-resolution (SR), as a key component of INR,
position encoding improves its representation ability. Motivated by position
encoding, we propose orthogonal position encoding (OPE) - an extension of
position encoding - and an OPE-Upscale module to replace the INR-based
upsampling module for arbitrary-scale image super-resolution. Same as INR, our
OPE-Upscale Module takes 2D coordinates and latent code as inputs; however it
does not require training parameters. This parameter-free feature allows the
OPE-Upscale Module to directly perform linear combination operations to
reconstruct an image in a continuous manner, achieving an arbitrary-scale image
reconstruction. As a concise SR framework, our method has high computing
efficiency and consumes less memory comparing to the state-of-the-art (SOTA),
which has been confirmed by extensive experiments and evaluations. In addition,
our method has comparable results with SOTA in arbitrary scale image
super-resolution. Last but not the least, we show that OPE corresponds to a set
of orthogonal basis, justifying our design principle. |
This paper proposes Orthogonal Position Encoding (OPE), a novel position encoding method inspired by 2D-Fourier Series, and uses it to design a parameter-free upsampling module (OPE-Upscale) for arbitrary-scale image super-resolution. |
Existing INR-based upsampling modules for arbitrary-scale SR increase network complexity and suffer from limitations in learning symmetric features. This work aims to address these issues by simplifying the SR framework and providing an interpretable image representation. |
OPE represents continuous image patches as linear combinations of orthogonal basis functions derived from 2D-Fourier Series. The OPE-Upscale module utilizes these basis functions and latent codes extracted from a feature map to reconstruct target image pixels at arbitrary scales. Patch ensemble is introduced to ensure seamless stitching of reconstructed patches. |
The proposed OPE method achieves comparable image super-resolution performance to state-of-the-art methods, with significantly reduced computational complexity and memory consumption.
OPE-Upscale module demonstrates superior time efficiency, especially for larger scale factors, compared to INR-based counterparts.
OPE effectively addresses the flipping consistency problem observed in INR-based methods, producing accurate symmetrical outputs for flipped inputs. |
The performance of OPE slightly degrades for low scale factors due to the simplified representation of larger grid regions in the continuous 2D domain.
Future work will explore sampling strategies to enhance OPE's performance at low scale factors without significantly compromising its efficiency. Additionally, exploring other orthogonal basis functions, like Legendre or Chebyshev polynomials, for position encoding is of interest. |
image super-resolution, arbitrary-scale, position encoding, orthogonal basis, parameter-free |
2303.00848
Report |
Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation |
Diederik P. Kingma, Ruiqi Gao |
To achieve the highest perceptual quality, state-of-the-art diffusion models
are optimized with objectives that typically look very different from the
maximum likelihood and the Evidence Lower Bound (ELBO) objectives. In this
work, we reveal that diffusion model objectives are actually closely related to
the ELBO.
Specifically, we show that all commonly used diffusion model objectives
equate to a weighted integral of ELBOs over different noise levels, where the
weighting depends on the specific objective used. Under the condition of
monotonic weighting, the connection is even closer: the diffusion objective
then equals the ELBO, combined with simple data augmentation, namely Gaussian
noise perturbation. We show that this condition holds for a number of
state-of-the-art diffusion models.
In experiments, we explore new monotonic weightings and demonstrate their
effectiveness, achieving state-of-the-art FID scores on the high-resolution
ImageNet benchmark. |
This paper reveals a close relationship between various diffusion model objectives and the Evidence Lower Bound (ELBO), showing they are equivalent to a weighted integral of ELBOs over different noise levels. |
This connection provides a deeper understanding of diffusion models and their relationship to traditional likelihood-based generative models. |
The authors analyze the weighted diffusion loss, generalizing various objectives by expressing them as special cases with specific weighting functions. They prove that monotonic weighting functions lead to equivalence with the ELBO combined with Gaussian noise perturbation. |
All commonly used diffusion model objectives can be expressed as a weighted integral of ELBOs over noise levels.
Monotonic weighting functions in diffusion objectives equate to maximizing the ELBO with Gaussian noise data augmentation.
Experiments on ImageNet using novel monotonic weighting functions achieve state-of-the-art FID scores for high-resolution image generation. |
Empirical results can be sensitive to hyperparameter choices and may require re-tuning for different datasets or resolutions.
Future work includes comparing diffusion models to other likelihood-based models using the established equivalence for a more comprehensive evaluation. |
diffusion models, generative models, evidence lower bound (elbo), data augmentation, image generation |
2303.00748
Report |
Efficient and Explicit Modelling of Image Hierarchies for Image Restoration |
Yawei Li, Yuchen Fan, Xiaoyu Xiang, Denis Demandolx, Rakesh Ranjan, Radu Timofte, Luc Van Gool |
The aim of this paper is to propose a mechanism to efficiently and explicitly
model image hierarchies in the global, regional, and local range for image
restoration. To achieve that, we start by analyzing two important properties of
natural images including cross-scale similarity and anisotropic image features.
Inspired by that, we propose the anchored stripe self-attention which achieves
a good balance between the space and time complexity of self-attention and the
modelling capacity beyond the regional range. Then we propose a new network
architecture dubbed GRL to explicitly model image hierarchies in the Global,
Regional, and Local range via anchored stripe self-attention, window
self-attention, and channel attention enhanced convolution. Finally, the
proposed network is applied to 7 image restoration types, covering both real
and synthetic settings. The proposed method sets the new state-of-the-art for
several of those. Code will be available at
https://github.com/ofsoundof/GRL-Image-Restoration.git. |
This paper presents GRL, a transformer network for image restoration that efficiently models image hierarchies in the global, regional, and local ranges. |
Natural images exhibit features at various scales. Explicitly modelling these hierarchical dependencies is crucial for high-quality image restoration, especially with increasing image resolutions. |
The authors propose anchored stripe self-attention, inspired by cross-scale similarity and anisotropic image features, to efficiently capture long-range dependencies. This mechanism is integrated with window self-attention and channel attention enhanced convolutions in the GRL architecture. |
GRL achieves state-of-the-art performance on various image restoration tasks, including denoising, super-resolution, deblurring, and JPEG artifact removal.
The method shows significant PSNR improvements over previous state-of-the-art methods, such as Restormer, on datasets like GoPro and RealBlur-R.
A tiny version of the network, GRL-T, demonstrates high efficiency with significantly reduced model complexity while maintaining competitive accuracy. |
Theoretical guarantees for similarity propagation in the anchored self-attention mechanism require further investigation.
Future work can explore the application of GRL to other image restoration tasks and more complex degradation scenarios. |
image restoration, transformer networks, self-attention, cross-scale similarity, anisotropic image features |
2303.00521
Report |
Quality-aware Pre-trained Models for Blind Image Quality Assessment |
Kai Zhao, Kun Yuan, Ming Sun, Mading Li, Xing Wen |
Blind image quality assessment (BIQA) aims to automatically evaluate the
perceived quality of a single image, whose performance has been improved by
deep learning-based methods in recent years. However, the paucity of labeled
data somewhat restrains deep learning-based BIQA methods from unleashing their
full potential. In this paper, we propose to solve the problem by a pretext
task customized for BIQA in a self-supervised learning manner, which enables
learning representations from orders of magnitude more data. To constrain the
learning process, we propose a quality-aware contrastive loss based on a simple
assumption: the quality of patches from a distorted image should be similar,
but vary from patches from the same image with different degradations and
patches from different images. Further, we improve the existing degradation
process and form a degradation space with the size of roughly $2\times10^7$.
After pre-trained on ImageNet using our method, models are more sensitive to
image quality and perform significantly better on downstream BIQA tasks.
Experimental results show that our method obtains remarkable improvements on
popular BIQA datasets. |
This paper proposes Quality-aware Pre-Trained (QPT) models for Blind Image Quality Assessment (BIQA) to address the challenge of limited labeled data by utilizing self-supervised learning on a massive scale. |
Existing BIQA datasets are too small to fully leverage the power of deep learning. This paper aims to overcome this limitation and improve the performance of BIQA models. |
The paper introduces a novel self-supervised learning framework based on MoCoV2. It involves a complex degradation process with shuffle order, high-order, and skip operations to generate diverse distorted images. A quality-aware contrastive loss distinguishes between patches with varying perceptual qualities, enabling the model to learn quality-aware representations. |
QPT models significantly outperform state-of-the-art BIQA methods on five benchmark datasets.
QPT models demonstrate strong generalization ability and can be easily integrated with existing methods by replacing pre-trained weights.
The degradation space and the quality-aware contrastive loss are crucial for the effectiveness of QPT. |
Larger-scale datasets like JFT-300M could further enhance QPT's performance.
Exploring the trade-off between model capacity and pre-training time is crucial for practical applications. |
blind image quality assessment, self-supervised learning, contrastive learning, image degradation modeling, pre-training |
2303.00404
Report |
Distilled Reverse Attention Network for Open-world Compositional Zero-Shot Learning |
Yun Li, Zhe Liu, Saurav Jha, Sally Cripps, Lina Yao |
Open-World Compositional Zero-Shot Learning (OW-CZSL) aims to recognize new
compositions of seen attributes and objects. In OW-CZSL, methods built on the
conventional closed-world setting degrade severely due to the unconstrained OW
test space. While previous works alleviate the issue by pruning compositions
according to external knowledge or correlations in seen pairs, they introduce
biases that harm the generalization. Some methods thus predict state and object
with independently constructed and trained classifiers, ignoring that
attributes are highly context-dependent and visually entangled with objects. In
this paper, we propose a novel Distilled Reverse Attention Network to address
the challenges. We also model attributes and objects separately but with
different motivations, capturing contextuality and locality, respectively. We
further design a reverse-and-distill strategy that learns disentangled
representations of elementary components in training data supervised by reverse
attention and knowledge distillation. We conduct experiments on three datasets
and consistently achieve state-of-the-art (SOTA) performance. |
This paper proposes DRANet for Open-World Compositional Zero-Shot Learning, which disentangles visual primitives of attributes and objects using a novel reverse-and-distill strategy. |
OW-CZSL, aiming to recognize unseen compositions of seen elements, is challenging due to the unconstrained output space. Existing methods either suffer from biases introduced by external knowledge or fail to address the context-dependent nature of attributes and visual entanglement. |
DRANet utilizes non-local attention for attributes to capture context and local attention for objects to enhance locality. It then leverages reverse attention and knowledge distillation to disentangle attribute and object features for improved generalization. |
DRANet achieves state-of-the-art performance on three benchmark datasets (MIT-States, UT-Zappos, C-GQA).
The proposed reverse-and-distill strategy effectively disentangles attribute and object embeddings, improving recognition of unseen compositions.
Employing different feature extractors tailored for attributes and objects, considering their distinct characteristics, further benefits the model's performance. |
Reverse attention might cause focal confusion or lead to inconsistencies between the predicted attributes and objects.
Future work includes extending the disentanglement strategy to multi-object recognition and exploring alternative disentanglement methods to address limitations. |
compositional zero-shot learning, open-world learning, disentanglement, reverse attention, knowledge distillation |
2303.00354
Report |
Unlimited-Size Diffusion Restoration |
Yinhuai Wang, Jiwen Yu, Runyi Yu, Jian Zhang |
Recently, using diffusion models for zero-shot image restoration (IR) has
become a new hot paradigm. This type of method only needs to use the
pre-trained off-the-shelf diffusion models, without any finetuning, and can
directly handle various IR tasks. The upper limit of the restoration
performance depends on the pre-trained diffusion models, which are in rapid
evolution. However, current methods only discuss how to deal with fixed-size
images, but dealing with images of arbitrary sizes is very important for
practical applications. This paper focuses on how to use those diffusion-based
zero-shot IR methods to deal with any size while maintaining the excellent
characteristics of zero-shot. A simple way to solve arbitrary size is to divide
it into fixed-size patches and solve each patch independently. But this may
yield significant artifacts since it neither considers the global semantics of
all patches nor the local information of adjacent patches. Inspired by the
Range-Null space Decomposition, we propose the Mask-Shift Restoration to
address local incoherence and propose the Hierarchical Restoration to alleviate
out-of-domain issues. Our simple, parameter-free approaches can be used not
only for image restoration but also for image generation of unlimited sizes,
with the potential to be a general tool for diffusion models. Code:
https://github.com/wyhuai/DDNM/tree/main/hq_demo |
This paper proposes two parameter-free methods, Mask-Shift Restoration (MSR) and Hierarchical Restoration (HiR), to enable diffusion-based zero-shot image restoration methods to handle images of unlimited size. |
Existing diffusion-based zero-shot image restoration methods primarily focus on fixed-size images, limiting their practical application in real-world scenarios where desired output sizes can vary. |
MSR addresses local incoherence by processing the image in overlapping patches and using restored regions as constraints. HiR tackles out-of-domain issues by first restoring a low-resolution version of the image, then using it as a global prior for the final restoration. |
MSR effectively eliminates boundary artifacts when processing large images in patches.
HiR significantly improves the semantic correctness of the restored images, especially in large-scale inpainting and super-resolution tasks.
Both MSR and HiR are parameter-free, training-free, and can be flexibly combined and applied to various diffusion models and zero-shot restoration methods. |
The proposed methods have a higher computational cost compared to supervised methods.
Performance relies on the pre-trained diffusion models, limiting their effectiveness for tasks where suitable models are unavailable. |
image restoration, diffusion models, zero-shot learning, unlimited size, range-null space decomposition |
2303.00165
Report |
Diffusion Probabilistic Fields |
Peiye Zhuang, Samira Abnar, Jiatao Gu, Alex Schwing, Joshua M. Susskind, Miguel Ángel Bautista |
Diffusion probabilistic models have quickly become a major approach for
generative modeling of images, 3D geometry, video and other domains. However,
to adapt diffusion generative modeling to these domains the denoising network
needs to be carefully designed for each domain independently, oftentimes under
the assumption that data lives in a Euclidean grid. In this paper we introduce
Diffusion Probabilistic Fields (DPF), a diffusion model that can learn
distributions over continuous functions defined over metric spaces, commonly
known as fields. We extend the formulation of diffusion probabilistic models to
deal with this field parametrization in an explicit way, enabling us to define
an end-to-end learning algorithm that side-steps the requirement of
representing fields with latent vectors as in previous approaches (Dupont et
al., 2022a; Du et al., 2021). We empirically show that, while using the same
denoising network, DPF effectively deals with different modalities like 2D
images and 3D geometry, in addition to modeling distributions over fields
defined on non-Euclidean metric spaces. |
This paper introduces Diffusion Probabilistic Fields (DPF), a novel diffusion model capable of learning distributions over continuous functions defined on metric spaces (fields), unifying generative modeling across different data domains. |
Existing diffusion models often assume data lies on a grid and require domain-specific denoising networks. DPF overcomes these limitations by unifying data representation as fields and enabling a single model to handle diverse domains. |
DPF uses an explicit field parameterization with context and query pairs, employing a PerceiverIO architecture as the score field network. This allows continuous evaluation and efficient handling of large numbers of context and query pairs during training and inference. |
DPF demonstrates compelling generative performance on diverse domains like images (CelebA-HQ, CIFAR-10), 3D geometry (ShapeNet), and spherical data, outperforming existing domain-agnostic methods.
The explicit field parameterization enables end-to-end learning, surpassing the performance of two-stage approaches that rely on latent representations.
DPF exhibits resolution-free generation capabilities, allowing for sampling at different resolutions than seen during training. |
The computational cost of the score network can be prohibitive for high-resolution data, necessitating further exploration of efficient transformer architectures.
Sampling in DPF, similar to other diffusion models, requires iterating over all timesteps, leading to slower inference compared to GANs. Investigating faster sampling techniques while maintaining sample quality is crucial. |
diffusion models, generative modeling, fields, perceiverio, domain-agnostic |
2303.00157
Report |
Semi-supervised Parametric Real-world Image Harmonization |
Ke Wang, Michaël Gharbi, He Zhang, Zhihao Xia, Eli Shechtman |
Learning-based image harmonization techniques are usually trained to undo
synthetic random global transformations applied to a masked foreground in a
single ground truth photo. This simulated data does not model many of the
important appearance mismatches (illumination, object boundaries, etc.) between
foreground and background in real composites, leading to models that do not
generalize well and cannot model complex local changes. We propose a new
semi-supervised training strategy that addresses this problem and lets us learn
complex local appearance harmonization from unpaired real composites, where
foreground and background come from different images. Our model is fully
parametric. It uses RGB curves to correct the global colors and tone and a
shading map to model local variations. Our method outperforms previous work on
established benchmarks and real composites, as shown in a user study, and
processes high-resolution images interactively. |
This paper introduces a novel semi-supervised dual-stream training strategy for real-world image harmonization, addressing limitations of existing methods trained on synthetic data. |
Existing methods struggle to generalize to real-world composites due to the domain gap between synthetic training data and real composites, which exhibit complex appearance mismatches. |
The proposed method alternates between supervised training on artist-retouched image pairs and unsupervised adversarial training on unpaired real composites. A parametric model with global RGB curves and a local shading map is employed for efficient and high-resolution processing. |
Outperforms state-of-the-art methods on iHarmony benchmark and real composite datasets.
User study confirms superior performance on real-world composites.
Enables local tonal adjustments, unlike previous methods limited to global corrections. |
The method's generalization to a wider range of image harmonization operations beyond color and shading is yet to be explored.
Future work could focus on incorporating more attributes into the model to further enhance realism. |
image harmonization, semi-supervised learning, adversarial training, parametric model, shading correction |
2302.14859
Report |
BakedSDF: Meshing Neural SDFs for Real-Time View Synthesis |
Lior Yariv, Peter Hedman, Christian Reiser, Dor Verbin, Pratul P. Srinivasan, Richard Szeliski, Jonathan T. Barron, Ben Mildenhall |
We present a method for reconstructing high-quality meshes of large unbounded
real-world scenes suitable for photorealistic novel view synthesis. We first
optimize a hybrid neural volume-surface scene representation designed to have
well-behaved level sets that correspond to surfaces in the scene. We then bake
this representation into a high-quality triangle mesh, which we equip with a
simple and fast view-dependent appearance model based on spherical Gaussians.
Finally, we optimize this baked representation to best reproduce the captured
viewpoints, resulting in a model that can leverage accelerated polygon
rasterization pipelines for real-time view synthesis on commodity hardware. Our
approach outperforms previous scene representations for real-time rendering in
terms of accuracy, speed, and power consumption, and produces high quality
meshes that enable applications such as appearance editing and physical
simulation. |
BakedSDF presents a method for reconstructing high-quality meshes of large unbounded real-world scenes suitable for photorealistic novel view synthesis, enabling real-time rendering on commodity hardware. |
Existing NeRF-based methods struggle to balance high-quality reconstruction with real-time rendering capabilities, especially on commodity hardware. BakedSDF addresses this by baking a neural volumetric representation into an efficiently renderable mesh. |
BakedSDF utilizes a hybrid neural volume-surface representation optimized in contracted coordinate space. This representation is then baked into a high-quality triangle mesh, equipped with a view-dependent appearance model based on spherical Gaussians, and fine-tuned to reproduce captured viewpoints. |
Outperforms previous scene representations for real-time rendering in terms of accuracy, speed, and power consumption.
Produces high-quality meshes suitable for applications like appearance editing and physical simulation.
Demonstrates that spherical Gaussians are a practical representation for view-dependent appearance in view synthesis. |
Limitations in representing semi-transparent content and scenes with small or detailed geometry due to the use of a fully opaque mesh.
The output meshes have a significant on-disk footprint, posing potential storage and streaming challenges. |
neural radiance fields, signed distance function, surface reconstruction, real-time rendering, view synthesis |
2302.14771
Report |
Generic-to-Specific Distillation of Masked Autoencoders |
Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao, Qixiang Ye |
Large vision Transformers (ViTs) driven by self-supervised pre-training
mechanisms achieved unprecedented progress. Lightweight ViT models limited by
the model capacity, however, benefit little from those pre-training mechanisms.
Knowledge distillation defines a paradigm to transfer representations from
large (teacher) models to small (student) ones. However, the conventional
single-stage distillation easily gets stuck on task-specific transfer, failing
to retain the task-agnostic knowledge crucial for model generalization. In this
study, we propose generic-to-specific distillation (G2SD), to tap the potential
of small ViT models under the supervision of large models pre-trained by masked
autoencoders. In generic distillation, decoder of the small model is encouraged
to align feature predictions with hidden representations of the large model, so
that task-agnostic knowledge can be transferred. In specific distillation,
predictions of the small model are constrained to be consistent with those of
the large model, to transfer task-specific features which guarantee task
performance. With G2SD, the vanilla ViT-Small model respectively achieves
98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image
classification, object detection, and semantic segmentation, setting a solid
baseline for two-stage vision distillation. Code will be available at
https://github.com/pengzhiliang/G2SD. |
This paper introduces Generic-to-Specific Distillation (G2SD), a two-stage knowledge distillation approach for lightweight Vision Transformers (ViTs), transferring both task-agnostic and task-specific knowledge from large masked autoencoder pre-trained models. |
Lightweight ViTs struggle to benefit from self-supervised pre-training methods like Masked Image Modeling (MIM), limiting their performance. G2SD addresses this by effectively transferring knowledge from larger, MIM-pretrained teachers, bridging the performance gap with CNNs in resource-constrained settings. |
G2SD uses two stages: 1) **Generic Distillation:** Aligns student decoder feature predictions with hidden representations of the teacher's decoder from a pre-trained MAE, transferring task-agnostic knowledge. 2) **Specific Distillation:** Fine-tunes the student on a specific task using a fine-tuned teacher MAE, transferring task-specific knowledge via consistent prediction. |
Vanilla ViT-Small with G2SD achieves 98.7% the top-1 accuracy of its teacher (ViT-Base) on ImageNet-1k.
G2SD surpasses single-stage distillation counterparts and competing methods in object detection and semantic segmentation tasks, demonstrating strong generalization ability.
The method proves effective for lightweight ViTs, pushing their performance to a new height and establishing a solid baseline for two-stage vision model distillation. |
The study primarily focuses on transferring knowledge from MAE-pretrained teachers; exploring other MIM methods could further enhance performance.
Investigating the impact of varying teacher-student model size ratios and more efficient distillation strategies remains for future work. |
knowledge distillation, vision transformers, masked image modeling, self-supervised learning, lightweight models |
2302.14736
Report |
TextIR: A Simple Framework for Text-based Editable Image Restoration |
Yunpeng Bai, Cairong Wang, Shuzhao Xie, Chao Dong, Chun Yuan, Zhi Wang |
Most existing image restoration methods use neural networks to learn strong
image-level priors from huge data to estimate the lost information. However,
these works still struggle in cases when images have severe information
deficits. Introducing external priors or using reference images to provide
information also have limitations in the application domain. In contrast, text
input is more readily available and provides information with higher
flexibility. In this work, we design an effective framework that allows the
user to control the restoration process of degraded images with text
descriptions. We use the text-image feature compatibility of the CLIP to
alleviate the difficulty of fusing text and image features. Our framework can
be used for various image restoration tasks, including image inpainting, image
super-resolution, and image colorization. Extensive experiments demonstrate the
effectiveness of our method. |
This paper presents TextIR, a novel framework for text-based editable image restoration leveraging the text-image feature compatibility of CLIP. |
Existing image restoration methods struggle with severe information deficits, and while external priors or reference images can help, they have limitations. Text input offers a more flexible and accessible alternative. |
TextIR utilizes CLIP's shared embedding space to train a generator that takes degraded images and text descriptions as input. During training, ground truth images are translated into CLIP image embeddings to simulate text conditions. The generator incorporates multi-level features from the degraded image and modulates them with text-derived style codes. |
TextIR outperforms a diffusion-based method in text-guided inpainting, producing more natural and realistic results.
The framework effectively colorizes grayscale images based on text descriptions, demonstrating accurate target localization and color matching.
In super-resolution, TextIR surpasses a blind face restoration method, generating clearer results consistent with the provided text. |
The current implementation of TextIR relies on CLIP's pre-trained knowledge and may not generalize well to unseen concepts or domains.
Future work could explore alternative text-image fusion mechanisms or incorporate additional constraints for improved control over the restoration process. |
image restoration, text-guided image editing, clip, image inpainting, super-resolution |
2302.14728
Report |
Global Context-Aware Person Image Generation |
Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein |
We propose a data-driven approach for context-aware person image generation.
Specifically, we attempt to generate a person image such that the synthesized
instance can blend into a complex scene. In our method, the position, scale,
and appearance of the generated person are semantically conditioned on the
existing persons in the scene. The proposed technique is divided into three
sequential steps. At first, we employ a Pix2PixHD model to infer a coarse
semantic mask that represents the new person's spatial location, scale, and
potential pose. Next, we use a data-centric approach to select the closest
representation from a precomputed cluster of fine semantic masks. Finally, we
adopt a multi-scale, attention-guided architecture to transfer the appearance
attributes from an exemplar image. The proposed strategy enables us to
synthesize semantically coherent realistic persons that can blend into an
existing scene without altering the global context. We conclude our findings
with relevant qualitative and quantitative evaluations. |
This paper proposes a data-driven approach for generating person images that blend seamlessly into complex scenes, considering the global context of existing people. |
Existing person image generation methods often produce unrealistic results due to their reliance on local attributes and neglect of global contextual information. |
The proposed method uses a three-stage approach: (1) Estimating the target person's location and pose with a Pix2PixHD model, (2) Refining the semantic map using a data-driven approach with a clustered knowledge base, (3) Rendering the refined map by transferring appearance attributes from an exemplar image. |
The proposed method generates semantically coherent and realistic persons that blend well with existing scenes.
A data-driven refinement strategy improves the visual quality and realism of the generated images.
The method achieves state-of-the-art results on various qualitative and quantitative benchmarks. |
The method may struggle with unconventional poses or misclassified outliers during clustering.
Future work could explore better ways to model global scene context and develop a more robust end-to-end approach. |
image generation, context-aware, person image synthesis, deep learning, computer vision |
2302.14683
Report |
IntrinsicNGP: Intrinsic Coordinate based Hash Encoding for Human NeRF |
Bo Peng, Jun Hu, Jingtao Zhou, Xuan Gao, Juyong Zhang |
Recently, many works have been proposed to utilize the neural radiance field
for novel view synthesis of human performers. However, most of these methods
require hours of training, making them difficult for practical use. To address
this challenging problem, we propose IntrinsicNGP, which can train from scratch
and achieve high-fidelity results in few minutes with videos of a human
performer. To achieve this target, we introduce a continuous and optimizable
intrinsic coordinate rather than the original explicit Euclidean coordinate in
the hash encoding module of instant-NGP. With this novel intrinsic coordinate,
IntrinsicNGP can aggregate inter-frame information for dynamic objects with the
help of proxy geometry shapes. Moreover, the results trained with the given
rough geometry shapes can be further refined with an optimizable offset field
based on the intrinsic coordinate.Extensive experimental results on several
datasets demonstrate the effectiveness and efficiency of IntrinsicNGP. We also
illustrate our approach's ability to edit the shape of reconstructed subjects. |
IntrinsicNGP, a novel view synthesis method for human bodies that can be trained from scratch in minutes on monocular videos using an intrinsic coordinate representation for hash encoding in INGP. |
Existing methods for human NeRF require hours of training, making them impractical for common users. IntrinsicNGP addresses this by enabling fast, high-fidelity novel view synthesis within minutes. |
IntrinsicNGP uses a UV-D mapping to represent query points with intrinsic coordinates based on nearest points on a rough human surface mesh and signed distance. It employs hash encoding on these coordinates for fast NeRF training and introduces an offset field to refine details. |
IntrinsicNGP achieves high-fidelity novel view synthesis comparable to state-of-the-art methods on ZJU-MoCap and custom datasets.
It converges significantly faster (within minutes) than other methods, which typically take hours.
IntrinsicNGP allows for shape editing of the reconstructed human body by manipulating the input surface mesh. |
The method's reliance on a template model (SMPL) can limit expressiveness despite using an offset field.
Future work could explore combining IntrinsicNGP with more advanced human shape reconstruction methods for improved accuracy and detail. |
neural rendering, human performance capture, novel view synthesis, intrinsic coordinates, hash encoding |
2302.14475
Report |
Benchmarking Deepart Detection |
Yabin Wang, Zhiwu Huang, Xiaopeng Hong |
Deepfake technologies have been blurring the boundaries between the real and
unreal, likely resulting in malicious events. By leveraging newly emerged
deepfake technologies, deepfake researchers have been making a great upending
to create deepfake artworks (deeparts), which are further closing the gap
between reality and fantasy. To address potentially appeared ethics questions,
this paper establishes a deepart detection database (DDDB) that consists of a
set of high-quality conventional art images (conarts) and five sets of deepart
images generated by five state-of-the-art deepfake models. This database
enables us to explore once-for-all deepart detection and continual deepart
detection. For the two new problems, we suggest four benchmark evaluations and
four families of solutions on the constructed DDDB. The comprehensive study
demonstrates the effectiveness of the proposed solutions on the established
benchmark dataset, which is capable of paving a way to more interesting
directions of deepart detection. The constructed benchmark dataset and the
source code will be made publicly available. |
This paper introduces DDDB, the first deepart detection database, and proposes two new deepart detection tasks: once-for-all deepart detection (ODD) and continual deepart detection (CDD). |
The emergence of highly realistic deepfake artworks (deeparts) necessitates detection and copyright identification to address ethical concerns. |
The authors construct DDDB with deeparts from five models and conarts from LAION-5B, designing four benchmark evaluations: one for ODD and three for CDD with varying rehearsal constraints. They propose solutions for each benchmark, including adapting existing methods and introducing a transformation framework to rescue rehearsal-free methods for the most challenging CDD scenario. |
Deeparts are significantly different from traditional deepfakes, rendering existing deepfake detectors ineffective.
Continual deepart detection methods generally outperform once-for-all methods, particularly with the proposed transformation framework in rehearsal-free settings.
The study highlights the challenge of deepart detection due to the high realism and closeness to real artworks. |
The paper acknowledges the limited availability of high-quality conarts and the reliance on Stable Diffusion's training data.
Future work includes exploring the use of easily-acquired conarts, collecting more diverse data, and leveraging deepart prompts. |
deepfake detection, deepart, continual learning, benchmarking, copyright identification |
2302.14452
Report |
An Effective Crop-Paste Pipeline for Few-shot Object Detection |
Shaobo Lin, Kun Wang, Xingyu Zeng, Rui Zhao |
Few-shot object detection (FSOD) aims to expand an object detector for novel
categories given only a few instances for training. However, detecting novel
categories with only a few samples usually leads to the problem of
misclassification. In FSOD, we notice the false positive (FP) of novel
categories is prominent, in which the base categories are often recognized as
novel ones. To address this issue, a novel data augmentation pipeline that
Crops the Novel instances and Pastes them on the selected Base images, called
CNPB, is proposed. There are two key questions to be answered: (1) How to
select useful base images? and (2) How to combine novel and base data? We
design a multi-step selection strategy to find useful base data. Specifically,
we first discover the base images which contain the FP of novel categories and
select a certain amount of samples from them for the base and novel categories
balance. Then the bad cases, such as the base images that have unlabeled ground
truth or easily confused base instances, are removed by using CLIP. Finally,
the same category strategy is adopted, in which a novel instance with category
n is pasted on the base image with the FP of n. During combination, a novel
instance is cropped and randomly down-sized, and thus pasted at the assigned
optimal location from the randomly generated candidates in a selected base
image. Our method is simple yet effective and can be easy to plug into existing
FSOD methods, demonstrating significant potential for use. Extensive
experiments on PASCAL VOC and MS COCO validate the effectiveness of our method. |
This paper proposes CNPB, a novel data augmentation pipeline for Few-Shot Object Detection (FSOD) that addresses the issue of misclassifying base categories as novel categories (false positives). |
FSOD models often struggle with misclassification, particularly false positives where base categories are incorrectly identified as novel categories. This limits their accuracy and practical applicability. |
CNPB works by cropping novel instances and pasting them onto carefully selected base images containing false positives. The key steps include: (1) Identifying base images with false positives using a trained FSOD model, (2) Selecting a balanced subset of these base images, (3) Removing unsuitable base images (e.g., containing unlabeled ground truth) using the CLIP model, and (4) Pasting a novel instance onto a base image containing a false positive of the same category. |
CNPB consistently reduces the false positive ratio of novel categories in FSOD models.
CNPB significantly improves the performance of multiple baseline FSOD methods (TFA, FSCE, DeFRCN).
CNPB achieves state-of-the-art performance on PASCAL VOC and MS COCO datasets. |
The improvement on MS COCO is less significant than PASCAL VOC due to higher shot settings used.
Further exploration of advanced data augmentation techniques on the pasted novel instances might yield additional benefits. |
few-shot object detection, data augmentation, false positives, misclassification, computer vision |
2302.14434
Report |
A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-The-Wild Images |
Biwen Lei, Jianqiang Ren, Mengyang Feng, Miaomiao Cui, Xuansong Xie |
Limited by the nature of the low-dimensional representational capacity of
3DMM, most of the 3DMM-based face reconstruction (FR) methods fail to recover
high-frequency facial details, such as wrinkles, dimples, etc. Some attempt to
solve the problem by introducing detail maps or non-linear operations, however,
the results are still not vivid. To this end, we in this paper present a novel
hierarchical representation network (HRN) to achieve accurate and detailed face
reconstruction from a single image. Specifically, we implement the geometry
disentanglement and introduce the hierarchical representation to fulfill
detailed face modeling. Meanwhile, 3D priors of facial details are incorporated
to enhance the accuracy and authenticity of the reconstruction results. We also
propose a de-retouching module to achieve better decoupling of the geometry and
appearance. It is noteworthy that our framework can be extended to a multi-view
fashion by considering detail consistency of different views. Extensive
experiments on two single-view and two multi-view FR benchmarks demonstrate
that our method outperforms the existing methods in both reconstruction
accuracy and visual effects. Finally, we introduce a high-quality 3D face
dataset FaceHD-100 to boost the research of high-fidelity face reconstruction.
The project homepage is at https://younglbw.github.io/HRN-homepage/. |
This paper introduces Hierarchical Representation Network (HRN), a novel method for accurate and detailed 3D face reconstruction from single and multi-view images. |
Current 3DMM-based face reconstruction methods struggle to recover high-frequency facial details. This paper aims to address this by introducing a novel hierarchical representation network that captures details in a coarse-to-fine manner. |
The method decouples facial geometry into low, mid, and high-frequency details, representing them with blendshape coefficients, a vertex-wise deformation map, and a pixel-wise displacement map, respectively. It utilizes two image translation networks to estimate detail maps and incorporates 3D priors of facial details for enhanced accuracy. A de-retouching module helps decouple geometry and appearance. |
HRN outperforms state-of-the-art methods on single-view face reconstruction benchmarks (FaceScape, REALY) in terms of detail capturing and shape accuracy.
The method generalizes well to multi-view face reconstruction, achieving superior performance on FaceScape and ESRC datasets with only a few input views.
Ablation studies validate the contribution of each component, including hierarchical modeling, contour-aware loss, 3D detail priors, and the de-retouching module. |
The paper acknowledges limitations regarding the handling of extreme poses and heavy occlusions.
Future work will focus on extending the method for high-quality head reconstruction and exploring alternative detail modeling approaches. |
3d face reconstruction, hierarchical representation, detail modeling, 3d morphable model (3dmm), de-retouching |
2302.14431
Report |
Efficient Masked Autoencoders with Self-Consistency |
Zhaowen Li, Yousong Zhu, Zhiyang Chen, Wei Li, Chaoyang Zhao, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang |
Inspired by masked language modeling (MLM) in natural language processing,
masked image modeling (MIM) has been recognized as a strong and popular
self-supervised pre-training method in computer vision. However, its high
random mask ratio would result in two serious problems: 1) the data are not
efficiently exploited, which brings inefficient pre-training (\eg, 1600 epochs
for MAE $vs.$ 300 epochs for the supervised), and 2) the high uncertainty and
inconsistency of the pre-trained model, \ie, the prediction of the same patch
may be inconsistent under different mask rounds. To tackle these problems, we
propose efficient masked autoencoders with self-consistency (EMAE), to improve
the pre-training efficiency and increase the consistency of MIM. In particular,
we progressively divide the image into K non-overlapping parts, each of which
is generated by a random mask and has the same mask ratio. Then the MIM task is
conducted parallelly on all parts in an iteration and generates predictions.
Besides, we design a self-consistency module to further maintain the
consistency of predictions of overlapping masked patches among parts. Overall,
the proposed method is able to exploit the data more efficiently and obtains
reliable representations. Experiments on ImageNet show that EMAE achieves even
higher results with only 300 pre-training epochs under ViT-Base than MAE (1600
epochs). EMAE also consistently obtains state-of-the-art transfer performance
on various downstream tasks, like object detection, and semantic segmentation. |
This paper proposes Efficient Masked Autoencoders with Self-Consistency (EMAE) to improve pre-training efficiency and consistency in Masked Image Modeling (MIM). |
High random mask ratios in MIM lead to inefficient pre-training and high inconsistency in the pre-trained model. |
EMAE divides the image into non-overlapping parts, performs MIM on each part parallelly, and utilizes a self-consistency module to maintain consistency among overlapping predictions. |
EMAE achieves higher accuracy on ImageNet linear evaluation with fewer epochs compared to MAE.
EMAE consistently obtains state-of-the-art transfer performance on object detection, instance segmentation, and semantic segmentation.
Ablation studies demonstrate the effectiveness of whole data utilization and the self-consistency module. |
The method's performance on larger datasets and architectures needs further investigation due to resource constraints.
The model's reliance on training data statistics might lead to inheriting biases, potentially with negative social impacts. |
self-supervised learning, masked image modeling, vision transformer, pre-training, computer vision |
2302.14368
Report |
Towards Enhanced Controllability of Diffusion Models |
Wonwoong Cho, Hareesh Ravi, Midhun Harikumar, Vinh Khuc, Krishna Kumar Singh, Jingwan Lu, David I. Inouye, Ajinkya Kale |
Denoising Diffusion models have shown remarkable capabilities in generating
realistic, high-quality and diverse images. However, the extent of
controllability during generation is underexplored. Inspired by techniques
based on GAN latent space for image manipulation, we train a diffusion model
conditioned on two latent codes, a spatial content mask and a flattened style
embedding. We rely on the inductive bias of the progressive denoising process
of diffusion models to encode pose/layout information in the spatial structure
mask and semantic/style information in the style code. We propose two generic
sampling techniques for improving controllability. We extend composable
diffusion models to allow for some dependence between conditional inputs, to
improve the quality of generations while also providing control over the amount
of guidance from each latent code and their joint distribution. We also propose
timestep dependent weight scheduling for content and style latents to further
improve the translations. We observe better controllability compared to
existing methods and show that without explicit training objectives, diffusion
models can be used for effective image manipulation and image translation. |
This paper introduces a novel framework to enhance the controllability of image-conditioned diffusion models for image translation and manipulation. |
Diffusion models often lack the fine-grained controllability offered by GANs, limiting their use in applications like reference-based image translation. |
The proposed method learns disentangled content and style latent spaces by training separate encoders alongside the diffusion model. Two novel sampling techniques, Generalized Composable Diffusion Models (GCDM) and timestep-dependent weight scheduling, are introduced to improve controllability during generation. |
GCDM outperforms existing methods, including DiffuseIT and SAE, achieving better FID and LPIPS scores on image translation tasks.
Timestep scheduling, leveraging the inductive bias of diffusion models, further enhances translation quality and control by weighting content and style information across timesteps.
The learned latent spaces demonstrate desirable properties for manipulation, allowing for attribute-specific editing via PCA and smooth content/style interpolations. |
Further research is needed to explore training diffusion models with timestep scheduling to implicitly learn a mixture-of-experts model.
Exploring the use of classifiers to potentially discover better directions for attribute manipulation in the latent space is a promising future direction. |
diffusion models, image translation, image manipulation, controllable generation, latent space |
2302.14290
Report |
Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation |
Gaurav Patel, Konda Reddy Mopuri, Qiang Qiu |
Data-free Knowledge Distillation (DFKD) has gained popularity recently, with
the fundamental idea of carrying out knowledge transfer from a Teacher neural
network to a Student neural network in the absence of training data. However,
in the Adversarial DFKD framework, the student network's accuracy, suffers due
to the non-stationary distribution of the pseudo-samples under multiple
generator updates. To this end, at every generator update, we aim to maintain
the student's performance on previously encountered examples while acquiring
knowledge from samples of the current distribution. Thus, we propose a
meta-learning inspired framework by treating the task of Knowledge-Acquisition
(learning from newly generated samples) and Knowledge-Retention (retaining
knowledge on previously met samples) as meta-train and meta-test, respectively.
Hence, we dub our method as Learning to Retain while Acquiring. Moreover, we
identify an implicit aligning factor between the Knowledge-Retention and
Knowledge-Acquisition tasks indicating that the proposed student update
strategy enforces a common gradient direction for both tasks, alleviating
interference between the two objectives. Finally, we support our hypothesis by
exhibiting extensive evaluation and comparison of our method with prior arts on
multiple datasets. |
This paper introduces a novel meta-learning inspired student update strategy for Adversarial Data-Free Knowledge Distillation (DFKD) that maintains student performance on past data (Knowledge-Retention) while learning from new data (Knowledge-Acquisition). |
In Adversarial DFKD, the student network's accuracy suffers due to the constantly changing distribution of generated pseudo-samples. The proposed method addresses this by encouraging the student to retain knowledge from previously encountered distributions. |
The method treats Knowledge-Acquisition (learning from new samples) and Knowledge-Retention (retaining knowledge from past samples) as meta-train and meta-test tasks, respectively. This strategy implicitly aligns the gradients of both tasks, enforcing a common optimization path. |
The proposed method demonstrates significant improvement in the learning evolution and peak accuracies compared to existing Adversarial DFKD methods.
It exhibits global monotonicity in student learning, ensuring consistently high accuracy throughout the distillation process.
The method is scalable across different network architectures and replay schemes, showing consistent improvements with both Memory Buffer and Generative Replay. |
The method's performance on complex datasets like Tiny-ImageNet with Generative Replay requires further investigation.
Training a VAE for Generative Replay on a stream of synthetic samples can be challenging due to distribution drift. |
knowledge distillation, data-free learning, meta-learning, adversarial learning, distribution shift |
2302.14007
Report |
Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training |
Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzhi Li, Pheng-Ann Heng |
Masked Autoencoders (MAE) have shown promising performance in self-supervised
learning for both 2D and 3D computer vision. However, existing MAE-style
methods can only learn from the data of a single modality, i.e., either images
or point clouds, which neglect the implicit semantic and geometric correlation
between 2D and 3D. In this paper, we explore how the 2D modality can benefit 3D
masked autoencoding, and propose Joint-MAE, a 2D-3D joint MAE framework for
self-supervised 3D point cloud pre-training. Joint-MAE randomly masks an input
3D point cloud and its projected 2D images, and then reconstructs the masked
information of the two modalities. For better cross-modal interaction, we
construct our JointMAE by two hierarchical 2D-3D embedding modules, a joint
encoder, and a joint decoder with modal-shared and model-specific decoders. On
top of this, we further introduce two cross-modal strategies to boost the 3D
representation learning, which are local-aligned attention mechanisms for 2D-3D
semantic cues, and a cross-reconstruction loss for 2D-3D geometric constraints.
By our pre-training paradigm, Joint-MAE achieves superior performance on
multiple downstream tasks, e.g., 92.4% accuracy for linear SVM on ModelNet40
and 86.07% accuracy on the hardest split of ScanObjectNN. |
This paper proposes Joint-MAE, a novel 2D-3D joint masked autoencoding framework for self-supervised 3D point cloud pre-training, leveraging readily available 2D images to enhance 3D representation learning. |
Existing MAE methods only learn from single modality data (images or point clouds) neglecting the implicit correlations between 2D and 3D. Joint-MAE addresses this by exploiting the dense, fine-grained information in 2D images to benefit 3D point cloud understanding. |
Joint-MAE projects 3D point clouds into 2D depth maps. It uses hierarchical modules for 2D and 3D token embedding, masks tokens, and employs a joint encoder for cross-modal interaction. A joint decoder with modal-shared and specific components reconstructs masked data. Further, it introduces local-aligned attention for better feature interaction and a cross-reconstruction loss for geometric constraint. |
Joint-MAE outperforms existing self-supervised methods, achieving 92.4% accuracy on ModelNet40 with linear SVM.
It demonstrates superior performance on out-of-distribution data, surpassing Point-MAE by 0.89% on the challenging ScanObjectNN dataset.
It excels in few-shot learning scenarios and achieves state-of-the-art results on part segmentation, highlighting its strong representation learning capability. |
While Joint-MAE demonstrates the benefit of 2D for 3D pre-training, exploring the reverse (3D benefiting 2D MAE) is left for future work.
The current design of Joint-MAE relies on projecting point clouds into depth maps; incorporating other 2D modalities like RGB images could further enhance performance. |
self-supervised learning, masked autoencoding, point cloud representation learning, multi-modal learning, cross-modal interaction |
2302.13987
Report |
UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction |
Zhenwei Zhu, Liying Yang, Ning Li, Chaohao Jiang, Yanyan Liang |
In recent years, many video tasks have achieved breakthroughs by utilizing
the vision transformer and establishing spatial-temporal decoupling for feature
extraction. Although multi-view 3D reconstruction also faces multiple images as
input, it cannot immediately inherit their success due to completely ambiguous
associations between unstructured views. There is not usable prior
relationship, which is similar to the temporally-coherence property in a video.
To solve this problem, we propose a novel transformer network for Unstructured
Multiple Images (UMIFormer). It exploits transformer blocks for decoupled
intra-view encoding and designed blocks for token rectification that mine the
correlation between similar tokens from different views to achieve decoupled
inter-view encoding. Afterward, all tokens acquired from various branches are
compressed into a fixed-size compact representation while preserving rich
information for reconstruction by leveraging the similarities between tokens.
We empirically demonstrate on ShapeNet and confirm that our decoupled learning
method is adaptable for unstructured multiple images. Meanwhile, the
experiments also verify our model outperforms existing SOTA methods by a large
margin. Code will be available at https://github.com/GaryZhu1996/UMIFormer. |
This paper introduces UMIFormer, a novel transformer network that decouples intra- and inter-view feature extraction for multi-view 3D reconstruction from unstructured images. |
Existing methods struggle to effectively extract features from unstructured multi-view images due to the lack of prior positional correspondence, like temporal coherence in videos. |
UMIFormer leverages transformer blocks for intra-view encoding and introduces Inter-View-Decoupled Blocks (IVDBs) based on similar token correlations for inter-view encoding. A Similar-Token Merger (STM) compresses features into a compact representation for the decoder. |
UMIFormer significantly outperforms previous state-of-the-art methods on ShapeNet benchmark.
The proposed decoupled learning method is shown to be effective for unstructured multi-view images.
The model exhibits robustness to varying numbers of input views. |
The model requires large memory and faces computational challenges with a high number of input views.
Future work includes model compression and algorithm acceleration for higher resolution reconstruction and improved inference efficiency. |
3d reconstruction, vision transformer, multi-view learning, deep learning, computer vision |
2302.13770
Report |
Mask Reference Image Quality Assessment |
Pengxiang Xiao, Shuai He, Limin Liu, Anlong Ming |
Understanding semantic information is an essential step in knowing what is
being learned in both full-reference (FR) and no-reference (NR) image quality
assessment (IQA) methods. However, especially for many severely distorted
images, even if there is an undistorted image as a reference (FR-IQA), it is
difficult to perceive the lost semantic and texture information of distorted
images directly. In this paper, we propose a Mask Reference IQA (MR-IQA) method
that masks specific patches of a distorted image and supplements missing
patches with the reference image patches. In this way, our model only needs to
input the reconstructed image for quality assessment. First, we design a mask
generator to select the best candidate patches from reference images and
supplement the lost semantic information in distorted images, thus providing
more reference for quality assessment; in addition, the different masked
patches imply different data augmentations, which favors model training and
reduces overfitting. Second, we provide a Mask Reference Network (MRNet): the
dedicated modules can prevent disturbances due to masked patches and help
eliminate the patch discontinuity in the reconstructed image. Our method
achieves state-of-the-art performances on the benchmark KADID-10k, LIVE and
CSIQ datasets and has better generalization performance across datasets. The
code and results are available in the supplementary material. |
This paper proposes Mask Reference IQA (MR-IQA) to recover lost semantic and texture information in distorted images for better quality assessment. |
Existing FR-IQA methods struggle to recover lost semantic and texture details in distorted images, hindering accurate quality assessment. MR-IQA addresses this by directly incorporating reference image information into distorted regions. |
The method uses a Mask Generator (MG) to select severely distorted patches based on MAE difference with reference images. These patches are then replaced with corresponding reference patches, creating a masked image. This masked image is then fed into a Mask Reference Network (MRNet), a modified Swin Transformer, for quality prediction. The MRNet incorporates a Feature Mask Module (FMM) to mitigate interference from masked patches and enhance feature processing. |
MR-IQA achieves state-of-the-art performance on LIVE, CSIQ, and KADID-10k datasets.
It outperforms both traditional and deep learning-based FR and NR IQA methods.
The method shows strong generalization ability across different datasets. |
The performance improvement is not consistent across all datasets, with less pronounced gains on datasets with simpler distortion types.
Future work can explore optimizing the masking strategy and adapting the approach for NR-IQA. |
image quality assessment, full-reference iqa, semantic information, mask reference image, swin transformer |
2302.13543
Report |
BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling |
Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, Anton Van Den Hengel |
Reasoning the 3D structure of a non-rigid dynamic scene from a single moving
camera is an under-constrained problem. Inspired by the remarkable progress of
neural radiance fields (NeRFs) in photo-realistic novel view synthesis of
static scenes, extensions have been proposed for dynamic settings. These
methods heavily rely on neural priors in order to regularize the problem. In
this work, we take a step back and reinvestigate how current implementations
may entail deleterious effects, including limited expressiveness, entanglement
of light and density fields, and sub-optimal motion localization. As a remedy,
we advocate for a bridge between classic non-rigid-structure-from-motion
(\nrsfm) and NeRF, enabling the well-studied priors of the former to constrain
the latter. To this end, we propose a framework that factorizes time and space
by formulating a scene as a composition of bandlimited, high-dimensional
signals. We demonstrate compelling results across complex dynamic scenes that
involve changes in lighting, texture and long-range dynamics. |
This paper proposes BLiRF, a novel framework for dynamic 3D scene modeling that represents radiance fields as bandlimited signals, allowing for the integration of explicit and implicit priors and enabling efficient factorization of spatio-temporal dynamics. |
Existing dynamic NeRF extensions, heavily reliant on implicit neural priors, suffer from limitations like dependence on a canonical frame, entanglement of light and density fields, limited expressiveness, and sub-optimal motion localization. |
BLiRF models the scene as a composition of bandlimited, high-dimensional signals, factoring in spatio-temporal dynamics. An implementation enforces a low-rank constraint on shape space, a neural prior over the frequency domain, and a union-of-subspaces prior on shape deformation over time. |
BLiRF demonstrates superior modeling of long-range dynamics and motion localization compared to ray deformation-based methods.
The framework effectively disentangles light and density fields, capturing scenes with dynamic lighting and textures.
BLiRF exhibits faster training times and doesn't necessitate complex loss regularizers or optimization procedures common in other dynamic NeRF architectures. |
The volumetric representation limits reconstruction resolution, a trade-off for speed common in grid-based NeRF models.
Exploration of alternative implementations and more complex priors within the generic framework is left for future work. |
neural radiance fields, dynamic scene modeling, novel view synthesis, non-rigid structure from motion, space-time factorization |
2302.13331
Report |
Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance |
Yoonjeon Kim, Hyunsu Kim, Junho Kim, Yunjey Choi, Eunho Yang |
With the advantages of fast inference and human-friendly flexible
manipulation, image-agnostic style manipulation via text guidance enables new
applications that were not previously available. The state-of-the-art
text-guided image-agnostic manipulation method embeds the representation of
each channel of StyleGAN independently in the Contrastive Language-Image
Pre-training (CLIP) space, and provides it in the form of a Dictionary to
quickly find out the channel-wise manipulation direction during inference time.
However, in this paper we argue that this dictionary which is constructed by
controlling single channel individually is limited to accommodate the
versatility of text guidance since the collective and interactive relation
among multiple channels are not considered. Indeed, we show that it fails to
discover a large portion of manipulation directions that can be found by
existing methods, which manually manipulates latent space without texts. To
alleviate this issue, we propose a novel method that learns a Dictionary, whose
entry corresponds to the representation of a single channel, by taking into
account the manipulation effect coming from the interaction with multiple other
channels. We demonstrate that our strategy resolves the inability of previous
methods in finding diverse known directions from unsupervised methods and
unknown directions from random text while maintaining the real-time inference
speed and disentanglement ability. |
This paper proposes Multi2One, a novel method for text-guided image manipulation in StyleGAN that learns a dictionary to represent multi-channel manipulation effects in CLIP space. |
Existing text-guided manipulation methods, particularly StyleCLIP's GlobalDirection, fail to capture the full manipulation capabilities of StyleGAN due to their reliance on single-channel manipulation representations, leading to limited coverage of possible edits. |
Multi2One learns a dictionary by embedding the manipulation effects of known directions from unsupervised methods (GANspace, SeFa) into CLIP space. It leverages both the reconstruction of these known directions and the mapping of their multi-channel manipulation effects to CLIP space to learn a more comprehensive representation. |
Multi2One demonstrates superior performance in reconstructing unsupervised directions compared to StyleCLIP GlobalDirection.
It achieves higher cosine similarity scores between manipulated images and text guidance in CLIP space, indicating better alignment with user intent.
The method successfully discovers manipulation directions that were not present in the original unsupervised directions, highlighting its ability to generalize to unseen combinations of semantic attributes. |
The flexibility and diversity of text input are not fully utilized due to limitations in CLIP's encoding ability and deterministic representation.
Future work could explore incorporating more advanced language models or alternative encoding schemes to enhance the expressiveness and controllability of text-guided manipulation. |
text-guided image manipulation, stylegan, clip, dictionary learning, unsupervised directions |
2302.13279
Report |
Makeup Extraction of 3D Representation via Illumination-Aware Image Decomposition |
Xingchao Yang, Takafumi Taketomi, Yoshihiro Kanamori |
Facial makeup enriches the beauty of not only real humans but also virtual
characters; therefore, makeup for 3D facial models is highly in demand in
productions. However, painting directly on 3D faces and capturing real-world
makeup are costly, and extracting makeup from 2D images often struggles with
shading effects and occlusions. This paper presents the first method for
extracting makeup for 3D facial models from a single makeup portrait. Our
method consists of the following three steps. First, we exploit the strong
prior of 3D morphable models via regression-based inverse rendering to extract
coarse materials such as geometry and diffuse/specular albedos that are
represented in the UV space. Second, we refine the coarse materials, which may
have missing pixels due to occlusions. We apply inpainting and optimization.
Finally, we extract the bare skin, makeup, and an alpha matte from the diffuse
albedo. Our method offers various applications for not only 3D facial models
but also 2D portrait images. The extracted makeup is well-aligned in the UV
space, from which we build a large-scale makeup dataset and a parametric makeup
model for 3D faces. Our disentangled materials also yield robust makeup
transfer and illumination-aware makeup interpolation/removal without a
reference image. |
This paper introduces a novel method for extracting facial makeup for 3D models from a single portrait image, enabling illumination-aware makeup manipulation in both 2D and 3D domains. |
Existing makeup transfer methods struggle with physical constraints like lighting and occlusions, while this method offers an integrated solution for realistic makeup application on 3D models. |
The method uses a three-step approach: (1) coarse facial material extraction using 3DMM fitting, (2) UV completion and material refinement via optimization, and (3) makeup extraction using a network trained on makeup and non-makeup albedo datasets. |
The method disentangles bare skin, makeup, and illumination components, enabling realistic makeup transfer while preserving lighting conditions.
The extracted makeup, represented in UV space, facilitates building a large-scale makeup dataset and a PCA-based makeup model for 3D faces.
The framework allows for various applications such as 3D makeup avatar creation, makeup editing, and illumination-aware makeup interpolation/removal. |
The method's reliance on 3DMM limits its ability to capture the full range of skin tones and subtle geometric details.
The current approach focuses on diffuse albedo for makeup extraction, future work could explore specular albedo for more realistic makeup representation. |
makeup extraction, 3d face reconstruction, illumination-aware makeup transfer, uv completion, inverse rendering |
2302.13153
Report |
Directed Diffusion: Direct Control of Object Placement through Attention Guidance |
Wan-Duo Kurt Ma, J. P. Lewis, Avisek Lahiri, Thomas Leung, W. Bastiaan Kleijn |
Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable
Diffusion are able to generate an effectively endless variety of images given
only a short text prompt describing the desired image content. In many cases
the images are of very high quality. However, these models often struggle to
compose scenes containing several key objects such as characters in specified
positional relationships. The missing capability to ``direct'' the placement of
characters and objects both within and across images is crucial in
storytelling, as recognized in the literature on film and animation theory. In
this work, we take a particularly straightforward approach to providing the
needed direction. Drawing on the observation that the cross-attention maps for
prompt words reflect the spatial layout of objects denoted by those words, we
introduce an optimization objective that produces ``activation'' at desired
positions in these cross-attention maps. The resulting approach is a step
toward generalizing the applicability of text-guided diffusion models beyond
single images to collections of related images, as in storybooks. Directed
Diffusion provides easy high-level positional control over multiple objects,
while making use of an existing pre-trained model and maintaining a coherent
blend between the positioned objects and the background. Moreover, it requires
only a few lines to implement. |
This paper introduces Directed Diffusion, a method to control object placement in text-to-image synthesis using pre-trained diffusion models without fine-tuning. |
Existing text-to-image models struggle to compose scenes with multiple objects in specific positions, hindering their use in storytelling and other applications requiring layout control. |
The method leverages the spatial interpretation of cross-attention maps in diffusion models. It optimizes a weight vector to re-weight trailing attention maps, guiding the placement of objects within user-specified bounding boxes during the denoising process. |
Directed Diffusion enables consistent control over the positioning of multiple objects, facilitating image generation for storytelling.
The method ensures seamless integration of positioned objects with the background, maintaining contextual interactions like shadows and lighting.
It offers a simple and efficient approach, requiring only bounding box specifications and a small optimization without extensive training or code changes. |
The method relies on the existing capabilities and limitations of pre-trained models, potentially inheriting their biases or struggling with complex prompts.
While enabling object placement, the approach currently focuses on static images and does not address challenges in generating dynamic scenes or videos. |
denoising diffusion, text-to-image synthesis, object placement, cross-attention guidance, storytelling |
2302.12995
Report |
Raw Image Reconstruction with Learned Compact Metadata |
Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex Kot, Bihan Wen |
While raw images exhibit advantages over sRGB images (e.g., linearity and
fine-grained quantization level), they are not widely used by common users due
to the large storage requirements. Very recent works propose to compress raw
images by designing the sampling masks in the raw image pixel space, leading to
suboptimal image representations and redundant metadata. In this paper, we
propose a novel framework to learn a compact representation in the latent space
serving as the metadata in an end-to-end manner. Furthermore, we propose a
novel sRGB-guided context model with improved entropy estimation strategies,
which leads to better reconstruction quality, smaller size of metadata, and
faster speed. We illustrate how the proposed raw image compression scheme can
adaptively allocate more bits to image regions that are important from a global
perspective. The experimental results show that the proposed method can achieve
superior raw image reconstruction results using a smaller size of the metadata
on both uncompressed sRGB images and JPEG images. |
This paper proposes a novel end-to-end deep encoding framework for raw image reconstruction that learns compact metadata in latent space with adaptive bit allocation, leading to high-fidelity reconstruction with less storage overhead. |
Raw images, despite advantages like linearity and fine-grained quantization, are not widely used due to large storage requirements. Existing compression methods suffer from suboptimal representations and metadata redundancy. |
The framework uses an sRGB-guided context model for efficient latent code encoding and a hyperprior model with improved entropy estimation strategies for further compression. It adaptively allocates bits based on image content, prioritizing complex regions. |
Achieves superior raw image reconstruction quality with lower storage overhead than previous state-of-the-art methods on AdobeFiveK and NUS datasets.
The sRGB-guided context model allows for adaptive bit allocation, prioritizing complex regions and resulting in efficient compression.
The proposed method shows robustness when reconstructing raw images from compressed JPEG images of varying quality factors. |
The current implementation only considers the information from a single sRGB image.
Future work could explore incorporating information from adjacent frames in a video to further reduce redundancy. |
raw image reconstruction, image compression, latent space, adaptive bit allocation, context modeling |
2302.12764
Report |
Modulating Pretrained Diffusion Models for Multimodal Image Synthesis |
Cusuh Ham, James Hays, Jingwan Lu, Krishna Kumar Singh, Zhifei Zhang, Tobias Hinz |
We present multimodal conditioning modules (MCM) for enabling conditional
image synthesis using pretrained diffusion models. Previous multimodal
synthesis works rely on training networks from scratch or fine-tuning
pretrained networks, both of which are computationally expensive for large,
state-of-the-art diffusion models. Our method uses pretrained networks but
\textit{does not require any updates to the diffusion network's parameters}.
MCM is a small module trained to modulate the diffusion network's predictions
during sampling using 2D modalities (e.g., semantic segmentation maps,
sketches) that were unseen during the original training of the diffusion model.
We show that MCM enables user control over the spatial layout of the image and
leads to increased control over the image generation process. Training MCM is
cheap as it does not require gradients from the original diffusion net,
consists of only $\sim$1$\%$ of the number of parameters of the base diffusion
model, and is trained using only a limited number of training examples. We
evaluate our method on unconditional and text-conditional models to demonstrate
the improved control over the generated images and their alignment with respect
to the conditioning inputs. |
This paper introduces Multimodal Conditioning Modules (MCM), a lightweight method for adapting pretrained diffusion models to perform multimodal image synthesis without requiring any updates to the original model's parameters. |
Training diffusion models from scratch or fine-tuning them for specific conditions is computationally expensive. This paper addresses this by enabling multimodal control of pretrained models in a computationally efficient manner. |
MCM is a small diffusion-like network trained to modulate the predictions of a pretrained diffusion model during sampling. It takes new modalities and the diffusion model's intermediate outputs as input and outputs parameters that modulate the noise prediction at each sampling timestep. |
MCM enables user control over spatial layout and generation process of images using new modalities like segmentation maps and sketches.
It achieves high-quality results comparable to fine-tuned models while being significantly smaller and using less training data.
MCM is flexible with respect to sampling methods and can be applied to both unconditional and conditional diffusion models. |
MCM currently only supports 2D modalities.
It struggles to ground semantics with poor-quality training data. |
image synthesis, diffusion models, multimodal learning, conditional image generation, pretrained models |
2302.12469
Report |
Unsupervised Discovery of Semantic Latent Directions in Diffusion Models |
Yong-Hyun Park, Mingi Kwon, Junghyo Jo, Youngjung Uh |
Despite the success of diffusion models (DMs), we still lack a thorough
understanding of their latent space. While image editing with GANs builds upon
latent space, DMs rely on editing the conditions such as text prompts. We
present an unsupervised method to discover interpretable editing directions for
the latent variables $\mathbf{x}_t \in \mathcal{X}$ of DMs. Our method adopts
Riemannian geometry between $\mathcal{X}$ and the intermediate feature maps
$\mathcal{H}$ of the U-Nets to provide a deep understanding over the
geometrical structure of $\mathcal{X}$. The discovered semantic latent
directions mostly yield disentangled attribute changes, and they are globally
consistent across different samples. Furthermore, editing in earlier timesteps
edits coarse attributes, while ones in later timesteps focus on high-frequency
details. We define the curvedness of a line segment between samples to show
that $\mathcal{X}$ is a curved manifold. Experiments on different baselines and
datasets demonstrate the effectiveness of our method even on Stable Diffusion.
Our source code will be publicly available for the future researchers. |
This paper presents an unsupervised method for discovering interpretable editing directions in the latent space of pre-trained diffusion models (DMs). |
Understanding the latent space of DMs is crucial for developing controllable image editing techniques, similar to what has been achieved with GANs. |
The method leverages Riemannian geometry by analyzing the Jacobian of the mapping between the latent space and the intermediate feature space of the U-Net. |
The discovered directions correspond to semantically meaningful image manipulations, such as changing age, gender, or breed.
Editing in earlier timesteps affects coarse attributes, while later timesteps control fine details.
The latent space of DMs exhibits a curved manifold structure. |
Some editing directions can be entangled due to dataset bias and model limitations.
The method exhibits occasional abrupt changes when applied to Stable Diffusion, suggesting a more complex latent space structure. |
machine learning, diffusion model, latent space, image editing, unsupervised learning |
2302.12464
Report |
RGI: robust GAN-inversion for mask-free image inpainting and unsupervised pixel-wise anomaly detection |
Shancong Mou, Xiaoyi Gu, Meng Cao, Haoping Bai, Ping Huang, Jiulong Shan, Jianjun Shi |
Generative adversarial networks (GANs), trained on a large-scale image
dataset, can be a good approximator of the natural image manifold.
GAN-inversion, using a pre-trained generator as a deep generative prior, is a
promising tool for image restoration under corruptions. However, the
performance of GAN-inversion can be limited by a lack of robustness to unknown
gross corruptions, i.e., the restored image might easily deviate from the
ground truth. In this paper, we propose a Robust GAN-inversion (RGI) method
with a provable robustness guarantee to achieve image restoration under unknown
\textit{gross} corruptions, where a small fraction of pixels are completely
corrupted. Under mild assumptions, we show that the restored image and the
identified corrupted region mask converge asymptotically to the ground truth.
Moreover, we extend RGI to Relaxed-RGI (R-RGI) for generator fine-tuning to
mitigate the gap between the GAN learned manifold and the true image manifold
while avoiding trivial overfitting to the corrupted input image, which further
improves the image restoration and corrupted region mask identification
performance. The proposed RGI/R-RGI method unifies two important applications
with state-of-the-art (SOTA) performance: (i) mask-free semantic inpainting,
where the corruptions are unknown missing regions, the restored background can
be used to restore the missing content; (ii) unsupervised pixel-wise anomaly
detection, where the corruptions are unknown anomalous regions, the retrieved
mask can be used as the anomalous region's segmentation mask. |
This paper proposes Robust GAN-inversion (RGI) and Relaxed RGI (R-RGI) methods to improve robustness and accuracy of GAN-inversion for image restoration under unknown gross corruptions, where a small fraction of pixels are completely corrupted. |
Existing GAN-inversion methods lack robustness to gross corruptions and suffer from approximation gap between learned and true image manifolds, limiting their performance in image restoration and anomaly detection. |
RGI learns latent representation and corrupted region mask simultaneously by minimizing a reconstruction loss with sparsity penalty on the mask. R-RGI extends RGI by incorporating generator fine-tuning to mitigate the approximation gap. |
RGI/R-RGI provably converges to the true clean image and corrupted region mask asymptotically.
RGI/R-RGI enables mask-free semantic inpainting, achieving comparable performance to methods requiring pre-configured masks.
R-RGI significantly outperforms state-of-the-art unsupervised pixel-wise anomaly detection methods on a synthetic defect dataset. |
The computational cost of RGI/R-RGI is high due to the optimization process for each image.
The performance of RGI/R-RGI relies on sufficient training data for GAN to learn a generalizable image manifold. |
gan-inversion, image restoration, anomaly detection, semantic inpainting, robust optimization |
2302.12400
Report |
Towards Stable Test-Time Adaptation in Dynamic Wild World |
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, Mingkui Tan |
Test-time adaptation (TTA) has shown to be effective at tackling distribution
shifts between training and testing data by adapting a given model on test
samples. However, the online model updating of TTA may be unstable and this is
often a key obstacle preventing existing TTA methods from being deployed in the
real world. Specifically, TTA may fail to improve or even harm the model
performance when test data have: 1) mixed distribution shifts, 2) small batch
sizes, and 3) online imbalanced label distribution shifts, which are quite
common in practice. In this paper, we investigate the unstable reasons and find
that the batch norm layer is a crucial factor hindering TTA stability.
Conversely, TTA can perform more stably with batch-agnostic norm layers, \ie,
group or layer norm. However, we observe that TTA with group and layer norms
does not always succeed and still suffers many failure cases. By digging into
the failure cases, we find that certain noisy test samples with large gradients
may disturb the model adaption and result in collapsed trivial solutions, \ie,
assigning the same class label for all samples. To address the above collapse
issue, we propose a sharpness-aware and reliable entropy minimization method,
called SAR, for further stabilizing TTA from two aspects: 1) remove partial
noisy samples with large gradients, 2) encourage model weights to go to a flat
minimum so that the model is robust to the remaining noisy samples. Promising
results demonstrate that SAR performs more stably over prior methods and is
computationally efficient under the above wild test scenarios. |
This paper proposes a sharpness-aware and reliable entropy minimization method (SAR) to stabilize online test-time adaptation (TTA) under wild test settings (mix shifts, small batch, and imbalanced label shifts). |
Existing TTA methods often fail to improve or even harm model performance under these common real-world test scenarios, hindering their practical deployment. |
The paper first analyzes and verifies that batch-agnostic norm layers are more beneficial for stable TTA than batch norm. To address the model collapse issue of entropy-based methods on these models, SAR removes noisy samples with large gradients based on entropy and encourages optimization to a flat minimum for robustness to remaining noisy samples. |
Batch-agnostic norm layers (group and layer norm) are more beneficial for stable TTA under wild test settings than batch norm.
Online entropy minimization on group/layer norm models may lead to collapsed trivial solutions.
SAR stabilizes online TTA under wild test settings by effectively removing noisy samples and optimizing to a flat minimum, outperforming prior methods. |
The paper focuses on entropy-based online TTA methods and may not be directly applicable to other TTA strategies.
Future work can explore incorporating other stability-enhancing techniques into SAR or investigating its effectiveness on broader tasks beyond image classification. |
test-time adaptation, domain shift, entropy minimization, sharpness-aware learning, model robustness |
2302.12253
Report |
DisCO: Portrait Distortion Correction with Perspective-Aware 3D GANs |
Zhixiang Wang, Yu-Lun Liu, Jia-Bin Huang, Shin'ichi Satoh, Sizhuo Ma, Gurunandan Krishnan, Jian Wang |
Close-up facial images captured at short distances often suffer from
perspective distortion, resulting in exaggerated facial features and
unnatural/unattractive appearances. We propose a simple yet effective method
for correcting perspective distortions in a single close-up face. We first
perform GAN inversion using a perspective-distorted input facial image by
jointly optimizing the camera intrinsic/extrinsic parameters and face latent
code. To address the ambiguity of joint optimization, we develop starting from
a short distance, optimization scheduling, reparametrizations, and geometric
regularization. Re-rendering the portrait at a proper focal length and camera
distance effectively corrects perspective distortions and produces more
natural-looking results. Our experiments show that our method compares
favorably against previous approaches qualitatively and quantitatively. We
showcase numerous examples validating the applicability of our method on
in-the-wild portrait photos. We will release our code and the evaluation
protocol to facilitate future work. |
This paper introduces DiSCO, a novel method for correcting perspective distortions in close-up facial images using perspective-aware 3D GAN inversion. |
Close-up photos, like selfies, often suffer from undesirable perspective distortions that make facial features appear exaggerated. Existing correction methods struggle with severe distortions and cannot synthesize missing details. |
DiSCO jointly optimizes camera parameters (focal length, distance) and face latent code. To address optimization ambiguity, it employs strategies like close-up distance initialization, separate optimization scheduling, parameter reparameterizations, and geometric constraints. It further utilizes a geometry-aware stitching technique to handle full images, ensuring consistent manipulation of both the face and body. |
DiSCO outperforms previous methods qualitatively and quantitatively on benchmark datasets.
The method effectively corrects severe distortions in in-the-wild images, generating more natural and visually pleasing results.
It allows for additional visual effects like dolly-zoom videos. |
DiSCO faces challenges with out-of-distribution faces, such as extreme expressions or occlusions.
The current implementation relies on optimization-based inversion, which limits its speed. Future work will explore encoder-based solutions for real-time performance. |
perspective correction, 3d gan inversion, portrait distortion, face editing, dolly zoom |
2302.12251
Report |
VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion |
Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M. Alvarez, Sanja Fidler, Chen Feng, Anima Anandkumar |
Humans can easily imagine the complete 3D geometry of occluded objects and
scenes. This appealing ability is vital for recognition and understanding. To
enable such capability in AI systems, we propose VoxFormer, a Transformer-based
semantic scene completion framework that can output complete 3D volumetric
semantics from only 2D images. Our framework adopts a two-stage design where we
start from a sparse set of visible and occupied voxel queries from depth
estimation, followed by a densification stage that generates dense 3D voxels
from the sparse ones. A key idea of this design is that the visual features on
2D images correspond only to the visible scene structures rather than the
occluded or empty spaces. Therefore, starting with the featurization and
prediction of the visible structures is more reliable. Once we obtain the set
of sparse queries, we apply a masked autoencoder design to propagate the
information to all the voxels by self-attention. Experiments on SemanticKITTI
show that VoxFormer outperforms the state of the art with a relative
improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory
during training to less than 16GB. Our code is available on
https://github.com/NVlabs/VoxFormer. |
Proposes VoxFormer, a Transformer-based framework for camera-based 3D semantic scene completion, which outputs complete 3D volumetric semantics from only 2D images. |
Enabling AI systems to imagine the complete 3D geometry of occluded objects and scenes is vital for recognition and understanding in applications like autonomous driving. |
Adopts a two-stage design: (1) a query proposal network generates sparse occupied voxel queries from depth estimation, (2) a masked autoencoder-like Transformer densifies the sparse voxels and performs semantic segmentation. |
Outperforms state-of-the-art camera-based methods by a large margin on SemanticKITTI.
Achieves comparable performance to LiDAR-based methods, especially in safety-critical short-range areas.
Significantly improves the completion of small objects compared to baselines. |
Long-range performance needs further improvement due to unreliable depth estimation at far distances.
Decoupling long-range and short-range scene completion is a potential future direction. |
semantic scene completion, 3d vision, autonomous driving, transformer, camera-based perception |
2302.12248
Report |
Learning Visual Representations via Language-Guided Sampling |
Mohamed El Banani, Karan Desai, Justin Johnson |
Although an object may appear in numerous contexts, we often describe it in a
limited number of ways. Language allows us to abstract away visual variation to
represent and communicate concepts. Building on this intuition, we propose an
alternative approach to visual representation learning: using language
similarity to sample semantically similar image pairs for contrastive learning.
Our approach diverges from image-based contrastive learning by sampling view
pairs using language similarity instead of hand-crafted augmentations or
learned clusters. Our approach also differs from image-text contrastive
learning by relying on pre-trained language models to guide the learning rather
than directly minimizing a cross-modal loss. Through a series of experiments,
we show that language-guided learning yields better features than image-based
and image-text representation learning approaches. |
This paper proposes a new method called language-guided contrastive learning for visual representation learning, which utilizes language similarity to sample semantically similar image pairs for contrastive learning. |
Current image-based contrastive learning methods rely on visual similarity as a proxy for conceptual similarity, which limits the learned visual invariances. This work uses language as a proxy for conceptual similarity to improve generalization. |
The method samples image pairs with similar captions using a pre-trained sentence encoder (SBERT) and uses those pairs for contrastive learning with SimCLR, SimSiam, or SLIP. |
Language-guided contrastive learning outperforms image-only and image-text contrastive learning on linear probe and few-shot classification tasks.
The approach is robust to the choice of sampling strategy or language model, showing consistent performance gains with different sentence encoders.
Sampling nearest neighbors in language space provides higher-quality pairs for training compared to sampling in visual feature space, especially for self-supervised visual models. |
Image captions can be noisy or vague, resulting in the retrieval of unrelated image pairs.
A caption only captures one aspect of an image, potentially leading to similarity based on irrelevant factors. |
contrastive learning, visual representation learning, self-supervised learning, language-guided learning, image captioning |
2302.12237
Report |
Learning Neural Volumetric Representations of Dynamic Humans in Minutes |
Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, Xiaowei Zhou |
This paper addresses the challenge of quickly reconstructing free-viewpoint
videos of dynamic humans from sparse multi-view videos. Some recent works
represent the dynamic human as a canonical neural radiance field (NeRF) and a
motion field, which are learned from videos through differentiable rendering.
But the per-scene optimization generally requires hours. Other generalizable
NeRF models leverage learned prior from datasets and reduce the optimization
time by only finetuning on new scenes at the cost of visual fidelity. In this
paper, we propose a novel method for learning neural volumetric videos of
dynamic humans from sparse view videos in minutes with competitive visual
quality. Specifically, we define a novel part-based voxelized human
representation to better distribute the representational power of the network
to different human parts. Furthermore, we propose a novel 2D motion
parameterization scheme to increase the convergence rate of deformation field
learning. Experiments demonstrate that our model can be learned 100 times
faster than prior per-scene optimization methods while being competitive in the
rendering quality. Training our model on a $512 \times 512$ video with 100
frames typically takes about 5 minutes on a single RTX 3090 GPU. The code will
be released on our project page: https://zju3dv.github.io/instant_nvr |
This paper presents a novel dynamic human representation that significantly accelerates the optimization of neural human models from videos, achieving a 100x speedup compared to previous methods. |
Creating volumetric videos of human performers from multi-view videos has many applications, but existing methods suffer from lengthy optimization times, hindering their practical use. |
The proposed representation combines a part-based voxelized human model with a 2D motion parameterization scheme. The human body is decomposed into parts, each represented by an independent NeRF network with varying resolutions, optimizing representational power distribution. A 2D surface parameterization is used to predict motion, leveraging the fact that human motion primarily occurs at the surface level, which significantly reduces the dimensionality of the motion field and improves convergence rate. |
The proposed method achieves 100x faster optimization compared to previous neural human representations.
It maintains competitive rendering quality with state-of-the-art methods on benchmark datasets like ZJU-MoCap and MonoCap.
Training the model on a 100-frame monocular video with 512x512 resolution takes approximately 5 minutes on an RTX 3090 GPU. |
The method currently relies on accurate SMPL parameters, which may be difficult to obtain in unconstrained environments.
It focuses on reconstructing foreground dynamic humans and cannot handle dynamic backgrounds. |
neural human modeling, volumetric video, nerf, motion parameterization, fast optimization |
2302.12231
Report |
DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models |
Jamie Wynn, Daniyar Turmukhambetov |
Under good conditions, Neural Radiance Fields (NeRFs) have shown impressive
results on novel view synthesis tasks. NeRFs learn a scene's color and density
fields by minimizing the photometric discrepancy between training views and
differentiable renderings of the scene. Once trained from a sufficient set of
views, NeRFs can generate novel views from arbitrary camera positions. However,
the scene geometry and color fields are severely under-constrained, which can
lead to artifacts, especially when trained with few input views.
To alleviate this problem we learn a prior over scene geometry and color,
using a denoising diffusion model (DDM). Our DDM is trained on RGBD patches of
the synthetic Hypersim dataset and can be used to predict the gradient of the
logarithm of a joint probability distribution of color and depth patches. We
show that, these gradients of logarithms of RGBD patch priors serve to
regularize geometry and color of a scene. During NeRF training, random RGBD
patches are rendered and the estimated gradient of the log-likelihood is
backpropagated to the color and density fields. Evaluations on LLFF, the most
relevant dataset, show that our learned prior achieves improved quality in the
reconstructed geometry and improved generalization to novel views. Evaluations
on DTU show improved reconstruction quality among NeRF methods. |
This paper introduces DiffusioNeRF, a novel approach for regularizing Neural Radiance Fields (NeRFs) using Denoising Diffusion Models (DDMs). |
NeRFs often produce low-quality or physically implausible geometries and appearances, particularly when trained on a limited number of input views. This method aims to address this issue and improve the quality of NeRF reconstructions. |
A DDM is trained on RGBD patches from the synthetic Hypersim dataset to learn a prior over scene geometry and color. The DDM provides gradients of the log-likelihood of RGBD patches, which are then used to regularize the NeRF's density and color fields during training. |
The learned prior improves the quality of reconstructed geometry, resulting in more plausible depth maps.
DiffusioNeRF shows improved generalization to novel views, particularly in the few-view setting.
On the DTU dataset, DiffusioNeRF achieves improved reconstruction quality compared to other NeRF methods, even surpassing some SDF-based methods. |
The DDM regularization can sometimes lead to over-smoothing of thin structures.
Further research is needed on the principled combination of DDM gradients with the NeRF objective to optimize the scheduling of diffusion time (τ) and gradient weights. |
neural radiance fields, nerf, denoising diffusion models, novel view synthesis, 3d reconstruction |
2302.12228
Report |
Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models |
Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or |
Text-to-image personalization aims to teach a pre-trained diffusion model to
reason about novel, user provided concepts, embedding them into new scenes
guided by natural language prompts. However, current personalization approaches
struggle with lengthy training times, high storage requirements or loss of
identity. To overcome these limitations, we propose an encoder-based
domain-tuning approach. Our key insight is that by underfitting on a large set
of concepts from a given domain, we can improve generalization and create a
model that is more amenable to quickly adding novel concepts from the same
domain. Specifically, we employ two components: First, an encoder that takes as
an input a single image of a target concept from a given domain, e.g. a
specific face, and learns to map it into a word-embedding representing the
concept. Second, a set of regularized weight-offsets for the text-to-image
model that learn how to effectively ingest additional concepts. Together, these
components are used to guide the learning of unseen concepts, allowing us to
personalize a model using only a single image and as few as 5 training steps -
accelerating personalization from dozens of minutes to seconds, while
preserving quality. |
This paper proposes Encoder for Tuning (E4T), an encoder-based domain-tuning method for fast personalization of text-to-image models, enabling adaptation to novel concepts in seconds. |
Current personalization methods for text-to-image models are slow, require significant storage for each concept, and often lead to overfitting. |
E4T pretrains on a large dataset of a specific domain (e.g., faces, cats) to learn an encoder that maps concept images to word embeddings and weight offsets for efficient model tuning. At inference, it personalizes the model using a single image and few training steps. |
E4T achieves comparable or superior personalization quality to existing methods like Textual Inversion and DreamBooth, using only a single image and significantly less training time.
The iterative refinement approach used in E4T allows the model to focus on high-level details first and progressively refine the concept representation during denoising.
Quantitative evaluation demonstrates E4T's effectiveness in capturing identity while adhering to user prompts, placing it on the Pareto front for both metrics. |
The reliance on large, domain-specific datasets for encoder pretraining limits E4T's applicability to concepts with abundant training data.
The need for inference-time tuning, while fast, requires capable hardware and more memory compared to direct fine-tuning methods. |
text-to-image synthesis, personalization, diffusion models, encoder-decoder architecture, domain adaptation |
2302.12066
Report |
Teaching CLIP to Count to Ten |
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel |
Large vision-language models (VLMs), such as CLIP, learn rich joint
image-text representations, facilitating advances in numerous downstream tasks,
including zero-shot classification and text-to-image generation. Nevertheless,
existing VLMs exhibit a prominent well-documented limitation - they fail to
encapsulate compositional concepts such as counting. We introduce a simple yet
effective method to improve the quantitative understanding of VLMs, while
maintaining their overall performance on common benchmarks. Specifically, we
propose a new counting-contrastive loss used to finetune a pre-trained VLM in
tandem with its original objective. Our counting loss is deployed over
automatically-created counterfactual examples, each consisting of an image and
a caption containing an incorrect object count. For example, an image depicting
three dogs is paired with the caption "Six dogs playing in the yard". Our loss
encourages discrimination between the correct caption and its counterfactual
variant which serves as a hard negative example. To the best of our knowledge,
this work is the first to extend CLIP's capabilities to object counting.
Furthermore, we introduce "CountBench" - a new image-text counting benchmark
for evaluating a model's understanding of object counting. We demonstrate a
significant improvement over state-of-the-art baseline models on this task.
Finally, we leverage our count-aware CLIP model for image retrieval and
text-conditioned image generation, demonstrating that our model can produce
specific counts of objects more reliably than existing ones. |
This paper introduces a novel method for improving the quantitative understanding of large-scale vision-language models (VLMs) like CLIP, enabling them to better comprehend and process object counts in images and text. |
Existing VLMs struggle with compositional concepts like counting, limiting their performance in tasks such as image retrieval and text-to-image generation. This work addresses this limitation, enhancing VLMs' ability to accurately associate object counts in text with visual representations. |
The method involves creating a filtered counting training set with captions explicitly stating object counts. A novel counting-contrastive loss is introduced, training the VLM to distinguish between correct captions and counterfactual ones with incorrect object counts. |
The proposed method significantly improves zero-shot count classification accuracy on the newly introduced CountBench benchmark.
The finetuned VLMs retain their performance on general zero-shot classification tasks, demonstrating the preservation of their original knowledge.
The enhanced VLMs exhibit improved performance in text-to-image retrieval and generation, producing results that better adhere to specified object counts in text prompts. |
The method's performance is limited by the availability of training data, particularly for images with a large number of objects and corresponding captions accurately stating the count.
The current implementation focuses on counting up to ten and may not generalize well to larger numbers without further adaptation. |
vision-language models, clip, counting, compositionality, text-to-image generation |
2302.11831
Report |
Embedding Fourier for Ultra-High-Definition Low-Light Image Enhancement |
Chongyi Li, Chun-Le Guo, Man Zhou, Zhexin Liang, Shangchen Zhou, Ruicheng Feng, Chen Change Loy |
Ultra-High-Definition (UHD) photo has gradually become the standard
configuration in advanced imaging devices. The new standard unveils many issues
in existing approaches for low-light image enhancement (LLIE), especially in
dealing with the intricate issue of joint luminance enhancement and noise
removal while remaining efficient. Unlike existing methods that address the
problem in the spatial domain, we propose a new solution, UHDFour, that embeds
Fourier transform into a cascaded network. Our approach is motivated by a few
unique characteristics in the Fourier domain: 1) most luminance information
concentrates on amplitudes while noise is closely related to phases, and 2) a
high-resolution image and its low-resolution version share similar amplitude
patterns.Through embedding Fourier into our network, the amplitude and phase of
a low-light image are separately processed to avoid amplifying noise when
enhancing luminance. Besides, UHDFour is scalable to UHD images by implementing
amplitude and phase enhancement under the low-resolution regime and then
adjusting the high-resolution scale with few computations. We also contribute
the first real UHD LLIE dataset, \textbf{UHD-LL}, that contains 2,150
low-noise/normal-clear 4K image pairs with diverse darkness and noise levels
captured in different scenarios. With this dataset, we systematically analyze
the performance of existing LLIE methods for processing UHD images and
demonstrate the advantage of our solution. We believe our new framework,
coupled with the dataset, would push the frontier of LLIE towards UHD. The code
and dataset are available at https://li-chongyi.github.io/UHDFour. |
This paper proposes UHDFour, a novel UHD Low-Light Image Enhancement (LLIE) framework that leverages Fourier transform in a cascaded network for efficient joint luminance enhancement and noise removal, addressing limitations of existing spatial domain methods. |
Existing LLIE methods struggle to handle real-world UHD images due to limitations in noise removal, suboptimal enhancement, incompatibility with high-resolution inputs, and inefficiency. UHDFour tackles these challenges by processing images in the Fourier domain. |
UHDFour consists of LRNet and HRNet. LRNet processes downsampled images in Fourier domain (enhancing amplitude and phase separately) and estimates LR output. HRNet refines amplitude and phase in HR using LRNet outputs and estimates final HR output. |
UHDFour outperforms 14 state-of-the-art LLIE methods on the newly introduced UHD-LL dataset, achieving superior quantitative and qualitative results.
The paper introduces UHD-LL, the first real-world UHD LLIE dataset with 2,150 low-noise/normal-clear 4K image pairs, addressing the lack of diverse, high-resolution benchmark data.
Analysis reveals that existing LLIE models, even when retrained, fail to effectively handle noise and maintain image fidelity in UHD images. |
The study is limited to image enhancement, excluding video data and adversarial losses.
Trained models on sRGB data might not generalize to extreme cases with information loss due to limited bit depth, necessitating exploration with HDR data. |
low-light image enhancement, uhd image processing, fourier transform, deep learning, image denoising |
2302.11710
Report |
Controlled and Conditional Text to Image Generation with Diffusion Prior |
Pranav Aggarwal, Hareesh Ravi, Naveen Marri, Sachin Kelkar, Fengbin Chen, Vinh Khuc, Midhun Harikumar, Ritiz Tambi, Sudharshan Reddy Kakumanu, Purvak Lapsiya, Alvin Ghouas, Sarah Saber, Malavika Ramprasad, Baldo Faieta, Ajinkya Kale |
Denoising Diffusion models have shown remarkable performance in generating
diverse, high quality images from text. Numerous techniques have been proposed
on top of or in alignment with models like Stable Diffusion and Imagen that
generate images directly from text. A lesser explored approach is DALLE-2's two
step process comprising a Diffusion Prior that generates a CLIP image embedding
from text and a Diffusion Decoder that generates an image from a CLIP image
embedding. We explore the capabilities of the Diffusion Prior and the
advantages of an intermediate CLIP representation. We observe that Diffusion
Prior can be used in a memory and compute efficient way to constrain the
generation to a specific domain without altering the larger Diffusion Decoder.
Moreover, we show that the Diffusion Prior can be trained with additional
conditional information such as color histogram to further control the
generation. We show quantitatively and qualitatively that the proposed
approaches perform better than prompt engineering for domain specific
generation and existing baselines for color conditioned generation. We believe
that our observations and results will instigate further research into the
diffusion prior and uncover more of its capabilities. |
This paper explores the capabilities of Diffusion Prior, a component of DALLE-2, for controllable and conditional text-to-image generation by training it on specific domains and with additional conditional information like color histograms. |
This approach allows for domain-specific and conditional generation without modifying the larger Diffusion Decoder, making it memory and computationally efficient. |
The authors trained separate Diffusion Prior models on datasets of textures, vectors, isolated objects, and color histograms. They also trained a custom LDM conditioned on CLIP L/14 image embeddings as the Diffusion Decoder. |
Domain-specific priors effectively constrain image generation to the desired domain (textures, vectors, isolated objects) and outperform Stable Diffusion with prompt engineering.
The color-conditioned prior generates images aligning with both the text prompt and color palette, surpassing color transfer methods applied to Stable Diffusion outputs in terms of quality and semantic relevance.
The proposed method is more memory and computationally efficient than finetuning large diffusion models for similar tasks. |
The color prior might be biased towards generating vector images when trained on a dataset containing vector images with color histograms.
Further research is needed to explore the approach's effectiveness on a wider range of domains and conditional inputs. |
diffusion models, text-to-image generation, conditional image generation, domain adaptation, diffusion prior |
2302.11566
Report |
Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition |
Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, Otmar Hilliges |
We present Vid2Avatar, a method to learn human avatars from monocular
in-the-wild videos. Reconstructing humans that move naturally from monocular
in-the-wild videos is difficult. Solving it requires accurately separating
humans from arbitrary backgrounds. Moreover, it requires reconstructing
detailed 3D surface from short video sequences, making it even more
challenging. Despite these challenges, our method does not require any
groundtruth supervision or priors extracted from large datasets of clothed
human scans, nor do we rely on any external segmentation modules. Instead, it
solves the tasks of scene decomposition and surface reconstruction directly in
3D by modeling both the human and the background in the scene jointly,
parameterized via two separate neural fields. Specifically, we define a
temporally consistent human representation in canonical space and formulate a
global optimization over the background model, the canonical human shape and
texture, and per-frame human pose parameters. A coarse-to-fine sampling
strategy for volume rendering and novel objectives are introduced for a clean
separation of dynamic human and static background, yielding detailed and robust
3D human geometry reconstructions. We evaluate our methods on publicly
available datasets and show improvements over prior art. |
This paper presents \methodname, a method to reconstruct detailed 3D avatars from monocular in-the-wild videos via self-supervised scene decomposition, without requiring groundtruth supervision, priors from large datasets, or external segmentation modules. |
Reconstructing humans from in-the-wild videos is challenging because it requires separating humans from arbitrary backgrounds and reconstructing detailed surfaces from short video sequences. |
The method jointly models the human and background with separate neural fields and optimizes them globally. It defines a temporally consistent human representation in canonical space and utilizes a coarse-to-fine sampling strategy with novel objectives for clean separation. |
Outperforms state-of-the-art methods in 2D segmentation, novel view synthesis, and 3D reconstruction.
Achieves robust and detailed 3D reconstruction of humans with complex clothing and facial features.
Demonstrates high-quality results on various in-the-wild videos from different sources. |
Relies on reasonable pose estimates as input.
Faces challenges with loose clothing due to fast dynamics. |
3d human reconstruction, scene decomposition, neural rendering, implicit neural representation, monocular video |
2302.11562
Report |
Uncovering Bias in Face Generation Models |
Cristian Muñoz, Sara Zannone, Umar Mohammed, Adriano Koshiyama |
Recent advancements in GANs and diffusion models have enabled the creation of
high-resolution, hyper-realistic images. However, these models may misrepresent
certain social groups and present bias. Understanding bias in these models
remains an important research question, especially for tasks that support
critical decision-making and could affect minorities. The contribution of this
work is a novel analysis covering architectures and embedding spaces for
fine-grained understanding of bias over three approaches: generators, attribute
modifier, and post-processing bias mitigators. This work shows that generators
suffer from bias across all social groups with attribute preferences such as
between 75%-85% for whiteness and 60%-80% for the female gender (for all
trained CelebA models) and low probabilities of generating children and older
men. Modifier and mitigators work as post-processor and change the generator
performance. For instance, attribute channel perturbation strategies modify the
embedding spaces. We quantify the influence of this change on group fairness by
measuring the impact on image quality and group features. Specifically, we use
the Fr\'echet Inception Distance (FID), the Face Matching Error and the
Self-Similarity score. For Interfacegan, we analyze one and two attribute
channel perturbations and examine the effect on the fairness distribution and
the quality of the image. Finally, we analyzed the post-processing bias
mitigators, which are the fastest and most computationally efficient way to
mitigate bias. We find that these mitigation techniques show similar results on
KL divergence and FID score, however, self-similarity scores show a different
feature concentration on the new groups of the data distribution. The
weaknesses and ongoing challenges described in this work must be considered in
the pursuit of creating fair and unbiased face generation models. |
The paper presents a novel analysis of bias in face generation models, focusing on architectures and embedding spaces to understand bias in generators, attribute modifiers, and post-processing bias mitigators. |
Understanding bias in face generation models is crucial as biased datasets can lead to unfair representations and discriminatory outcomes, especially in critical decision-making tasks affecting minorities. |
The study analyzes bias across different generators (StyleGAN2, CIPS, LDM, DDPM), attribute channel modifiers (InterfaceGAN, GANSpace, StyleSpace), and bias mitigators (StyleFlow, FairGen, FairStyle) using metrics like FID, Face Matching Error, Self-Similarity score, and KL divergence. |
Generators exhibit bias across social groups, showing preferences for whiteness (75%-85%) and female gender (60%-80%) in CelebA-trained models, and low representation of children and older men.
Attribute modifiers, while manipulating attribute boundaries, impact generator performance, as seen with InterfaceGAN and its effect on fairness distribution and image quality.
Post-processing bias mitigators, while computationally efficient, show varying results, with similar KL divergence and FID scores but differing self-similarity scores, indicating varied feature concentration in mitigated datasets. |
The study primarily uses binary classifications for certain attributes like age (Young/Adult), which might not fully capture the nuances of age representation.
Future work could explore intersectional bias across multiple attributes and develop more robust evaluation metrics for fairness in face generation models. |
bias analysis, face generation, bias mitigation, gans, diffusion models |
2302.11552
Report |
Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC |
Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, Will Grathwohl |
Since their introduction, diffusion models have quickly become the prevailing
approach to generative modeling in many domains. They can be interpreted as
learning the gradients of a time-varying sequence of log-probability density
functions. This interpretation has motivated classifier-based and
classifier-free guidance as methods for post-hoc control of diffusion models.
In this work, we build upon these ideas using the score-based interpretation of
diffusion models, and explore alternative ways to condition, modify, and reuse
diffusion models for tasks involving compositional generation and guidance. In
particular, we investigate why certain types of composition fail using current
techniques and present a number of solutions. We conclude that the sampler (not
the model) is responsible for this failure and propose new samplers, inspired
by MCMC, which enable successful compositional generation. Further, we propose
an energy-based parameterization of diffusion models which enables the use of
new compositional operators and more sophisticated, Metropolis-corrected
samplers. Intriguingly we find these samplers lead to notable improvements in
compositional generation across a wide set of problems such as
classifier-guided ImageNet modeling and compositional text-to-image generation. |
This paper investigates the compositionality of diffusion models, focusing on why typical composition methods fail and introducing solutions based on MCMC sampling and an energy-based parameterization for diffusion models. |
Compositionality in generative models is crucial for efficiently repurposing learned priors and achieving flexible generation without retraining for complex scenarios. |
The authors analyze the failure of existing composition methods in diffusion models, proposing annealed MCMC sampling and an energy-based parameterization to address the issue. They evaluate their method on various datasets, including 2D synthetic data, CLEVR, ImageNet, and text-to-image generation. |
MCMC sampling significantly improves compositional generation quality compared to reverse diffusion.
The energy-based parameterization enables more sophisticated MCMC sampling techniques with Metropolis corrections, leading to further improvements.
The proposed methods demonstrate impressive results in complex compositional tasks like text-to-image generation with multiple concepts and generating image tapestries with spatially controlled content. |
The use of sophisticated MCMC sampling increases computational cost compared to standard diffusion sampling.
The energy-based parameterization requires double the memory and compute compared to score-parameterized models. |
diffusion models, compositional generation, energy-based models, mcmc sampling, text-to-image generation |
2302.11383
Report |
Entity-Level Text-Guided Image Manipulation |
Yikai Wang, Jianan Wang, Guansong Lu, Hang Xu, Zhenguo Li, Wei Zhang, Yanwei Fu |
Existing text-guided image manipulation methods aim to modify the appearance
of the image or to edit a few objects in a virtual or simple scenario, which is
far from practical applications. In this work, we study a novel task on
text-guided image manipulation on the entity level in the real world (eL-TGIM).
The task imposes three basic requirements, (1) to edit the entity consistent
with the text descriptions, (2) to preserve the entity-irrelevant regions, and
(3) to merge the manipulated entity into the image naturally. To this end, we
propose an elegant framework, dubbed as SeMani, forming the Semantic
Manipulation of real-world images that can not only edit the appearance of
entities but also generate new entities corresponding to the text guidance. To
solve eL-TGIM, SeMani decomposes the task into two phases: the semantic
alignment phase and the image manipulation phase. In the semantic alignment
phase, SeMani incorporates a semantic alignment module to locate the
entity-relevant region to be manipulated. In the image manipulation phase,
SeMani adopts a generative model to synthesize new images conditioned on the
entity-irrelevant regions and target text descriptions. We discuss and propose
two popular generation processes that can be utilized in SeMani, the discrete
auto-regressive generation with transformers and the continuous denoising
generation with diffusion models, yielding SeMani-Trans and SeMani-Diff,
respectively. We conduct extensive experiments on the real datasets CUB,
Oxford, and COCO datasets to verify that SeMani can distinguish the
entity-relevant and -irrelevant regions and achieve more precise and flexible
manipulation in a zero-shot manner compared with baseline methods. Our codes
and models will be released at https://github.com/Yikai-Wang/SeMani. |
This paper introduces entity-Level Text-Guided Image Manipulation (eL-TGIM), a novel task aiming to manipulate specific entities within an image using text descriptions. |
eL-TGIM addresses the limitations of existing TGIM methods that struggle to precisely identify and edit entities in real-world images. |
The authors propose SeMani, a framework that decomposes eL-TGIM into semantic alignment and image manipulation phases. They present two variants: SeMani-Trans, employing discrete token-wise processing, and SeMani-Diff, utilizing continuous pixel-level manipulation with diffusion models. |
SeMani effectively distinguishes and manipulates entities based on text descriptions while preserving irrelevant image regions.
SeMani-Trans demonstrates the ability to manipulate both appearance and structure of entities.
Quantitative and qualitative evaluations on CUB, Oxford, and COCO datasets show SeMani's superiority over existing TGIM methods. |
SeMani-Trans's autoregressive generation may limit its capacity to fully leverage unmasked image regions.
Future work could explore enhancing SeMani's ability to handle complex relationships and interactions between multiple entities. |
image manipulation, text-guided image editing, semantic alignment, diffusion models, vision and language |
2302.11306
Report |
Human MotionFormer: Transferring Human Motions with Vision Transformers |
Hongyu Liu, Xintong Han, Chengbin Jin, Lihui Qian, Huawei Wei, Zhe Lin, Faqiang Wang, Haoye Dong, Yibing Song, Jia Xu, Qifeng Chen |
Human motion transfer aims to transfer motions from a target dynamic person
to a source static one for motion synthesis. An accurate matching between the
source person and the target motion in both large and subtle motion changes is
vital for improving the transferred motion quality. In this paper, we propose
Human MotionFormer, a hierarchical ViT framework that leverages global and
local perceptions to capture large and subtle motion matching, respectively. It
consists of two ViT encoders to extract input features (i.e., a target motion
image and a source human image) and a ViT decoder with several cascaded blocks
for feature matching and motion transfer. In each block, we set the target
motion feature as Query and the source person as Key and Value, calculating the
cross-attention maps to conduct a global feature matching. Further, we
introduce a convolutional layer to improve the local perception after the
global cross-attention computations. This matching process is implemented in
both warping and generation branches to guide the motion transfer. During
training, we propose a mutual learning loss to enable the co-supervision
between warping and generation branches for better motion representations.
Experiments show that our Human MotionFormer sets the new state-of-the-art
performance both qualitatively and quantitatively. Project page:
\url{https://github.com/KumapowerLIU/Human-MotionFormer} |
This paper proposes Human MotionFormer, a hierarchical Vision Transformer framework that leverages global and local perceptions for accurate motion matching in human motion transfer. |
Accurate matching between source person and target motion is crucial for high-quality motion transfer, especially in scenarios with both large and subtle motion changes. |
The method utilizes two ViT encoders for feature extraction and a ViT decoder with cascaded blocks for feature matching and motion transfer. It incorporates cross-attention for global matching and convolutional layers for local refinement. A mutual learning loss is introduced to enable co-supervision between warping and generation branches during training. |
MotionFormer achieves state-of-the-art performance both qualitatively and quantitatively on human motion transfer benchmarks.
The method effectively captures both large and subtle motion changes, resulting in more realistic and natural motion transfer results.
The proposed mutual learning loss effectively improves the quality of generated images by enhancing the complementariness of warping and generation branches. |
The model assumes a fixed background, which might limit its applicability in complex real-world scenes.
The computational cost of the model is relatively high compared to some existing methods. |
motion transfer, vision transformer, global and local matching, mutual learning, image generation |
2302.10893
Report |
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness |
Felix Friedrich, Manuel Brack, Lukas Struppek, Dominik Hintersdorf, Patrick Schramowski, Sasha Luccioni, Kristian Kersting |
Generative AI models have recently achieved astonishing results in quality
and are consequently employed in a fast-growing number of applications.
However, since they are highly data-driven, relying on billion-sized datasets
randomly scraped from the internet, they also suffer from degenerated and
biased human behavior, as we demonstrate. In fact, they may even reinforce such
biases. To not only uncover but also combat these undesired effects, we present
a novel strategy, called Fair Diffusion, to attenuate biases after the
deployment of generative text-to-image models. Specifically, we demonstrate
shifting a bias, based on human instructions, in any direction yielding
arbitrarily new proportions for, e.g., identity groups. As our empirical
evaluation demonstrates, this introduced control enables instructing generative
image models on fairness, with no data filtering and additional training
required. |
The paper introduces \textsc{Fair Diffusion}, a novel strategy to mitigate biases in deployed text-to-image generative models by allowing users to instruct the model on fairness using textual guidance. |
Generative AI models, despite their impressive capabilities, often perpetuate and amplify biases present in their training data, leading to unfair outcomes in applications. |
\textsc{Fair Diffusion} builds upon classifier-free guidance and introduces a fairness guidance term that allows users to steer image generation towards desired attribute proportions, enabling the implementation of different fairness definitions. |
The study reveals significant gender and racial biases in Stable Diffusion's training dataset (LAION-5B) and its pre-trained model (CLIP), which are mirrored in the generated images.
Stable Diffusion's generated images exhibit amplification, reflection, or mitigation of biases compared to LAION-5B, with no clear tendency observed.
\textsc{Fair Diffusion} successfully mitigates gender occupation biases in Stable Diffusion's output, shifting attribute proportions towards user-defined fairness goals while preserving overall image composition. |
The study relies on binary gender classification due to the limitations of current tools, while acknowledging the non-binary nature of gender.
The evaluation of \textsc{Fair Diffusion} relies on a pre-trained classifier (FairFace) for gender classification, which may have its own inherent biases. |
fairness, bias mitigation, generative ai, text-to-image synthesis, diffusion models |
2302.10781
Report |
Learning 3D Photography Videos via Self-supervised Diffusion on Single Images |
Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan |
3D photography renders a static image into a video with appealing 3D visual
effects. Existing approaches typically first conduct monocular depth
estimation, then render the input frame to subsequent frames with various
viewpoints, and finally use an inpainting model to fill those missing/occluded
regions. The inpainting model plays a crucial role in rendering quality, but it
is normally trained on out-of-domain data. To reduce the training and inference
gap, we propose a novel self-supervised diffusion model as the inpainting
module. Given a single input image, we automatically construct a training pair
of the masked occluded image and the ground-truth image with random
cycle-rendering. The constructed training samples are closely aligned to the
testing instances, without the need of data annotation. To make full use of the
masked images, we design a Masked Enhanced Block (MEB), which can be easily
plugged into the UNet and enhance the semantic conditions. Towards real-world
animation, we present a novel task: out-animation, which extends the space and
time of input objects. Extensive experiments on real datasets show that our
method achieves competitive results with existing SOTA methods. |
This paper proposes a novel self-supervised diffusion model for 3D photography that can generate high-quality 3D videos from single images, addressing the limitations of previous methods requiring large multi-view datasets. |
Existing 3D photography methods suffer from a gap between training and inference, particularly in complex scenes, leading to visual distortions. This work aims to bridge this gap and enable high-quality 3D video generation from single images. |
The proposed method uses a cycle-rendering technique to create self-supervised training pairs of masked and ground truth images. It then leverages a conditional diffusion model with a Masked Enhanced Block (MEB) to learn to inpaint the occluded regions of images, resulting in realistic 3D videos. |
The method outperforms previous state-of-the-art methods in novel view synthesis on RealEstate10k and MannequinChallenge datasets.
Qualitative results demonstrate the model's ability to generate clearer, more realistic content with better detail preservation compared to baselines.
The proposed out-animation task extends the capabilities of 3D photography by generating videos that extend the space and time of input objects, showing promising results on the MSCOCO dataset. |
The method currently relies on monocular depth estimation, which can introduce errors in complex scenes.
Further exploration is needed to improve the temporal consistency and smoothness of generated 3D videos, particularly in the out-animation task. |
3d photography, diffusion models, self-supervised learning, novel view synthesis, out-animation |
2302.10688
Report |
On Calibrating Diffusion Probabilistic Models |
Tianyu Pang, Cheng Lu, Chao Du, Min Lin, Shuicheng Yan, Zhijie Deng |
Recently, diffusion probabilistic models (DPMs) have achieved promising
results in diverse generative tasks. A typical DPM framework includes a forward
process that gradually diffuses the data distribution and a reverse process
that recovers the data distribution from time-dependent data scores. In this
work, we observe that the stochastic reverse process of data scores is a
martingale, from which concentration bounds and the optional stopping theorem
for data scores can be derived. Then, we discover a simple way for calibrating
an arbitrary pretrained DPM, with which the score matching loss can be reduced
and the lower bounds of model likelihood can consequently be increased. We
provide general calibration guidelines under various model parametrizations.
Our calibration method is performed only once and the resulting models can be
used repeatedly for sampling. We conduct experiments on multiple datasets to
empirically validate our proposal. Our code is at
https://github.com/thudzj/Calibrated-DPMs. |
This paper presents a simple calibration technique for pre-trained diffusion probabilistic models (DPMs) to enhance sample quality and model likelihood. |
Existing DPMs often suffer from mis-calibration due to dataset bias or sub-optimal training, leading to reduced performance. |
The method leverages the martingale property of data scores in DPMs and subtracts a time-dependent calibration term (expectation of the score model) from the pre-trained model's output. |
Calibrated DPMs demonstrate significantly improved sample quality (FID score) on CIFAR-10 and CelebA datasets, especially with high-order DPM-Solver samplers.
Calibration reduces the score matching objective, leading to an increased lower bound for model likelihood, as evidenced by experiments on various datasets.
The calibration term can be effectively estimated using a substantial portion of training data or generated data from the pre-trained model. |
While improving model likelihood, calibration does not always guarantee a lower FID score, highlighting the complex relationship between likelihood and sample quality.
Post-training calibration is computationally challenging for text-to-image generation due to the vast number of conditions, requiring alternative strategies like dynamic recording. |
diffusion models, generative models, score matching, calibration, model likelihood |
2302.10668
Report |
$PC^2$: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction |
Luke Melas-Kyriazi, Christian Rupprecht, Andrea Vedaldi |
Reconstructing the 3D shape of an object from a single RGB image is a
long-standing and highly challenging problem in computer vision. In this paper,
we propose a novel method for single-image 3D reconstruction which generates a
sparse point cloud via a conditional denoising diffusion process. Our method
takes as input a single RGB image along with its camera pose and gradually
denoises a set of 3D points, whose positions are initially sampled randomly
from a three-dimensional Gaussian distribution, into the shape of an object.
The key to our method is a geometrically-consistent conditioning process which
we call projection conditioning: at each step in the diffusion process, we
project local image features onto the partially-denoised point cloud from the
given camera pose. This projection conditioning process enables us to generate
high-resolution sparse geometries that are well-aligned with the input image,
and can additionally be used to predict point colors after shape
reconstruction. Moreover, due to the probabilistic nature of the diffusion
process, our method is naturally capable of generating multiple different
shapes consistent with a single input image. In contrast to prior work, our
approach not only performs well on synthetic benchmarks, but also gives large
qualitative improvements on complex real-world data. |
This paper introduces Projection-Conditioned Point Cloud Diffusion, a novel method for reconstructing 3D objects from single RGB images. This is achieved by gradually denoising a randomly sampled point cloud into the shape of an object, guided by image features projected onto the points throughout the process. |
Reconstructing 3D shapes from single images is challenging but crucial for applications like AR/VR. Existing deep learning methods are often limited to low-resolution outputs or struggle with representing shape ambiguity. This method leverages the power of denoising diffusion models to produce high-resolution, diverse 3D reconstructions. |
The method utilizes a conditional denoising diffusion process on point clouds. Crucially, it introduces "projection conditioning", where image features are projected onto the intermediate point clouds at each denoising step, ensuring geometric consistency with the input image. This conditioning is also used for predicting point colors. |
The method achieves competitive results on the ShapeNet benchmark, particularly excelling in reconstructing objects with fine details.
Qualitative results on the real-world Co3D dataset demonstrate the capability to generate high-quality, detailed 3D reconstructions, outperforming previous methods like NeRF-WCE in handling shape uncertainty.
By exploiting the probabilistic nature of diffusion models, the method can produce multiple plausible 3D shapes per input image, enabling filtering strategies to select the most consistent reconstruction. |
The method currently relies on point cloud ground truth for training, although this can be obtained from multi-view data.
While filtering strategies improve results, there's room for developing more sophisticated filtering criteria to further bridge the gap to the oracle upper bound. |
3d reconstruction, diffusion models, point clouds, single-view reconstruction, conditional image synthesis |
2302.10663
Report |
RealFusion: 360° Reconstruction of Any Object from a Single Image |
Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, Andrea Vedaldi |
We consider the problem of reconstructing a full 360{\deg} photographic model
of an object from a single image of it. We do so by fitting a neural radiance
field to the image, but find this problem to be severely ill-posed. We thus
take an off-the-self conditional image generator based on diffusion and
engineer a prompt that encourages it to "dream up" novel views of the object.
Using an approach inspired by DreamFields and DreamFusion, we fuse the given
input view, the conditional prior, and other regularizers in a final,
consistent reconstruction. We demonstrate state-of-the-art reconstruction
results on benchmark images when compared to prior methods for monocular 3D
reconstruction of objects. Qualitatively, our reconstructions provide a
faithful match of the input view and a plausible extrapolation of its
appearance and 3D shape, including to the side of the object not visible in the
image. |
Introduces \Methodname, a method for reconstructing a full 360° photographic 3D model of an object from a single image using an off-the-shelf 2D diffusion image generator as a prior. |
Solves the severely ill-posed problem of single-image 3D reconstruction by leveraging the powerful statistical model of the 3D world captured in pre-trained 2D diffusion models. |
Uses a single-image textual inversion technique to condition the diffusion model to 'dream up' novel views of the object. These views, along with the input image, are used to train a neural radiance field in a coarse-to-fine manner with additional regularization for smooth surfaces. |
Achieves state-of-the-art reconstruction results on benchmark images and in-the-wild images compared to previous single-image reconstruction methods.
Generates plausible 3D reconstructions that faithfully match the input view and provide a plausible extrapolation of appearance and 3D shape.
Demonstrates the viability of leveraging pre-trained 2D diffusion models for single-image 3D reconstruction. |
Occasionally fails to converge to a plausible geometry or copies the front view to the back of the object.
Future work includes specializing the diffusion model for new-view synthesis and incorporating dynamics for animated 3D scenes. |
3d reconstruction, diffusion models, neural radiance fields, single-image reconstruction, textual inversion |
2302.10586
Report |
Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels |
Zebin You, Yong Zhong, Fan Bao, Jiacheng Sun, Chongxuan Li, Jun Zhu |
In an effort to further advance semi-supervised generative and classification
tasks, we propose a simple yet effective training strategy called dual pseudo
training (DPT), built upon strong semi-supervised learners and diffusion
models. DPT operates in three stages: training a classifier on partially
labeled data to predict pseudo-labels; training a conditional generative model
using these pseudo-labels to generate pseudo images; and retraining the
classifier with a mix of real and pseudo images. Empirically, DPT consistently
achieves SOTA performance of semi-supervised generation and classification
across various settings. In particular, with one or two labels per class, DPT
achieves a Fr\'echet Inception Distance (FID) score of 3.08 or 2.52 on ImageNet
256x256. Besides, DPT outperforms competitive semi-supervised baselines
substantially on ImageNet classification tasks, achieving top-1 accuracies of
59.0 (+2.8), 69.5 (+3.0), and 74.4 (+2.0) with one, two, or five labels per
class, respectively. Notably, our results demonstrate that diffusion can
generate realistic images with only a few labels (e.g., <0.1%) and generative
augmentation remains viable for semi-supervised classification. Our code is
available at https://github.com/ML-GSAI/DPT. |
This paper introduces Dual Pseudo Training (DPT), a novel training method for improving semi-supervised image generation and classification by leveraging the synergy between diffusion models and semi-supervised classifiers. |
DPT addresses the challenge of limited labeled data in semi-supervised learning, aiming to improve the performance of both conditional image generation and classification tasks. |
DPT operates in three stages: 1) Training a classifier on partially labeled data to generate pseudo-labels for unlabeled data. 2) Training a conditional generative diffusion model on all data using the pseudo-labels. 3) Retraining or fine-tuning the classifier using augmented data generated by the diffusion model. |
DPT achieves state-of-the-art semi-supervised generation performance on CIFAR-10 and ImageNet, even outperforming some supervised methods.
DPT significantly improves semi-supervised classification results on ImageNet benchmarks, demonstrating the efficacy of generative augmentation for classification.
The paper provides evidence that diffusion models can generate high-quality images with very few labels (<0.1%). |
The paper acknowledges that directly using pseudo images and labels without further filtering based on semantic alignment is a limitation.
Future work can explore the integration of semantic alignment techniques like CLIP to filter noisy image-label pairs. |
semi-supervised learning, diffusion models, image generation, image classification, generative augmentation |
2302.10523
Report |
I2V: Towards Texture-Aware Self-Supervised Blind Denoising using Self-Residual Learning for Real-World Images |
Kanggeun Lee, Kyungryun Lee, Won-Ki Jeong |
Although the advances of self-supervised blind denoising are significantly
superior to conventional approaches without clean supervision in synthetic
noise scenarios, it shows poor quality in real-world images due to spatially
correlated noise corruption. Recently, pixel-shuffle downsampling (PD) has been
proposed to eliminate the spatial correlation of noise. A study combining a
blind spot network (BSN) and asymmetric PD (AP) successfully demonstrated that
self-supervised blind denoising is applicable to real-world noisy images.
However, PD-based inference may degrade texture details in the testing phase
because high-frequency details (e.g., edges) are destroyed in the downsampled
images. To avoid such an issue, we propose self-residual learning without the
PD process to maintain texture information. We also propose an order-variant PD
constraint, noise prior loss, and an efficient inference scheme (progressive
random-replacing refinement ($\text{PR}^3$)) to boost overall performance. The
results of extensive experiments show that the proposed method outperforms
state-of-the-art self-supervised blind denoising approaches, including several
supervised learning methods, in terms of PSNR, SSIM, LPIPS, and DISTS in
real-world sRGB images. |
This paper presents I2V, a self-supervised blind denoising framework for real-world sRGB images that preserves texture details better than existing methods. |
Existing self-supervised blind denoising methods struggle with real-world images due to spatially correlated noise and often degrade texture details. I2V aims to address these limitations. |
I2V leverages self-residual learning with a noise extractor network, order-variant pixel-shuffle downsampling, a noise prior loss, and a progressive random-replacing refinement (PR^3) inference scheme. |
I2V outperforms state-of-the-art self-supervised blind denoisers on SIDD, DND, and NIND datasets in terms of PSNR, SSIM, LPIPS, and DISTS.
The proposed method preserves texture details better than some supervised learning methods, as demonstrated by LPIPS and DISTS.
I2V achieves a faster inference speed compared to the AP-BSN+R^3 method. |
Training I2V requires more GPU memory than AP-BSN due to increased computational cost.
Future work includes exploring different noise extractor structures like Restormer. |
image denoising, self-supervised learning, blind denoising, texture preservation, real-world images |
2302.10326
Report |
Unsupervised Out-of-Distribution Detection with Diffusion Inpainting |
Zhenzhen Liu, Jin Peng Zhou, Yufan Wang, Kilian Q. Weinberger |
Unsupervised out-of-distribution detection (OOD) seeks to identify
out-of-domain data by learning only from unlabeled in-domain data. We present a
novel approach for this task - Lift, Map, Detect (LMD) - that leverages recent
advancement in diffusion models. Diffusion models are one type of generative
models. At their core, they learn an iterative denoising process that gradually
maps a noisy image closer to their training manifolds. LMD leverages this
intuition for OOD detection. Specifically, LMD lifts an image off its original
manifold by corrupting it, and maps it towards the in-domain manifold with a
diffusion model. For an out-of-domain image, the mapped image would have a
large distance away from its original manifold, and LMD would identify it as
OOD accordingly. We show through extensive experiments that LMD achieves
competitive performance across a broad variety of datasets. Code can be found
at https://github.com/zhenzhel/lift_map_detect. |
This paper presents Lift, Map, Detect (LMD), a novel unsupervised out-of-distribution (OOD) detection method leveraging the manifold mapping ability of diffusion models. |
Unsupervised OOD detection is crucial for deploying machine learning models in real-world settings where out-of-domain data can lead to unpredictable and potentially harmful consequences. |
LMD lifts an image off its original manifold by masking it. Then, it maps the lifted image towards the in-domain manifold using a diffusion model trained on in-domain data. Finally, it leverages the reconstruction distance between the original and mapped images to detect OOD data. |
LMD achieves competitive performance on various datasets, demonstrating its effectiveness and versatility.
Using multiple reconstructions and an alternating checkerboard masking strategy consistently enhances LMD's performance.
LPIPS distance metric proves to be a robust choice for measuring reconstruction dissimilarity across different datasets. |
The reliance on iterative denoising in diffusion models makes LMD computationally expensive for real-time applications.
Future work could explore integrating fast diffusion model sampling techniques to improve LMD's speed. |
out-of-distribution detection, diffusion models, unsupervised learning, image inpainting, reconstruction error |
2302.10305
Report |
Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance |
Chaerin Kong, Nojun Kwak |
Recent years have witnessed astonishing advances in the field of multimodal
representation learning, with contrastive learning being the cornerstone for
major breakthroughs. Latest works delivered further improvements by
incorporating different objectives such as masked modeling and captioning into
the frameworks, but our understanding on how these objectives facilitate
learning remains vastly incomplete. In this paper, we leverage the fact that
classifier-guided diffusion models generate images that reflect the semantic
signals provided by the classifier to study the characteristics of multimodal
learning objectives. Specifically, we compare contrastive, matching and
captioning loss in terms of their semantic signals, and introduce a simple
baseline that not only supports our analyses but also improves the quality of
generative guidance in a straightforward manner. |
This paper leverages the ability of classifier-guided diffusion models to reflect semantic signals to analyze the characteristics of various multimodal learning objectives (contrastive, matching, and captioning) for vision-language pretraining. |
Understanding how different objectives contribute to multimodal representation learning, particularly in vision-language tasks, is crucial for improving generative models and achieving better text-to-image generation. |
The authors utilize a pre-trained diffusion model and the BLIP model, which is capable of evaluating contrastive (ITC), matching (ITM), and captioning (CAP) losses. They analyze the generated images by using each objective as guidance in the diffusion process. |
ITC excels at generating fine details but struggles with global scene composition and semantic object relations, often omitting or incorrectly fusing attributes.
CAP demonstrates strong scene understanding and generates images faithful to complex prompts, but it has higher optimization complexity compared to ITC.
ITM, incorporating patch-token cross-attention, exhibits robust visual understanding and representation, generating coherent scenes with accurate object-attribute relations. |
The study primarily relies on qualitative analysis and a limited user study for evaluation.
Future work could explore incorporating more powerful generative models and diverse datasets to further validate the findings |
multimodal learning, vision-language pretraining, contrastive learning, diffusion models, text-to-image generation |
2302.10174
Report |
Towards Universal Fake Image Detectors that Generalize Across Generative Models |
Utkarsh Ojha, Yuheng Li, Yong Jae Lee |
With generative models proliferating at a rapid rate, there is a growing need
for general purpose fake image detectors. In this work, we first show that the
existing paradigm, which consists of training a deep network for real-vs-fake
classification, fails to detect fake images from newer breeds of generative
models when trained to detect GAN fake images. Upon analysis, we find that the
resulting classifier is asymmetrically tuned to detect patterns that make an
image fake. The real class becomes a sink class holding anything that is not
fake, including generated images from models not accessible during training.
Building upon this discovery, we propose to perform real-vs-fake classification
without learning; i.e., using a feature space not explicitly trained to
distinguish real from fake images. We use nearest neighbor and linear probing
as instantiations of this idea. When given access to the feature space of a
large pretrained vision-language model, the very simple baseline of nearest
neighbor classification has surprisingly good generalization ability in
detecting fake images from a wide variety of generative models; e.g., it
improves upon the SoTA by +15.07 mAP and +25.90% acc when tested on unseen
diffusion and autoregressive models. |
This paper proposes a method for detecting fake images generated by various generative models by leveraging the feature space of a large pre-trained vision-language model (CLIP-ViT), which is not explicitly trained for real-vs-fake classification. |
Existing deep learning methods for fake image detection struggle to generalize to unseen families of generative models, often misclassifying fake images from diffusion models as real. |
The authors propose two simple methods: 1) Nearest Neighbor classification: finding the nearest neighbor of a test image in a feature bank of real and fake images embedded using CLIP-ViT, and 2) Linear Probing: training a linear classifier on top of the CLIP-ViT features. |
Both Nearest Neighbor and Linear Probing using CLIP-ViT's feature space significantly outperform state-of-the-art methods in detecting fake images from unseen generative model families.
The approach is robust to the choice of training data source (GAN or diffusion models).
The method maintains good performance even with a smaller training dataset. |
The study mainly focuses on detecting completely generated images and might not directly apply to images with localized manipulations.
Further research is needed to understand the underlying similarity between fake images from different generative models that enables their detection. |
fake image detection, generalization, generative models, clip, vision-language models |
2302.10167
Report |
Cross-domain Compositing with Pretrained Diffusion Models |
Roy Hachnochi, Mingrui Zhao, Nadav Orzech, Rinon Gal, Ali Mahdavi-Amiri, Daniel Cohen-Or, Amit Haim Bermano |
Diffusion models have enabled high-quality, conditional image editing
capabilities. We propose to expand their arsenal, and demonstrate that
off-the-shelf diffusion models can be used for a wide range of cross-domain
compositing tasks. Among numerous others, these include image blending, object
immersion, texture-replacement and even CG2Real translation or stylization. We
employ a localized, iterative refinement scheme which infuses the injected
objects with contextual information derived from the background scene, and
enables control over the degree and types of changes the object may undergo. We
conduct a range of qualitative and quantitative comparisons to prior work, and
exhibit that our method produces higher quality and realistic results without
requiring any annotations or training. Finally, we demonstrate how our method
may be used for data augmentation of downstream tasks. |
This paper proposes a novel method for cross-domain compositing using off-the-shelf diffusion models, enabling realistic merging of image parts from different visual domains (e.g., photos and paintings). |
This technique addresses the challenge of combining objects from different visual domains while maintaining realism and coherency, expanding the capabilities of diffusion models beyond traditional image editing tasks. |
The method leverages a localized, iterative refinement scheme based on ILVR (in-domain latent space interpolation). It infuses injected objects with contextual information from the background, allowing control over the degree and types of object changes while ensuring domain consistency. |
The method outperforms baselines in qualitative and quantitative comparisons for image modification, object immersion, and data augmentation for SVR.
It enables realistic blending of objects with their backgrounds, matching style and adding details while preserving object structure.
The technique effectively bridges the domain gap between synthetic and real images, improving the performance of single-view 3D reconstruction models on real-world data. |
The method faces challenges when processing small objects or semantically complex images, requiring further exploration for optimal parameter selection.
Future work includes extending the technique to video, addressing temporal consistency for cross-domain video compositing. |
diffusion models, cross-domain compositing, image editing, object immersion, data augmentation |
2302.09923
Report |
Prompt Stealing Attacks Against Text-to-Image Generation Models |
Xinyue Shen, Yiting Qu, Michael Backes, Yang Zhang |
Text-to-Image generation models have revolutionized the artwork design
process and enabled anyone to create high-quality images by entering text
descriptions called prompts. Creating a high-quality prompt that consists of a
subject and several modifiers can be time-consuming and costly. In consequence,
a trend of trading high-quality prompts on specialized marketplaces has
emerged. In this paper, we perform the first study on understanding the threat
of a novel attack, namely prompt stealing attack, which aims to steal prompts
from generated images by text-to-image generation models. Successful prompt
stealing attacks directly violate the intellectual property of prompt engineers
and jeopardize the business model of prompt marketplaces. We first perform a
systematic analysis on a dataset collected by ourselves and show that a
successful prompt stealing attack should consider a prompt's subject as well as
its modifiers. Based on this observation, we propose a simple yet effective
prompt stealing attack, PromptStealer. It consists of two modules: a subject
generator trained to infer the subject and a modifier detector for identifying
the modifiers within the generated image. Experimental results demonstrate that
PromptStealer is superior over three baseline methods, both quantitatively and
qualitatively. We also make some initial attempts to defend PromptStealer. In
general, our study uncovers a new attack vector within the ecosystem
established by the popular text-to-image generation models. We hope our results
can contribute to understanding and mitigating this emerging threat. |
This paper presents the first study on "prompt stealing attacks," which aim to steal the text prompts used to generate images from text-to-image generation models. |
Successful attacks could violate the intellectual property of prompt engineers and impact the business model of prompt marketplaces. |
The authors collect a dataset of prompt-image pairs and propose a novel attack method, PromptStealer, which uses a subject generator and a modifier detector to infer the prompt from an image. |
PromptStealer outperforms baseline methods in recovering prompts, as measured by semantic, modifier, image, and pixel similarities.
The attack remains effective on real-world prompts traded in marketplaces.
A defense method based on adversarial examples shows promise but requires strong assumptions. |
The evaluation primarily focuses on Stable Diffusion, with limited testing on other text-to-image models.
The defense method assumes white-box access to the attack model. |
prompt engineering, text-to-image generation, intellectual property, adversarial examples, ai security |
2302.09778
Report |
Composer: Creative and Controllable Image Synthesis with Composable Conditions |
Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, Jingren Zhou |
Recent large-scale generative models learned on big data are capable of
synthesizing incredible images yet suffer from limited controllability. This
work offers a new generation paradigm that allows flexible control of the
output image, such as spatial layout and palette, while maintaining the
synthesis quality and model creativity. With compositionality as the core idea,
we first decompose an image into representative factors, and then train a
diffusion model with all these factors as the conditions to recompose the
input. At the inference stage, the rich intermediate representations work as
composable elements, leading to a huge design space (i.e., exponentially
proportional to the number of decomposed factors) for customizable content
creation. It is noteworthy that our approach, which we call Composer, supports
various levels of conditions, such as text description as the global
information, depth map and sketch as the local guidance, color histogram for
low-level details, etc. Besides improving controllability, we confirm that
Composer serves as a general framework and facilitates a wide range of
classical generative tasks without retraining. Code and models will be made
available. |
This paper introduces Composer, a novel compositional generative model framework for highly controllable image synthesis, allowing flexible control over spatial layout, palette, and other image aspects while maintaining high synthesis quality and creativity. |
Existing generative models, though capable of producing high-quality images, often lack the controllability needed for practical design applications. This work addresses this limitation by proposing a compositional approach that significantly expands the control space and enables more flexible image generation. |
Composer decomposes images into representative factors (e.g., caption, semantics, color, sketch, depth) and trains a diffusion model conditioned on these factors for image recomposition. This enables flexible customization by combining different representations during inference. |
Composer enables diverse image manipulations like variations, interpolations, reconfigurations, and region-specific editing.
It can reformulate traditional generation tasks such as colorization, style transfer, image translation, and virtual try-on without retraining.
The model achieves a zero-shot FID of 9.2 on COCO for text-to-image generation, demonstrating its competitive performance. |
The joint training strategy for multiple conditions, though effective, could potentially downweight the single-conditional generation performance.
Conflicts might arise when incompatible conditions are used, requiring further investigation and potential mitigation strategies. |
image generation, controllable generation, compositionality, diffusion models, multi-modal generation |
2302.09554
Report |
Mixed Hierarchy Network for Image Restoration |
Hu Gao, Depeng Dang |
Image restoration is a long-standing low-level vision problem, e.g.,
deblurring and deraining. In the process of image restoration, it is necessary
to consider not only the spatial details and contextual information of
restoration to ensure the quality, but also the system complexity. Although
many methods have been able to guarantee the quality of image restoration, the
system complexity of the state-of-the-art (SOTA) methods is increasing as well.
Motivated by this, we present a mixed hierarchy network that can balance these
competing goals. Our main proposal is a mixed hierarchy architecture, that
progressively recovers contextual information and spatial details from degraded
images while we design intra-blocks to reduce system complexity. Specifically,
our model first learns the contextual information using encoder-decoder
architectures, and then combines them with high-resolution branches that
preserve spatial detail. In order to reduce the system complexity of this
architecture for convenient analysis and comparison, we replace or remove the
nonlinear activation function with multiplication and use a simple network
structure. In addition, we replace spatial convolution with global
self-attention for the middle block of encoder-decoder. The resulting tightly
interlinked hierarchy architecture, named as MHNet, delivers strong performance
gains on several image restoration tasks, including image deraining, and
deblurring. |
This paper presents MHNet, a mixed hierarchy network for image restoration that balances high-quality restoration with low system complexity. |
Existing deep learning methods for image restoration, while achieving good performance, often suffer from high system complexity, requiring significant computational resources. |
MHNet uses a mixed hierarchy architecture. It first learns contextual information at a lower hierarchy using encoder-decoder subnetworks with a selective multi-head attention mechanism. It then refines spatial details at a higher hierarchy operating on full resolution with a full resolution subnetwork. An adaptive feature fusion mechanism facilitates information exchange between hierarchies. The network primarily utilizes nonlinear activation-free blocks to reduce system complexity. |
MHNet achieves state-of-the-art performance on several image restoration tasks, including image deraining and deblurring.
MHNet demonstrates significant performance gains with lower computational resources compared to existing methods.
The paper provides ablation studies demonstrating the contribution of each component in MHNet. |
The model's performance is limited by the reliance on a fixed hierarchy structure.
Exploring more efficient attention mechanisms or alternative feature fusion strategies could further enhance the network's performance. |
image restoration, deblurring, deraining, mixed hierarchy network, efficient deep learning |
2302.09486
Report |
LC-NeRF: Local Controllable Face Generation in Neural Randiance Field |
Wenyang Zhou, Lu Yuan, Shuyu Chen, Lin Gao, Shimin Hu |
3D face generation has achieved high visual quality and 3D consistency thanks
to the development of neural radiance fields (NeRF). Recently, to generate and
edit 3D faces with NeRF representation, some methods are proposed and achieve
good results in decoupling geometry and texture. The latent codes of these
generative models affect the whole face, and hence modifications to these codes
cause the entire face to change. However, users usually edit a local region
when editing faces and do not want other regions to be affected. Since changes
to the latent code affect global generation results, these methods do not allow
for fine-grained control of local facial regions. To improve local
controllability in NeRF-based face editing, we propose LC-NeRF, which is
composed of a Local Region Generators Module and a Spatial-Aware Fusion Module,
allowing for local geometry and texture control of local facial regions.
Qualitative and quantitative evaluations show that our method provides better
local editing than state-of-the-art face editing methods. Our method also
performs well in downstream tasks, such as text-driven facial image editing. |
This paper introduces LC-NeRF, a novel NeRF-based face generation and editing method that provides local control over geometry and texture, enabling fine-grained modifications to specific facial regions. |
Existing NeRF-based face editing methods often struggle to modify local regions without affecting the entire face, limiting their control and potentially leading to inconsistent facial identities. This method addresses this limitation by providing more fine-grained control over local features. |
LC-NeRF utilizes a local region generator module to decompose global 3D representations into local regions for separate geometry and texture control. A spatial-aware fusion module then aggregates these regions into a final image, ensuring seamless integration. The method is trained with a double discriminator supervision strategy to ensure high-quality generation and consistency between the image and the semantic mask. |
LC-NeRF can edit local facial regions accurately without affecting non-editing regions, as demonstrated by qualitative and quantitative evaluations.
The method successfully decouples geometry and texture, allowing for independent modification of each aspect.
LC-NeRF excels in downstream tasks such as text-driven facial image editing, showing its versatility and potential for various applications. |
While LC-NeRF allows for local region and geometry/texture decoupling, it currently lacks the ability to control local internal textures finely, such as hair texture or wrinkles.
Future work will focus on enabling finer control over local texture content. |
face editing, neural radiance fields (nerf), generative adversarial networks (gans), local control, 3d face generation |
2302.09311
Report |
Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields |
Sungheon Park, Minjung Son, Seokhwan Jang, Young Chun Ahn, Ji-Yeon Kim, Nahyup Kang |
Temporal interpolation often plays a crucial role to learn meaningful
representations in dynamic scenes. In this paper, we propose a novel method to
train spatiotemporal neural radiance fields of dynamic scenes based on temporal
interpolation of feature vectors. Two feature interpolation methods are
suggested depending on underlying representations, neural networks or grids. In
the neural representation, we extract features from space-time inputs via
multiple neural network modules and interpolate them based on time frames. The
proposed multi-level feature interpolation network effectively captures
features of both short-term and long-term time ranges. In the grid
representation, space-time features are learned via four-dimensional hash
grids, which remarkably reduces training time. The grid representation shows
more than 100 times faster training speed than the previous neural-net-based
methods while maintaining the rendering quality. Concatenating static and
dynamic features and adding a simple smoothness term further improve the
performance of our proposed models. Despite the simplicity of the model
architectures, our method achieved state-of-the-art performance both in
rendering quality for the neural representation and in training speed for the
grid representation. |
This paper introduces a novel method for training dynamic Neural Radiance Fields (NeRFs) using temporal interpolation of feature vectors, enabling the representation of dynamic scenes without explicit deformation or scene flow estimation. |
Existing dynamic NeRF methods struggle with ambiguities in scene changes (appearance, movement, color change) and often rely on complex deformation modules. This method provides a simpler, more effective approach for learning dynamic scene representations. |
The method uses two representations: 1) Neural Representation: Features are extracted from space-time inputs via multiple MLPs and temporally interpolated. 2) Grid Representation: Space-time features are learned using 4D hash grids. Both representations are enhanced by concatenating static and dynamic features and incorporating a smoothness term. |
The neural representation achieves state-of-the-art rendering quality on D-NeRF and competitive results on HyperNeRF datasets.
The grid representation achieves remarkably faster training speeds (over 100x) compared to neural-network-based methods while maintaining competitive rendering quality.
The proposed smoothness regularizer, encouraging feature similarity between adjacent frames, consistently improves the performance of both representations. |
The method struggles to recover 3D structures of small, rapidly moving objects and unseen dynamic regions during training.
Exploring hybrid representations combining the strengths of neural and grid representations could be promising for future work. |
neural radiance fields, dynamic scene reconstruction, temporal interpolation, hash grids, novel view synthesis |
2302.09260
Report |
Attribute-Specific Manipulation Based on Layer-Wise Channels |
Yuanjie Yan, Jian Zhao, Furao Shen |
Image manipulation on the latent space of the pre-trained StyleGAN can
control the semantic attributes of the generated images. Recently, some studies
have focused on detecting channels with specific properties to directly
manipulate the latent code, which is limited by the entanglement of the latent
space. To detect the attribute-specific channels, we propose a novel detection
method in the context of pre-trained classifiers. We analyse the gradients
layer by layer on the style space. The intensities of the gradients indicate
the channel's responses to specific attributes. The latent style codes of
channels control separate attributes in the layers. We choose channels with
top-$k$ gradients to control specific attributes in the maximum response layer.
We implement single-channel and multi-channel manipulations with a certain
attribute. Our methods can accurately detect relevant channels for a large
number of face attributes. Extensive qualitative and quantitative results
demonstrate that the proposed methods outperform state-of-the-art methods in
generalization and scalability. |
This paper proposes a novel gradient-based method for detecting and manipulating attribute-specific channels in the style space of StyleGAN for semantic image editing. |
Existing methods for manipulating StyleGAN latent space are limited by entanglement, difficulty in pinpointing attribute-specific channels, and lack of flexibility in multi-attribute and continuous manipulation. |
The method leverages pre-trained classifiers to analyze gradients of style codes with respect to specific attributes. It selects the top-k channels with the largest gradients for single-channel or multi-channel manipulation. |
The method accurately detects relevant channels for a large number of face attributes (over 35), including both regions and semantic attributes.
It enables both single-channel and multi-channel manipulation, allowing for fine-grained control over attribute editing.
Quantitative and qualitative evaluations demonstrate superior performance over state-of-the-art methods in terms of generalization and scalability. |
Multi-channel manipulation requires further research on balancing the editing intensity of multiple channels.
The method focuses on facial attributes and could be extended to other domains. |
semantic manipulation, face editing, generative adversarial networks (gans), stylegan, stylespace |
2302.09057
Report |
Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent |
Giannis Daras, Yuval Dagan, Alexandros G. Dimakis, Constantinos Daskalakis |
Imperfect score-matching leads to a shift between the training and the
sampling distribution of diffusion models. Due to the recursive nature of the
generation process, errors in previous steps yield sampling iterates that drift
away from the training distribution. Yet, the standard training objective via
Denoising Score Matching (DSM) is only designed to optimize over non-drifted
data. To train on drifted data, we propose to enforce a \emph{consistency}
property which states that predictions of the model on its own generated data
are consistent across time. Theoretically, we show that if the score is learned
perfectly on some non-drifted points (via DSM) and if the consistency property
is enforced everywhere, then the score is learned accurately everywhere.
Empirically we show that our novel training objective yields state-of-the-art
results for conditional and unconditional generation in CIFAR-10 and baseline
improvements in AFHQ and FFHQ. We open-source our code and models:
https://github.com/giannisdaras/cdm |
This paper introduces Consistent Diffusion Models (CDM), a novel method to mitigate sampling drift in diffusion models by enforcing consistency, a property ensuring model predictions on generated data remain consistent over time. |
Sampling drift, a discrepancy between training and sampling distributions due to imperfect score matching, is a major challenge in diffusion models. This drift leads to accumulated errors during the recursive generation process, impacting sample quality. CDM addresses this by improving score function accuracy, particularly in regions with low probability under the target distribution. |
The authors define a 'consistency property' based on the idea that a denoising function's output should match the expected value of the clean image generated using the learned reverse process. They then propose a new training objective that enforces this consistency property, encouraging the model to make self-consistent predictions across time. |
Theoretically, the paper proves that enforcing consistency, along with a weak form of score matching, suffices to learn the correct score function everywhere.
Empirically, CDM achieves state-of-the-art results for conditional and unconditional generation on CIFAR-10, surpassing previous benchmarks.
CDM also shows baseline improvements in image quality and reduced geometric inconsistencies on more challenging datasets like AFHQ and FFHQ. |
The proposed regularization in CDM increases training time by approximately 1.5x.
The method does not explicitly address or enforce the conservativeness of the learned vector field, a key theoretical assumption. |
diffusion models, generative models, score matching, sampling drift, consistency regularization |
2302.08908
Report |
LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation |
Jiaxin Cheng, Xiao Liang, Xingjian Shi, Tong He, Tianjun Xiao, Mu Li |
Layout-to-image generation refers to the task of synthesizing photo-realistic
images based on semantic layouts. In this paper, we propose LayoutDiffuse that
adapts a foundational diffusion model pretrained on large-scale image or
text-image datasets for layout-to-image generation. By adopting a novel neural
adaptor based on layout attention and task-aware prompts, our method trains
efficiently, generates images with both high perceptual quality and layout
alignment, and needs less data. Experiments on three datasets show that our
method significantly outperforms other 10 generative models based on GANs,
VQ-VAE, and diffusion models. |
Presents LayoutDiffuse, a method for adapting pretrained foundational diffusion models (trained on image-text pairs or only images) for layout-conditioned image generation. |
Addresses the limitations of existing layout-to-image generation methods, such as the inability to handle complex layouts or the need for extensive training. |
Adapts pretrained diffusion models by incorporating layout information through layout attention and task-adaptive prompts, fine-tuning the model for efficient adaptation. |
Achieves state-of-the-art results on bounding box and mask layout-to-image generation benchmarks, outperforming GAN-based and other diffusion-based methods.
Demonstrates time and data efficiency, requiring significantly less training time and data compared to training diffusion models from scratch.
Generates high-quality images that are both perceptually plausible and well-aligned with the input layouts, as evidenced by quantitative metrics and human evaluation. |
The adapted model size is larger due to the addition of layout attention layers.
Future work can explore identity-preserving image editing by combining LayoutDiffuse with textual inversion fine-tuning methods. |
layout-to-image generation, diffusion models, fine-tuning, layout attention, task-adaptive prompts |
2302.08788
Report |
MixNeRF: Modeling a Ray with Mixture Density for Novel View Synthesis from Sparse Inputs |
Seunghyeon Seo, Donghoon Han, Yeonjin Chang, Nojun Kwak |
Neural Radiance Field (NeRF) has broken new ground in the novel view
synthesis due to its simple concept and state-of-the-art quality. However, it
suffers from severe performance degradation unless trained with a dense set of
images with different camera poses, which hinders its practical applications.
Although previous methods addressing this problem achieved promising results,
they relied heavily on the additional training resources, which goes against
the philosophy of sparse-input novel-view synthesis pursuing the training
efficiency. In this work, we propose MixNeRF, an effective training strategy
for novel view synthesis from sparse inputs by modeling a ray with a mixture
density model. Our MixNeRF estimates the joint distribution of RGB colors along
the ray samples by modeling it with mixture of distributions. We also propose a
new task of ray depth estimation as a useful training objective, which is
highly correlated with 3D scene geometry. Moreover, we remodel the colors with
regenerated blending weights based on the estimated ray depth and further
improves the robustness for colors and viewpoints. Our MixNeRF outperforms
other state-of-the-art methods in various standard benchmarks with superior
efficiency of training and inference. |
MixNeRF, a novel regularization-based neural radiance field (NeRF) training strategy for high-quality novel view synthesis from sparse inputs, addresses the limitations of previous methods that rely heavily on extra training resources, enhancing both training and inference efficiency. |
Existing NeRF models struggle with performance degradation when trained on sparse input views due to the difficulty in accurately estimating 3D geometry, which hinders their practical applications in domains like AR/VR and autonomous driving where dense training data is often unavailable. |
MixNeRF models the colors along a ray with a mixture density model, using the predicted weights as mixing coefficients for a mixture of Laplace distributions. It introduces ray depth estimation as an auxiliary task, utilizing the estimated depths to regenerate blending weights and remodel colors for enhanced robustness against viewpoint shifts. |
MixNeRF successfully learns 3D geometry from sparse views by leveraging a mixture density model, representing blending weight distributions more accurately than baselines.
It introduces ray depth estimation as an effective auxiliary task, resulting in more precise depth maps compared to methods relying on depth smoothing strategies.
MixNeRF outperforms state-of-the-art pre-training and regularization methods on LLFF, DTU, and Realistic Synthetic 360° datasets, demonstrating superior efficiency in both training and inference. |
MixNeRF may exhibit artifacts in rendered images under extremely sparse scenarios (e.g., 3-view) due to interference from non-object elements like backgrounds.
Future work could focus on developing algorithms for distinguishing between object and non-object pixels to further mitigate artifacts, particularly in datasets like DTU. |
novel view synthesis, neural radiance fields (nerf), sparse input, mixture density model, depth estimation |
2302.08510
Report |
Text-driven Visual Synthesis with Latent Diffusion Prior |
Ting-Hsuan Liao, Songwei Ge, Yiran Xu, Yao-Chih Lee, Badour AlBahar, Jia-Bin Huang |
There has been tremendous progress in large-scale text-to-image synthesis
driven by diffusion models enabling versatile downstream applications such as
3D object synthesis from texts, image editing, and customized generation. We
present a generic approach using latent diffusion models as powerful image
priors for various visual synthesis tasks. Existing methods that utilize such
priors fail to use these models' full capabilities. To improve this, our core
ideas are 1) a feature matching loss between features from different layers of
the decoder to provide detailed guidance and 2) a KL divergence loss to
regularize the predicted latent features and stabilize the training. We
demonstrate the efficacy of our approach on three different applications,
text-to-3D, StyleGAN adaptation, and layered image editing. Extensive results
show our method compares favorably against baselines. |
This paper introduces a novel approach that utilizes latent diffusion models as powerful image priors for various visual synthesis tasks, such as text-to-3D, StyleGAN adaptation, and layered image editing. |
Existing methods often lack a unified approach to leverage diffusion models for different visual synthesis tasks. This paper aims to address this gap and provide a more generic and effective solution. |
The proposed approach consists of two key components: (1) a feature matching loss for extracting finer-grained details from multiple decoder layers of the latent diffusion model, and (2) a KL divergence loss to regularize the predicted latent features and stabilize the training process. |
In text-to-3D synthesis, the method generates more detailed and visually appealing 3D models compared to baselines using CLIP or latent score distillation alone.
For StyleGAN adaptation, the method achieves superior FID scores and competitive CLIP and LPIPS scores, indicating improved image quality and diversity.
In layered image editing, the method demonstrates superior performance in manipulating image appearances and generating fine details compared to Text2LIVE and latent score distillation baseline. |
The method struggles to resolve the multiple faces issue in the text-to-3D task, a common limitation in current text-to-3D methods.
Some cases exhibit color over-saturation or out-of-focus issues, despite using the KL loss for regularization. |
latent diffusion model, visual synthesis, text-to-3d, stylegan adaptation, image editing |
2302.08453
Report |
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models |
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie |
The incredible generative ability of large-scale text-to-image (T2I) models
has demonstrated strong power of learning complex structures and meaningful
semantics. However, relying solely on text prompts cannot fully take advantage
of the knowledge learned by the model, especially when flexible and accurate
controlling (e.g., color and structure) is needed. In this paper, we aim to
``dig out" the capabilities that T2I models have implicitly learned, and then
explicitly use them to control the generation more granularly. Specifically, we
propose to learn simple and lightweight T2I-Adapters to align internal
knowledge in T2I models with external control signals, while freezing the
original large T2I models. In this way, we can train various adapters according
to different conditions, achieving rich control and editing effects in the
color and structure of the generation results. Further, the proposed
T2I-Adapters have attractive properties of practical value, such as
composability and generalization ability. Extensive experiments demonstrate
that our T2I-Adapter has promising generation quality and a wide range of
applications. |
This paper proposes T2I-Adapter, a lightweight model designed to enhance the controllability of pre-trained text-to-image diffusion models like Stable Diffusion by aligning internal model knowledge with external control signals. |
Existing T2I models struggle to generate images that accurately reflect complex or imaginative user intentions, especially regarding structure and color, solely relying on text prompts. |
T2I-Adapters are trained to extract guidance features from various conditions like sketches, color palettes, depth maps, etc., and inject them into the encoder of the diffusion model, providing additional control signals during image generation. |
T2I-Adapters demonstrate superior generation quality and alignment compared to existing methods, evidenced by qualitative and quantitative (FID, CLIP Score) evaluations on tasks like sketch-to-image and segmentation-to-image generation.
The method supports flexible single-adapter control for various conditions, including imaginative scenarios, and exhibits promising image editing capabilities.
T2I-Adapters are composable, allowing for multi-condition control without retraining, and exhibit generalizability, enabling their use on custom models fine-tuned from the same base T2I model. |
Multi-adapter control currently requires manual adjustment of guidance feature combinations.
Future work will explore adaptive fusion of multi-modal guidance information. |
text-to-image synthesis, diffusion models, controllable image generation, adapter networks, multi-modal guidance |
2302.08374
Report |
Efficiency 360: Efficient Vision Transformers |
Badri N. Patro, Vijay Srinivas Agneeswaran |
Transformers are widely used for solving tasks in natural language
processing, computer vision, speech, and music domains. In this paper, we talk
about the efficiency of transformers in terms of memory (the number of
parameters), computation cost (number of floating points operations), and
performance of models, including accuracy, the robustness of the model, and
fair \& bias-free features. We mainly discuss the vision transformer for the
image classification task. Our contribution is to introduce an efficient 360
framework, which includes various aspects of the vision transformer, to make it
more efficient for industrial applications. By considering those applications,
we categorize them into multiple dimensions such as privacy, robustness,
transparency, fairness, inclusiveness, continual learning, probabilistic
models, approximation, computational complexity, and spectral complexity. We
compare various vision transformer models based on their performance, the
number of parameters, and the number of floating point operations (FLOPs) on
multiple datasets. |
This paper presents a comprehensive analysis of efficient transformers in the vision domain, focusing on their memory usage, computational cost, and performance across various aspects such as accuracy, robustness, fairness, and bias. |
The paper addresses the challenge of designing efficient transformer models for industrial applications, particularly in computer vision, due to the growing size and computational demands of these models. |
The paper reviews various techniques employed to enhance the efficiency of vision transformers, categorizing them into dimensions like computational complexity, spectral complexity, robustness, privacy, approximation, efficient learning, transparency, fairness, and inclusiveness. |
WaveViT demonstrates superior efficiency in terms of accuracy and parameter count compared to other transformer models.
CvT achieves comparable results on ImageNet benchmarks with a relatively small number of parameters.
CMT exhibits promising performance on ImageNet with a small parameter count and low FLOPs, especially for higher resolution images (384x384). |
The paper primarily focuses on image classification tasks, leaving the exploration of efficient transformers for other vision tasks like object detection and segmentation for future work.
Evaluating the latest vision transformer models on the Long Range Arena (LRA) benchmark, which focuses on long-range data contexts, is an open area for future research. |
vision transformers, efficient deep learning, computational complexity, model robustness, transfer learning |
2302.08242
Report |
Tuning computer vision models with task rewards |
André Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai |
Misalignment between model predictions and intended usage can be detrimental
for the deployment of computer vision models. The issue is exacerbated when the
task involves complex structured outputs, as it becomes harder to design
procedures which address this misalignment. In natural language processing,
this is often addressed using reinforcement learning techniques that align
models with a task reward. We adopt this approach and show its surprising
effectiveness across multiple computer vision tasks, such as object detection,
panoptic segmentation, colorization and image captioning. We believe this
approach has the potential to be widely useful for better aligning models with
a diverse range of computer vision tasks. |
This paper introduces a novel approach for fine-tuning computer vision models by directly optimizing task rewards using reinforcement learning, specifically the REINFORCE algorithm. |
This is important because traditional computer vision models often rely on optimizing differentiable loss functions that may not directly correlate with the desired task performance or involve complex and indirect optimization procedures. |
The methodology consists of two steps: (1) pretraining a model with maximum likelihood estimation (MLE) to learn data distribution and (2) fine-tuning the model to maximize a task-specific reward function using REINFORCE. |
Reward optimization significantly improves performance on object detection and panoptic segmentation tasks, achieving results comparable to state-of-the-art methods.
It enables control over qualitative aspects of model outputs, as demonstrated by tuning colorization models to produce vivid and colorful images.
The approach proves effective for image captioning, showing consistent improvements in CIDEr score compared to MLE pretrained models. |
Reward hacking is a potential limitation where the model might exploit weaknesses in reward definition instead of improving the intended task.
Careful reward design is crucial and often non-trivial, requiring consideration of potential biases and unintended consequences. |
computer vision, reinforcement learning, reward optimization, task alignment, mle pretraining |
2302.08113
Report |
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation |
Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel |
Recent advances in text-to-image generation with diffusion models present
transformative capabilities in image quality. However, user controllability of
the generated image, and fast adaptation to new tasks still remains an open
challenge, currently mostly addressed by costly and long re-training and
fine-tuning or ad-hoc adaptations to specific image generation tasks. In this
work, we present MultiDiffusion, a unified framework that enables versatile and
controllable image generation, using a pre-trained text-to-image diffusion
model, without any further training or finetuning. At the center of our
approach is a new generation process, based on an optimization task that binds
together multiple diffusion generation processes with a shared set of
parameters or constraints. We show that MultiDiffusion can be readily applied
to generate high quality and diverse images that adhere to user-provided
controls, such as desired aspect ratio (e.g., panorama), and spatial guiding
signals, ranging from tight segmentation masks to bounding boxes. Project
webpage: https://multidiffusion.github.io |
Introduces MultiDiffusion, a unified framework for versatile and controllable image generation using a pre-trained text-to-image diffusion model without further training. |
Addresses the challenge of user controllability in text-to-image generation, enabling flexible adaptation to new tasks without costly retraining. |
Defines a new generation process that optimizes a shared set of parameters or constraints across multiple reference diffusion generation processes applied to different image regions. |
Generates high-quality, seamless panoramic images from text prompts.
Enables text-to-image generation with user-provided spatial guidance, from bounding boxes to tight masks.
Outperforms baselines in panorama generation quality and region-based generation accuracy. |
Quality heavily reliant on the generative prior of the reference diffusion model.
Further exploration of more general optimization problems and constraints within the MultiDiffusion framework. |
image generation, diffusion models, controllable generation, text-to-image synthesis, multidiffusion |
2302.08106
Report |
Towards Efficient Visual Adaption via Structural Re-parameterization |
Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, Rongrong Ji |
Parameter-efficient transfer learning (PETL) is an emerging research spot
aimed at inexpensively adapting large-scale pre-trained models to downstream
tasks. Recent advances have achieved great success in saving storage costs for
various pre-trained models by updating a small number of parameters instead of
full tuning. However, we notice that most existing PETL methods still incur
non-negligible latency during inference. In this paper, we propose a
parameter-efficient and computational friendly adapter for giant vision models,
called RepAdapter. Specifically, we first prove that common adaptation modules
can also be seamlessly integrated into most giant vision models via our
structural re-parameterization, thereby achieving zero-cost during inference.
We then investigate the sparse design and effective placement of adapter
structure, helping our RepAdaper obtain other advantages in terms of parameter
efficiency and performance. To validate RepAdapter, we conduct extensive
experiments on 27 benchmark datasets of three vision tasks, i.e., image and
video classifications and semantic segmentation. Experimental results show the
superior performance and efficiency of RepAdapter than the state-of-the-art
PETL methods. For instance, RepAdapter outperforms full tuning by +7.2% on
average and saves up to 25% training time, 20% GPU memory, and 94.6% storage
cost of ViT-B/16 on VTAB-1k. The generalization ability of RepAdapter is also
well validated by a bunch of vision models. Our source code is released at
https://github.com/luogen1996/RepAdapter. |
This paper proposes RepAdapter, a novel parameter-efficient transfer learning (PETL) method for adapting giant vision models to downstream tasks, which achieves zero inference cost via structural re-parameterization. |
Most existing PETL methods, while reducing storage costs, still lead to significant inference latency. This paper addresses the need for a PETL method that is both parameter-efficient and computationally friendly during inference. |
RepAdapter sequentially inserts lightweight, linear adapter networks into pre-trained models. After training, these adapters are re-parameterized into the nearby projection weights, enabling zero-cost inference. The paper also investigates a sparse adapter structure and effective placement strategies to further enhance parameter efficiency and performance. |
RepAdapter consistently outperforms state-of-the-art PETL methods on 27 benchmark datasets, including image and video classification, and semantic segmentation.
It demonstrates superior efficiency, reducing training time and GPU memory consumption compared to full fine-tuning.
The method exhibits strong generalization ability across various vision models like ConvNeXt, ViT, Swin-Transformer, and CLIP. |
The paper acknowledges that the exploration of sparse structures is limited to group-wise transformations.
Future work could investigate applying RepAdapter to more complex vision tasks and exploring automated adapter placement strategies.
Future work could also explore the theoretical aspects of why pre-inserting the adapter leads to better performance. |
parameter-efficient transfer learning, visual adapters, structural re-parameterization, vision transformer, inference efficiency |
2302.08063
Report |
MINOTAUR: Multi-task Video Grounding From Multimodal Queries |
Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran |
Video understanding tasks take many forms, from action detection to visual
query localization and spatio-temporal grounding of sentences. These tasks
differ in the type of inputs (only video, or video-query pair where query is an
image region or sentence) and outputs (temporal segments or spatio-temporal
tubes). However, at their core they require the same fundamental understanding
of the video, i.e., the actors and objects in it, their actions and
interactions. So far these tasks have been tackled in isolation with
individual, highly specialized architectures, which do not exploit the
interplay between tasks. In contrast, in this paper, we present a single,
unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic
Memory benchmark which entail queries of three different forms: given an
egocentric video and a visual, textual or activity query, the goal is to
determine when and where the answer can be seen within the video. Our model
design is inspired by recent query-based approaches to spatio-temporal
grounding, and contains modality-specific query encoders and task-specific
sliding window inference that allow multi-task training with diverse input
modalities and different structured outputs. We exhaustively analyze
relationships among the tasks and illustrate that cross-task learning leads to
improved performance on each individual task, as well as the ability to
generalize to unseen tasks, such as zero-shot spatial localization of language
queries. |
MINOTAUR, a unified Transformer-based model for grounding multimodal queries (visual, textual, activity) in long-form egocentric videos. |
Existing video understanding models often tackle tasks in isolation. This work proposes a unified approach to leverage the interplay between tasks and improve performance/generalization. |
The model encodes video and query using task-specific modules, fuses them with a Transformer, and decodes spatio-temporal responses using sliding window inference and a foreground frame prediction module. |
Multi-task learning surpasses single-task models on 9 out of 12 metrics across 3 episodic memory tasks.
The model demonstrates zero-shot spatio-temporal grounding of language queries, not explicitly trained for.
Ablation studies confirm the effectiveness of each component, including modality-specific encoders and multi-scale inference. |
The model's performance could benefit from larger-scale pre-training on extensive video-text datasets.
Exploring alternative multi-task learning strategies might further enhance performance and generalization capabilities. |
video grounding, multimodal learning, egocentric vision, transformer, zero-shot learning |
2302.07979
Report |
PRedItOR: Text Guided Image Editing with Diffusion Prior |
Hareesh Ravi, Sachin Kelkar, Midhun Harikumar, Ajinkya Kale |
Diffusion models have shown remarkable capabilities in generating high
quality and creative images conditioned on text. An interesting application of
such models is structure preserving text guided image editing. Existing
approaches rely on text conditioned diffusion models such as Stable Diffusion
or Imagen and require compute intensive optimization of text embeddings or
fine-tuning the model weights for text guided image editing. We explore text
guided image editing with a Hybrid Diffusion Model (HDM) architecture similar
to DALLE-2. Our architecture consists of a diffusion prior model that generates
CLIP image embedding conditioned on a text prompt and a custom Latent Diffusion
Model trained to generate images conditioned on CLIP image embedding. We
discover that the diffusion prior model can be used to perform text guided
conceptual edits on the CLIP image embedding space without any finetuning or
optimization. We combine this with structure preserving edits on the image
decoder using existing approaches such as reverse DDIM to perform text guided
image editing. Our approach, PRedItOR does not require additional inputs,
fine-tuning, optimization or objectives and shows on par or better results than
baselines qualitatively and quantitatively. We provide further analysis and
understanding of the diffusion prior model and believe this opens up new
possibilities in diffusion models research. |
PRedItOR: a novel method for text-guided image editing using a pre-trained Hybrid Diffusion Model (HDM) similar to DALLE-2, leveraging the Diffusion Prior for conceptual edits in CLIP image embedding space. |
Existing text-guided image editing techniques based on diffusion models often require base prompts, optimization of embeddings, or fine-tuning, which PRedItOR overcomes by using a pre-trained HDM and a novel two-step editing approach. |
PRedItOR uses the Diffusion Prior to perform a "conceptual edit" by manipulating the base image's CLIP embedding based on the edit text. This is followed by a "structural edit" using reverse DDIM on the HDM's decoder, conditioned on the edited embedding. |
Conceptual editing with the Diffusion Prior effectively captures the edit text's context while preserving information from the base image.
PRedItOR achieves comparable or better qualitative results than existing baselines without requiring base prompts, optimization, or fine-tuning.
Quantitative analysis shows that PRedItOR can achieve a balance between relevance to the edit text and fidelity to the base image's structure. |
The HDM used is trained on a smaller dataset compared to models used in some baselines, limiting the scope of comparable edits.
PRedItOR relies on reverse DDIM, which, similar to SDEdit, can struggle with color-changing edits, leading to a trade-off between color accuracy and structure preservation. |
text-guided image editing, diffusion models, diffusion prior, clip embedding, hybrid diffusion model |
2302.07864
Report |
Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild |
Hshmat Sahak, Daniel Watson, Chitwan Saharia, David Fleet |
Diffusion models have shown promising results on single-image
super-resolution and other image- to-image translation tasks. Despite this
success, they have not outperformed state-of-the-art GAN models on the more
challenging blind super-resolution task, where the input images are out of
distribution, with unknown degradations. This paper introduces SR3+, a
diffusion-based model for blind super-resolution, establishing a new
state-of-the-art. To this end, we advocate self-supervised training with a
combination of composite, parameterized degradations for self-supervised
training, and noise-conditioing augmentation during training and testing. With
these innovations, a large-scale convolutional architecture, and large-scale
datasets, SR3+ greatly outperforms SR3. It outperforms Real-ESRGAN when trained
on the same data, with a DRealSR FID score of 36.82 vs. 37.22, which further
improves to FID of 32.37 with larger models, and further still with larger
training sets. |
This paper introduces SR3+, a diffusion-based model for blind super-resolution that achieves state-of-the-art results by using self-supervised training with a combination of composite, parameterized degradations and noise-conditioning augmentation. |
Blind super-resolution, where input images have unknown degradations, is a challenging task where previous diffusion models fell short of state-of-the-art GAN models. |
SR3+ leverages a convolutional UNet architecture trained with self-supervision. The training process involves: 1) Applying a sequence of parameterized degradations to high-resolution images to mimic real-world degradations, 2) Noise conditioning augmentation during training and testing to improve robustness and generalization. |
SR3+ outperforms SR3 and Real-ESRGAN on FID-10K when trained on the same data.
Noise conditioning augmentation at test time provides a trade-off between input alignment and realistic detail hallucination.
Increasing model capacity and training set size leads to significant improvements in SR3+ performance. |
Potential failure modes, like gibberish text generation, may require more training steps or architectural improvements.
Exploration of larger models and improved architectures is left for future work. |
super-resolution, diffusion models, blind image super-resolution, noise conditioning augmentation, self-supervised learning |
2302.07848
Report |
One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2 |
Trevine Oorloff, Yaser Yacoob |
While recent research has progressively overcome the low-resolution
constraint of one-shot face video re-enactment with the help of StyleGAN's
high-fidelity portrait generation, these approaches rely on at least one of the
following: explicit 2D/3D priors, optical flow based warping as motion
descriptors, off-the-shelf encoders, etc., which constrain their performance
(e.g., inconsistent predictions, inability to capture fine facial details and
accessories, poor generalization, artifacts). We propose an end-to-end
framework for simultaneously supporting face attribute edits, facial motions
and deformations, and facial identity control for video generation. It employs
a hybrid latent-space that encodes a given frame into a pair of latents:
Identity latent, $\mathcal{W}_{ID}$, and Facial deformation latent,
$\mathcal{S}_F$, that respectively reside in the $W+$ and $SS$ spaces of
StyleGAN2. Thereby, incorporating the impressive editability-distortion
trade-off of $W+$ and the high disentanglement properties of $SS$. These hybrid
latents employ the StyleGAN2 generator to achieve high-fidelity face video
re-enactment at $1024^2$. Furthermore, the model supports the generation of
realistic re-enactment videos with other latent-based semantic edits (e.g.,
beard, age, make-up, etc.). Qualitative and quantitative analyses performed
against state-of-the-art methods demonstrate the superiority of the proposed
approach. |
This paper presents a novel end-to-end framework for one-shot face video re-enactment at 1024x1024 resolution using a hybrid latent space approach with StyleGAN2. |
Existing methods for face video re-enactment either suffer from low resolution, rely on explicit 2D/3D priors that limit generalizability, or struggle to capture fine facial details. This work leverages the implicit priors and disentanglement properties of StyleGAN2's latent spaces to address these limitations. |
The framework employs an encoder-decoder architecture. The encoder maps an input frame to two latents: an Identity latent in StyleGAN2's W+ space and a Facial Deformation latent in the first 10 layers of the StyleSpace (SS). The decoder utilizes the pre-trained StyleGAN2 generator to synthesize re-enacted frames by combining these latents. A novel "Cyclic Manifold Adjustment" technique is introduced to improve identity reconstruction for out-of-domain subjects. |
The proposed method achieves state-of-the-art quantitative and qualitative results for both same-identity and cross-identity re-enactment at 1024x1024 resolution.
The hybrid latent space approach, combining W+ and SS, is shown to be superior to using W+ alone, highlighting the importance of disentanglement for encoding facial deformations.
The framework demonstrates robustness to variations in head pose and expression in source frames. |
The model inherits limitations from StyleGAN2, such as texture sticking and challenges in handling occlusions and backgrounds.
The lack of high-resolution datasets for re-enactment is acknowledged. |
face video re-enactment, stylegan2, hybrid latent space, one-shot learning, cyclic manifold adjustment |
2302.07685
Report |
Video Probabilistic Diffusion Models in Projected Latent Space |
Sihyun Yu, Kihyuk Sohn, Subin Kim, Jinwoo Shin |
Despite the remarkable progress in deep generative models, synthesizing
high-resolution and temporally coherent videos still remains a challenge due to
their high-dimensionality and complex temporal dynamics along with large
spatial variations. Recent works on diffusion models have shown their potential
to solve this challenge, yet they suffer from severe computation- and
memory-inefficiency that limit the scalability. To handle this issue, we
propose a novel generative model for videos, coined projected latent video
diffusion models (PVDM), a probabilistic diffusion model which learns a video
distribution in a low-dimensional latent space and thus can be efficiently
trained with high-resolution videos under limited resources. Specifically, PVDM
is composed of two components: (a) an autoencoder that projects a given video
as 2D-shaped latent vectors that factorize the complex cubic structure of video
pixels and (b) a diffusion model architecture specialized for our new
factorized latent space and the training/sampling procedure to synthesize
videos of arbitrary length with a single model. Experiments on popular video
generation datasets demonstrate the superiority of PVDM compared with previous
video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the
UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of
the prior state-of-the-art. |
This paper proposes PVDM, a novel latent diffusion model for video generation that operates in a low-dimensional latent space, enabling efficient training with high-resolution videos. |
Synthesizing high-resolution and temporally coherent videos is challenging due to high dimensionality and complex temporal dynamics. Existing diffusion models, while promising, are computationally and memory intensive, limiting scalability. PVDM addresses these limitations. |
PVDM employs a two-stage framework: 1) An autoencoder projects videos into three 2D image-like latent vectors, factorizing the complex cubic structure. 2) A diffusion model tailored for this latent space synthesizes videos of arbitrary length using a joint training strategy for unconditional and frame-conditional generation. |
PVDM achieves state-of-the-art results on UCF-101 and SkyTimelapse datasets for video generation, outperforming baselines in both quantitative metrics (FVD and IS) and qualitative assessments.
The proposed method demonstrates significant computational and memory efficiency compared to pixel-based video diffusion models, enabling training and generation of high-resolution videos with limited resources.
PVDM excels in long video generation, effectively maintaining temporal coherency across extended timesteps even on challenging datasets like UCF-101. |
There is still room for improvement in bridging the gap between real and generated videos.
Exploring better latent structures or designing more specialized diffusion model architectures for triplane latents could be beneficial. |
video generation, diffusion models, latent space, autoencoder, deep learning |
2302.07577
Report |
Efficient Teacher: Semi-Supervised Object Detection for YOLOv5 |
Bowen Xu, Mingtao Chen, Wenlong Guan, Lulu Hu |
Semi-Supervised Object Detection (SSOD) has been successful in improving the
performance of both R-CNN series and anchor-free detectors. However, one-stage
anchor-based detectors lack the structure to generate high-quality or flexible
pseudo labels, leading to serious inconsistency problems in SSOD. In this
paper, we propose the Efficient Teacher framework for scalable and effective
one-stage anchor-based SSOD training, consisting of Dense Detector, Pseudo
Label Assigner, and Epoch Adaptor. Dense Detector is a baseline model that
extends RetinaNet with dense sampling techniques inspired by YOLOv5. The
Efficient Teacher framework introduces a novel pseudo label assignment
mechanism, named Pseudo Label Assigner, which makes more refined use of pseudo
labels from Dense Detector. Epoch Adaptor is a method that enables a stable and
efficient end-to-end semi-supervised training schedule for Dense Detector. The
Pseudo Label Assigner prevents the occurrence of bias caused by a large number
of low-quality pseudo labels that may interfere with the Dense Detector during
the student-teacher mutual learning mechanism, and the Epoch Adaptor utilizes
domain and distribution adaptation to allow Dense Detector to learn globally
distributed consistent features, making the training independent of the
proportion of labeled data. Our experiments show that the Efficient Teacher
framework achieves state-of-the-art results on VOC, COCO-standard, and
COCO-additional using fewer FLOPs than previous methods. To the best of our
knowledge, this is the first attempt to apply Semi-Supervised Object Detection
to YOLOv5.Code is available:
https://github.com/AlibabaResearch/efficientteacher |
This paper proposes Efficient Teacher, a novel framework for scalable and effective semi-supervised object detection (SSOD) training for one-stage anchor-based detectors. |
One-stage anchor-based detectors often struggle with SSOD due to limitations in generating high-quality pseudo labels and the inconsistency of these labels during training. This paper aims to address these challenges and improve the performance of SSOD in this detector category. |
The Efficient Teacher framework consists of three main components: Dense Detector (a RetinaNet-based detector enhanced with dense sampling techniques), Pseudo Label Assigner (PLA, for refined pseudo label assignment), and Epoch Adaptor (EA, for efficient and stable training). PLA categorizes pseudo labels into reliable and uncertain ones and utilizes soft loss for uncertain labels, while EA optimizes training by employing domain and distribution adaptation. |
Efficient Teacher achieves state-of-the-art results on VOC, COCO-standard, and COCO-additional datasets with fewer FLOPs compared to previous SSOD methods.
Pseudo Label Assigner significantly improves performance by mitigating the negative impact of uncertain pseudo labels.
Epoch Adaptor enables faster and more stable training through domain and distribution adaptation. |
The current implementation primarily focuses on object detection tasks, further research is needed to explore its applicability in instance segmentation tasks.
The computational cost of online Mosaic data augmentation during distribution adaptation could be further reduced. |
semi-supervised object detection, pseudo label assignment, one-stage detectors, anchor-based detectors, domain adaptation |
2302.07483
Report |
EdgeYOLO: An Edge-Real-Time Object Detector |
Shihan Liu, Junlin Zha, Jian Sun, Zhuo Li, Gang Wang |
This paper proposes an efficient, low-complexity and anchor-free object
detector based on the state-of-the-art YOLO framework, which can be implemented
in real time on edge computing platforms. We develop an enhanced data
augmentation method to effectively suppress overfitting during training, and
design a hybrid random loss function to improve the detection accuracy of small
objects. Inspired by FCOS, a lighter and more efficient decoupled head is
proposed, and its inference speed can be improved with little loss of
precision. Our baseline model can reach the accuracy of 50.6% AP50:95 and 69.8%
AP50 in MS COCO2017 dataset, 26.4% AP50:95 and 44.8% AP50 in VisDrone2019-DET
dataset, and it meets real-time requirements (FPS>=30) on edge-computing device
Nvidia Jetson AGX Xavier. We also designed lighter models with less parameters
for edge computing devices with lower computing power, which also show better
performances. Our source code, hyper-parameters and model weights are all
available at https://github.com/LSH9832/edgeyolo. |
This paper proposes EdgeYOLO, an efficient and anchor-free object detector based on the YOLO framework, designed for real-time performance on edge computing platforms. |
Many state-of-the-art object detectors, while accurate, struggle to achieve real-time performance on edge devices due to their complexity. This work aims to bridge this gap by creating a model that balances high accuracy with real-time inference speed on resource-constrained hardware. |
The paper introduces several key innovations: 1) An enhanced data augmentation method combining Mosaic and Mixup to improve data richness and reduce overfitting. 2) A lightweight decoupled head design with reduced channels and layers, further optimized for inference speed using re-parameterization techniques. 3) A staged loss function utilizing Hybrid-Random Loss and cIOU loss to improve detection accuracy, particularly for small objects. |
EdgeYOLO achieves 50.6% AP on MS COCO2017 and 26.4% AP on VisDrone2019-DET, surpassing several state-of-the-art models in accuracy while maintaining real-time performance (FPS ≥ 30) on a Nvidia Jetson AGX Xavier.
The lightweight decoupled head design provides a significant precision improvement without sacrificing inference speed compared to coupled or traditional decoupled heads.
The staged loss function with Hybrid-Random Loss and cIOU loss demonstrably boosts the detection performance, particularly for small objects. |
The paper acknowledges that while the use of segmentation labels during training can improve accuracy, it is not strictly necessary and has a minor impact on the final result.
Future work will focus on further enhancing the detection accuracy for small objects and exploring additional optimizations for edge devices. |
object detection, anchor-free, real-time, edge computing, yolo |
2302.07319
Report |
Frustratingly Simple but Effective Zero-shot Detection and Segmentation: Analysis and a Strong Baseline |
Siddhesh Khandelwal, Anirudth Nambirajan, Behjat Siddiquie, Jayan Eledath, Leonid Sigal |
Methods for object detection and segmentation often require abundant
instance-level annotations for training, which are time-consuming and expensive
to collect. To address this, the task of zero-shot object detection (or
segmentation) aims at learning effective methods for identifying and localizing
object instances for the categories that have no supervision available.
Constructing architectures for these tasks requires choosing from a myriad of
design options, ranging from the form of the class encoding used to transfer
information from seen to unseen categories, to the nature of the function being
optimized for learning. In this work, we extensively study these design
choices, and carefully construct a simple yet extremely effective zero-shot
recognition method. Through extensive experiments on the MSCOCO dataset on
object detection and segmentation, we highlight that our proposed method
outperforms existing, considerably more complex, architectures. Our findings
and method, which we propose as a competitive future baseline, point towards
the need to revisit some of the recent design trends in zero-shot detection /
segmentation. |
This paper proposes a simple yet effective method for zero-shot object detection and segmentation, achieved by carefully ablating and selecting optimal design choices for each model component. |
Current methods for object detection and segmentation require extensive instance-level annotations, which are costly and time-consuming. Zero-shot learning addresses this by transferring knowledge from seen categories to unseen categories without requiring annotations for the latter. |
The method uses a two-step training process: 1) Training a Faster R-CNN (detection) or Mask R-CNN (segmentation) on seen categories. 2) Fine-tuning a projection layer to map image features to a semantic embedding space using normalized category-name embeddings (GloVe, ConceptNet). |
Outperforms existing zero-shot detection methods by a significant margin on MSCOCO benchmark.
Shows superior performance in zero-shot instance segmentation tasks compared to baselines.
Demonstrates the importance of choosing appropriate semantic embeddings for optimal zero-shot learning performance. |
Limited exploration of more advanced semantic embedding techniques beyond GloVe and ConceptNet.
Future work could explore the impact of different architectures and training paradigms on the proposed approach. |
zero-shot learning, object detection, instance segmentation, semantic embeddings, transfer learning |
2302.07121
Report |
Universal Guidance for Diffusion Models |
Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, Tom Goldstein |
Typical diffusion models are trained to accept a particular form of
conditioning, most commonly text, and cannot be conditioned on other modalities
without retraining. In this work, we propose a universal guidance algorithm
that enables diffusion models to be controlled by arbitrary guidance modalities
without the need to retrain any use-specific components. We show that our
algorithm successfully generates quality images with guidance functions
including segmentation, face recognition, object detection, and classifier
signals. Code is available at
https://github.com/arpitbansal297/Universal-Guided-Diffusion. |
This paper proposes a universal guidance algorithm that allows diffusion models to be controlled by arbitrary guidance modalities (e.g., segmentation, face recognition, object detection) without retraining. |
Existing diffusion models are typically limited to a single conditioning modality and require retraining for new modalities, which is computationally expensive. |
The algorithm leverages pre-trained guidance models on denoised images during the sampling process, closing the domain gap between noisy latent states and clean images. It incorporates forward guidance based on predicted clean images and backward guidance to optimize the image towards the prompt. |
The algorithm successfully generates high-quality images guided by various modalities, including CLIP text embeddings, segmentation maps, face recognition embeddings, and object detection outputs.
It is effective with both unconditional diffusion models (ImageNet) and text-conditional models (Stable Diffusion).
The method can effectively combine multiple guidance functions simultaneously, as demonstrated with segmentation-guided inpainting. |
The generation process using universal guidance is slower than standard conditional generation due to multiple denoising iterations and backward guidance optimization.
Optimal hyperparameters for sampling need to be determined individually for each guidance network. |
diffusion models, guided image generation, universal guidance, multimodal conditioning, image synthesis |
2302.06908
Report |
DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model |
Yichen Peng, Chunqi Zhao, Haoran Xie, Tsukasa Fukusato, Kazunori Miyata |
Synthesizing face images from monochrome sketches is one of the most
fundamental tasks in the field of image-to-image translation. However, it is
still challenging to (1)~make models learn the high-dimensional face features
such as geometry and color, and (2)~take into account the characteristics of
input sketches. Existing methods often use sketches as indirect inputs (or as
auxiliary inputs) to guide the models, resulting in the loss of sketch features
or the alteration of geometry information. In this paper, we introduce a
Sketch-Guided Latent Diffusion Model (SGLDM), an LDM-based network architect
trained on the paired sketch-face dataset. We apply a Multi-Auto-Encoder (AE)
to encode the different input sketches from different regions of a face from
pixel space to a feature map in latent space, which enables us to reduce the
dimension of the sketch input while preserving the geometry-related information
of local face details. We build a sketch-face paired dataset based on the
existing method that extracts the edge map from an image. We then introduce a
Stochastic Region Abstraction (SRA), an approach to augment our dataset to
improve the robustness of SGLDM to handle sketch input with arbitrary
abstraction. The evaluation study shows that SGLDM can synthesize high-quality
face images with different expressions, facial accessories, and hairstyles from
various sketches with different abstraction levels. |
This paper introduces DiffFaceSketch (SGLDM), a Latent Diffusion Model (LDM) for synthesizing high-fidelity face images from sketches, enhancing control and detail preservation over existing sketch-to-image methods. |
Synthesizing face images from sketches is crucial for applications like character design but challenging due to the sparse nature of sketch data and the need for detailed geometry and color mapping. |
SGLDM uses a two-stage training process: 1) a Multi-Auto-Encoder (AE) encodes sketches into feature maps preserving local details, and 2) an LDM learns to generate faces conditioned on these encoded sketches. They also introduce Stochastic Region Abstraction (SRA) for data augmentation, improving robustness to different sketch abstraction levels. |
SGLDM generates more realistic faces with higher fidelity to input sketches compared to GAN-based methods like Pix2Pix and DeepFaceDrawing.
Quantitative evaluation shows SGLDM achieves superior scores in FID and LPIPS metrics, indicating better image quality and consistency with real faces.
User study confirms SGLDM synthesized images have higher preference for both visual quality and input consistency. |
The synthesis can be overly sensitive to sketch quality, leading to artifacts with poor sketches.
The method, while using LDM for efficiency, is still computationally intensive compared to GAN-based approaches, especially during training and sampling. |
image synthesis, sketch-to-image translation, latent diffusion model, face generation, data augmentation |
2302.06833
Report |
VQ3D: Learning a 3D-Aware Generative Model on ImageNet |
Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun |
Recent work has shown the possibility of training generative models of 3D
content from 2D image collections on small datasets corresponding to a single
object class, such as human faces, animal faces, or cars. However, these models
struggle on larger, more complex datasets. To model diverse and unconstrained
image collections such as ImageNet, we present VQ3D, which introduces a
NeRF-based decoder into a two-stage vector-quantized autoencoder. Our Stage 1
allows for the reconstruction of an input image and the ability to change the
camera position around the image, and our Stage 2 allows for the generation of
new 3D scenes. VQ3D is capable of generating and reconstructing 3D-aware images
from the 1000-class ImageNet dataset of 1.2 million training images. We achieve
an ImageNet generation FID score of 16.8, compared to 69.8 for the next best
baseline method. |
Presents VQ3D, a novel 3D-aware generative model trained on large and diverse 2D image collections (e.g., ImageNet) using a two-stage vector-quantized autoencoder with a NeRF-based decoder. |
Existing 3D generative models struggle with large, diverse datasets like ImageNet, limiting their ability to generate diverse and unconstrained 3D content. |
Combines a ViT-based encoder and a conditional NeRF decoder in a two-stage VQ-autoencoder framework. Employs a novel depth loss during training to supervise geometry learning using pseudo-GT depth and renders novel views for improved 3D consistency. |
Achieves state-of-the-art generation results on ImageNet with an FID score of 16.8, a significant improvement over the next best baseline (StyleNeRF at 69.8).
Demonstrates competitive performance on the CompCars dataset, highlighting its ability to generalize to different datasets with a simple pose sampling scheme.
Enables single-view 3D reconstruction and manipulation, allowing for novel view synthesis and image editing directly from a single RGB input. |
Large viewpoint manipulation is limited due to the autoencoder-based formulation.
Reliance on a pre-trained depth network for geometry supervision may limit applicability to datasets or domains where accurate depth estimation is challenging. |
3d generative models, nerf, vector quantization, imagenet, novel view synthesis |
2302.06793
Report |
HR-NeuS: Recovering High-Frequency Surface Geometry via Neural Implicit Surfaces |
Erich Liang, Kenan Deng, Xi Zhang, Chun-Kai Wang |
Recent advances in neural implicit surfaces for multi-view 3D reconstruction
primarily focus on improving large-scale surface reconstruction accuracy, but
often produce over-smoothed geometries that lack fine surface details. To
address this, we present High-Resolution NeuS (HR-NeuS), a novel neural
implicit surface reconstruction method that recovers high-frequency surface
geometry while maintaining large-scale reconstruction accuracy. We achieve this
by utilizing (i) multi-resolution hash grid encoding rather than positional
encoding at high frequencies, which boosts our model's expressiveness of local
geometry details; (ii) a coarse-to-fine algorithmic framework that selectively
applies surface regularization to coarse geometry without smoothing away fine
details; (iii) a coarse-to-fine grid annealing strategy to train the network.
We demonstrate through experiments on DTU and BlendedMVS datasets that our
approach produces 3D geometries that are qualitatively more detailed and
quantitatively of similar accuracy compared to previous approaches. |
This paper proposes \methodname, a novel neural implicit surface reconstruction method that recovers high-frequency surface details while maintaining large-scale accuracy. |
Previous methods often produce over-smoothed geometries lacking fine details. This work addresses this limitation to achieve higher fidelity 3D reconstructions. |
The method leverages: (i) Multi-resolution hash grid encoding for enhanced local geometry detail. (ii) A coarse-to-fine framework applying surface regularization selectively to avoid over-smoothing. (iii) A coarse-to-fine grid annealing strategy for network training. |
Recovers finer surface details and textures compared to NeuS.
Achieves similar or better reconstruction accuracy compared to NeuS and NeuralWarp on the DTU dataset.
Ablation study demonstrates the individual contributions of each proposed component. |
Does not incorporate multi-view constraints used by some other methods.
Does not explicitly address ambiguity between shading and surface normals. |
3d reconstruction, neural implicit surfaces, multi-resolution hash encoding, surface regularization, coarse-to-fine training |
2302.06733
Report |
Robust Unsupervised StyleGAN Image Restoration |
Yohan Poirier-Ginter, Jean-François Lalonde |
GAN-based image restoration inverts the generative process to repair images
corrupted by known degradations. Existing unsupervised methods must be
carefully tuned for each task and degradation level. In this work, we make
StyleGAN image restoration robust: a single set of hyperparameters works across
a wide range of degradation levels. This makes it possible to handle
combinations of several degradations, without the need to retune. Our proposed
approach relies on a 3-phase progressive latent space extension and a
conservative optimizer, which avoids the need for any additional regularization
terms. Extensive experiments demonstrate robustness on inpainting, upsampling,
denoising, and deartifacting at varying degradations levels, outperforming
other StyleGAN-based inversion techniques. Our approach also favorably compares
to diffusion-based restoration by yielding much more realistic inversion
results. Code is available at https://lvsn.github.io/RobustUnsupervised/. |
This paper proposes a robust unsupervised StyleGAN image restoration method that uses a single set of hyperparameters across a wide range of degradation levels and types. |
Existing unsupervised StyleGAN image restoration methods require careful hyperparameter tuning for each task and degradation level, making them impractical for handling combinations of degradations. This paper addresses this limitation by introducing a robust approach. |
The proposed method employs a 3-phase progressive latent space extension, starting with global optimization, then expanding to layer-wise, and finally filter-wise. It leverages a conservative normalized gradient descent (NGD) optimizer and a multi-resolution loss function. |
The method achieves state-of-the-art results on most scenarios, outperforming other StyleGAN-based inversion techniques even when they are optimized for each task/level individually.
It demonstrates robustness to varying degradation levels across inpainting, upsampling, denoising, and deartifacting.
The method effectively handles compositions of these tasks without requiring hyperparameter retuning. |
The method is limited to the domain learned by the GAN.
It requires knowledge of an approximate degradation function. |
image restoration, stylegan, unsupervised learning, generative adversarial networks, robustness |
2302.06608
Report |
3D-aware Blending with Generative NeRFs |
Hyunsu Kim, Gayoung Lee, Yunjey Choi, Jin-Hwa Kim, Jun-Yan Zhu |
Image blending aims to combine multiple images seamlessly. It remains
challenging for existing 2D-based methods, especially when input images are
misaligned due to differences in 3D camera poses and object shapes. To tackle
these issues, we propose a 3D-aware blending method using generative Neural
Radiance Fields (NeRF), including two key components: 3D-aware alignment and
3D-aware blending. For 3D-aware alignment, we first estimate the camera pose of
the reference image with respect to generative NeRFs and then perform 3D local
alignment for each part. To further leverage 3D information of the generative
NeRF, we propose 3D-aware blending that directly blends images on the NeRF's
latent representation space, rather than raw pixel space. Collectively, our
method outperforms existing 2D baselines, as validated by extensive
quantitative and qualitative evaluations with FFHQ and AFHQ-Cat. |
The paper proposes a novel 3D-aware image blending method using generative Neural Radiance Fields (NeRFs), enabling seamless blending of unaligned images while preserving 3D consistency. |
Existing 2D image blending methods struggle to handle misaligned images with differences in camera poses and object shapes. This work addresses these limitations by leveraging 3D information. |
The proposed method involves 1) 3D-aware alignment: estimating camera poses and aligning objects in 3D using NeRFs, and 2) 3D-aware blending: blending images in the NeRF's latent space using image-blending and density-blending losses. |
Outperforms state-of-the-art 2D image blending methods in terms of photorealism and faithfulness.
Enables disentanglement of color and geometric changes during blending.
Produces multi-view consistent results, showcasing the 3D awareness of the method. |
Performance relies on the quality of GAN inversion, which can be a bottleneck.
Real-time editing is limited due to the optimization-based approach. Future work could explore encoder-based solutions. |
image blending, generative neural radiance fields, 3d-aware image editing, gan inversion, multi-view consistency |
2302.06586
Report |
Stitchable Neural Networks |
Zizheng Pan, Jianfei Cai, Bohan Zhuang |
The public model zoo containing enormous powerful pretrained model families
(e.g., ResNet/DeiT) has reached an unprecedented scope than ever, which
significantly contributes to the success of deep learning. As each model family
consists of pretrained models with diverse scales (e.g., DeiT-Ti/S/B), it
naturally arises a fundamental question of how to efficiently assemble these
readily available models in a family for dynamic accuracy-efficiency trade-offs
at runtime. To this end, we present Stitchable Neural Networks (SN-Net), a
novel scalable and efficient framework for model deployment. It cheaply
produces numerous networks with different complexity and performance trade-offs
given a family of pretrained neural networks, which we call anchors.
Specifically, SN-Net splits the anchors across the blocks/layers and then
stitches them together with simple stitching layers to map the activations from
one anchor to another. With only a few epochs of training, SN-Net effectively
interpolates between the performance of anchors with varying scales. At
runtime, SN-Net can instantly adapt to dynamic resource constraints by
switching the stitching positions. Extensive experiments on ImageNet
classification demonstrate that SN-Net can obtain on-par or even better
performance than many individually trained networks while supporting diverse
deployment scenarios. For example, by stitching Swin Transformers, we challenge
hundreds of models in Timm model zoo with a single network. We believe this new
elastic model framework can serve as a strong baseline for further research in
wider communities. |
The paper introduces Stitchable Neural Networks (SN-Net), a novel framework that constructs a single scalable network by stitching together pre-trained models of varying sizes from a model family using simple stitching layers, enabling efficient model deployment and dynamic adaptation to resource constraints. |
Existing scalable deep learning methods like model compression and NAS are limited to single model design spaces and struggle to leverage the knowledge from pretrained model families. SN-Net aims to overcome these limitations by efficiently combining pretrained models for better flexibility and accuracy in diverse deployment scenarios. |
SN-Net strategically stitches together pre-trained models (anchors) from the same family using simple 1x1 convolutional layers. It employs a "Fast-to-Slow" stitching direction, connecting a smaller model's early layers to a larger model's later layers. It also utilizes a "nearest stitching" strategy, stitching only models with similar complexities. Training involves randomly sampling and training individual stitches, leveraging knowledge distillation for performance improvement. |
SN-Net achieves flexible accuracy-efficiency trade-offs, effectively interpolating performance between stitched models.
A single SN-Net, trained on ImageNet, achieves comparable or superior performance to individually trained models while significantly reducing training cost and storage space.
SN-Net demonstrates generalizability across different architectures, successfully stitching plain ViTs, hierarchical ViTs, CNNs, and even combining CNNs with ViTs. |
The random stitch sampling strategy during training might be suboptimal for very large stitching spaces, potentially requiring more training epochs.
Future work can explore extending SN-Net to other tasks like NLP, dense prediction, and transfer learning. |
model stitching, elastic deep learning, model deployment, pre-trained models, resource constraints |
2302.06235
Report |
A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models |
James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan |
Contrastively trained text-image models have the remarkable ability to
perform zero-shot classification, that is, classifying previously unseen images
into categories that the model has never been explicitly trained to identify.
However, these zero-shot classifiers need prompt engineering to achieve high
accuracy. Prompt engineering typically requires hand-crafting a set of prompts
for individual downstream tasks. In this work, we aim to automate this prompt
engineering and improve zero-shot accuracy through prompt ensembling. In
particular, we ask "Given a large pool of prompts, can we automatically score
the prompts and ensemble those that are most suitable for a particular
downstream dataset, without needing access to labeled validation data?". We
demonstrate that this is possible. In doing so, we identify several pathologies
in a naive prompt scoring method where the score can be easily overconfident
due to biases in pre-training and test data, and we propose a novel prompt
scoring method that corrects for the biases. Using our proposed scoring method
to create a weighted average prompt ensemble, our method outperforms equal
average ensemble, as well as hand-crafted prompts, on ImageNet, 4 of its
variants, and 11 fine-grained classification benchmarks, all while being fully
automatic, optimization-free, and not requiring access to labeled validation
data. |
This paper proposes Zero-shot Prompt Ensembling (ZPE), an automatic and optimization-free method for selecting and weighting prompts for zero-shot classification with text-image models, eliminating the need for manual prompt engineering. |
Hand-crafting prompts for zero-shot classification in text-image models is labor-intensive and often requires labeled validation data, limiting their general applicability. Automating this process broadens the usability of these models. |
ZPE scores prompts based on normalized maximum logits over a set of test images, addressing biases from word frequency in pre-training data and spurious concept frequency in test data. It then uses these scores for weighted averaging or to select a subset of prompts. |
ZPE consistently outperforms a naive max-logit scoring baseline.
ZPE achieves higher accuracy than hand-crafted prompts on ImageNet, its variants, and a majority of tested fine-grained datasets, despite being fully automatic.
ZPE proves robust to variations in model architecture, the size of the pool set, and the number of random/test images used for score estimation. |
ZPE relies on a large, diverse pool of high-quality prompts, which is currently lacking.
The method scores prompts independently, potentially missing benefits from prompt combinations. |
zero-shot classification, prompt engineering, text-image models, prompt ensembling, bias correction |
2302.06112
Report |
How to Use Dropout Correctly on Residual Networks with Batch Normalization |
Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Donggeon Lee, Sang Woo Kim |
For the stable optimization of deep neural networks, regularization methods
such as dropout and batch normalization have been used in various tasks.
Nevertheless, the correct position to apply dropout has rarely been discussed,
and different positions have been employed depending on the practitioners. In
this study, we investigate the correct position to apply dropout. We
demonstrate that for a residual network with batch normalization, applying
dropout at certain positions increases the performance, whereas applying
dropout at other positions decreases the performance. Based on theoretical
analysis, we provide the following guideline for the correct position to apply
dropout: apply one dropout after the last batch normalization but before the
last weight layer in the residual branch. We provide detailed theoretical
explanations to support this claim and demonstrate them through module tests.
In addition, we investigate the correct position of dropout in the head that
produces the final prediction. Although the current consensus is to apply
dropout after global average pooling, we prove that applying dropout before
global average pooling leads to a more stable output. The proposed guidelines
are validated through experiments using different datasets and models. |
This paper investigates the optimal position to apply dropout for improved deep neural network performance, particularly within residual networks with batch normalization. |
While dropout is a widely used regularization technique, its ideal placement within network architectures, especially alongside batch normalization, remains unclear. This lack of understanding can lead to suboptimal performance. |
The authors theoretically analyze the variance inconsistency introduced by dropout and how the order of operations (dropout, ReLU, weight layers, batch normalization, skip connections) affects this inconsistency. They leverage this analysis to propose guidelines for dropout placement. |
Applying dropout after the last batch normalization but before the last weight layer in a residual branch improves performance.
Using dropout before global average pooling in the network head leads to more stable outputs compared to the common practice of applying it afterward.
The proposed guidelines are validated through experiments on various datasets (CIFAR-10, CIFAR-100, Caltech-101, Oxford-IIIT Pet, ImageNet) and models (PreResNet, ResNetV1, MobileNetV2, EfficientNet, DenseNet). |
The analysis focuses on PreResNet (ResNetV2) architecture, and while applicable to other variants, further investigation is needed for broader generalization.
The study primarily focuses on variance inconsistency as the main challenge posed by dropout, potentially overlooking other factors that might influence performance. |
dropout, batch normalization, residual networks, regularization, deep learning |
2302.05905
Report |
Single Motion Diffusion |
Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H. Bermano, Daniel Cohen-Or |
Synthesizing realistic animations of humans, animals, and even imaginary
creatures, has long been a goal for artists and computer graphics
professionals. Compared to the imaging domain, which is rich with large
available datasets, the number of data instances for the motion domain is
limited, particularly for the animation of animals and exotic creatures (e.g.,
dragons), which have unique skeletons and motion patterns. In this work, we
present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to
learn the internal motifs of a single motion sequence with arbitrary topology
and synthesize motions of arbitrary length that are faithful to them. We
harness the power of diffusion models and present a denoising network
explicitly designed for the task of learning from a single input motion. SinMDM
is designed to be a lightweight architecture, which avoids overfitting by using
a shallow network with local attention layers that narrow the receptive field
and encourage motion diversity. SinMDM can be applied in various contexts,
including spatial and temporal in-betweening, motion expansion, style transfer,
and crowd animation. Our results show that SinMDM outperforms existing methods
both in quality and time-space efficiency. Moreover, while current approaches
require additional training for different applications, our work facilitates
these applications at inference time. Our code and trained models are available
at https://sinmdm.github.io/SinMDM-page. |
This paper introduces SinMDM, a novel single motion diffusion model for synthesizing diverse and realistic motions from a single input sequence. |
Motion data, especially for non-humanoid characters, is often scarce, making traditional data-driven methods challenging. SinMDM tackles this by effectively learning motion motifs from a single sequence, enabling diverse animation generation for arbitrary skeletons. |
SinMDM leverages a shallow UNet architecture with local attention layers (QnA) to learn from a single motion sequence. This design choice, coupled with a narrow receptive field, encourages motion diversity and prevents overfitting. |
SinMDM outperforms prior art, including Ganimator, in quantitative metrics on both HumanML3D and Mixamo benchmarks.
The model effectively synthesizes long, high-quality motion sequences and demonstrates various motion manipulation capabilities, including in-betweening, style transfer, and crowd animation.
SinMDM showcases the potential of diffusion models for learning from limited data, challenging the notion that they require large datasets. |
Like all single-instance learning methods, SinMDM has limited ability to synthesize out-of-distribution motions.
The iterative nature of diffusion models results in relatively long inference times. |
motion synthesis, diffusion models, single-instance learning, character animation, computer graphics |
2302.05872
Report |
I$^2$SB: Image-to-Image Schrödinger Bridge |
Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A. Theodorou, Weili Nie, Anima Anandkumar |
We propose Image-to-Image Schr\"odinger Bridge (I$^2$SB), a new class of
conditional diffusion models that directly learn the nonlinear diffusion
processes between two given distributions. These diffusion bridges are
particularly useful for image restoration, as the degraded images are
structurally informative priors for reconstructing the clean images. I$^2$SB
belongs to a tractable class of Schr\"odinger bridge, the nonlinear extension
to score-based models, whose marginal distributions can be computed
analytically given boundary pairs. This results in a simulation-free framework
for nonlinear diffusions, where the I$^2$SB training becomes scalable by
adopting practical techniques used in standard diffusion models. We validate
I$^2$SB in solving various image restoration tasks, including inpainting,
super-resolution, deblurring, and JPEG restoration on ImageNet 256x256 and show
that I$^2$SB surpasses standard conditional diffusion models with more
interpretable generative processes. Moreover, I$^2$SB matches the performance
of inverse methods that additionally require the knowledge of the corruption
operators. Our work opens up new algorithmic opportunities for developing
efficient nonlinear diffusion models on a large scale. scale. Project page and
codes: https://i2sb.github.io/ |
This paper proposes Image-to-Image Schrödinger Bridge (I$^2$SB), a new conditional diffusion model that learns nonlinear diffusion bridges directly between two given distributions, making it particularly suitable for image restoration. |
Existing diffusion models for image restoration typically start their generative denoising processes with Gaussian white noise, lacking structural information from the degraded images. I$^2$SB overcomes this limitation by directly leveraging the degraded images as informative priors, leading to more efficient and interpretable image restoration. |
I$^2$SB constructs tractable Schrödinger bridges between individual clean images and their corresponding degraded distributions. It leverages an analytic posterior given boundary pairs for training and utilizes standard DDPM for generation. The method avoids complex simulations typically required by standard Schrödinger bridge models, making it scalable to high-dimensional data. |
I$^2$SB surpasses standard conditional diffusion models like Palette and ADM in multiple image restoration tasks, including super-resolution, JPEG restoration, and inpainting.
I$^2$SB achieves competitive performance to diffusion-based inverse models without requiring knowledge of the corruption operators.
I$^2$SB exhibits more interpretable and efficient generation processes with smaller performance drops as the number of function evaluations decreases. |
The tractability of I$^2$SB relies on the availability of paired data during training, limiting its application in unpaired image translation tasks.
Exploring simulation-free diffusion bridges under more flexible setups is an interesting future direction. |
image restoration, diffusion models, schrödinger bridge, conditional generation, image-to-image translation |
2302.05499
Report |
CUDA: Curriculum of Data Augmentation for Long-Tailed Recognition |
Sumyeong Ahn, Jongwoo Ko, Se-Young Yun |
Class imbalance problems frequently occur in real-world tasks, and
conventional deep learning algorithms are well known for performance
degradation on imbalanced training datasets. To mitigate this problem, many
approaches have aimed to balance among given classes by re-weighting or
re-sampling training samples. These re-balancing methods increase the impact of
minority classes and reduce the influence of majority classes on the output of
models. However, the extracted representations may be of poor quality owing to
the limited number of minority samples. To handle this restriction, several
methods have been developed that increase the representations of minority
samples by leveraging the features of the majority samples. Despite extensive
recent studies, no deep analysis has been conducted on determination of classes
to be augmented and strength of augmentation has been conducted. In this study,
we first investigate the correlation between the degree of augmentation and
class-wise performance, and find that the proper degree of augmentation must be
allocated for each class to mitigate class imbalance problems. Motivated by
this finding, we propose a simple and efficient novel curriculum, which is
designed to find the appropriate per-class strength of data augmentation,
called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA
can simply be integrated into existing long-tailed recognition methods. We
present the results of experiments showing that CUDA effectively achieves
better generalization performance compared to the state-of-the-art method on
various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist
2018. |
The paper proposes CUDA, a simple and efficient curriculum learning-based data augmentation method for long-tailed recognition, which adaptively finds the proper augmentation strength for each class. |
Class imbalance problems are common in real-world tasks, and traditional deep learning algorithms often perform poorly on imbalanced datasets. While existing methods address this issue by re-weighting or re-sampling training samples, they often fail to fully utilize the limited information available for minority classes. |
CUDA measures a Level-of-Learning (LoL) score for each class, reflecting the model's ability to classify augmented samples. Based on this score, it generates augmented samples with varying difficulties, gradually increasing the augmentation strength for classes the model learns well and decreasing it for those it struggles with. |
CUDA consistently outperforms state-of-the-art methods on multiple imbalanced datasets, including CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.
Analysis shows that CUDA improves both the classifier's balance, reducing the variance in weight norms between classes, and the feature extractor's ability, leading to better feature alignment.
The LoL score dynamics demonstrate that CUDA effectively adjusts augmentation strength throughout training, allowing the model to learn difficult samples without forgetting the original information. |
The impact of the number and type of predefined augmentation operations on CUDA's performance could be further investigated.
Exploring the effectiveness of CUDA in other domains beyond image classification, such as natural language processing or time series analysis, would be valuable. |
class imbalance, long-tailed recognition, data augmentation, curriculum learning, deep learning |
2302.05496
Report |
MaskSketch: Unpaired Structure-guided Masked Image Generation |
Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa |
Recent conditional image generation methods produce images of remarkable
diversity, fidelity and realism. However, the majority of these methods allow
conditioning only on labels or text prompts, which limits their level of
control over the generation result. In this paper, we introduce MaskSketch, an
image generation method that allows spatial conditioning of the generation
result using a guiding sketch as an extra conditioning signal during sampling.
MaskSketch utilizes a pre-trained masked generative transformer, requiring no
model training or paired supervision, and works with input sketches of
different levels of abstraction. We show that intermediate self-attention maps
of a masked generative transformer encode important structural information of
the input image, such as scene layout and object shape, and we propose a novel
sampling method based on this observation to enable structure-guided
generation. Our results show that MaskSketch achieves high image realism and
fidelity to the guiding structure. Evaluated on standard benchmark datasets,
MaskSketch outperforms state-of-the-art methods for sketch-to-image
translation, as well as unpaired image-to-image translation approaches. |
Introduces MaskSketch, a sketch-guided image generation method leveraging pre-trained masked generative transformers for realistic image synthesis with spatial control. |
Addresses limitations of existing methods that struggle with fine-grained spatial control in image generation, particularly in sketch-to-photo translation due to domain gaps. |
Utilizes self-attention maps of a pre-trained masked generative transformer to define structural similarity and guide image generation towards the desired layout specified by an input sketch. |
Demonstrates that self-attention maps encode structural information robust to domain shifts between sketches and photos.
Achieves high realism and structure fidelity in sketch-to-photo translation without paired supervision.
Outperforms state-of-the-art sketch-to-photo and general unpaired image translation methods in realism and structure preservation. |
Computational efficiency is a limitation due to multiple sampling iterations and rejection sampling.
Limited by the coarse granularity of transformer attention maps and the flexibility of the pre-trained ImageNet model. |
image generation, sketch-to-photo translation, generative transformers, self-attention maps, structure-guided synthesis |
2302.05486
Report |
RAFaRe: Learning Robust and Accurate Non-parametric 3D Face Reconstruction from Pseudo 2D&3D Pairs |
Longwei Guo, Hao Zhu, Yuanxun Lu, Menghua Wu, Xun Cao |
We propose a robust and accurate non-parametric method for single-view 3D
face reconstruction (SVFR). While tremendous efforts have been devoted to
parametric SVFR, a visible gap still lies between the result 3D shape and the
ground truth. We believe there are two major obstacles: 1) the representation
of the parametric model is limited to a certain face database; 2) 2D images and
3D shapes in the fitted datasets are distinctly misaligned. To resolve these
issues, a large-scale pseudo 2D\&3D dataset is created by first rendering the
detailed 3D faces, then swapping the face in the wild images with the rendered
face. These pseudo 2D&3D pairs are created from publicly available datasets
which eliminate the gaps between 2D and 3D data while covering diverse
appearances, poses, scenes, and illumination. We further propose a
non-parametric scheme to learn a well-generalized SVFR model from the created
dataset, and the proposed hierarchical signed distance function turns out to be
effective in predicting middle-scale and small-scale 3D facial geometry. Our
model outperforms previous methods on FaceScape-wild/lab and MICC benchmarks
and is well generalized to various appearances, poses, expressions, and
in-the-wild environments. The code is released at
http://github.com/zhuhao-nju/rafare . |
This paper presents a novel non-parametric method for single-view 3D face reconstruction (SVFR) that surpasses previous parametric methods limited by 3DMMs and inaccurate training data. |
Achieving robust and accurate SVFR is crucial for various applications, including facial editing, animation, and VR/AR. |
The authors create a large-scale pseudo 2D&3D dataset with accurate alignment by swapping in-the-wild faces with precisely reconstructed faces. They then employ a hierarchical signed distance function to train a non-parametric SVFR model on this dataset. |
The method outperforms previous approaches on FaceScape-wild/lab and MICC benchmarks, demonstrating superior accuracy.
It exhibits strong generalization to diverse appearances, poses, expressions, and in-the-wild environments.
The hierarchical SDF proves effective in recovering detailed facial geometry at different scales. |
The non-uniform mesh topology requires an additional registration step for downstream applications.
The performance on faces with large poses is relatively lower due to limited training data with extreme poses. |
3d face reconstruction, single-view reconstruction, non-parametric method, hierarchical signed distance function, data augmentation |
2302.05016
Report |
Is Multimodal Vision Supervision Beneficial to Language? |
Avinash Madasu, Vasudev Lal |
Vision (image and video) - Language (VL) pre-training is the recent popular
paradigm that achieved state-of-the-art results on multi-modal tasks like
image-retrieval, video-retrieval, visual question answering etc. These models
are trained in an unsupervised way and greatly benefit from the complementary
modality supervision. In this paper, we explore if the language representations
trained using vision supervision perform better than vanilla language
representations on Natural Language Understanding and commonsense reasoning
benchmarks. We experiment with a diverse set of image-text models such as
ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT),
VIOLET. We compare the performance of language representations of stand-alone
text encoders of these models to the language representations of text encoders
learnt through vision supervision. Our experiments suggest that vanilla
language representations show superior performance on most of the tasks. These
results shed light on the current drawbacks of the vision-language models. |
This paper investigates whether language representations trained with visual supervision from image-text and video-text models perform better than vanilla language representations on Natural Language Understanding (NLU) and commonsense reasoning tasks. |
Vision-language pre-training has shown success in multi-modal tasks, raising the question of its impact on language understanding capabilities. |
The study compares vanilla language models (BERT, RoBERTa, DistilBERT) pre-trained on text captions with their vision-supervised counterparts from models like ALBEF, BLIP, METER, ALPRO, FiT, and VIOLET. They are evaluated on GLUE, Superglue, and commonsense reasoning benchmarks. |
Vanilla language representations outperform vision-supervised counterparts on most NLU tasks (NLI, sentence similarity, reading comprehension).
Similar trends are observed for commonsense reasoning benchmarks.
However, vision-supervised models show improvements on specific tasks like WNLI (GLUE) and COPA (Superglue). |
The study primarily focuses on understanding language capabilities and doesn't evaluate multi-modal tasks.
Future work can explore the impact of different pre-training objectives and data scales on the performance difference. |
vision-language pre-training, natural language understanding, commonsense reasoning, language representation learning, multi-modal learning |
2302.04871
Report |
In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing |
Yiran Xu, Zhixin Shu, Cameron Smith, Seoung Wug Oh, Jia-Bin Huang |
3D-aware GANs offer new capabilities for view synthesis while preserving the
editing functionalities of their 2D counterparts. GAN inversion is a crucial
step that seeks the latent code to reconstruct input images or videos,
subsequently enabling diverse editing tasks through manipulation of this latent
code. However, a model pre-trained on a particular dataset (e.g., FFHQ) often
has difficulty reconstructing images with out-of-distribution (OOD) objects
such as faces with heavy make-up or occluding objects. We address this issue by
explicitly modeling OOD objects from the input in 3D-aware GANs. Our core idea
is to represent the image using two individual neural radiance fields: one for
the in-distribution content and the other for the out-of-distribution object.
The final reconstruction is achieved by optimizing the composition of these two
radiance fields with carefully designed regularization. We demonstrate that our
explicit decomposition alleviates the inherent trade-off between reconstruction
fidelity and editability. We evaluate reconstruction accuracy and editability
of our method on challenging real face images and videos and showcase favorable
results against other baselines. |
This paper introduces a novel 3D-aware GAN inversion method for reconstructing and editing portrait images and videos containing out-of-distribution (OOD) objects, such as heavy makeup or accessories. |
Existing 3D GAN inversion techniques struggle to reconstruct and edit images with OOD objects due to the models being trained primarily on in-distribution data (e.g., natural faces). This limits their ability to handle challenging cases with complex textures or occlusions. |
The method decomposes the 3D representation into two neural radiance fields, one for the in-distribution face and another for the OOD object. This is achieved by leveraging the tri-plane representation of EG3D and employing a composite volume rendering scheme that combines both radiance fields for reconstruction. |
The approach achieves high-fidelity reconstruction of faces with OOD objects, outperforming existing methods on metrics such as LPIPS, PSNR, SSIM, and ID similarity.
It preserves the editability of the pre-trained GAN, allowing for semantic manipulations like changing facial expressions while leaving the OOD component intact.
The method enables 3D-aware applications such as novel view synthesis and OOD object removal. |
The method faces challenges in editing OOD regions directly, handling duplicate objects (like adding glasses to existing glasses), and dealing with extreme poses.
The current implementation primarily focuses on single-frame editing and can suffer from temporal inconsistency in video editing. |
gan inversion, 3d-aware gans, out-of-distribution data, neural radiance fields, composite volume rendering |
2302.04869
Report |
Reversible Vision Transformers |
Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik |
We present Reversible Vision Transformers, a memory efficient architecture
design for visual recognition. By decoupling the GPU memory requirement from
the depth of the model, Reversible Vision Transformers enable scaling up
architectures with efficient memory usage. We adapt two popular models, namely
Vision Transformer and Multiscale Vision Transformers, to reversible variants
and benchmark extensively across both model sizes and tasks of image
classification, object detection and video classification. Reversible Vision
Transformers achieve a reduced memory footprint of up to 15.5x at roughly
identical model complexity, parameters and accuracy, demonstrating the promise
of reversible vision transformers as an efficient backbone for hardware
resource limited training regimes. Finally, we find that the additional
computational burden of recomputing activations is more than overcome for
deeper models, where throughput can increase up to 2.3x over their
non-reversible counterparts. Full code and trained models are available at
https://github.com/facebookresearch/slowfast. A simpler, easy to understand and
modify version is also available at https://github.com/karttikeya/minREV |
This paper introduces Reversible Vision Transformers (Rev-ViT and Rev-MViT), memory-efficient versions of ViT and MViT that decouple memory usage from model depth by recomputing activations instead of storing them. |
The memory requirements of deep Vision Transformers often limit their scalability, especially in memory-intensive tasks like video recognition. Reversible architectures offer a solution by significantly reducing activation memory footprint. |
The authors adapt ViT and MViT to reversible architectures by employing reversible transformations, reconfiguring residual connections to improve stability in deep models, and developing training recipes tailored for the inherent regularization of reversible networks. |
Rev-ViT and Rev-MViT achieve comparable accuracy to their non-reversible counterparts across image classification, object detection, and video classification benchmarks.
Reversible models exhibit significant memory savings, with Rev-ViT-L using 15.5x less memory and Rev-MViT-B using 4.5x less memory per image than their non-reversible versions.
Deeper Rev-MViT models demonstrate up to 2.3x higher throughput compared to standard MViT models due to reduced memory bottlenecks. |
The stage-transition blocks in Rev-MViT, necessary for resolution changes, still require activation caching, somewhat limiting memory savings.
Further research can explore asynchronous activation recomputation and parallelization strategies to further improve the training speed of reversible transformers. |
vision transformer, reversible architecture, memory efficiency, image classification, video classification, object detection |
2302.04868
Report |
MEGANE: Morphable Eyeglass and Avatar Network |
Junxuan Li, Shunsuke Saito, Tomas Simon, Stephen Lombardi, Hongdong Li, Jason Saragih |
Eyeglasses play an important role in the perception of identity. Authentic
virtual representations of faces can benefit greatly from their inclusion.
However, modeling the geometric and appearance interactions of glasses and the
face of virtual representations of humans is challenging. Glasses and faces
affect each other's geometry at their contact points, and also induce
appearance changes due to light transport. Most existing approaches do not
capture these physical interactions since they model eyeglasses and faces
independently. Others attempt to resolve interactions as a 2D image synthesis
problem and suffer from view and temporal inconsistencies. In this work, we
propose a 3D compositional morphable model of eyeglasses that accurately
incorporates high-fidelity geometric and photometric interaction effects. To
support the large variation in eyeglass topology efficiently, we employ a
hybrid representation that combines surface geometry and a volumetric
representation. Unlike volumetric approaches, our model naturally retains
correspondences across glasses, and hence explicit modification of geometry,
such as lens insertion and frame deformation, is greatly simplified. In
addition, our model is relightable under point lights and natural illumination,
supporting high-fidelity rendering of various frame materials, including
translucent plastic and metal within a single morphable model. Importantly, our
approach models global light transport effects, such as casting shadows between
faces and glasses. Our morphable model for eyeglasses can also be fit to novel
glasses via inverse rendering. We compare our approach to state-of-the-art
methods and demonstrate significant quality improvements. |
Presents MEGANE, a 3D morphable and relightable model of eyeglasses that captures geometric and photometric interactions between eyeglasses and faces. |
Existing methods for synthesizing glasses on faces either lack 3D consistency, fail to model interactions, or are not relightable, limiting their realism. |
Combines a hybrid mesh-volumetric representation for glasses with a generative human head model, leveraging physics-inspired neural relighting and multi-view data with explicit geometry guidance. |
Accurately models geometric deformations of both glasses and faces at contact points.
Achieves high-fidelity relighting under novel illuminations, supporting diverse materials including translucent plastic and metal.
Enables few-shot reconstruction of novel glasses and supports lens insertion with realistic refraction and reflection. |
Initial glasses position and subtle motion due to expressions are entangled.
Current relighting is per-point-light, limiting real-time applicability. |
neural rendering, generative model, 3d face, eyeglasses, relighting |
2302.04867
Report |
UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models |
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu |
Diffusion probabilistic models (DPMs) have demonstrated a very promising
ability in high-resolution image synthesis. However, sampling from a
pre-trained DPM is time-consuming due to the multiple evaluations of the
denoising network, making it more and more important to accelerate the sampling
of DPMs. Despite recent progress in designing fast samplers, existing methods
still cannot generate satisfying images in many applications where fewer steps
(e.g., $<$10) are favored. In this paper, we develop a unified corrector (UniC)
that can be applied after any existing DPM sampler to increase the order of
accuracy without extra model evaluations, and derive a unified predictor (UniP)
that supports arbitrary order as a byproduct. Combining UniP and UniC, we
propose a unified predictor-corrector framework called UniPC for the fast
sampling of DPMs, which has a unified analytical form for any order and can
significantly improve the sampling quality over previous methods, especially in
extremely few steps. We evaluate our methods through extensive experiments
including both unconditional and conditional sampling using pixel-space and
latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional)
and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function
evaluations. Code is available at https://github.com/wl-zhao/UniPC. |
This paper proposes UniPC, a unified predictor-corrector framework for fast sampling of diffusion probabilistic models (DPMs). |
Sampling from DPMs is computationally expensive due to many evaluations of the denoising network. UniPC enables faster sampling while maintaining high image quality, particularly in extremely few steps, which is crucial in applications like prompt design for text-to-image models. |
The framework is based on a novel unified corrector (UniC) that increases the order of accuracy without extra model evaluations and a unified predictor (UniP) that supports arbitrary order. It leverages the structure of exponential integrators with respect to half log-SNR for efficient computation. |
UniPC achieves superior sampling quality in few-step sampling on various datasets, including CIFAR10, LSUN Bedroom, FFHQ, ImageNet, and MS-COCO2014, outperforming state-of-the-art methods like DPM-Solver++.
UniC consistently improves the sampling quality of existing DPM solvers with different updating methods and orders.
UniPC allows for customizable order schedules and demonstrates promising results with both noise and data prediction models. |
UniPC, being a training-free method, still lags behind training-based approaches in performance.
Further improvements are possible by exploring better choices for the function B(h), a more accurate estimation of epsilon(x_t, t), and optimal order schedules. |
diffusion probabilistic models, fast sampling, predictor-corrector, high-order solver, image synthesis |
2302.04850
Report |
Robot Synesthesia: A Sound and Emotion Guided AI Painter |
Vihaan Misra, Peter Schaldenbrand, Jean Oh |
If a picture paints a thousand words, sound may voice a million. While recent
robotic painting and image synthesis methods have achieved progress in
generating visuals from text inputs, the translation of sound into images is
vastly unexplored. Generally, sound-based interfaces and sonic interactions
have the potential to expand accessibility and control for the user and provide
a means to convey complex emotions and the dynamic aspects of the real world.
In this paper, we propose an approach for using sound and speech to guide a
robotic painting process, known here as robot synesthesia. For general sound,
we encode the simulated paintings and input sounds into the same latent space.
For speech, we decouple speech into its transcribed text and the tone of the
speech. Whereas we use the text to control the content, we estimate the
emotions from the tone to guide the mood of the painting. Our approach has been
fully integrated with FRIDA, a robotic painting framework, adding sound and
speech to FRIDA's existing input modalities, such as text and style. In two
surveys, participants were able to correctly guess the emotion or natural sound
used to generate a given painting more than twice as likely as random chance.
On our sound-guided image manipulation and music-guided paintings, we discuss
the results qualitatively. |
This paper introduces Robot Synesthesia, a novel approach that incorporates sound and speech inputs into the FRIDA robotic painting system, enabling a robot to generate paintings that reflect the semantic and emotional content of auditory cues. |
This research is important because it explores the underexplored area of translating sound into images, expanding the accessibility and control of robotic painting systems, and enabling a richer expression of human emotions in art created by robots. |
The methodology involves leveraging pre-trained audio-image encoders like CLIP_audio for natural sounds and decoupling speech into transcribed text (content) and tone (emotion) using Whisper and a Speech Emotion Recognition model. These features are then used to guide the robotic painting process in FRIDA. |
User studies showed that participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance.
Paintings generated from natural sounds like rain or thunder were recognizable by human observers.
Emotion-guided paintings, even with abstract appearances, successfully conveyed the intended emotion to human viewers. |
The generalization of the generated content is limited by the training data used for the audio-image and image-emotion models.
Evaluating the quality of generated artwork, especially abstract ones, remains a challenge in this field. |
robotic painting, sound-guided image generation, emotion in art, human-robot interaction, multimodal learning |
2302.04841
Report |
Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics |
Anton Voronov, Mikhail Khoroshikh, Artem Babenko, Max Ryabinin |
Text-to-image generation models represent the next step of evolution in image
synthesis, offering a natural way to achieve flexible yet fine-grained control
over the result. One emerging area of research is the fast adaptation of large
text-to-image models to smaller datasets or new visual concepts. However, many
efficient methods of adaptation have a long training time, which limits their
practical applications, slows down experiments, and spends excessive GPU
resources. In this work, we study the training dynamics of popular
text-to-image personalization methods (such as Textual Inversion or
DreamBooth), aiming to speed them up. We observe that most concepts are learned
at early stages and do not improve in quality later, but standard training
convergence metrics fail to indicate that. Instead, we propose a simple drop-in
early stopping criterion that only requires computing the regular training
objective on a fixed set of inputs for all training iterations. Our experiments
on Stable Diffusion for 48 different concepts and three personalization methods
demonstrate the competitive performance of our approach, which makes adaptation
up to 8 times faster with no significant drops in quality. |
This paper proposes DVAR, a novel early stopping criterion to accelerate the adaptation of text-to-image models (e.g., Textual Inversion, DreamBooth) by leveraging a deterministic training loss calculated on a fixed input batch. |
Existing adaptation methods for text-to-image models often have long training times, hindering their practical application and efficient experimentation. |
The authors analyze the training dynamics of adaptation methods and identify that fixing random components in the loss function makes its convergence more interpretable. This observation leads to the development of DVAR, which monitors the stabilization of the deterministic loss for early stopping. |
DVAR significantly reduces training time (2-8x faster) for various adaptation methods on Stable Diffusion v1.5 without compromising image quality.
The deterministic loss used in DVAR provides a more reliable convergence indicator than standard metrics like training loss or gradient norm.
The adaptive nature of DVAR helps mitigate overfitting to training images, leading to better generalization to unseen prompts. |
The study primarily focuses on Stable Diffusion v1.5, and further validation on other models and datasets is needed.
Exploring alternative early stopping criteria beyond variance-based methods could be beneficial. |
text-to-image generation, model personalization, early stopping, stable diffusion, training dynamics |
2302.04638
Report |
Better Diffusion Models Further Improve Adversarial Training |
Zekai Wang, Tianyu Pang, Chao Du, Min Lin, Weiwei Liu, Shuicheng Yan |
It has been recognized that the data generated by the denoising diffusion
probabilistic model (DDPM) improves adversarial training. After two years of
rapid development in diffusion models, a question naturally arises: can better
diffusion models further improve adversarial training? This paper gives an
affirmative answer by employing the most recent diffusion model which has
higher efficiency ($\sim 20$ sampling steps) and image quality (lower FID
score) compared with DDPM. Our adversarially trained models achieve
state-of-the-art performance on RobustBench using only generated data (no
external datasets). Under the $\ell_\infty$-norm threat model with
$\epsilon=8/255$, our models achieve $70.69\%$ and $42.67\%$ robust accuracy on
CIFAR-10 and CIFAR-100, respectively, i.e. improving upon previous
state-of-the-art models by $+4.58\%$ and $+8.03\%$. Under the $\ell_2$-norm
threat model with $\epsilon=128/255$, our models achieve $84.86\%$ on CIFAR-10
($+4.44\%$). These results also beat previous works that use external data. We
also provide compelling results on the SVHN and TinyImageNet datasets. Our code
is available at https://github.com/wzekai99/DM-Improves-AT. |
This paper explores the impact of utilizing an advanced diffusion model, the elucidating diffusion model (EDM), to enhance adversarial training (AT) for improved robustness against adversarial attacks. |
The work is significant as it addresses the question of whether advancements in diffusion models can further improve the effectiveness of AT, a crucial defense against adversarial attacks. |
The authors generate data using the class-conditional EDM and incorporate it into the AT process, replacing the previously used DDPM-generated data. They conduct comprehensive experiments on CIFAR-10, CIFAR-100, SVHN, and TinyImageNet datasets, comparing their approach to state-of-the-art methods. |
Replacing DDPM-generated data with EDM-generated data leads to significant improvements in both clean and robust accuracy of adversarially trained models.
The authors achieve state-of-the-art results on RobustBench without using any external data, surpassing even previous methods that rely on external datasets.
The study reveals that using generated data with lower FID scores (indicating higher quality) consistently leads to enhanced model robustness. |
The study primarily focuses on ℓ∞ and ℓ2 norm-based attacks, leaving the exploration of other attack types for future work.
The work highlights the need for more efficient utilization of diffusion models in adversarial learning to address the computational demands of generating large amounts of data. |
adversarial training, diffusion models, robustness, data augmentation, edm |
2302.04440
Report |
Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples |
Marco Jiralerspong, Avishek Joey Bose, Ian Gemp, Chongli Qin, Yoram Bachrach, Gauthier Gidel |
The past few years have seen impressive progress in the development of deep
generative models capable of producing high-dimensional, complex, and
photo-realistic data. However, current methods for evaluating such models
remain incomplete: standard likelihood-based metrics do not always apply and
rarely correlate with perceptual fidelity, while sample-based metrics, such as
FID, are insensitive to overfitting, i.e., inability to generalize beyond the
training set. To address these limitations, we propose a new metric called the
Feature Likelihood Divergence (FLD), a parametric sample-based metric that uses
density estimation to provide a comprehensive trichotomic evaluation accounting
for novelty (i.e., different from the training samples), fidelity, and
diversity of generated samples. We empirically demonstrate the ability of FLD
to identify overfitting problem cases, even when previously proposed metrics
fail. We also extensively evaluate FLD on various image datasets and model
classes, demonstrating its ability to match intuitions of previous metrics like
FID while offering a more comprehensive evaluation of generative models. Code
is available at https://github.com/marcojira/fld. |
Proposes Feature Likelihood Divergence (FLD), a sample-based metric for evaluating generative models that captures sample fidelity, diversity, and novelty. |
Existing sample-based metrics like FID, while correlating with sample quality and diversity, fail to detect overfitting (memorization of the training set), which is crucial for assessing generalization ability and addressing privacy concerns. |
FLD leverages a Mixture of Gaussians (MoG) density estimator in a perceptually meaningful feature space (e.g., DINOv2). It fits the MoG's variances to the training set such that overfit samples receive vanishingly small variances, negatively impacting the density estimation and resulting FLD score. |
FLD correlates strongly with sample fidelity, penalizing perceptually significant transformations more than minor ones.
It effectively captures mode coverage and diversity, with scores improving as generated samples encompass more classes and avoid redundant copies.
FLD demonstrates consistent detection of overfitting, even for subtly transformed copies of training data, outperforming FID, CT, and AuthPct metrics. |
The reliance on fixed feature spaces might not generalize to all datasets and modalities, necessitating exploration of alternative embeddings.
Future work could explore extensions of FLD for evaluating conditional generative models and its applicability in other data modalities like text, audio, and time series. |
generative models, evaluation metrics, overfitting detection, sample fidelity, sample diversity |
2302.04265
Report |
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models |
Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, Tommi Jaakkola |
We introduce a new family of physics-inspired generative models termed PFGM++
that unifies diffusion models and Poisson Flow Generative Models (PFGM). These
models realize generative trajectories for $N$ dimensional data by embedding
paths in $N{+}D$ dimensional space while still controlling the progression with
a simple scalar norm of the $D$ additional variables. The new models reduce to
PFGM when $D{=}1$ and to diffusion models when $D{\to}\infty$. The flexibility
of choosing $D$ allows us to trade off robustness against rigidity as
increasing $D$ results in more concentrated coupling between the data and the
additional variable norms. We dispense with the biased large batch field
targets used in PFGM and instead provide an unbiased perturbation-based
objective similar to diffusion models. To explore different choices of $D$, we
provide a direct alignment method for transferring well-tuned hyperparameters
from diffusion models ($D{\to} \infty$) to any finite $D$ values. Our
experiments show that models with finite $D$ can be superior to previous
state-of-the-art diffusion models on CIFAR-10/FFHQ $64{\times}64$ datasets,
with FID scores of $1.91/2.43$ when $D{=}2048/128$. In class-conditional
setting, $D{=}2048$ yields current state-of-the-art FID of $1.74$ on CIFAR-10.
In addition, we demonstrate that models with smaller $D$ exhibit improved
robustness against modeling errors. Code is available at
https://github.com/Newbeeer/pfgmpp |
Presents PFGM++, a new family of physics-inspired generative models unifying diffusion models and Poisson Flow Generative Models (PFGM) by embedding data in higher dimensions and controlling generation with a scalar norm. |
Provides flexibility in balancing robustness and learning rigidity, potentially leading to improved generative models, particularly in resource-constrained settings. |
Expands PFGM's electrostatic view into higher dimensions, introduces a perturbation-based training objective, and proves equivalence to diffusion models as the augmentation dimension approaches infinity. |
Models with finite augmentation dimensions outperform state-of-the-art diffusion models on CIFAR-10/FFHQ 64x64 datasets.
An optimal augmentation dimension exists that balances robustness and learning efficiency.
Decreasing the augmentation dimension improves robustness against modeling errors like noise injection, large sampling steps, and quantization. |
Identifying the optimal augmentation dimension for various architectures and tasks requires further analysis.
Developing stochastic samplers for PFGM++ is a promising direction. |
generative models, diffusion models, poisson flow generative models, robustness, image generation |
2302.04233
Report |
SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images |
Nikhil Gosala, Kürsat Petek, Paulo L. J. Drews-Jr, Wolfram Burgard, Abhinav Valada |
Bird's-Eye-View (BEV) semantic maps have become an essential component of
automated driving pipelines due to the rich representation they provide for
decision-making tasks. However, existing approaches for generating these maps
still follow a fully supervised training paradigm and hence rely on large
amounts of annotated BEV data. In this work, we address this limitation by
proposing the first self-supervised approach for generating a BEV semantic map
using a single monocular image from the frontal view (FV). During training, we
overcome the need for BEV ground truth annotations by leveraging the more
easily available FV semantic annotations of video sequences. Thus, we propose
the SkyEye architecture that learns based on two modes of self-supervision,
namely, implicit supervision and explicit supervision. Implicit supervision
trains the model by enforcing spatial consistency of the scene over time based
on FV semantic sequences, while explicit supervision exploits BEV pseudolabels
generated from FV semantic annotations and self-supervised depth estimates.
Extensive evaluations on the KITTI-360 dataset demonstrate that our
self-supervised approach performs on par with the state-of-the-art fully
supervised methods and achieves competitive results using only 1% of direct
supervision in the BEV compared to fully supervised approaches. Finally, we
publicly release both our code and the BEV datasets generated from the
KITTI-360 and Waymo datasets. |
This paper introduces SkyEye, the first self-supervised framework for generating semantic bird's-eye-view (BEV) maps from single monocular front-view images. |
Generating BEV semantic maps is essential for autonomous driving, but existing methods rely on large amounts of annotated BEV data, which is difficult and expensive to obtain. SkyEye addresses this by utilizing more readily available front-view annotations and self-supervision. |
SkyEye leverages implicit supervision, enforcing spatial consistency over time using front-view semantic sequences, and explicit supervision, utilizing BEV pseudolabels generated from front-view annotations and self-supervised depth estimates. |
SkyEye achieves performance comparable to state-of-the-art fully supervised methods on the KITTI-360 dataset without using BEV ground truth annotations.
The approach shows competitive results even when trained with only 1% of BEV pseudolabels compared to fully supervised approaches.
SkyEye demonstrates superior generalization capabilities compared to baselines when pretrained on KITTI-360 and evaluated on Waymo. |
The model's reliance on temporal context can impact performance in highly dynamic scenes.
Perspective distortion limits spatial observability for distant regions, a common limitation for camera-based methods. |
self-supervised learning, bev semantic mapping, autonomous driving, monocular vision, 3d representation learning |
2302.03675
Report |
Auditing Gender Presentation Differences in Text-to-Image Models |
Yanzhe Zhang, Lu Jiang, Greg Turk, Diyi Yang |
Text-to-image models, which can generate high-quality images based on textual
input, have recently enabled various content-creation tools. Despite
significantly affecting a wide range of downstream applications, the
distributions of these generated images are still not fully understood,
especially when it comes to the potential stereotypical attributes of different
genders. In this work, we propose a paradigm (Gender Presentation Differences)
that utilizes fine-grained self-presentation attributes to study how gender is
presented differently in text-to-image models. By probing gender indicators in
the input text (e.g., "a woman" or "a man"), we quantify the frequency
differences of presentation-centric attributes (e.g., "a shirt" and "a dress")
through human annotation and introduce a novel metric: GEP. Furthermore, we
propose an automatic method to estimate such differences. The automatic GEP
metric based on our approach yields a higher correlation with human annotations
than that based on existing CLIP scores, consistently across three
state-of-the-art text-to-image models. Finally, we demonstrate the
generalization ability of our metrics in the context of gender stereotypes
related to occupations. |
This paper proposes GEP, a novel metric to quantify gender presentation differences in text-to-image models by analyzing the frequency of presentation-centric attributes (e.g., clothing) in images generated with different gender indicators. |
Understanding how gender is portrayed in generated images is crucial for identifying and mitigating potential biases and stereotypes perpetuated by text-to-image models. |
The authors define gender indicators, attributes, and contexts to construct prompts for image generation. They manually annotate the frequency of attributes in generated images and propose an automatic method using cross-modal classifiers trained on CLIP embeddings to estimate these frequencies. |
Significant attribute-wise differences are observed in generated images when prompting with different genders, both with and without explicit attribute mentions.
The proposed automatic GEP metric based on cross-modal classifiers shows a stronger correlation with human annotations than using CLIP similarity scores alone.
The GEP metric can be extended to reveal attribute-based gender stereotypes related to occupations. |
The study is limited by the selected set of attributes and contexts, which may not be exhaustive or representative of all scenarios.
The lack of real-world distribution data for the studied attributes makes it difficult to determine if the observed differences are amplified compared to reality. |
text-to-image generation, gender bias, stereotype detection, clip, cross-modal classifiers |
2302.03594
Report |
NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM |
Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R. Oswald, Andreas Geiger, Marc Pollefeys |
Neural implicit representations have recently become popular in simultaneous
localization and mapping (SLAM), especially in dense visual SLAM. However,
previous works in this direction either rely on RGB-D sensors, or require a
separate monocular SLAM approach for camera tracking and do not produce
high-fidelity dense 3D scene reconstruction. In this paper, we present
NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes for camera
poses and a hierarchical neural implicit map representation, which also allows
for high-quality novel view synthesis. To facilitate the optimization process
for mapping, we integrate additional supervision signals including
easy-to-obtain monocular geometric cues and optical flow, and also introduce a
simple warping loss to further enforce geometry consistency. Moreover, to
further boost performance in complicated indoor scenes, we also propose a local
adaptive transformation from signed distance functions (SDFs) to density in the
volume rendering equation. On both synthetic and real-world datasets we
demonstrate strong performance in dense mapping, tracking, and novel view
synthesis, even competitive with recent RGB-D SLAM systems. |
NICER-SLAM, a novel dense RGB SLAM system that uses a hierarchical neural implicit representation for end-to-end optimization of both scene representation and camera poses. |
To address limitations of existing dense SLAM systems that either rely on RGB-D sensors or separate tracking and mapping pipelines, hindering high-fidelity dense 3D reconstruction in challenging scenarios with monocular RGB input. |
The system leverages a hierarchical neural implicit representation for scene geometry and color, incorporating geometric and motion regularizations, including monocular cues and a novel warping loss. A locally adaptive SDF-to-density transformation enhances performance in complex indoor environments. |
NICER-SLAM achieves competitive 3D reconstruction and tracking accuracy compared to RGB-D SLAM methods, even without depth input.
It demonstrates superior novel view synthesis quality, surpassing both traditional and implicit-based SLAM approaches.
The system exhibits robustness in challenging scenarios with low-resolution images and motion blur. |
The current implementation is not yet real-time.
Loop closure is not incorporated, limiting long-term tracking accuracy. |
slam, neural implicit representations, 3d reconstruction, novel view synthesis, monocular rgb |
2302.03406
Report |
High-Resolution GAN Inversion for Degraded Images in Large Diverse Datasets |
Yanbo Wang, Chuming Lin, Donghao Luo, Ying Tai, Zhizhong Zhang, Yuan Xie |
The last decades are marked by massive and diverse image data, which shows
increasingly high resolution and quality. However, some images we obtained may
be corrupted, affecting the perception and the application of downstream tasks.
A generic method for generating a high-quality image from the degraded one is
in demand. In this paper, we present a novel GAN inversion framework that
utilizes the powerful generative ability of StyleGAN-XL for this problem. To
ease the inversion challenge with StyleGAN-XL, Clustering \& Regularize
Inversion (CRI) is proposed. Specifically, the latent space is firstly divided
into finer-grained sub-spaces by clustering. Instead of initializing the
inversion with the average latent vector, we approximate a centroid latent
vector from the clusters, which generates an image close to the input image.
Then, an offset with a regularization term is introduced to keep the inverted
latent vector within a certain range. We validate our CRI scheme on multiple
restoration tasks (i.e., inpainting, colorization, and super-resolution) of
complex natural images, and show preferable quantitative and qualitative
results. We further demonstrate our technique is robust in terms of data and
different GAN models. To our best knowledge, we are the first to adopt
StyleGAN-XL for generating high-quality natural images from diverse degraded
inputs. Code is available at https://github.com/Booooooooooo/CRI. |
This paper proposes CRI, a novel GAN inversion framework utilizing StyleGAN-XL to generate high-quality images from diverse degraded inputs (e.g., inpainted, colorized, or low-resolution images). |
Generating high-quality images from degraded images is crucial due to the uneven quality of online images and the need for high-quality images in various applications. |
CRI utilizes clustering to find a better starting point for optimization in the complex latent space of StyleGAN-XL and introduces a regularized offset to constrain the optimization process, ensuring high perceptual quality of generated images. |
CRI outperforms existing GAN inversion methods in image inpainting, colorization, and super-resolution tasks on ImageNet, CelebA-HQ, and out-of-domain datasets.
CRI with StyleGAN-XL generates higher quality images than DGP with BigGAN, highlighting the benefit of using StyleGAN-XL for this task.
Ablation studies demonstrate the effectiveness of clustering and the regularized offset in improving both quantitative and qualitative results. |
The clustering time increases with the number of clusters, posing a trade-off between performance and computational cost.
Further exploration of applying CRI to other degradation types like noise and blur is left for future work. |
gan inversion, image restoration, stylegan-xl, image inpainting, image colorization, super-resolution |
2302.03084
Report |
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval |
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister |
In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of
CIR models using labeled triplets consisting of the query image, text
specification, and the target image. Labeling such triplets is expensive and
hinders broad applicability of CIR. In this work, we propose to study an
important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to
build a CIR model without requiring labeled triplets for training. To this end,
we propose a novel method, called Pic2Word, that requires only weakly labeled
image-caption pairs and unlabeled image datasets to train. Unlike existing
supervised CIR models, our model trained on weakly labeled or unlabeled
datasets shows strong generalization across diverse ZS-CIR tasks, e.g.,
attribute editing, object composition, and domain conversion. Our approach
outperforms several supervised CIR methods on the common CIR benchmark, CIRR
and Fashion-IQ. Code will be made publicly available at
https://github.com/google-research/composed_image_retrieval. |
This paper proposes a novel task, Zero-Shot Composed Image Retrieval (ZS-CIR), and introduces a method called Pic2Word to address it. Pic2Word enables CIR without requiring labeled triplets for training. |
Existing CIR methods are limited by the need for expensive labeled triplet data and often struggle to generalize to diverse CIR tasks. ZS-CIR aims to overcome these limitations by enabling CIR models to function without task-specific labeled data. |
Pic2Word leverages pre-trained vision-language contrastive learning models (e.g., CLIP) and learns a mapping network that converts image embeddings into pseudo language tokens. This allows for the composition of image and text queries within the language embedding space, effectively achieving early fusion. |
Pic2Word significantly outperforms zero-shot baselines on domain conversion, object composition, and scene manipulation tasks.
On CIRR and Fashion-IQ datasets, Pic2Word achieves performance comparable to or better than several recent supervised CIR methods trained on labeled data.
Analysis suggests that the learned pseudo language tokens effectively capture image information and that the method benefits from the early fusion strategy. |
The performance of Pic2Word on CIRR and Fashion-IQ highlights the potential dataset-specific bias in relative importance of image and text modalities.
Future work can explore the use of multiple pseudo tokens to represent images with finer details. |
composed image retrieval, zero-shot learning, vision-language models, contrastive learning, early fusion |
2302.03024
Report |
AIM: Adapting Image Models for Efficient Video Action Recognition |
Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, Mu Li |
Recent vision transformer based video models mostly follow the ``image
pre-training then finetuning" paradigm and have achieved great success on
multiple video benchmarks. However, full finetuning such a video model could be
computationally expensive and unnecessary, given the pre-trained image
transformer models have demonstrated exceptional transferability. In this work,
we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient
video understanding. By freezing the pre-trained image model and adding a few
lightweight Adapters, we introduce spatial adaptation, temporal adaptation and
joint adaptation to gradually equip an image model with spatiotemporal
reasoning capability. We show that our proposed AIM can achieve competitive or
even better performance than prior arts with substantially fewer tunable
parameters on four video action recognition benchmarks. Thanks to its
simplicity, our method is also generally applicable to different image
pre-trained models, which has the potential to leverage more powerful image
foundation models in the future. The project webpage is
\url{https://adapt-image-models.github.io/}. |
This paper proposes AIM, a novel method to adapt pre-trained image models (e.g., ViT) for efficient video understanding by adding lightweight adapters and reusing spatial attention for temporal modeling. |
Full finetuning of video models is computationally expensive and potentially unnecessary given the strong transferability of pre-trained image models. |
The method introduces spatial, temporal, and joint adaptation modules. Spatial adaptation uses adapters after self-attention for spatial feature refinement. Temporal adaptation reuses image self-attention for temporal modeling and adds adapters for temporal feature tuning. Joint adaptation utilizes adapters in parallel to MLP layers for spatiotemporal reasoning. |
AIM achieves competitive or better performance than state-of-the-art methods on K400, K700, and Diving-48 with significantly fewer tunable parameters.
AIM exhibits data efficiency, outperforming fully finetuned counterparts, especially in low-data regimes.
The method is simple, generally applicable, and reduces training cost significantly compared to full finetuning. |
The reused spatial attention for temporal modeling might not be sufficient for complex temporal relationships.
Future work could explore leveraging pre-trained weights from text or audio models for enhanced temporal adaptation. |
video understanding, action recognition, efficient finetuning, vision transformer, transfer learning |
2302.03011
Report |
Structure and Content-Guided Video Synthesis with Diffusion Models |
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis |
Text-guided generative diffusion models unlock powerful image creation and
editing tools. While these have been extended to video generation, current
approaches that edit the content of existing footage while retaining structure
require expensive re-training for every input or rely on error-prone
propagation of image edits across frames. In this work, we present a structure
and content-guided video diffusion model that edits videos based on visual or
textual descriptions of the desired output. Conflicts between user-provided
content edits and structure representations occur due to insufficient
disentanglement between the two aspects. As a solution, we show that training
on monocular depth estimates with varying levels of detail provides control
over structure and content fidelity. Our model is trained jointly on images and
videos which also exposes explicit control of temporal consistency through a
novel guidance method. Our experiments demonstrate a wide variety of successes;
fine-grained control over output characteristics, customization based on a few
reference images, and a strong user preference towards results by our model. |
Presents a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions, addressing limitations of prior methods relying on expensive retraining or error-prone propagation. |
Improves video editing by enabling intuitive content modification while retaining structure, addressing the challenge of balancing temporal consistency and spatial detail in existing video editing tools. |
Extends latent diffusion models to video generation using temporal layers in a pretrained image model, trained jointly on images and videos with conditioning on depth estimates (structure) and CLIP embeddings (content). |
Enables diverse video edits including style changes, environment modifications, and character replacements guided by text prompts or example images.
Provides control over temporal consistency, content fidelity, and structure adherence through novel guidance methods and variable depth blurring.
Demonstrates superior performance in user studies, with a strong preference for generated results compared to baseline methods. |
Reliance on depth maps as a structure representation limits the extent of content edits, particularly those involving significant changes in object shape.
Potential for misuse of generative models for harmful purposes requires further research on mitigating abuse. |
video editing, diffusion models, generative ai, text-to-video, content and structure |
2302.02908
Report |
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval |
Ziyang luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing Ma, Qingwen lin, Daxin Jiang |
Image-text retrieval (ITR) is a task to retrieve the relevant images/texts,
given the query from another modality. The conventional dense retrieval
paradigm relies on encoding images and texts into dense representations using
dual-stream encoders, however, it faces challenges with low retrieval speed in
large-scale retrieval scenarios. In this work, we propose the lexicon-weighting
paradigm, where sparse representations in vocabulary space are learned for
images and texts to take advantage of the bag-of-words models and efficient
inverted indexes, resulting in significantly reduced retrieval latency. A
crucial gap arises from the continuous nature of image data, and the
requirement for a sparse vocabulary space representation. To bridge this gap,
we introduce a novel pre-training framework, Lexicon-Bottlenecked
Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon
representations. This framework features lexicon-bottlenecked modules between
the dual-stream encoders and weakened text decoders, allowing for constructing
continuous bag-of-words bottlenecks to learn lexicon-importance distributions.
Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art
performance on two benchmark ITR datasets, MSCOCO and Flickr30k. Furthermore,
in large-scale retrieval scenarios, LexLIP outperforms CLIP with a 5.5 ~ 221.3X
faster retrieval speed and 13.2 ~ 48.8X less index storage memory. |
This paper presents LexLIP, a lexicon-weighting paradigm for large-scale image-text retrieval that leverages sparse representations in vocabulary space for faster retrieval and reduced storage compared to conventional dense retrieval methods. |
Large-scale image-text retrieval faces challenges with low retrieval speed and high storage requirements when using dense retrieval methods, limiting their practicality in real-world applications. |
LexLIP introduces a lexicon-bottlenecked pre-training framework with dual-stream encoders, lexicon-bottlenecked modules, and weakened text decoders to learn importance-aware lexicon representations for images and texts. This enables efficient retrieval using bag-of-words models and inverted indexes. |
LexLIP achieves state-of-the-art performance on MSCOCO and Flickr30k image-text retrieval benchmarks with smaller pre-training datasets compared to previous methods.
In large-scale retrieval scenarios, LexLIP demonstrates significantly faster retrieval speed (5.5x-221.3x) and reduced index storage memory (13.2x-48.8x) compared to CLIP.
Ablation studies highlight the contribution of each component in the LexLIP framework, particularly the contrastive learning objectives for aligning image and text representations in the vocabulary space. |
The large-scale benchmark is established by expanding Flickr30k with 1M random pairs from Conceptual Caption 12M, which may not fully represent real-world large-scale retrieval scenarios.
Future work includes exploring alternative sparsification strategies and applying LexLIP to other cross-modal retrieval tasks beyond image-text retrieval. |
image-text retrieval, lexicon-weighting paradigm, lexicon-bottlenecked pre-training, large-scale retrieval, sparse representation |
2302.02693
Report |
PatchDCT: Patch Refinement for High Quality Instance Segmentation |
Qinrou Wen, Jirui Yang, Xue Yang, Kewei Liang |
High-quality instance segmentation has shown emerging importance in computer
vision. Without any refinement, DCT-Mask directly generates high-resolution
masks by compressed vectors. To further refine masks obtained by compressed
vectors, we propose for the first time a compressed vector based multi-stage
refinement framework. However, the vanilla combination does not bring
significant gains, because changes in some elements of the DCT vector will
affect the prediction of the entire mask. Thus, we propose a simple and novel
method named PatchDCT, which separates the mask decoded from a DCT vector into
several patches and refines each patch by the designed classifier and
regressor. Specifically, the classifier is used to distinguish mixed patches
from all patches, and to correct previously mispredicted foreground and
background patches. In contrast, the regressor is used for DCT vector
prediction of mixed patches, further refining the segmentation quality at
boundary locations. Experiments on COCO show that our method achieves 2.0%,
3.2%, 4.5% AP and 3.4%, 5.3%, 7.0% Boundary AP improvements over Mask-RCNN on
COCO, LVIS, and Cityscapes, respectively. It also surpasses DCT-Mask by 0.7%,
1.1%, 1.3% AP and 0.9%, 1.7%, 4.2% Boundary AP on COCO, LVIS and Cityscapes.
Besides, the performance of PatchDCT is also competitive with other
state-of-the-art methods. |
Proposes PatchDCT, a compressed vector-based instance segmentation method that refines mask patches independently using patch DCT vectors, achieving high-quality masks with fine boundaries. |
Existing methods struggle to refine high-resolution instance masks due to limitations in low-resolution representations or difficulties in refining global DCT vectors. |
Divides masks into patches, classifies them into foreground, background, or mixed, and refines mixed patches with a regressor predicting short, informative DCT vectors. |
Achieves 2.0% AP and 3.4% Boundary AP improvement over Mask-RCNN on COCO.
Outperforms DCT-Mask by 0.7% AP and 0.9% Boundary AP on COCO.
Demonstrates competitive performance with state-of-the-art methods on COCO test-dev. |
May generate masks with holes in semantically ambiguous areas.
Future work includes improving classification and regression for better handling such areas, and exploring applications in other challenging domains like aerial images. |
instance segmentation, dct, multi-stage refinement, patching, boundary attention |
2302.02615
Report |
Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need |
Jingyao Li, Pengguang Chen, Shaozuo Yu, Zexin He, Shu Liu, Jiaya Jia |
The core of out-of-distribution (OOD) detection is to learn the
in-distribution (ID) representation, which is distinguishable from OOD samples.
Previous work applied recognition-based methods to learn the ID features, which
tend to learn shortcuts instead of comprehensive representations. In this work,
we find surprisingly that simply using reconstruction-based methods could boost
the performance of OOD detection significantly. We deeply explore the main
contributors of OOD detection and find that reconstruction-based pretext tasks
have the potential to provide a generally applicable and efficacious prior,
which benefits the model in learning intrinsic data distributions of the ID
dataset. Specifically, we take Masked Image Modeling as a pretext task for our
OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms
previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by
3.0%, and near-distribution OOD detection by 2.1%. It even defeats the
10-shot-per-class outlier exposure OOD detection, although we do not include
any OOD samples for our detection |
This paper proposes MOOD, a novel out-of-distribution (OOD) detection framework leveraging Masked Image Modeling (MIM) as a pretext task to learn intrinsic data distributions of in-distribution data. |
Existing recognition-based methods for OOD detection often learn shortcuts instead of comprehensive representations, limiting their effectiveness. This work shows that reconstruction-based methods like MIM significantly improve OOD detection by learning real data distribution. |
The paper explores various factors influencing OOD detection, including pretext tasks (comparing MIM with contrastive learning), architectures (evaluating ViT, BiT, and MLP-Mixer), fine-tuning processes, and OOD detection metrics. It uses MIM pre-training on ImageNet-21k, followed by fine-tuning on the ID dataset, and employs Mahalanobis distance for OOD detection. |
MOOD outperforms SOTA on one-class OOD detection by 5.7%, achieving 94.9% AUROC.
On multi-class OOD detection, MOOD surpasses SOTA by 3.0%, reaching 97.6% AUROC.
For near-distribution OOD detection, MOOD achieves 98.3% AUROC, 2.1% higher than previous SOTA. |
The paper doesn't conduct experiments with intermediate fine-tuning on ImageNet-30 for one-class OOD detection, potentially limiting performance.
Future work could explore the effectiveness of other reconstruction-based pretext tasks for OOD detection. |
out-of-distribution detection, masked image modeling, vision transformer, self-supervised learning, anomaly detection |
2302.02550
Report |
Domain Re-Modulation for Few-Shot Generative Domain Adaptation |
Yi Wu, Ziqiang Li, Chaoyue Wang, Heliang Zheng, Shanshan Zhao, Bin Li, Dacheng Tao |
In this study, we delve into the task of few-shot Generative Domain
Adaptation (GDA), which involves transferring a pre-trained generator from one
domain to a new domain using only a few reference images. Inspired by the way
human brains acquire knowledge in new domains, we present an innovative
generator structure called Domain Re-Modulation (DoRM). DoRM not only meets the
criteria of high quality, large synthesis diversity, and cross-domain
consistency, which were achieved by previous research in GDA, but also
incorporates memory and domain association, akin to how human brains operate.
Specifically, DoRM freezes the source generator and introduces new mapping and
affine modules (M&A modules) to capture the attributes of the target domain
during GDA. This process resembles the formation of new synapses in human
brains. Consequently, a linearly combinable domain shift occurs in the style
space. By incorporating multiple new M&A modules, the generator gains the
capability to perform high-fidelity multi-domain and hybrid-domain generation.
Moreover, to maintain cross-domain consistency more effectively, we introduce a
similarity-based structure loss. This loss aligns the auto-correlation map of
the target image with its corresponding auto-correlation map of the source
image during training. Through extensive experiments, we demonstrate the
superior performance of our DoRM and similarity-based structure loss in
few-shot GDA, both quantitatively and qualitatively. The code will be available
at https://github.com/wuyi2020/DoRM. |
This paper presents DoRM, a novel generator structure for few-shot Generative Domain Adaptation (GDA) inspired by the human brain's learning mechanism, achieving high-quality, diverse, and cross-domain consistent image synthesis while enabling memory and domain association. |
Few-shot GDA aims to transfer a pre-trained generator to a new domain using limited data. Existing methods struggle with multi-domain generation and synthesizing images in unseen hybrid domains, limitations addressed by DoRM. |
DoRM freezes the source generator and introduces new mapping and affine modules to capture target domain attributes, enabling domain shift in the style space. A similarity-based structure loss is introduced to enhance cross-domain consistency. |
DoRM outperforms state-of-the-art methods in 10-shot GDA across various domains, demonstrating superior quality, diversity, and cross-domain consistency.
DoRM enables efficient multi-domain generation with a single generator, significantly reducing storage requirements compared to methods requiring full generator updates.
DoRM excels in hybrid-domain generation, effectively integrating learned domains to synthesize images in unseen hybrid domains, a capability not well-addressed by previous works. |
The strength of domain shift in DoRM currently requires manual adjustment.
The domain association in DoRM can be further improved by incorporating a new M&A module and additional consistency loss. |
generative adversarial networks, domain adaptation, few-shot learning, image synthesis, domain association |
2302.02503
Report |
Leaving Reality to Imagination: Robust Classification via Generated Datasets |
Hritik Bansal, Aditya Grover |
Recent research on robustness has revealed significant performance gaps
between neural image classifiers trained on datasets that are similar to the
test set, and those that are from a naturally shifted distribution, such as
sketches, paintings, and animations of the object categories observed during
training. Prior work focuses on reducing this gap by designing engineered
augmentations of training data or through unsupervised pretraining of a single
large model on massive in-the-wild training datasets scraped from the Internet.
However, the notion of a dataset is also undergoing a paradigm shift in recent
years. With drastic improvements in the quality, ease-of-use, and access to
modern generative models, generated data is pervading the web. In this light,
we study the question: How do these generated datasets influence the natural
robustness of image classifiers? We find that Imagenet classifiers trained on
real data augmented with generated data achieve higher accuracy and effective
robustness than standard training and popular augmentation strategies in the
presence of natural distribution shifts. We analyze various factors influencing
these results, including the choice of conditioning strategies and the amount
of generated data. Additionally, we find that the standard ImageNet classifiers
suffer a performance degradation of upto 20\% on the generated data, indicating
their fragility at accurately classifying the objects under novel variations.
Lastly, we demonstrate that the image classifiers, which have been trained on
real data augmented with generated data from the base generative model, exhibit
greater resilience to natural distribution shifts compared to the classifiers
trained on real data augmented with generated data from the finetuned
generative model on the real data. The code, models, and datasets are available
at https://github.com/Hritikbansal/generative-robustness. |
This paper investigates the impact of augmenting real image datasets with synthetic data generated by modern text-to-image models, specifically Stable Diffusion, on the robustness of image classifiers to natural distribution shifts. |
Improving the robustness of image classifiers to natural variations is crucial for real-world applications like autonomous driving and medical diagnosis, where models are often deployed in environments different from their training data. |
The authors generate synthetic datasets conditioned on ImageNet class labels using various Stable Diffusion conditioning strategies (text prompts, real images, and their combination). They train classifiers on real data, generated data, and their mixtures, evaluating their performance on ImageNet-1K and its natural distribution shift variants (ImageNet-Sketch, ImageNet-R, ImageNet-V2, ObjectNet). |
Classifiers trained on a mixture of real and generated data achieve higher accuracy and effective robustness on natural distribution shift datasets compared to those trained solely on real data or with standard augmentation techniques.
Increasing the proportion of generated data in the training mix generally improves effective robustness but might come at the cost of accuracy on the original dataset.
Standard ImageNet classifiers show significant performance degradation (up to 20%) on the generated data, highlighting their fragility to novel variations of objects. |
The study primarily focuses on ImageNet-1K, and the generalizability of the findings to other datasets and domains requires further investigation.
The ethical implications of using generated data, particularly concerning bias amplification and privacy, need careful consideration. |
robustness, generative models, data augmentation, image classification, natural distribution shift |
2302.02412
Report |
Mixture of Diffusers for scene composition and high resolution image generation |
Álvaro Barbero Jiménez |
Diffusion methods have been proven to be very effective to generate images
while conditioning on a text prompt. However, and although the quality of the
generated images is unprecedented, these methods seem to struggle when trying
to generate specific image compositions. In this paper we present Mixture of
Diffusers, an algorithm that builds over existing diffusion models to provide a
more detailed control over composition. By harmonizing several diffusion
processes acting on different regions of a canvas, it allows generating larger
images, where the location of each object and style is controlled by a separate
diffusion process. |
Presents Mixture of Diffusers, a method leveraging multiple diffusion models on a single canvas to achieve fine-grained composition control and generate high-resolution images with limited GPU memory. |
Existing text-conditioned diffusion models struggle to accurately represent complex image compositions and face limitations in generating high-resolution images due to memory constraints. |
The method combines multiple diffusion models, each operating on a specific canvas region with a unique text prompt and weight. Gaussian weights are employed to ensure smooth transitions between regions. It adapts to latent space models through an approximate pixel-to-latent region mapping and supports image conditioning for outpainting and iterative image generation. |
Demonstrates superior composition control compared to single diffusion models, accurately placing objects at user-specified locations.
Enables high-resolution image generation (up to 4K) on limited memory GPUs by dividing the image into smaller regions.
Successfully implements smooth style transitions across the image by varying text prompts for different regions. |
Current implementation limits diffusion models to rectangular regions.
Further exploration of free-form region masking and integration of inpainting techniques. |
diffusion models, image generation, image composition, high-resolution images, outpainting |
2302.02398
Report |
Diffusion Model for Generative Image Denoising |
Yutong Xie, Minne Yuan, Bin Dong, Quanzheng Li |
In supervised learning for image denoising, usually the paired clean images
and noisy images are collected or synthesised to train a denoising model. L2
norm loss or other distance functions are used as the objective function for
training. It often leads to an over-smooth result with less image details. In
this paper, we regard the denoising task as a problem of estimating the
posterior distribution of clean images conditioned on noisy images. We apply
the idea of diffusion model to realize generative image denoising. According to
the noise model in denoising tasks, we redefine the diffusion process such that
it is different from the original one. Hence, the sampling of the posterior
distribution is a reverse process of dozens of steps from the noisy image. We
consider three types of noise model, Gaussian, Gamma and Poisson noise. With
the guarantee of theory, we derive a unified strategy for model training. Our
method is verified through experiments on three types of noise models and
achieves excellent performance. |
This paper proposes a novel diffusion model specifically designed for generative image denoising, diverging from traditional supervised methods. |
Traditional supervised denoising methods, relying on L2 norm loss, often result in over-smoothed images, lacking fine details. This new method aims to estimate the posterior distribution of clean images given noisy images, leading to more realistic and detailed denoising results. |
The proposed diffusion model defines the diffusion process based on the specific noise model of the image (Gaussian, Gamma, or Poisson), allowing the reverse process to start directly from the noisy image. The training strategy employs a unified approach by minimizing the KL divergence, which is further simplified to minimizing L2 norm loss for all three noise models. |
The method generates visually pleasing denoised images with finer details compared to traditional supervised learning.
Quantitative metrics (PSNR, SSIM) demonstrate comparable performance between the average of generated samples and supervised learning.
The method effectively estimates the posterior distribution of clean images, even with fewer diffusion steps. |
There exists a gap in quantitative metrics between individual generated samples and supervised learning results, indicating potential for improvement.
Future work aims to explore the method’s applicability to other noise models and diverse datasets. |
image denoising, diffusion model, generative model, posterior distribution, noise model |
2302.02373
Report |
ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories |
Zijian Zhang, Zhou Zhao, Jun Yu, Qi Tian |
Diffusion models have recently exhibited remarkable abilities to synthesize
striking image samples since the introduction of denoising diffusion
probabilistic models (DDPMs). Their key idea is to disrupt images into noise
through a fixed forward process and learn its reverse process to generate
samples from noise in a denoising way. For conditional DDPMs, most existing
practices relate conditions only to the reverse process and fit it to the
reversal of unconditional forward process. We find this will limit the
condition modeling and generation in a small time window. In this paper, we
propose a novel and flexible conditional diffusion model by introducing
conditions into the forward process. We utilize extra latent space to allocate
an exclusive diffusion trajectory for each condition based on some shifting
rules, which will disperse condition modeling to all timesteps and improve the
learning capacity of model. We formulate our method, which we call
\textbf{ShiftDDPMs}, and provide a unified point of view on existing related
methods. Extensive qualitative and quantitative experiments on image synthesis
demonstrate the feasibility and effectiveness of ShiftDDPMs. |
The paper proposes ShiftDDPMs, a novel conditional diffusion model that introduces conditions into the forward process by shifting diffusion trajectories in latent space according to conditions. |
Existing conditional DDPM methods relate conditions only to the reverse process, limiting condition modeling to a small time window. Shifting diffusion trajectories disperses condition modeling across all timesteps, potentially improving model learning capacity. |
The method utilizes a shift coefficient schedule and a shift predictor to control the mean shift of diffusion trajectories. Different shift modes, such as Prior-Shift, Data-Normalization, and Quadratic-Shift, are explored with fixed and trainable shift predictors. |
ShiftDDPMs effectively perform conditional image synthesis, as demonstrated on MNIST and CIFAR-10 datasets.
Both Prior-Shift and Quadratic-Shift outperform traditional conditional DDPMs, showing improved learning capacity.
ShiftDDPMs successfully interpolate between different conditions and achieve competitive results on image inpainting and text-to-image synthesis. |
The choice of shift coefficient schedule (k_t) is flexible but lacks extensive empirical investigation.
The paper primarily focuses on image synthesis, leaving exploration of other data modalities for future work. |
diffusion models, conditional image synthesis, generative models, deep learning, shiftddpms |
2302.02284
Report |
Design Booster: A Text-Guided Diffusion Model for Image Translation with Spatial Layout Preservation |
Shiqi Sun, Shancheng Fang, Qian He, Wei Liu |
Diffusion models are able to generate photorealistic images in arbitrary
scenes. However, when applying diffusion models to image translation, there
exists a trade-off between maintaining spatial structure and high-quality
content. Besides, existing methods are mainly based on test-time optimization
or fine-tuning model for each input image, which are extremely time-consuming
for practical applications. To address these issues, we propose a new approach
for flexible image translation by learning a layout-aware image condition
together with a text condition. Specifically, our method co-encodes images and
text into a new domain during the training phase. In the inference stage, we
can choose images/text or both as the conditions for each time step, which
gives users more flexible control over layout and content. Experimental
comparisons of our method with state-of-the-art methods demonstrate our model
performs best in both style image translation and semantic image translation
and took the shortest time. |
Proposes Design Booster, a novel diffusion-based method for flexible image translation that balances text descriptions with input image layout preservation. |
Addresses limitations of existing diffusion models for image translation, which often struggle to maintain spatial structure while achieving high-quality content generation. |
Introduces a jointly trained encoder to extract spatial information from input images and employs a flexible sampling strategy with multi-condition control (text and/or image) at each denoising step. |
Achieves superior performance in both style and semantic image translation compared to state-of-the-art methods.
Preserves spatial layout of input images while enabling text-guided modifications to style and content.
Offers fast inference speed, making it suitable for practical applications. |
Exhibits slightly weaker ability to change color under strong layout-preserving parameters.
Future work could explore more complex and adaptive strategies for condition injection during sampling. |
image translation, diffusion models, layout preservation, text-guided synthesis, multi-condition control |
2302.02272
Report |
Divide and Compose with Score Based Generative Models |
Sandesh Ghimire, Armand Comas, Davin Hill, Aria Masoomi, Octavia Camps, Jennifer Dy |
While score based generative models, or diffusion models, have found success
in image synthesis, they are often coupled with text data or image label to be
able to manipulate and conditionally generate images. Even though manipulation
of images by changing the text prompt is possible, our understanding of the
text embedding and our ability to modify it to edit images is quite limited.
Towards the direction of having more control over image manipulation and
conditional generation, we propose to learn image components in an unsupervised
manner so that we can compose those components to generate and manipulate
images in informed manner. Taking inspiration from energy based models, we
interpret different score components as the gradient of different energy
functions. We show how score based learning allows us to learn interesting
components and we can visualize them through generation. We also show how this
novel decomposition allows us to compose, generate and modify images in
interesting ways akin to dreaming. We make our code available at
https://github.com/sandeshgh/Score-based-disentanglement |
This paper proposes a novel method for decomposing images into interpretable score components within a score-based generative model framework, enabling controlled image manipulation and generation. |
Existing conditional score-based generative models lack interpretability and control over image generation, making targeted manipulation challenging. |
The authors leverage the connection between score functions and energy-based models, decomposing the score function into multiple components representing different energy functions. They train an autoencoder that learns to encode images into latent vectors representing these components, which can be individually manipulated to generate diverse and controlled variations. |
Score component decomposition allows for reconstruction with natural variations.
Visualizing generated samples from individual components reveals their ability to capture distinct image attributes, like shape, color, or texture.
Manipulating score components by interpolation with unconditional score functions enables controlled image editing, preserving certain features while varying others. |
The current method's ability to manipulate images is limited by the number of score components.
Future work could explore scaling the approach to a higher number of components and guiding them toward more human-interpretable representations. |
score-based generative models, diffusion models, image manipulation, disentanglement, energy-based models |
2302.02234
Report |
Revisiting Image Deblurring with an Efficient ConvNet |
Lingyan Ruan, Mojtaba Bemana, Hans-peter Seidel, Karol Myszkowski, Bin Chen |
Image deblurring aims to recover the latent sharp image from its blurry
counterpart and has a wide range of applications in computer vision. The
Convolution Neural Networks (CNNs) have performed well in this domain for many
years, and until recently an alternative network architecture, namely
Transformer, has demonstrated even stronger performance. One can attribute its
superiority to the multi-head self-attention (MHSA) mechanism, which offers a
larger receptive field and better input content adaptability than CNNs.
However, as MHSA demands high computational costs that grow quadratically with
respect to the input resolution, it becomes impractical for high-resolution
image deblurring tasks. In this work, we propose a unified lightweight CNN
network that features a large effective receptive field (ERF) and demonstrates
comparable or even better performance than Transformers while bearing less
computational costs. Our key design is an efficient CNN block dubbed LaKD,
equipped with a large kernel depth-wise convolution and spatial-channel mixing
structure, attaining comparable or larger ERF than Transformers but with a
smaller parameter scale. Specifically, we achieve +0.17dB / +0.43dB PSNR over
the state-of-the-art Restormer on defocus / motion deblurring benchmark
datasets with 32% fewer parameters and 39% fewer MACs. Extensive experiments
demonstrate the superior performance of our network and the effectiveness of
each module. Furthermore, we propose a compact and intuitive ERFMeter metric
that quantitatively characterizes ERF, and shows a high correlation to the
network performance. We hope this work can inspire the research community to
further explore the pros and cons of CNN and Transformer architectures beyond
image deblurring tasks. |
This paper proposes LaKDNet, a lightweight CNN for image deblurring that achieves comparable or better performance than Transformer-based methods while being more computationally efficient. |
Image deblurring is crucial for various computer vision tasks, but Transformer-based methods, while effective, are computationally expensive, especially for high-resolution images. This work explores the potential of efficient CNNs for this task. |
The authors propose the LaKD block, featuring large kernel depth-wise convolution and spatial-channel mixing to achieve a large effective receptive field (ERF) with low computational cost. They integrate this block into a U-Net architecture. Additionally, they introduce ERFMeter, a metric to quantify ERF and correlate it with network performance. |
LaKDNet achieves state-of-the-art results on defocus deblurring benchmarks, outperforming Restormer with 32% fewer parameters and 39% fewer MACs.
For motion deblurring, LaKDNet shows competitive performance, exceeding Uformer and Restormer on GoPro dataset by up to +0.43dB PSNR while using significantly fewer computational resources.
ERFMeter demonstrates a strong correlation (Pearson correlation coefficient r=0.8) with network performance, suggesting its potential for guiding network design. |
The network's generalization ability from synthetic to real blur is slightly weaker than Transformer-based methods, suggesting room for improvement.
The ERFMeter metric primarily focuses on ERF and might not capture the impact of other factors contributing to network performance. |
image deblurring, convolutional neural networks, effective receptive field, lightweight model, erfmeter |
2302.02181
Report |
Model Stitching and Visualization How GAN Generators can Invert Networks in Real-Time |
Rudolf Herdt, Maximilian Schmidt, Daniel Otero Baguer, Jean Le'Clerc Arrastia, Peter Maass |
In this work, we propose a fast and accurate method to reconstruct
activations of classification and semantic segmentation networks by stitching
them with a GAN generator utilizing a 1x1 convolution. We test our approach on
images of animals from the AFHQ wild dataset, ImageNet1K, and real-world
digital pathology scans of stained tissue samples. Our results show comparable
performance to established gradient descent methods but with a processing time
that is two orders of magnitude faster, making this approach promising for
practical applications. |
This paper presents a fast and accurate method to reconstruct the activations of deep neural networks used for classification and semantic segmentation. This method works by stitching the feature extractor network with a pretrained GAN generator using a 1x1 convolution. |
Reconstructing activations of deep networks is important for understanding the internal representations learned by these models and can aid in tasks like image generation and manipulation. Existing methods, such as gradient descent, are accurate but computationally expensive. This work offers a faster alternative with comparable accuracy. |
The method trains a 1x1 convolution layer to map the activations from a hidden layer of the feature extractor to a hidden layer of a pretrained GAN generator. During inference, this mapping enables the reconstruction of activations by propagating them through the stitched GAN, acting as a decoder. |
The GAN-based reconstruction method achieves comparable accuracy to gradient descent methods in terms of cosine similarity and L1 loss when evaluated on AFHQ wild, ImageNet1K, and digital pathology datasets.
The proposed method is significantly faster than gradient descent, achieving a speedup of two orders of magnitude, making it suitable for real-time applications.
The study suggests that the features learned by the GAN generator are compatible with the features learned by the feature extractor, even if they are trained independently. |
The method's performance depends on the ability of the GAN generator to understand the concepts learned by the feature extractor. If the GAN is not trained on similar data or concepts, the reconstruction might be inaccurate.
The use of class-conditional GANs for reconstruction introduces challenges when stitching into deeper layers due to the reliance on class conditioning information. |
computer vision, deep learning, gan, network inversion, activation reconstruction |
2302.02057
Report |
Semantic Diffusion Network for Semantic Segmentation |
Haoru Tan, Sitong Wu, Jimin Pi |
Precise and accurate predictions over boundary areas are essential for
semantic segmentation. However, the commonly-used convolutional operators tend
to smooth and blur local detail cues, making it difficult for deep models to
generate accurate boundary predictions. In this paper, we introduce an
operator-level approach to enhance semantic boundary awareness, so as to
improve the prediction of the deep semantic segmentation model. Specifically,
we first formulate the boundary feature enhancement as an anisotropic diffusion
process. We then propose a novel learnable approach called semantic diffusion
network (SDN) to approximate the diffusion process, which contains a
parameterized semantic difference convolution operator followed by a feature
fusion module. Our SDN aims to construct a differentiable mapping from the
original feature to the inter-class boundary-enhanced feature. The proposed SDN
is an efficient and flexible module that can be easily plugged into existing
encoder-decoder segmentation models. Extensive experiments show that our
approach can achieve consistent improvements over several typical and
state-of-the-art segmentation baseline models on challenging public benchmarks.
The code will be released soon. |
This paper introduces Semantic Diffusion Network (SDN), an operator-level approach to enhance semantic boundary awareness in semantic segmentation models by approximating an anisotropic diffusion process. |
Existing convolutional operators tend to smooth and blur local details, hindering accurate boundary prediction in semantic segmentation. SDN addresses this limitation by enhancing inter-class boundary features. |
SDN utilizes a learnable approach comprising a parameterized semantic difference convolution operator and a feature fusion module. The semantic difference convolution leverages semantic guidance features to enhance inter-class boundaries while suppressing intra-class ones. |
SDN achieves consistent mIoU improvements across various baseline models and datasets (ADE20K and Cityscapes).
It significantly improves boundary quality, as demonstrated by higher F-scores in boundary regions.
SDN exhibits good compatibility with other boundary-promoting methods, further enhancing segmentation performance. |
The paper only validates SDN's effectiveness on semantic segmentation; further exploration in other visual tasks is needed.
While SDN has positive applications, its potential use in inhumane surveillance needs careful consideration and regulation. |
semantic segmentation, boundary awareness, deep learning, anisotropic diffusion, convolutional neural networks |
2302.01872
Report |
MOSE: A New Dataset for Video Object Segmentation in Complex Scenes |
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H. S. Torr, Song Bai |
Video object segmentation (VOS) aims at segmenting a particular object
throughout the entire video clip sequence. The state-of-the-art VOS methods
have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
However, since the target objects in these existing datasets are usually
relatively salient, dominant, and isolated, VOS under complex scenes has rarely
been studied. To revisit VOS and make it more applicable in the real world, we
collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to
study the tracking and segmenting objects in complex environments. MOSE
contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725
high-quality object segmentation masks. The most notable feature of MOSE
dataset is complex scenes with crowded and occluded objects. The target objects
in the videos are commonly occluded by others and disappear in some frames. To
analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4
different settings on the proposed MOSE dataset and conduct comprehensive
comparisons. The experiments show that current VOS algorithms cannot well
perceive objects in complex scenes. For example, under the semi-supervised VOS
setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4%
on MOSE, much lower than their ~90% J&F performance on DAVIS. The results
reveal that although excellent performance has been achieved on existing
benchmarks, there are unresolved challenges under complex scenes and more
efforts are desired to explore these challenges in the future. The proposed
MOSE dataset has been released at https://henghuiding.github.io/MOSE. |
This paper introduces a new large-scale video object segmentation benchmark dataset called MOSE, specifically designed to study object tracking and segmentation in complex environments. |
Existing video object segmentation datasets often feature salient and isolated objects, while real-world scenarios frequently involve complex and occluded scenes. MOSE aims to bridge this gap and promote the development of more comprehensive and robust video object segmentation algorithms. |
The authors collected 2,149 high-resolution videos featuring crowded and occluded objects, many of which disappear and reappear throughout the video. They annotated these videos with 430,984 high-quality segmentation masks. The dataset was then used to benchmark 18 existing video object segmentation methods under 4 different settings. |
Existing video object segmentation algorithms perform significantly worse on MOSE compared to previous benchmark datasets, highlighting the difficulty of complex scenes.
The highest \(\mathcal{J}\&\mathcal{F}\) achieved by current state-of-the-art methods under the semi-supervised setting is only 59.4% on MOSE, significantly lower than their ~90% performance on datasets like DAVIS.
Heavy occlusions, crowds, small object size, and object disappearance/reappearance pose significant challenges to existing methods. |
MOSE currently focuses on object categories common in existing image segmentation datasets, potentially limiting its generalizability.
Future work could explore incorporating more diverse object categories and even more challenging scenarios, such as extreme lighting conditions or fast camera movements. |
video object segmentation, dataset, complex scenes, occlusion, benchmark |
2302.01721
Report |
TEXTure: Text-Guided Texturing of 3D Shapes |
Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, Daniel Cohen-Or |
In this paper, we present TEXTure, a novel method for text-guided generation,
editing, and transfer of textures for 3D shapes. Leveraging a pretrained
depth-to-image diffusion model, TEXTure applies an iterative scheme that paints
a 3D model from different viewpoints. Yet, while depth-to-image models can
create plausible textures from a single viewpoint, the stochastic nature of the
generation process can cause many inconsistencies when texturing an entire 3D
object. To tackle these problems, we dynamically define a trimap partitioning
of the rendered image into three progression states, and present a novel
elaborated diffusion sampling process that uses this trimap representation to
generate seamless textures from different views. We then show that one can
transfer the generated texture maps to new 3D geometries without requiring
explicit surface-to-surface mapping, as well as extract semantic textures from
a set of images without requiring any explicit reconstruction. Finally, we show
that TEXTure can be used to not only generate new textures but also edit and
refine existing textures using either a text prompt or user-provided scribbles.
We demonstrate that our TEXTuring method excels at generating, transferring,
and editing textures through extensive evaluation, and further close the gap
between 2D image generation and 3D texturing. |
TEXTure, a novel method for text-guided generation, editing, and transfer of textures for 3D shapes, leveraging pretrained depth-to-image diffusion models. |
Addresses the limitations of previous 3D texturing methods by providing a fast and efficient approach for generating high-quality and consistent textures on 3D models. |
An iterative painting scheme that renders the object from different viewpoints, applies a depth-based painting using a modified diffusion model, and projects the result back to the mesh vertices or atlas. |
Generates high-quality, realistic textures consistent on both local and global scales.
Enables texture transfer from both painted meshes and sets of images to new, untextured meshes.
Supports texture editing through both text prompts and user-provided scribbles. |
Potential inconsistencies on a global scale due to occlusions.
Dependence on fixed viewpoints for painting, which may not be optimal for all geometries. |
text-guided synthesis, 3d texturing, diffusion models, texture transfer, texture editing |
2302.01579
Report |
Semantic 3D-aware Portrait Synthesis and Manipulation Based on Compositional Neural Radiance Field |
Tianxiang Ma, Bingchuan Li, Qian He, Jing Dong, Tieniu Tan |
Recently 3D-aware GAN methods with neural radiance field have developed
rapidly. However, current methods model the whole image as an overall neural
radiance field, which limits the partial semantic editability of synthetic
results. Since NeRF renders an image pixel by pixel, it is possible to split
NeRF in the spatial dimension. We propose a Compositional Neural Radiance Field
(CNeRF) for semantic 3D-aware portrait synthesis and manipulation. CNeRF
divides the image by semantic regions and learns an independent neural radiance
field for each region, and finally fuses them and renders the complete image.
Thus we can manipulate the synthesized semantic regions independently, while
fixing the other parts unchanged. Furthermore, CNeRF is also designed to
decouple shape and texture within each semantic region. Compared to
state-of-the-art 3D-aware GAN methods, our approach enables fine-grained
semantic region manipulation, while maintaining high-quality 3D-consistent
synthesis. The ablation studies show the effectiveness of the structure and
loss function used by our method. In addition real image inversion and cartoon
portrait 3D editing experiments demonstrate the application potential of our
method. |
This paper introduces CNeRF, the first compositional neural radiance field for semantic 3D-aware portrait synthesis and manipulation. |
Current 3D-aware GAN methods with neural radiance fields lack semantic editability as they model the entire image as a single unit. CNeRF addresses this limitation. |
CNeRF divides the image into semantic regions and learns independent neural radiance fields for each region. It then fuses these fields to render the complete image. This method allows for individual manipulation of semantic regions using latent codes. |
CNeRF achieves high-quality 3D-consistent portrait synthesis comparable to state-of-the-art methods.
The proposed method allows for fine-grained semantic region manipulation in generated portraits.
CNeRF successfully decouples shape and texture within each semantic region, allowing for independent control over each attribute. |
Further improvements in 3D reconstruction quality are possible.
Future work will explore combining CNeRF with advanced 3D-aware GANs like EG3D. |
generative adversarial networks, neural radiance fields, 3d-aware image synthesis, semantic manipulation, compositional rendering |
2302.01532
Report |
INV: Towards Streaming Incremental Neural Videos |
Shengze Wang, Alexey Supikov, Joshua Ratcliff, Henry Fuchs, Ronald Azuma |
Recent works in spatiotemporal radiance fields can produce photorealistic
free-viewpoint videos. However, they are inherently unsuitable for interactive
streaming scenarios (e.g. video conferencing, telepresence) because have an
inevitable lag even if the training is instantaneous. This is because these
approaches consume videos and thus have to buffer chunks of frames (often
seconds) before processing. In this work, we take a step towards interactive
streaming via a frame-by-frame approach naturally free of lag. Conventional
wisdom believes that per-frame NeRFs are impractical due to prohibitive
training costs and storage. We break this belief by introducing Incremental
Neural Videos (INV), a per-frame NeRF that is efficiently trained and
streamable. We designed INV based on two insights: (1) Our main finding is that
MLPs naturally partition themselves into Structure and Color Layers, which
store structural and color/texture information respectively. (2) We leverage
this property to retain and improve upon knowledge from previous frames, thus
amortizing training across frames and reducing redundant learning. As a result,
with negligible changes to NeRF, INV can achieve good qualities (>28.6db) in
8min/frame. It can also outperform prior SOTA in 19% less training time.
Additionally, our Temporal Weight Compression reduces the per-frame size to
0.3MB/frame (6.6% of NeRF). More importantly, INV is free from buffer lag and
is naturally fit for streaming. While this work does not achieve real-time
training, it shows that incremental approaches like INV present new
possibilities in interactive 3D streaming. Moreover, our discovery of natural
information partition leads to a better understanding and manipulation of MLPs.
Code and dataset will be released soon. |
This paper introduces Incremental Neural Videos (INV), a per-frame neural radiance field representation for efficient streaming of dynamic 3D scenes. |
Existing spatiotemporal radiance fields suffer from buffer lag, making them unsuitable for interactive streaming applications like telepresence. |
The authors leverage the discovery that MLPs naturally partition into Structure and Color Layers, enabling them to design INV which stores per-frame structure and a shared color representation. |
INV achieves state-of-the-art per-frame quality with less training than previous methods.
Temporal Weight Compression reduces the per-frame size to a streamable 0.3MB.
The paper provides evidence for the natural partitioning of information within MLPs, leading to a better understanding of these models. |
Visual stability is limited, especially for short training times.
Future work includes achieving real-time training and handling large scene changes. |
neural radiance fields, 3d video streaming, incremental learning, mlp, temporal weight compression |
2302.01384
Report |
Energy-Inspired Self-Supervised Pretraining for Vision Models |
Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu |
Motivated by the fact that forward and backward passes of a deep network
naturally form symmetric mappings between input and output representations, we
introduce a simple yet effective self-supervised vision model pretraining
framework inspired by energy-based models (EBMs). In the proposed framework, we
model energy estimation and data restoration as the forward and backward passes
of a single network without any auxiliary components, e.g., an extra decoder.
For the forward pass, we fit a network to an energy function that assigns low
energy scores to samples that belong to an unlabeled dataset, and high energy
otherwise. For the backward pass, we restore data from corrupted versions
iteratively using gradient-based optimization along the direction of energy
minimization. In this way, we naturally fold the encoder-decoder architecture
widely used in masked image modeling into the forward and backward passes of a
single vision model. Thus, our framework now accepts a wide range of pretext
tasks with different data corruption methods, and permits models to be
pretrained from masked image modeling, patch sorting, and image restoration,
including super-resolution, denoising, and colorization. We support our
findings with extensive experiments, and show the proposed method delivers
comparable and even better performance with remarkably fewer epochs of training
compared to the state-of-the-art self-supervised vision model pretraining
methods. Our findings shed light on further exploring self-supervised vision
model pretraining and pretext tasks beyond masked image modeling. |
This paper proposes a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs), where energy estimation and data restoration are modeled as the forward and backward passes of a single network. |
This approach eliminates the need for auxiliary components like decoders, heavy data augmentations, or modifications to the network structure, simplifying self-supervised vision model pretraining. |
The forward pass trains a network to fit an energy function, assigning low energy scores to in-distribution samples and high energy to others. The backward pass uses gradient-based optimization to restore data from corrupted versions, moving towards energy minimization. This effectively folds the encoder-decoder architecture into a single vision model. |
The proposed method achieves comparable or even better performance with remarkably fewer epochs of training compared to state-of-the-art self-supervised vision model pretraining methods.
The framework's flexibility allows for a broader range of pretext tasks beyond masked image modeling, including patch sorting and image restoration (e.g., super-resolution, denoising, and colorization).
The approach demonstrates good generalization across various network architectures, including ViT, ResNet, ConvNeXt, and Swin-Transformer. |
While achieving strong finetuning results, the method doesn't directly yield strongly linearly-separable features, resulting in lower linear probing accuracy compared to contrastive learning methods.
Future work will focus on further exploring pretext tasks for self-supervised vision model pretraining and improving linear separability of the learned features. |
self-supervised learning, vision model pretraining, energy-based models, masked image modeling, image restoration |
2302.01329
Report |
Dreamix: Video Diffusion Models are General Video Editors |
Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, Yedid Hoshen |
Text-driven image and video diffusion models have recently achieved
unprecedented generation realism. While diffusion models have been successfully
applied for image editing, very few works have done so for video editing. We
present the first diffusion-based method that is able to perform text-based
motion and appearance editing of general videos. Our approach uses a video
diffusion model to combine, at inference time, the low-resolution
spatio-temporal information from the original video with new, high resolution
information that it synthesized to align with the guiding text prompt. As
obtaining high-fidelity to the original video requires retaining some of its
high-resolution information, we add a preliminary stage of finetuning the model
on the original video, significantly boosting fidelity. We propose to improve
motion editability by a new, mixed objective that jointly finetunes with full
temporal attention and with temporal attention masking. We further introduce a
new framework for image animation. We first transform the image into a coarse
video by simple image processing operations such as replication and perspective
geometric projections, and then use our general video editor to animate it. As
a further application, we can use our method for subject-driven video
generation. Extensive qualitative and numerical experiments showcase the
remarkable editing ability of our method and establish its superior performance
compared to baseline methods. |
Dreamix, a novel method for text-based video editing, enabling motion and appearance edits in real-world videos using a text-conditioned video diffusion model. |
Existing text-based editing methods are primarily image-centric and struggle to maintain temporal consistency in videos. Dreamix leverages the power of video diffusion models to achieve high-quality video editing with strong fidelity to the original content. |
Dreamix finetunes a pre-trained cascaded video diffusion model on the input video with a mixed objective, combining full temporal attention and masked temporal attention. At inference, it corrupts the video, then uses the finetuned model to guide the generation towards the text prompt while preserving original video details. |
Dreamix enables unprecedented video editing capabilities, including modifying motion, appearance, adding objects, and changing backgrounds, all while maintaining temporal consistency.
The proposed mixed finetuning significantly improves motion editing and background change scenarios compared to baselines.
Dreamix enables new applications, such as text-guided image animation by converting images to coarse videos and applying video editing, and subject-driven video generation by finetuning on a collection of subject images. |
Hyperparameter selection, such as noise strength, is currently manual and could be automated for improved user experience.
Automatic evaluation metrics for text-guided video editing are lacking and would benefit from further research to better align with human preference. |
video editing, diffusion models, text-guided generation, image animation, subject-driven video generation |
2302.01327
Report |
Dual PatchNorm |
Manoj Kumar, Mostafa Dehghani, Neil Houlsby |
We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms),
before and after the patch embedding layer in Vision Transformers. We
demonstrate that Dual PatchNorm outperforms the result of exhaustive search for
alternative LayerNorm placement strategies in the Transformer block itself. In
our experiments, incorporating this trivial modification, often leads to
improved accuracy over well-tuned Vision Transformers and never hurts. |
This paper introduces Dual PatchNorm (DPN), a simple modification to Vision Transformers (ViTs) that involves adding two Layer Normalization layers before and after the patch embedding layer. |
The authors aim to explore LayerNorm placement strategies beyond the standard pre-LN approach in ViTs and demonstrate that DPN consistently improves performance across various vision tasks. |
The authors conduct extensive experiments on image classification (ImageNet-1k, ImageNet-21k, JFT), contrastive learning, semantic segmentation (ADE20K), and transfer learning (VTAB). They compare DPN with various other LayerNorm placement strategies, including exhaustive search within Transformer blocks. |
DPN consistently improves accuracy over well-tuned vanilla ViT baselines on image classification tasks, achieving an average gain of 1.4% on ImageNet-1k.
DPN also shows benefits in contrastive learning and semantic segmentation, leading to improved zero-shot ImageNet accuracy and mIoU on ADE20K, respectively.
Analysis of gradient norms suggests that DPN helps stabilize training by reducing the gradient norm of the embedding layer. |
While DPN consistently shows improvements, there are a few cases where it performs on par or slightly worse than the baseline.
Future work can explore the theoretical underpinnings of DPN's effectiveness and investigate its applicability to other ViT variants. |
vision transformers, layer normalization, dual patchnorm, image classification, contrastive learning |
2302.01162
Report |
Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using Pixel-aligned Reconstruction Priors |
Zhangyang Xiong, Di Kang, Derong Jin, Weikai Chen, Linchao Bao, Shuguang Cui, Xiaoguang Han |
Fast generation of high-quality 3D digital humans is important to a vast
number of applications ranging from entertainment to professional concerns.
Recent advances in differentiable rendering have enabled the training of 3D
generative models without requiring 3D ground truths. However, the quality of
the generated 3D humans still has much room to improve in terms of both
fidelity and diversity. In this paper, we present Get3DHuman, a novel 3D human
framework that can significantly boost the realism and diversity of the
generated outcomes by only using a limited budget of 3D ground-truth data. Our
key observation is that the 3D generator can profit from human-related priors
learned through 2D human generators and 3D reconstructors. Specifically, we
bridge the latent space of Get3DHuman with that of StyleGAN-Human via a
specially-designed prior network, where the input latent code is mapped to the
shape and texture feature volumes spanned by the pixel-aligned 3D
reconstructor. The outcomes of the prior network are then leveraged as the
supervisory signals for the main generator network. To ensure effective
training, we further propose three tailored losses applied to the generated
feature volumes and the intermediate feature maps. Extensive experiments
demonstrate that Get3DHuman greatly outperforms the other state-of-the-art
approaches and can support a wide range of applications including shape
interpolation, shape re-texturing, and single-view reconstruction through
latent inversion. |
Presents Get3DHuman, a 3D human generation framework that leverages priors from 2D human generators and 3D reconstructors to synthesize high-fidelity clothed 3D humans with diverse shapes and textures. |
Generating diverse and realistic 3D humans is crucial for various applications, but current 3D generative models struggle with limited 3D data. This work overcomes these limitations by leveraging priors from well-established 2D and 3D domains. |
The framework employs a prior network (StyleGAN-Human + PIFuHD) to extract normal maps, depth maps, and shape/texture feature volumes as supervisory signals. A two-branch 3D generator (shape and texture) is trained with tailored losses, including latent prior loss and adversarial loss on feature volumes, to ensure high-quality and diverse results. A refinement module enhances the final textured mesh. |
Significantly outperforms state-of-the-art methods (EG3D, SDF-StyleGAN, GET3D) in generating high-fidelity clothed 3D humans, as evidenced by quantitative metrics (COV, MMD, FPD, FID, FID3D) and visual comparisons.
Demonstrates strong capability in various applications, including shape interpolation, shape re-texturing, and single-view reconstruction through latent inversion.
Successfully incorporates inductive bias from 2D and 3D priors, resulting in a more effective and efficient 3D human generation process. |
Currently limited to generating models in standing poses due to the constraints of the StyleGAN-Human prior.
Reliance on manual filtering of training data might introduce bias. |
3d human generation, generative adversarial networks, prior learning, differentiable rendering, shape and texture synthesis |
2302.01133
Report |
SceneScape: Text-Driven Consistent Scene Generation |
Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel |
We present a method for text-driven perpetual view generation -- synthesizing
long-term videos of various scenes solely, given an input text prompt
describing the scene and camera poses. We introduce a novel framework that
generates such videos in an online fashion by combining the generative power of
a pre-trained text-to-image model with the geometric priors learned by a
pre-trained monocular depth prediction model. To tackle the pivotal challenge
of achieving 3D consistency, i.e., synthesizing videos that depict
geometrically-plausible scenes, we deploy an online test-time training to
encourage the predicted depth map of the current frame to be geometrically
consistent with the synthesized scene. The depth maps are used to construct a
unified mesh representation of the scene, which is progressively constructed
along the video generation process. In contrast to previous works, which are
applicable only to limited domains, our method generates diverse scenes, such
as walkthroughs in spaceships, caves, or ice castles. |
This paper introduces SceneScape, the first text-driven perpetual view generation method that synthesizes long-term videos of diverse scenes solely from text prompts and camera poses. |
This addresses limitations of previous methods restricted to specific domains and requiring large-scale training by leveraging the power of pre-trained text-to-image and depth prediction models for zero-shot scene generation. |
SceneScape combines pre-trained models with a unified 3D mesh representation, progressively constructing the scene while ensuring 3D consistency through test-time fine-tuning of depth prediction and image inpainting. |
SceneScape generates high-quality, diverse scenes with significant parallax and complex structures from text prompts.
The method demonstrates superior 3D consistency compared to baselines like VideoFusion and GEN-1, evidenced by quantitative metrics and user studies.
Test-time fine-tuning of depth prediction and image decoding proves crucial for achieving geometric plausibility and visual quality. |
The reliance on pre-trained models can introduce biases present in their training data.
Representing scenes with triangular mesh limits the ability to depict dramatic depth discontinuities found in outdoor environments. |
text-driven generation, perpetual view generation, 3d scene synthesis, test-time optimization, zero-shot learning |
2302.01056
Report |
Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial Defense |
Zunzhi You, Daochang Liu, Bohyung Han, Chang Xu |
Recent advancements in masked image modeling (MIM) have made it a prevailing
framework for self-supervised visual representation learning. The MIM
pretrained models, like most deep neural network methods, remain vulnerable to
adversarial attacks, limiting their practical application, and this issue has
received little research attention. In this paper, we investigate how this
powerful self-supervised learning paradigm can provide adversarial robustness
to downstream classifiers. During the exploration, we find that noisy image
modeling (NIM), a simple variant of MIM that adopts denoising as the pre-text
task, reconstructs noisy images surprisingly well despite severe corruption.
Motivated by this observation, we propose an adversarial defense method,
referred to as De^3, by exploiting the pretrained decoder for denoising.
Through De^3, NIM is able to enhance adversarial robustness beyond providing
pretrained features. Furthermore, we incorporate a simple modification,
sampling the noise scale hyperparameter from random distributions, and enable
the defense to achieve a better and tunable trade-off between accuracy and
robustness. Experimental results demonstrate that, in terms of adversarial
robustness, NIM is superior to MIM thanks to its effective denoising
capability. Moreover, the defense provided by NIM achieves performance on par
with adversarial training while offering the extra tunability advantage. Source
code and models are available at https://github.com/youzunzhi/NIM-AdvDef. |
This paper investigates Noisy Image Modeling (NIM), a variant of Masked Image Modeling (MIM) using denoising as a pretext task, and proposes \de3, a defense method leveraging NIM's denoising capability to enhance adversarial robustness. |
MIM models, while effective for representation learning, lack adversarial robustness, limiting their applicability in safety-critical tasks. This paper explores NIM's potential for enhancing robustness beyond pretrained features. |
The authors train NIM models, observe their strong denoising capability, and propose \de3. This method adds noise to adversarial examples during testing and uses the pretrained NIM decoder to denoise them, mitigating adversarial perturbations. They also propose randomizing the noise level during NIM pretraining for a tunable accuracy-robustness trade-off. |
NIM-pretrained classifiers, even without defense, exhibit better robustness than MIM counterparts.
NIM with \de3 significantly improves robustness against various attacks, outperforming undefended MIM while offering a tunable accuracy-robustness trade-off.
NIM's denoising capability is shown to be superior to MIM's reconstruction ability, contributing to enhanced robustness. |
The paper primarily focuses on demonstrating NIM's advantage over MIM for robustness, not achieving state-of-the-art defense.
Exploration of alternative degradation methods beyond Gaussian noise in NIM is left for future work. |
adversarial robustness, self-supervised learning, masked image modeling, denoising, vision transformers |
2302.00908
Report |
GANalyzer: Analysis and Manipulation of GANs Latent Space for Controllable Face Synthesis |
Ali Pourramezan Fard, Mohammad H. Mahoor, Sarah Ariel Lamer, Timothy Sweeny |
Generative Adversarial Networks (GANs) are capable of synthesizing
high-quality facial images. Despite their success, GANs do not provide any
information about the relationship between the input vectors and the generated
images. Currently, facial GANs are trained on imbalanced datasets, which
generate less diverse images. For example, more than 77% of 100K images that we
randomly synthesized using the StyleGAN3 are classified as Happy, and only
around 3% are Angry. The problem even becomes worse when a mixture of facial
attributes is desired: less than 1% of the generated samples are Angry Woman,
and only around 2% are Happy Black. To address these problems, this paper
proposes a framework, called GANalyzer, for the analysis, and manipulation of
the latent space of well-trained GANs. GANalyzer consists of a set of
transformation functions designed to manipulate latent vectors for a specific
facial attribute such as facial Expression, Age, Gender, and Race. We analyze
facial attribute entanglement in the latent space of GANs and apply the
proposed transformation for editing the disentangled facial attributes. Our
experimental results demonstrate the strength of GANalyzer in editing facial
attributes and generating any desired faces. We also create and release a
balanced photo-realistic human face dataset. Our code is publicly available on
GitHub. |
Proposes GANalyzer, a framework to analyze and manipulate the latent space of pre-trained GANs for controllable face synthesis, enabling both facial attribute editing (preserving identity) and feature-based synthesis (specifying attributes). |
Addresses the lack of control over facial attributes in GAN-generated images and the issue of imbalanced datasets leading to less diverse outputs. |
Analyzes facial attributes of synthesized images and their latent vectors using pre-trained classifiers. Defines a transformation function based on Eigenvectors of the Covariance matrix of latent vectors belonging to specific attributes. This function allows for manipulation of latent vectors to control facial features in generated images. |
Successfully edits single and multiple facial attributes like age, gender, race, and expression while preserving identity.
Enables feature-based synthesis to generate faces with specific attributes, addressing dataset imbalance.
Provides control over the intensity of the desired facial attribute in both editing and synthesis. |
Reliance on the performance of pre-trained classifiers for accurate attribute labeling.
Potential limitations due to entanglement of facial attributes in the training data of the original GAN. |
generative adversarial networks, face synthesis, latent space manipulation, facial attribute editing, feature-based synthesis |
2302.00833
Report |
RobustNeRF: Ignoring Distractors with Robust Losses |
Sara Sabour, Suhani Vora, Daniel Duckworth, Ivan Krasin, David J. Fleet, Andrea Tagliasacchi |
Neural radiance fields (NeRF) excel at synthesizing new views given
multi-view, calibrated images of a static scene. When scenes include
distractors, which are not persistent during image capture (moving objects,
lighting variations, shadows), artifacts appear as view-dependent effects or
'floaters'. To cope with distractors, we advocate a form of robust estimation
for NeRF training, modeling distractors in training data as outliers of an
optimization problem. Our method successfully removes outliers from a scene and
improves upon our baselines, on synthetic and real-world scenes. Our technique
is simple to incorporate in modern NeRF frameworks, with few hyper-parameters.
It does not assume a priori knowledge of the types of distractors, and is
instead focused on the optimization problem rather than pre-processing or
modeling transient objects. More results on our page
https://robustnerf.github.io/public. |
This paper presents RobustNeRF, a novel method to address the issue of distractors (transient objects or effects) in neural radiance fields (NeRF) by treating them as outliers during optimization. |
Distractors are common in real-world scenes and can severely degrade the quality of NeRF reconstructions. Existing methods for handling distractors have limitations, such as requiring pre-trained segmentation models or complex loss balancing. |
RobustNeRF utilizes a trimmed least squares loss function combined with iterative re-weighted least squares (IRLS). It leverages spatial smoothness assumptions to distinguish distractors from high-frequency details, effectively ignoring distractors during training. |
RobustNeRF outperforms baselines like MipNeRF360 and DDNeRF in terms of reconstruction quality on both synthetic and real-world datasets.
The method is robust to varying clutter levels and requires minimal hyperparameter tuning.
Qualitative and quantitative evaluations demonstrate the efficacy of RobustNeRF in ignoring distractors and producing high-quality NeRF reconstructions. |
On clean datasets, RobustNeRF may exhibit slightly lower reconstruction quality and longer training times compared to methods like MipNeRF360 due to inherent statistical inefficiency.
Future work will focus on handling very small distractors, learning neural weight functions for improved accuracy, and incorporating the robust loss into other NeRF frameworks. |
nerf, robust estimation, outlier rejection, 3d reconstruction, computer vision |
2302.00190
Report |
Neural Wavelet-domain Diffusion for 3D Shape Generation, Inversion, and Manipulation |
Jingyu Hu, Ka-Hei Hui, Zhengzhe Liu, Ruihui Li, Chi-Wing Fu |
This paper presents a new approach for 3D shape generation, inversion, and
manipulation, through a direct generative modeling on a continuous implicit
representation in wavelet domain. Specifically, we propose a compact wavelet
representation with a pair of coarse and detail coefficient volumes to
implicitly represent 3D shapes via truncated signed distance functions and
multi-scale biorthogonal wavelets. Then, we design a pair of neural networks: a
diffusion-based generator to produce diverse shapes in the form of the coarse
coefficient volumes and a detail predictor to produce compatible detail
coefficient volumes for introducing fine structures and details. Further, we
may jointly train an encoder network to learn a latent space for inverting
shapes, allowing us to enable a rich variety of whole-shape and region-aware
shape manipulations. Both quantitative and qualitative experimental results
manifest the compelling shape generation, inversion, and manipulation
capabilities of our approach over the state-of-the-art methods. |
This paper introduces a novel approach for 3D shape generation, inversion, and manipulation using a compact wavelet representation of implicit functions in the frequency domain. |
Existing 3D shape generation methods struggle to produce diverse and realistic shapes with fine details. This work addresses these limitations by proposing a compact wavelet representation and a diffusion-based generative model operating in the frequency domain. |
The method leverages biorthogonal wavelets to decompose the truncated signed distance field (TSDF) of a 3D shape into coarse and detail coefficient volumes. It then employs a diffusion-based generator to synthesize coarse coefficient volumes and a detail predictor to generate compatible details. An encoder network is jointly trained for shape inversion and manipulation. |
The method generates diverse and realistic 3D shapes exhibiting complex structures, fine details, and clean surfaces.
It faithfully inverts unseen shapes into latent codes, enabling high-quality shape reconstruction and interpolation.
It supports various region-aware manipulations, including part replacement, part-wise interpolation, and part-wise re-generation. |
The generated shapes, while visually plausible, may not always meet desired functionalities.
The method requires a large number of shapes for training, limiting its effectiveness for categories with few training samples. |
3d shape generation, shape manipulation, diffusion model, wavelet representation, implicit function |
2301.13823
Report |
Grounding Language Models to Images for Multimodal Inputs and Outputs |
Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried |
We propose an efficient method to ground pretrained text-only language models
to the visual domain, enabling them to process arbitrarily interleaved
image-and-text data, and generate text interleaved with retrieved images. Our
method leverages the abilities of language models learnt from large scale
text-only pretraining, such as in-context learning and free-form text
generation. We keep the language model frozen, and finetune input and output
linear layers to enable cross-modality interactions. This allows our model to
process arbitrarily interleaved image-and-text inputs, and generate free-form
text interleaved with retrieved images. We achieve strong zero-shot performance
on grounded tasks such as contextual image retrieval and multimodal dialogue,
and showcase compelling interactive abilities. Our approach works with any
off-the-shelf language model and paves the way towards an effective, general
solution for leveraging pretrained language models in visually grounded
settings. |
This paper introduces FROMAGE, an efficient method to ground pre-trained text-only language models (LLMs) to the visual domain, allowing them to process and generate interleaved image-and-text data. |
This is important because it enables LLMs to leverage visual cues, improving their performance on visually grounded tasks like multimodal dialogue and contextual image retrieval, while retaining their existing text generation abilities. |
FROMAGE leverages a frozen LLM and a frozen visual encoder, training only linear mapping layers for image-to-text and text-to-image interactions, along with a new \texttt{[RET]} token for image retrieval. This allows for efficient training with a multi-task objective of image captioning and image-text retrieval. |
FROMAGE demonstrates strong zero-shot performance on contextual image retrieval, outperforming CLIP, particularly when provided with long and complex descriptions or multimodal context.
The model exhibits competitive results on zero-shot Visual Dialogue, surpassing prior work in text-to-image retrieval within dialogue.
FROMAGE showcases in-context learning abilities by generating coherent and relevant multimodal stories and responses in interactive settings. |
The model's reliance on image retrieval from a fixed set limits its ability to generate novel images or handle prompts unlikely to be found in natural images.
While the introduced \texttt{[RET]} token enables image interleaving, further research is needed to encourage its natural generation during inference. |
vision-and-language, large language models, multimodal dialogue, contextual image retrieval, frozen model adaptation |
2301.13721
Report |
DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models |
Tao Yang, Yuwang Wang, Yan Lv, Nanning Zheng |
Targeting to understand the underlying explainable factors behind
observations and modeling the conditional generation process on these factors,
we connect disentangled representation learning to Diffusion Probabilistic
Models (DPMs) to take advantage of the remarkable modeling ability of DPMs. We
propose a new task, disentanglement of (DPMs): given a pre-trained DPM, without
any annotations of the factors, the task is to automatically discover the
inherent factors behind the observations and disentangle the gradient fields of
DPM into sub-gradient fields, each conditioned on the representation of each
discovered factor. With disentangled DPMs, those inherent factors can be
automatically discovered, explicitly represented, and clearly injected into the
diffusion process via the sub-gradient fields. To tackle this task, we devise
an unsupervised approach named DisDiff, achieving disentangled representation
learning in the framework of DPMs. Extensive experiments on synthetic and
real-world datasets demonstrate the effectiveness of DisDiff. |
This paper introduces a novel task: disentanglement of diffusion probabilistic models (DPMs). The goal is to automatically discover and explicitly represent inherent factors of variation within pre-trained DPMs, without any factor annotations. This is achieved by disentangling the DPM's gradient fields into sub-gradient fields, each conditioned on the representation of a discovered factor. |
Disentangling DPMs offers two main advantages: 1) It enables unsupervised control over image generation by uncovering inherent semantic factors, extending the possibilities for DPM conditioning beyond supervised methods. 2) DPMs, with their strong image generation quality and natural affinity for inversion, offer a more suitable framework for disentangled representation learning compared to VAEs or GANs. |
The authors propose DisDiff, an unsupervised approach that learns disentangled representations for each factor and their corresponding disentangled conditional sub-gradient fields. It utilizes an encoder to learn factor representations and a decoder to learn the sub-gradient fields. A novel Disentangling Loss function encourages the learned representations to satisfy disentanglement requirements while still allowing for accurate input image reconstruction. |
DisDiff significantly outperforms existing VAE-based and GAN-based disentanglement methods on benchmark datasets like Shapes3D, MPI3D, and Cars3D, as measured by FactorVAE score and DCI.
Qualitative results demonstrate DisDiff's ability to effectively disentangle factors and enable image editing by swapping factor representations.
DisDiff allows for partial condition sampling, generating images conditioned on a specific subset of factors, both in controlled and real-world datasets like CelebA. |
The unsupervised nature of DisDiff might lead to learned disentangled representations on natural image sets that are not easily interpretable by humans, requiring further exploration of methods like CLIP for guidance.
As a diffusion-based method, DisDiff's generation speed is slower compared to VAE-based and GAN-based methods, a common limitation for DPM-based approaches. |
disentangled representation learning, diffusion probabilistic models, unsupervised learning, image generation, image editing |
2301.13622
Report |
Learning Data Representations with Joint Diffusion Models |
Kamil Deja, Tomasz Trzcinski, Jakub M. Tomczak |
Joint machine learning models that allow synthesizing and classifying data
often offer uneven performance between those tasks or are unstable to train. In
this work, we depart from a set of empirical observations that indicate the
usefulness of internal representations built by contemporary deep
diffusion-based generative models not only for generating but also predicting.
We then propose to extend the vanilla diffusion model with a classifier that
allows for stable joint end-to-end training with shared parameterization
between those objectives. The resulting joint diffusion model outperforms
recent state-of-the-art hybrid methods in terms of both classification and
generation quality on all evaluated benchmarks. On top of our joint training
approach, we present how we can directly benefit from shared generative and
discriminative representations by introducing a method for visual
counterfactual explanations. |
This paper introduces a joint diffusion model, combining a diffusion model and a classifier through shared parameterization, to enhance both data generation and classification tasks. |
Joint models that synthesize and classify data often suffer from uneven performance or training instability. This work leverages the representational power of diffusion models to improve both aspects within a single model. |
The authors analyze the usefulness of internal representations learned by diffusion models for prediction tasks. They propose a joint training approach where a classifier shares the encoder part of the diffusion model's UNet architecture, leading to shared representations for both generative and discriminative objectives. Additionally, they introduce a conditional sampling algorithm that optimizes internal diffusion representations using the classifier. |
The joint diffusion model outperforms stand-alone classifiers and previous joint models in classification accuracy across multiple datasets.
It demonstrates superior generative capabilities compared to vanilla diffusion models and other hybrid methods, as evidenced by improved FID scores.
The model effectively generates visual counterfactual explanations by identifying minimal changes in input images required to alter the classifier's decision. |
The conditional sampling method, while effective, relies on a step size parameter that requires tuning for optimal precision and diversity of generated samples.
Further exploration of more sophisticated domain adaptation techniques could enhance the model's performance in domain transfer scenarios. |
deep generative models, diffusion models, joint models, conditional sampling, counterfactual explanations |
2301.13188
Report |
Extracting Training Data from Diffusion Models |
Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace |
Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have
attracted significant attention due to their ability to generate high-quality
synthetic images. In this work, we show that diffusion models memorize
individual images from their training data and emit them at generation time.
With a generate-and-filter pipeline, we extract over a thousand training
examples from state-of-the-art models, ranging from photographs of individual
people to trademarked company logos. We also train hundreds of diffusion models
in various settings to analyze how different modeling and data decisions affect
privacy. Overall, our results show that diffusion models are much less private
than prior generative models such as GANs, and that mitigating these
vulnerabilities may require new advances in privacy-preserving training. |
This paper demonstrates that state-of-the-art diffusion models memorize and regenerate individual training examples, posing privacy risks. |
This is important because it challenges the assumption that diffusion models generate novel images and raises concerns about data privacy, copyright infringement, and the potential for misuse with sensitive data. |
The authors devise a two-stage data extraction attack: (1) generate numerous images from pre-trained diffusion models (Stable Diffusion and Imagen) using diverse prompts and (2) identify memorized training examples by detecting near-identical generations. They also train hundreds of diffusion models on CIFAR-10 to analyze the factors influencing memorization. |
The authors extract over a thousand training examples from Stable Diffusion and Imagen, including personally identifiable information and copyrighted material.
Diffusion models are found to be less private than GANs, with stronger diffusion models exhibiting higher vulnerability to memorization.
Existing defenses like data deduplication and differentially-private training provide limited protection or cause training instability. |
The definition of memorization based on pixel-level similarity might overlook more nuanced forms of data copying.
Differentially-private training for diffusion models requires further investigation to address training instability. |
diffusion models, memorization, privacy, data extraction, generative models |
2301.13173
Report |
Shape-aware Text-driven Layered Video Editing |
Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, Jia-Bin Huang |
Temporal consistency is essential for video editing applications. Existing
work on layered representation of videos allows propagating edits consistently
to each frame. These methods, however, can only edit object appearance rather
than object shape changes due to the limitation of using a fixed UV mapping
field for texture atlas. We present a shape-aware, text-driven video editing
method to tackle this challenge. To handle shape changes in video editing, we
first propagate the deformation field between the input and edited keyframe to
all frames. We then leverage a pre-trained text-conditioned diffusion model as
guidance for refining shape distortion and completing unseen regions. The
experimental results demonstrate that our method can achieve shape-aware
consistent video editing and compare favorably with the state-of-the-art. |
This paper introduces a novel shape-aware, text-driven video editing method that enables changes to both object appearance and shape in a video, ensuring temporal consistency. |
Existing video editing methods based on layered representations are limited to appearance editing due to fixed UV mapping. This method addresses the challenge of achieving consistent shape changes in videos, expanding the possibilities for creative video editing. |
The method leverages a pre-trained NLA model to decompose the video into layers, then uses a text-to-image diffusion model to edit a keyframe. By estimating semantic correspondence between the input and edited keyframes, the method generates per-frame deformation fields. Finally, a pre-trained diffusion model guides the optimization of atlas texture and deformation, completing unseen regions and refining shape details. |
The method successfully achieves shape-aware consistent video editing, as demonstrated through visual comparisons with baseline methods.
Ablation studies confirm the importance of both UV deformation and atlas optimization for achieving high-quality results.
The proposed approach allows for shape interpolation, expanding creative possibilities for video editing. |
The method relies on the accuracy of NLA mapping, which may fail in complex motion scenarios leading to artifacts.
Inaccurate semantic correspondence initialization between different objects can hinder the optimization process, suggesting potential for user-guided improvements. |
video editing, shape editing, text-driven editing, neural layered atlas, diffusion models |
2301.13156
Report |
SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation |
Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang |
Since the introduction of Vision Transformers, the landscape of many computer
vision tasks (e.g., semantic segmentation), which has been overwhelmingly
dominated by CNNs, recently has significantly revolutionized. However, the
computational cost and memory requirement render these methods unsuitable on
the mobile device, especially for the high-resolution per-pixel semantic
segmentation task. In this paper, we introduce a new method squeeze-enhanced
Axial TransFormer (SeaFormer) for mobile semantic segmentation. Specifically,
we design a generic attention block characterized by the formulation of squeeze
Axial and detail enhancement. It can be further used to create a family of
backbone architectures with superior cost-effectiveness. Coupled with a light
segmentation head, we achieve the best trade-off between segmentation accuracy
and latency on the ARM-based mobile devices on the ADE20K and Cityscapes
datasets. Critically, we beat both the mobile-friendly rivals and
Transformer-based counterparts with better performance and lower latency
without bells and whistles. Beyond semantic segmentation, we further apply the
proposed SeaFormer architecture to image classification problem, demonstrating
the potentials of serving as a versatile mobile-friendly backbone. |
This paper introduces SeaFormer, a mobile-friendly Transformer-based model for semantic segmentation, featuring a squeeze-enhanced Axial attention block for efficient global context modeling. |
Vision Transformers are computationally expensive and memory-intensive, making them unsuitable for mobile semantic segmentation, especially with high-resolution images. |
SeaFormer uses a squeeze-enhanced Axial attention mechanism, squeezing feature maps for efficient global context aggregation and enhancing local details with a convolution kernel. |
SeaFormer outperforms mobile-friendly networks (e.g., MobileNetV3) and Transformer-based models (e.g., TopFormer) on ADE20K and Cityscapes datasets.
SeaFormer-Base achieves +7.9% mIoU improvement over MobileNetV3 with lower latency on an ARM-based mobile device.
Ablation studies demonstrate the effectiveness of each component in SeaFormer, especially the squeeze-enhanced Axial attention block. |
The system's performance might be limited due to the lack of exhaustive evaluation and testing in real-world deployments.
Future work includes extending the mobile-friendly approach to more downstream tasks and exploring its potential on GPU systems. |
semantic segmentation, vision transformer, mobile-friendly, axial attention, edge computing |
2301.12959
Report |
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis |
Ming Tao, Bing-Kun Bao, Hao Tang, Changsheng Xu |
Synthesizing high-fidelity complex images from text is challenging. Based on
large pretraining, the autoregressive and diffusion models can synthesize
photo-realistic images. Although these large models have shown notable
progress, there remain three flaws. 1) These models require tremendous training
data and parameters to achieve good performance. 2) The multi-step generation
design slows the image synthesis process heavily. 3) The synthesized visual
features are difficult to control and require delicately designed prompts. To
enable high-quality, efficient, fast, and controllable text-to-image synthesis,
we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the
powerful pretrained CLIP model both in the discriminator and generator.
Specifically, we propose a CLIP-based discriminator. The complex scene
understanding ability of CLIP enables the discriminator to accurately assess
the image quality. Furthermore, we propose a CLIP-empowered generator that
induces the visual concepts from CLIP through bridge features and prompts. The
CLIP-integrated generator and discriminator boost training efficiency, and as a
result, our model only requires about 3% training data and 6% learnable
parameters, achieving comparable results to large pretrained autoregressive and
diffusion models. Moreover, our model achieves 120 times faster synthesis speed
and inherits the smooth latent space from GAN. The extensive experimental
results demonstrate the excellent performance of our GALIP. Code is available
at https://github.com/tobran/GALIP. |
This paper introduces GALIP, a novel text-to-image generation framework that integrates the pretrained CLIP model in both the discriminator and generator, enabling high-quality, efficient, fast, and controllable text-to-image synthesis. |
Existing large pretrained autoregressive and diffusion models, while impressive, require tremendous training data and parameters, have slow multi-step generation, and lack intuitive control over visual features. GALIP addresses these limitations. |
GALIP leverages a CLIP-based discriminator with a frozen CLIP image encoder and a learnable mate-discriminator to accurately assess image quality. It also uses a CLIP-empowered generator with a frozen CLIP encoder and a learnable mate-generator to induce visual concepts from CLIP via bridge features and prompts. |
GALIP achieves comparable synthesis quality to large pretrained models with significantly smaller trainable parameters and training data.
It enables ~120x faster synthesis speed compared to diffusion models like LDM.
GALIP inherits the smooth latent space from GANs, allowing for more controllable synthesis. |
The CLIP text encoder in GALIP might be improved by using more advanced large language models like T5.
Increasing the model size and pretraining dataset size could further enhance the synthesis ability, particularly for imaginary images. |
text-to-image synthesis, generative adversarial networks (gans), clip, image generation, deep learning |
2301.12914
Report |
PromptMix: Text-to-image diffusion models enhance the performance of lightweight networks |
Arian Bakhtiarnia, Qi Zhang, Alexandros Iosifidis |
Many deep learning tasks require annotations that are too time consuming for
human operators, resulting in small dataset sizes. This is especially true for
dense regression problems such as crowd counting which requires the location of
every person in the image to be annotated. Techniques such as data augmentation
and synthetic data generation based on simulations can help in such cases. In
this paper, we introduce PromptMix, a method for artificially boosting the size
of existing datasets, that can be used to improve the performance of
lightweight networks. First, synthetic images are generated in an end-to-end
data-driven manner, where text prompts are extracted from existing datasets via
an image captioning deep network, and subsequently introduced to text-to-image
diffusion models. The generated images are then annotated using one or more
high-performing deep networks, and mixed with the real dataset for training the
lightweight network. By extensive experiments on five datasets and two tasks,
we show that PromptMix can significantly increase the performance of
lightweight networks by up to 26%. |
Introduces PromptMix, a method to improve the performance of lightweight deep neural networks by augmenting training datasets with synthetic data generated using text-to-image diffusion models. |
Lightweight networks are crucial for deploying deep learning models on resource-constrained devices, but they often suffer from reduced accuracy compared to their heavyweight counterparts. PromptMix addresses this issue by generating additional training data, which is particularly beneficial for tasks where data collection and annotation are costly. |
1. **Prompt Generation:** Extract text descriptions (prompts) from existing datasets using image captioning or manually define them. 2. **Prompt Modification:** Enhance prompts with prefixes/suffixes to guide image generation. 3. **Image Generation:** Generate synthetic images using a text-to-image diffusion model (Stable Diffusion) based on the modified prompts. 4. **Image Filtering:** Filter out low-quality synthetic images based on the agreement between multiple heavyweight models' annotations. 5. **Data Mixing:** Combine a subset of the filtered synthetic data with the real dataset during training. |
PromptMix consistently enhances the performance of lightweight networks across different tasks (crowd counting, depth estimation), datasets, and architectures.
ResCSRNet, an ultra-lightweight architecture introduced in the paper, achieves comparable or even superior results to heavyweight models when trained with PromptMix.
The paper provides insights into PromptMix's hyperparameters, demonstrating that a wide range of settings leads to improvements over baseline training. |
PromptMix involves several hyperparameters that need to be tuned, although the ablation study shows its robustness to different configurations.
Generating high-quality synthetic images with faces in crowds remains a challenge due to limitations in current text-to-image diffusion models. |
lightweight deep learning, data augmentation, text-to-image synthesis, diffusion models, crowd counting, monocular depth estimation |
2301.12686
Report |
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration |
Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon |
Pre-trained diffusion models have been successfully used as priors in a
variety of linear inverse problems, where the goal is to reconstruct a signal
from noisy linear measurements. However, existing approaches require knowledge
of the linear operator. In this paper, we propose GibbsDDRM, an extension of
Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the
linear measurement operator is unknown. GibbsDDRM constructs a joint
distribution of the data, measurements, and linear operator by using a
pre-trained diffusion model for the data prior, and it solves the problem by
posterior sampling with an efficient variant of a Gibbs sampler. The proposed
method is problem-agnostic, meaning that a pre-trained diffusion model can be
applied to various inverse problems without fine-tuning. In experiments, it
achieved high performance on both blind image deblurring and vocal
dereverberation tasks, despite the use of simple generic priors for the
underlying linear operators. |
This paper introduces GibbsDDRM, a novel method for blind linear inverse problems that utilizes a pre-trained diffusion model as a data prior and a partially collapsed Gibbs sampler for efficient posterior sampling. |
Many real-world inverse problems are blind, meaning the measurement process is unknown, requiring estimation of both the original signal and the linear operator parameters, which poses a significant challenge. |
GibbsDDRM constructs a joint distribution of data, measurements, and linear operator parameters. It then uses a partially collapsed Gibbs sampler, alternately sampling data/latent variables and linear operator parameters, leveraging the diffusion model's representational power for accurate estimation. |
GibbsDDRM achieves high performance on blind image deblurring, surpassing competing methods in perceptual quality (LPIPS) despite using a simple prior for the blur kernel.
In vocal dereverberation, GibbsDDRM demonstrates superior performance in terms of signal quality (SI-SDR), perceptual quality (FAD), and reverberation removal (SRMR).
The method's efficacy is demonstrated even with large measurement noise and simple priors for the linear operator, highlighting its robustness and generalizability. |
GibbsDDRM's reliance on SVD computations might limit its applicability to problems involving large-scale linear operators.
Future research could explore extending GibbsDDRM to handle non-linear inverse problems. |
diffusion models, inverse problems, gibbs sampling, blind deblurring, dereverberation |
2301.12597
Report |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi |
The cost of vision-and-language pre-training has become increasingly
prohibitive due to end-to-end training of large-scale models. This paper
proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps
vision-language pre-training from off-the-shelf frozen pre-trained image
encoders and frozen large language models. BLIP-2 bridges the modality gap with
a lightweight Querying Transformer, which is pre-trained in two stages. The
first stage bootstraps vision-language representation learning from a frozen
image encoder. The second stage bootstraps vision-to-language generative
learning from a frozen language model. BLIP-2 achieves state-of-the-art
performance on various vision-language tasks, despite having significantly
fewer trainable parameters than existing methods. For example, our model
outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable
parameters. We also demonstrate the model's emerging capabilities of zero-shot
image-to-text generation that can follow natural language instructions. |
BLIP-2, a new vision-language pre-training method that bootstraps from frozen pre-trained image encoders and large language models (LLMs), achieving state-of-the-art performance on various vision-language tasks while being computationally efficient. |
Vision-language pre-training (VLP) is becoming computationally expensive, and BLIP-2 leverages readily available unimodal models to improve efficiency and performance. |
BLIP-2 uses a lightweight Querying Transformer (Q-Former) to bridge the modality gap between frozen image encoders and LLMs. It employs a two-stage pre-training strategy: (1) vision-language representation learning with a frozen image encoder and (2) vision-to-language generative learning with a frozen LLM. |
BLIP-2 achieves state-of-the-art performance on zero-shot VQAv2, outperforming Flamingo80B by 8.7% with 54x fewer trainable parameters.
It demonstrates strong generalization ability to out-of-domain images, achieving impressive results on image captioning tasks.
BLIP-2 enables instructed zero-shot image-to-text generation, demonstrating capabilities like visual knowledge reasoning, visual conversation, etc. |
BLIP-2 does not currently benefit from in-context learning with LLMs due to the limitation of the pre-training dataset.
Generated image-to-text outputs may be inaccurate due to limitations of the LLM's knowledge or reasoning abilities. |
vision-language pre-training, image captioning, visual question answering, image-text retrieval, large language models |
2301.12429
Report |
Debiased Fine-Tuning for Vision-language Models by Prompt Regularization |
Beier Zhu, Yulei Niu, Saeil Lee, Minhoe Hur, Hanwang Zhang |
We present a new paradigm for fine-tuning large-scale visionlanguage
pre-trained models on downstream task, dubbed Prompt Regularization (ProReg).
Different from traditional fine-tuning which easily overfits to the downstream
task data, ProReg uses the prediction by prompting the pretrained model to
regularize the fine-tuning. The motivation is: by prompting the large model "a
photo of a [CLASS]", the fil-lin answer is only dependent on the pretraining
encyclopedic knowledge while independent of the task data distribution, which
is usually biased. Specifically, given a training sample prediction during
fine-tuning, we first calculate its KullbackLeibler loss of the prompt
prediction and Cross-Entropy loss of the ground-truth label, and then combine
them with a proposed sample-wise adaptive trade-off weight, which automatically
adjusts the transfer between the pretrained and downstream domains. On various
out-of-distribution benchmarks, we show the consistently strong performance of
ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning,
and other state-of-the-art methods. |
This paper introduces Prompt Regularization (ProReg), a novel fine-tuning paradigm for large-scale vision-language pre-trained models that leverages prompt-based predictions as regularization to mitigate overfitting to downstream task data and improve out-of-distribution generalization. |
Traditional fine-tuning methods often overfit to biased downstream data, while zero-shot prompt methods struggle with domain-specific generalization. ProReg addresses these limitations by effectively transferring knowledge from both pre-trained and downstream domains. |
ProReg combines a cross-entropy loss from ground-truth labels with a Kullback-Leibler loss between fine-tuned and prompt-based predictions. It introduces a sample-wise adaptive weight to dynamically balance the contribution of task-specific and pre-trained knowledge during training. |
ProReg consistently outperforms zero-shot prompt, conventional fine-tuning, and prompt tuning across various out-of-distribution benchmarks for image classification and visual question answering.
ProReg effectively mitigates biases from both pre-trained and downstream domains, achieving compelling performance in both out-of-distribution and in-distribution settings.
Ablation studies demonstrate the effectiveness of the sample-wise adaptive weight and the limitations of traditional knowledge distillation and model ensemble approaches. |
The performance of ProReg may be sensitive to the choice of prompt template.
The computational cost of ProReg is slightly higher than conventional fine-tuning due to the additional prompt prediction. |
prompt learning, fine-tuning, out-of-distribution generalization, vision-language models, knowledge distillation |
2301.12276
Report |
ProtoSeg: Interpretable Semantic Segmentation with Prototypical Parts |
Mikołaj Sacha, Dawid Rymarczyk, Łukasz Struski, Jacek Tabor, Bartosz Zieliński |
We introduce ProtoSeg, a novel model for interpretable semantic image
segmentation, which constructs its predictions using similar patches from the
training set. To achieve accuracy comparable to baseline methods, we adapt the
mechanism of prototypical parts and introduce a diversity loss function that
increases the variety of prototypes within each class. We show that ProtoSeg
discovers semantic concepts, in contrast to standard segmentation models.
Experiments conducted on Pascal VOC and Cityscapes datasets confirm the
precision and transparency of the presented method. |
Presents [Method Name], a model for interpretable semantic segmentation using prototypical object parts. |
Most current semantic segmentation methods lack interpretability of their predictions, which is important for applications like autonomous vehicles and medical image analysis. |
The method uses a DeepLabv2 backbone with a novel Prototype Diversity Loss to ensure that prototypes learned for different object parts are semantically distinct. The model is evaluated on Cityscapes and PASCAL VOC 2012 datasets. |
The model achieves interpretable segmentation by identifying prototypical parts of objects.
The Prototype Diversity Loss successfully encourages diversity in learned prototypes.
While the method provides interpretability, it achieves lower mIOU compared to the baseline DeepLabv2 model. |
The model's precision is currently lower than state-of-the-art non-interpretable methods.
Future work includes improving precision, exploring different backbones (e.g., U-Net), and applying the method to other segmentation tasks. |
semantic segmentation, interpretability, prototypical parts, deep learning, computer vision |
2301.12257
Report |
Few-shot Face Image Translation via GAN Prior Distillation |
Ruoyu Zhao, Mingrui Zhu, Xiaoyu Wang, Nannan Wang |
Face image translation has made notable progress in recent years. However,
when training on limited data, the performance of existing approaches
significantly declines. Although some studies have attempted to tackle this
problem, they either failed to achieve the few-shot setting (less than 10) or
can only get suboptimal results. In this paper, we propose GAN Prior
Distillation (GPD) to enable effective few-shot face image translation. GPD
contains two models: a teacher network with GAN Prior and a student network
that fulfills end-to-end translation. Specifically, we adapt the teacher
network trained on large-scale data in the source domain to the target domain
with only a few samples, where it can learn the target domain's knowledge.
Then, we can achieve few-shot augmentation by generating source domain and
target domain images simultaneously with the same latent codes. We propose an
anchor-based knowledge distillation module that can fully use the difference
between the training and the augmented data to distill the knowledge of the
teacher network into the student network. The trained student network achieves
excellent generalization performance with the absorption of additional
knowledge. Qualitative and quantitative experiments demonstrate that our method
achieves superior results than state-of-the-art approaches in a few-shot
setting. |
This paper proposes GAN Prior Distillation (GPD), a novel framework for few-shot face image translation that leverages knowledge distillation from a teacher network pre-trained on large-scale datasets. |
Existing face image translation methods struggle with limited training data, especially in few-shot settings (less than 10 image pairs). GPD addresses this challenge by efficiently transferring knowledge from a pre-trained GAN to a smaller, faster translation network. |
GPD employs two main modules: (1) a few-shot generative augmentation module that adapts a pre-trained GAN to generate augmented image pairs for the target domain, and (2) an anchor-based knowledge distillation module that leverages the differences in realism between training data and augmented data to effectively distill knowledge into the student network. |
GPD significantly outperforms existing few-shot image translation methods in terms of visual quality and evaluation metrics, particularly in capturing complex styles.
The few-shot generative augmentation module effectively expands limited training data, leading to improved structural integrity and detail in translated images.
The anchor-based knowledge distillation module effectively mitigates overfitting and promotes generalization by strategically leveraging both training and augmented data. |
GPD's current reliance on StyleGAN, predominantly trained on face datasets, limits its applicability to other domains.
Future work will focus on extending GPD to broader image domains by exploring novel few-shot image generation models for diverse data types. |
face image translation, few-shot learning, generative adversarial networks, knowledge distillation, data augmentation |
2301.12247
Report |
SEGA: Instructing Text-to-Image Models using Semantic Guidance |
Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, Kristian Kersting |
Text-to-image diffusion models have recently received a lot of interest for
their astonishing ability to produce high-fidelity images from text only.
However, achieving one-shot generation that aligns with the user's intent is
nearly impossible, yet small changes to the input prompt often result in very
different images. This leaves the user with little semantic control. To put the
user in control, we show how to interact with the diffusion process to flexibly
steer it along semantic directions. This semantic guidance (SEGA) generalizes
to any generative architecture using classifier-free guidance. More
importantly, it allows for subtle and extensive edits, changes in composition
and style, as well as optimizing the overall artistic conception. We
demonstrate SEGA's effectiveness on both latent and pixel-based diffusion
models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of
tasks, thus providing strong evidence for its versatility, flexibility, and
improvements over existing methods. |
This paper introduces Semantic Guidance (SEGA), a novel method to exert fine-grained semantic control over image generation in diffusion models. |
Current text-to-image diffusion models lack granular control; small prompt tweaks yield drastically different images. SEGA addresses this by enabling subtle and extensive edits, compositional and stylistic changes, and artistic optimization. |
SEGA leverages classifier-free guidance, manipulating the noise estimates of diffusion models based on user-defined textual prompts. It identifies semantic directions within the noise-estimate space and guides the generation along these vectors. |
SEGA robustly incorporates concepts into images, demonstrated by successfully adding 'glasses' to diverse portraits.
Guidance vectors are unique and transferable, allowing a single calculated vector to be applied across multiple images.
The strength of semantic guidance scales monotonically with the magnitude of the desired effect, offering intuitive control over the generation. |
Transferring guidance vectors across vastly different image compositions requires separate calculations.
The paper acknowledges potential biases inherited from the underlying diffusion model's training data. |
diffusion models, text-to-image generation, semantic control, image editing, generative ai |
2301.12141
Report |
What Decreases Editing Capability? Domain-Specific Hybrid Refinement for Improved GAN Inversion |
Pu Cao, Lu Yang, Dongxv Liu, Xiaoya Yang, Tianrui Huang, Qing Song |
Recently, inversion methods have focused on additional high-rate information
in the generator (e.g., weights or intermediate features) to refine inversion
and editing results from embedded latent codes. Although these techniques gain
reasonable improvement in reconstruction, they decrease editing capability,
especially on complex images (e.g., containing occlusions, detailed
backgrounds, and artifacts). A vital crux is refining inversion results,
avoiding editing capability degradation. To tackle this problem, we introduce
Domain-Specific Hybrid Refinement (DHR), which draws on the advantages and
disadvantages of two mainstream refinement techniques to maintain editing
ability with fidelity improvement. Specifically, we first propose
Domain-Specific Segmentation to segment images into two parts: in-domain and
out-of-domain parts. The refinement process aims to maintain the editability
for in-domain areas and improve two domains' fidelity. We refine these two
parts by weight modulation and feature modulation, which we call Hybrid
Modulation Refinement. Our proposed method is compatible with all latent code
embedding methods. Extension experiments demonstrate that our approach achieves
state-of-the-art in real image inversion and editing. Code is available at
https://github.com/caopulan/Domain-Specific_Hybrid_Refinement_Inversion. |
This paper introduces Domain-Specific Hybrid Refinement (DHR), a novel GAN inversion method that refines image inversion and editing by leveraging a hybrid approach of weight and feature modulation, addressing the issue of editing capability degradation in existing refinement techniques. |
Existing refinement methods for GAN inversion, though improve reconstruction fidelity, often sacrifice editing capability, especially for images with complex features. This paper addresses this by proposing DHR, a method that selectively refines different image domains to maintain a good balance between fidelity and editability. |
The proposed DHR method consists of two components: Domain-Specific Segmentation (DSS) and Hybrid Modulation Refinement (HMR). DSS automatically segments images into easy-to-invert 'in-domain' parts and challenging 'out-of-domain' parts without requiring data annotation. HMR then applies weight modulation to the 'in-domain' areas for better editing capability and feature modulation to 'out-of-domain' areas for accurate detail reconstruction. |
DHR achieves state-of-the-art performance in quantitative metrics, surpassing existing methods in MSE, LPIPS, and identity similarity.
Qualitative results demonstrate DHR's ability to preserve image details during both inversion and editing, leading to more faithful and photorealistic results.
User studies confirm the superiority of DHR, with users showing a strong preference for DHR results over existing methods in terms of both inversion quality and editing realism. |
The method currently focuses on the face domain, and future work could explore its generalization to other image domains.
The runtime of DHR, though significantly faster than some baselines, can be further improved for real-time applications. |
gan inversion, image editing, weight modulation, feature modulation, domain-specific segmentation |
2301.12025
Report |
Cross-Architectural Positive Pairs improve the effectiveness of Self-Supervised Learning |
Pranav Singh, Jacopo Cirrone |
Existing self-supervised techniques have extreme computational requirements
and suffer a substantial drop in performance with a reduction in batch size or
pretraining epochs. This paper presents Cross Architectural - Self Supervision
(CASS), a novel self-supervised learning approach that leverages Transformer
and CNN simultaneously. Compared to the existing state-of-the-art
self-supervised learning approaches, we empirically show that CASS-trained CNNs
and Transformers across four diverse datasets gained an average of 3.8% with 1%
labeled data, 5.9% with 10% labeled data, and 10.13% with 100% labeled data
while taking 69% less time. We also show that CASS is much more robust to
changes in batch size and training epochs than existing state-of-the-art
self-supervised learning approaches. We have open-sourced our code at
https://github.com/pranavsinghps1/CASS. |
This paper introduces CASS (Cross-Architectural Self-Supervision), a new self-supervised learning approach that uses both CNNs and Transformers to learn better data representations, particularly beneficial for medical image analysis where data is often limited. |
Existing self-supervised methods require large datasets and significant computational resources, hindering their application in medical imaging where data and computational power are often limited. CASS aims to address these limitations. |
CASS leverages the inherent architectural differences between CNNs and Transformers to create positive pairs from the same input image. It minimizes the cosine similarity loss between the logits of the two architectures, encouraging them to learn from each other. |
CASS outperforms the state-of-the-art self-supervised method DINO on four medical imaging datasets, achieving an average improvement of 3.8% with 1% labeled data, 5.9% with 10% labeled data, and 10.13% with 100% labeled data.
CASS is more robust to changes in batch size and training epochs compared to DINO.
CASS is computationally more efficient than DINO, taking 69% less time to train on the same hardware. |
CASS's performance has not been extensively evaluated on large-scale natural image datasets.
At inference time, without ground-truth labels, it's unclear whether to choose the CNN or the Transformer arm of CASS for optimal performance. |
self-supervised learning, medical image analysis, cnn, transformer, limited data |
2301.11706
Report |
Input Perturbation Reduces Exposure Bias in Diffusion Models |
Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, Rita Cucchiara |
Denoising Diffusion Probabilistic Models have shown an impressive generation
quality, although their long sampling chain leads to high computational costs.
In this paper, we observe that a long sampling chain also leads to an error
accumulation phenomenon, which is similar to the exposure bias problem in
autoregressive text generation. Specifically, we note that there is a
discrepancy between training and testing, since the former is conditioned on
the ground truth samples, while the latter is conditioned on the previously
generated results. To alleviate this problem, we propose a very simple but
effective training regularization, consisting in perturbing the ground truth
samples to simulate the inference time prediction errors. We empirically show
that, without affecting the recall and precision, the proposed input
perturbation leads to a significant improvement in the sample quality while
reducing both the training and the inference times. For instance, on CelebA
64$\times$64, we achieve a new state-of-the-art FID score of 1.27, while saving
37.5% of the training time. The code is publicly available at
https://github.com/forever208/DDPM-IP |
The paper proposes DDPM-IP, a novel training regularization method for Denoising Diffusion Probabilistic Models (DDPMs) using input perturbation to address the exposure bias problem. |
Existing DDPMs suffer from exposure bias due to a discrepancy between training (conditioned on ground truth) and inference (conditioned on previous predictions), leading to error accumulation and suboptimal generation quality. |
DDPM-IP perturbs the ground truth input during training with Gaussian noise, simulating inference-time prediction errors and encouraging the model to learn a smoother prediction function. The method requires no changes to network architecture or loss function. |
DDPM-IP consistently outperforms the baseline ADM model in terms of FID and sFID scores across various datasets (CIFAR10, ImageNet 32x32, LSUN tower 64x64, CelebA 64x64, FFHQ 128x128).
The method demonstrates significant acceleration in both training (converging faster) and inference (achieving comparable or better results with fewer sampling steps).
DDPM-IP does not negatively impact sample diversity, as evidenced by similar recall and precision scores compared to the baseline. |
Experiments are limited to datasets with small resolution images due to the computational demands of training DDPMs.
Future work includes exploring the effectiveness of DDPM-IP on higher resolution images and investigating dataset-specific tuning of the noise hyperparameter. |
denoising diffusion probabilistic models, exposure bias, input perturbation, image generation, generative models |
2301.11699
Report |
Image Restoration with Mean-Reverting Stochastic Differential Equations |
Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön |
This paper presents a stochastic differential equation (SDE) approach for
general-purpose image restoration. The key construction consists in a
mean-reverting SDE that transforms a high-quality image into a degraded
counterpart as a mean state with fixed Gaussian noise. Then, by simulating the
corresponding reverse-time SDE, we are able to restore the origin of the
low-quality image without relying on any task-specific prior knowledge.
Crucially, the proposed mean-reverting SDE has a closed-form solution, allowing
us to compute the ground truth time-dependent score and learn it with a neural
network. Moreover, we propose a maximum likelihood objective to learn an
optimal reverse trajectory that stabilizes the training and improves the
restoration results. The experiments show that our proposed method achieves
highly competitive performance in quantitative comparisons on image deraining,
deblurring, and denoising, setting a new state-of-the-art on two deraining
datasets. Finally, the general applicability of our approach is further
demonstrated via qualitative results on image super-resolution, inpainting, and
dehazing. Code is available at
https://github.com/Algolzw/image-restoration-sde. |
This paper introduces Image Restoration Stochastic Differential Equation (IR-SDE), a novel approach for image restoration leveraging a mean-reverting SDE to model image degradation as a diffusion process. |
Current diffusion models for image restoration often struggle to accurately restore ground truth details, especially when initialized with high-variance noise. IR-SDE tackles this by directly modeling the degradation process, leading to improved fidelity. |
The method uses a mean-reverting SDE with a closed-form solution to represent the degradation from high-quality to low-quality images. A neural network learns the time-dependent score function of this SDE using a proposed maximum likelihood objective for improved training stability. |
IR-SDE achieves state-of-the-art performance on deraining benchmarks, outperforming existing methods in both quantitative metrics and perceptual quality.
The method demonstrates strong performance on deblurring, surpassing GAN-based methods in perceptual scores while maintaining consistency with ground truths.
The authors further showcase the versatility of IR-SDE by successfully applying it to super-resolution, inpainting, and dehazing tasks. |
The exponential term in the variance calculation leads to overly smooth changes in the final steps, potentially hindering learning in that stage. Future work will explore alternative schedules to mitigate this.
The iterative nature of reverse SDE simulation increases computational cost during inference. Exploring optimization techniques for faster inference is a potential future direction. |
image restoration, stochastic differential equations, diffusion models, mean-reverting process, maximum likelihood |
2301.11558
Report |
Accelerating Guided Diffusion Sampling with Splitting Numerical Methods |
Suttisak Wizadwongsa, Supasorn Suwajanakorn |
Guided diffusion is a technique for conditioning the output of a diffusion
model at sampling time without retraining the network for each specific task.
One drawback of diffusion models, however, is their slow sampling process.
Recent techniques can accelerate unguided sampling by applying high-order
numerical methods to the sampling process when viewed as differential
equations. On the contrary, we discover that the same techniques do not work
for guided sampling, and little has been explored about its acceleration. This
paper explores the culprit of this problem and provides a solution based on
operator splitting methods, motivated by our key finding that classical
high-order numerical methods are unsuitable for the conditional function. Our
proposed method can re-utilize the high-order methods for guided sampling and
can generate images with the same quality as a 250-step DDIM baseline using
32-58% less sampling time on ImageNet256. We also demonstrate usage on a wide
variety of conditional generation tasks, such as text-to-image generation,
colorization, inpainting, and super-resolution. |
This paper proposes a solution based on operator splitting methods for accelerating guided diffusion sampling, which was previously difficult to achieve with high-order numerical methods. |
Guided diffusion models are powerful but suffer from slow sampling speed. Accelerating guided sampling enables faster generation of high-quality images for various conditional generation tasks. |
The authors first analyze the guided ODE and identify that the conditional term is the culprit behind the failure of classical high-order methods. They then apply splitting methods, namely Lie-Trotter and Strang splitting, to separate the diffusion and condition subproblems and solve them with different numerical methods. |
Only the conditional subproblem is incompatible with classical high-order numerical methods.
Strang splitting combined with 4th-order PLMS for diffusion and 1st-order PLMS for condition achieves the best performance.
The proposed method is 32-58% faster than a 250-step DDIM baseline while maintaining similar image quality on various tasks like text-to-image generation, inpainting, colorization, and super-resolution. |
The paper's findings are based on existing models and a specific sigma schedule; further investigation is needed for other schedules and models.
Improving the behavior of the conditional function itself might be a promising future direction. |
guided diffusion, sampling acceleration, operator splitting, numerical methods, conditional image generation |
2301.11326
Report |
Unsupervised Volumetric Animation |
Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Kyle Olszewski, Jian Ren, Hsin-Ying Lee, Menglei Chai, Sergey Tulyakov |
We propose a novel approach for unsupervised 3D animation of non-rigid
deformable objects. Our method learns the 3D structure and dynamics of objects
solely from single-view RGB videos, and can decompose them into semantically
meaningful parts that can be tracked and animated. Using a 3D autodecoder
framework, paired with a keypoint estimator via a differentiable PnP algorithm,
our model learns the underlying object geometry and parts decomposition in an
entirely unsupervised manner. This allows it to perform 3D segmentation, 3D
keypoint estimation, novel view synthesis, and animation. We primarily evaluate
the framework on two video datasets: VoxCeleb $256^2$ and TEDXPeople $256^2$.
In addition, on the Cats $256^2$ image dataset, we show it even learns
compelling 3D geometry from still images. Finally, we show our model can obtain
animatable 3D objects from a single or few images. Code and visual results
available on our project website, see
https://snap-research.github.io/unsupervised-volumetric-animation . |
This paper presents the first unsupervised method for 3D animation of non-rigid objects from single-view videos. |
Unsupervised 3D animation enables animating arbitrary objects in a 3D-consistent manner without requiring expensive 3D supervision or limiting animation to predefined object categories. |
The proposed approach learns a canonical 3D volumetric representation of objects and their segmentation into parts, using a differentiable PnP algorithm to estimate part poses from 2D keypoints, enabling deformation and animation. |
The method learns high-quality 3D geometry and part decompositions from unconstrained videos, outperforming 3D-GANs in novel view synthesis quality without requiring camera supervision.
Quantitative and qualitative comparisons on face and body datasets demonstrate superior novel view synthesis and comparable animation quality to state-of-the-art unsupervised 2D animation methods.
The approach generalizes to unseen objects, enabling animation and novel view synthesis from a single or a few images, highlighting its potential for diverse applications. |
Limitations include a fixed voxel grid resolution that may lead to artifacts during large camera movements and reliance on an optimization-based embedding procedure during inference.
Future work includes exploring alternative 3D representations for higher resolution and efficiency, and incorporating techniques for improving the quality and speed of inference for unseen objects. |
unsupervised learning, 3d animation, novel view synthesis, volumetric rendering, computer vision |
2301.11116
Report |
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring |
Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, Thomas H. Li |
Image-text pretrained models, e.g., CLIP, have shown impressive general
multi-modal knowledge learned from large-scale image-text data pairs, thus
attracting increasing attention for their potential to improve visual
representation learning in the video domain. In this paper, based on the CLIP
model, we revisit temporal modeling in the context of image-to-video knowledge
transferring, which is the key point for extending image-text pretrained models
to the video domain. We find that current temporal modeling mechanisms are
tailored to either high-level semantic-dominant tasks (e.g., retrieval) or
low-level visual pattern-dominant tasks (e.g., recognition), and fail to work
on the two cases simultaneously. The key difficulty lies in modeling temporal
dependency while taking advantage of both high-level and low-level knowledge in
CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary
Network (STAN) -- a simple and effective temporal modeling mechanism extending
CLIP model to diverse video tasks. Specifically, to realize both low-level and
high-level knowledge transferring, STAN adopts a branch structure with
decomposed spatial-temporal modules that enable multi-level CLIP features to be
spatial-temporally contextualized. We evaluate our method on two representative
video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments
demonstrate the superiority of our model over the state-of-the-art methods on
various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and
Something-Something-V2. Codes will be available at
https://github.com/farewellthree/STAN |
This paper revisits temporal modeling in CLIP-based image-to-video knowledge transferring and proposes Spatial-Temporal Auxiliary Network (STAN), a new branch structure for effective knowledge transfer to diverse video tasks. |
Extending image-text pretrained models like CLIP to the video domain is important but challenging. Existing temporal modeling methods fail to effectively transfer both high-level semantic and low-level visual knowledge from CLIP. |
STAN augments video frame features with spatial-temporal contexts at different CLIP output levels without altering CLIP's structure. It utilizes a branch structure with decomposed spatial-temporal modules for multi-level feature contextualization, and explores self-attention and 3D convolution based cross-frame modules. |
STAN outperforms state-of-the-art methods on video-text retrieval benchmarks like MSR-VTT, DiDeMo, and LSMDC.
STAN achieves competitive performance on video recognition tasks, including Kinetics-400 and Something-Something-V2.
Ablation studies confirm the contribution of each component in STAN and demonstrate the importance of multi-level feature learning and the branch structure. |
The performance of STAN on small-scale video-text retrieval datasets with limited training data is less competitive.
Exploring the compatibility of STAN with other advanced techniques like hierarchical video-text interaction and hard sample modeling is left for future work. |
video-text retrieval, video recognition, clip, knowledge transfer, temporal modeling |
2301.10972
Report |
On the Importance of Noise Scheduling for Diffusion Models |
Ting Chen |
We empirically study the effect of noise scheduling strategies for denoising
diffusion generative models. There are three findings: (1) the noise scheduling
is crucial for the performance, and the optimal one depends on the task (e.g.,
image sizes), (2) when increasing the image size, the optimal noise scheduling
shifts towards a noisier one (due to increased redundancy in pixels), and (3)
simply scaling the input data by a factor of $b$ while keeping the noise
schedule function fixed (equivalent to shifting the logSNR by $\log b$) is a
good strategy across image sizes. This simple recipe, when combined with
recently proposed Recurrent Interface Network (RIN), yields state-of-the-art
pixel-based diffusion models for high-resolution images on ImageNet, enabling
single-stage, end-to-end generation of diverse and high-fidelity images at
1024$\times$1024 resolution (without upsampling/cascades). |
This paper investigates the impact of noise scheduling strategies on the performance of denoising diffusion generative models for image generation. |
Noise scheduling is crucial for diffusion model performance, impacting the distribution of noise levels learned by the model. Optimal scheduling varies across tasks and image resolutions due to differing data redundancy and information density. |
The authors systematically explore two noise scheduling strategies: 1) Adjusting the noise schedule function (cosine, sigmoid, linear) and 2) Scaling the input data. They train and evaluate their methods on class-conditional ImageNet image generation at varying resolutions, using FID and Inception Score as metrics. |
Optimal noise scheduling is task- and resolution-dependent.
Scaling the input data is a simple yet effective strategy for adjusting noise scheduling, outperforming adjustments to the schedule function.
Combining input scaling with the Recurrent Interface Network (RIN) architecture achieves state-of-the-art single-stage, high-resolution image generation on ImageNet. |
The study primarily focuses on pixel-based diffusion models and hasn't been evaluated on latent diffusion models.
Further exploration of hyperparameter tuning for high-resolution images may yield additional performance improvements. |
denoising diffusion models, noise scheduling, image generation, recurrent interface network, high-resolution images |
2301.10941
Report |
GeCoNeRF: Few-shot Neural Radiance Fields via Geometric Consistency |
Min-seop Kwak, Jiuhn Song, Seungryong Kim |
We present a novel framework to regularize Neural Radiance Field (NeRF) in a
few-shot setting with a geometry-aware consistency regularization. The proposed
approach leverages a rendered depth map at unobserved viewpoint to warp sparse
input images to the unobserved viewpoint and impose them as pseudo ground
truths to facilitate learning of NeRF. By encouraging such geometry-aware
consistency at a feature-level instead of using pixel-level reconstruction
loss, we regularize the NeRF at semantic and structural levels while allowing
for modeling view dependent radiance to account for color variations across
viewpoints. We also propose an effective method to filter out erroneous warped
solutions, along with training strategies to stabilize training during
optimization. We show that our model achieves competitive results compared to
state-of-the-art few-shot NeRF models. Project page is available at
https://ku-cvlab.github.io/GeCoNeRF/. |
This paper introduces GeCoNeRF, a novel framework that utilizes geometric consistency to enhance the performance of Neural Radiance Fields (NeRF) in few-shot novel view synthesis. |
NeRF struggles in few-shot settings due to overfitting sparse input images and failing to reconstruct accurate geometry. This work addresses this limitation by introducing geometric constraints to regularize NeRF. |
GeCoNeRF warps sparse input images to novel viewpoints guided by the depth rendered by NeRF. By enforcing consistency between warped images and those rendered at novel viewpoints, GeCoNeRF regularizes both geometry and appearance. It utilizes feature-level regularization to handle view-dependent radiance effects and employs occlusion masking to filter out erroneous warpings. |
GeCoNeRF achieves competitive results compared to state-of-the-art few-shot NeRF models, as demonstrated on synthetic and real datasets.
The proposed method effectively captures fine details and reduces artifacts in few-shot scenarios.
Ablation studies validate the contribution of each component, including feature-level consistency loss, occlusion masking, and progressive training strategies. |
The thresholding technique used for occlusion handling can be sensitive to different scenes and datasets.
The assumption of surface coverage by multiple viewpoints may not always hold, leading to unnecessary computational costs. |
neural radiance fields, nerf, few-shot learning, novel view synthesis, geometric consistency |
2301.10916
Report |
ITstyler: Image-optimized Text-based Style Transfer |
Yunpeng Bai, Jiayue Liu, Chao Dong, Chun Yuan |
Text-based style transfer is a newly-emerging research topic that uses text
information instead of style image to guide the transfer process, significantly
extending the application scenario of style transfer. However, previous methods
require extra time for optimization or text-image paired data, leading to
limited effectiveness. In this work, we achieve a data-efficient text-based
style transfer method that does not require optimization at the inference
stage. Specifically, we convert text input to the style space of the
pre-trained VGG network to realize a more effective style swap. We also
leverage CLIP's multi-modal embedding space to learn the text-to-style mapping
with the image dataset only. Our method can transfer arbitrary new styles of
text input in real-time and synthesize high-quality artistic images. |
ITstyler, a data-efficient text-based style transfer method that utilizes the style representation capability of VGG features and the multi-modal embedding space of CLIP, enabling real-time transfer of arbitrary new styles from text input to artistic images. |
Existing text-based style transfer methods suffer from limitations like requiring extra optimization time, relying on text-image paired data, or lacking effectiveness in finding a suitable style representation space. |
The method adapts CLIP's embedding space to the style space of a pre-trained VGG network. It trains a mapping network to convert text embeddings from CLIP's text encoder into style representations in VGG's feature space. This allows for efficient style swapping using a modified AdaIN operation. |
ITstyler generates stylized images that better match the text description and are more artistically pleasing compared to previous methods.
User study confirms a strong preference for ITstyler's results.
The method exhibits superior speed, enabling real-time performance for arbitrary style transfer. |
The method might not effectively convert specific words (e.g., names) into artistic styles.
Future work can explore incorporating more complex language understanding to further improve the mapping from text to style representations. |
style transfer, text-based style transfer, clip, adain, vgg |
2301.10670
Report |
Towards Arbitrary Text-driven Image Manipulation via Space Alignment |
Yunpeng Bai, Zihan Zhong, Chao Dong, Weichen Zhang, Guowei Xu, Chun Yuan |
The recent GAN inversion methods have been able to successfully invert the
real image input to the corresponding editable latent code in StyleGAN. By
combining with the language-vision model (CLIP), some text-driven image
manipulation methods are proposed. However, these methods require extra costs
to perform optimization for a certain image or a new attribute editing mode. To
achieve a more efficient editing method, we propose a new Text-driven image
Manipulation framework via Space Alignment (TMSA). The Space Alignment module
aims to align the same semantic regions in CLIP and StyleGAN spaces. Then, the
text input can be directly accessed into the StyleGAN space and be used to find
the semantic shift according to the text description. The framework can support
arbitrary image editing mode without additional cost. Our work provides the
user with an interface to control the attributes of a given image according to
text input and get the result in real time. Ex tensive experiments demonstrate
our superior performance over prior works. |
This paper presents TMSA, a novel text-driven image manipulation framework that aligns semantic regions between CLIP and StyleGAN latent spaces, allowing arbitrary text input to control image attributes in real-time. |
Existing text-driven image manipulation methods require expensive optimization or training for each image or attribute, limiting flexibility and efficiency. TMSA aims to overcome these limitations. |
A mapping network is trained to align image embeddings from CLIP with corresponding latent codes in StyleGAN's W+ space. The network is further fine-tuned using generated images for in-domain adjustment and adapted for specific inversion encoders. |
TMSA enables accurate attribute editing from arbitrary text descriptions while preserving image identity, outperforming previous methods qualitatively and quantitatively.
The Space Alignment module effectively maps text embeddings to precise semantic locations in the StyleGAN latent space.
TMSA is adaptable to different inversion methods and can be combined with them to achieve high-fidelity image editing. |
The reliance on pre-trained CLIP and StyleGAN models might limit the scope of editable attributes.
Future work can explore extending TMSA for more complex manipulation tasks, such as multi-attribute editing or text-guided image generation. |
image manipulation, text-driven editing, generative adversarial networks, clip, stylegan |
2301.10241
Report |
K-Planes: Explicit Radiance Fields in Space, Time, and Appearance |
Sara Fridovich-Keil, Giacomo Meanti, Frederik Warburg, Benjamin Recht, Angjoo Kanazawa |
We introduce k-planes, a white-box model for radiance fields in arbitrary
dimensions. Our model uses d choose 2 planes to represent a d-dimensional
scene, providing a seamless way to go from static (d=3) to dynamic (d=4)
scenes. This planar factorization makes adding dimension-specific priors easy,
e.g. temporal smoothness and multi-resolution spatial structure, and induces a
natural decomposition of static and dynamic components of a scene. We use a
linear feature decoder with a learned color basis that yields similar
performance as a nonlinear black-box MLP decoder. Across a range of synthetic
and real, static and dynamic, fixed and varying appearance scenes, k-planes
yields competitive and often state-of-the-art reconstruction fidelity with low
memory usage, achieving 1000x compression over a full 4D grid, and fast
optimization with a pure PyTorch implementation. For video results and code,
please see https://sarafridov.github.io/K-Planes. |
This paper introduces k-planes, an explicit, interpretable, and memory-efficient model for representing radiance fields in arbitrary dimensions. |
Existing methods for dynamic radiance fields, which require representing 4D volumes, are either memory inefficient or rely on black-box components like MLPs. K-planes offers a solution that is both memory efficient and interpretable. |
K-planes factorizes a d-dimensional scene into (d choose 2) planes, representing every pair of dimensions. Each plane stores features that are then combined using the Hadamard product and decoded into color and density using either a linear decoder with a learned color basis (explicit model) or an MLP (hybrid model). |
K-planes achieves competitive and often state-of-the-art reconstruction fidelity on a variety of static and dynamic scenes, including those with varying appearance.
The model is compact, achieving 1000x compression over a full 4D grid.
K-planes allows for fast optimization using a pure PyTorch implementation. |
The resolution of reconstructions for scenes with varying appearance is slightly lower than NeRF-W.
Future work could explore extending the planar factorization to efficiently incorporate higher-order interactions between dimensions. |
radiance fields, neural rendering, planar factorization, dynamic scenes, explicit representation |
2301.09879
Report |
Data Augmentation Alone Can Improve Adversarial Training |
Lin Li, Michael Spratling |
Adversarial training suffers from the issue of robust overfitting, which
seriously impairs its generalization performance. Data augmentation, which is
effective at preventing overfitting in standard training, has been observed by
many previous works to be ineffective in mitigating overfitting in adversarial
training. This work proves that, contrary to previous findings, data
augmentation alone can significantly boost accuracy and robustness in
adversarial training. We find that the hardness and the diversity of data
augmentation are important factors in combating robust overfitting. In general,
diversity can improve both accuracy and robustness, while hardness can boost
robustness at the cost of accuracy within a certain limit and degrade them both
over that limit. To mitigate robust overfitting, we first propose a new crop
transformation, Cropshift, which has improved diversity compared to the
conventional one (Padcrop). We then propose a new data augmentation scheme,
based on Cropshift, with much improved diversity and well-balanced hardness.
Empirically, our augmentation method achieves the state-of-the-art accuracy and
robustness for data augmentations in adversarial training. Furthermore, when
combined with weight averaging it matches, or even exceeds, the performance of
the best contemporary regularization methods for alleviating robust
overfitting. Code is available at:
https://github.com/TreeLLi/DA-Alone-Improves-AT. |
This paper demonstrates that data augmentation alone, contrary to prior belief, can significantly improve the accuracy and robustness of adversarial training by carefully managing the hardness and diversity of augmentations. |
Robust overfitting, a major issue in adversarial training, limits generalization performance. Previous attempts to leverage data augmentation for this problem have been largely ineffective. |
The paper analyzes the impact of augmentation hardness and diversity, proposing Cropshift, a diversity-enhancing crop operation, and IDBH, a new augmentation scheme that balances hardness and maximizes diversity. The authors conduct experiments on CIFAR-10, SVHN, and Tiny ImageNet datasets, comparing IDBH with existing augmentation and regularization techniques. |
IDBH achieves state-of-the-art accuracy and robustness among data augmentation methods in adversarial training, significantly surpassing the baseline augmentation with early stopping.
IDBH matches or exceeds the performance of state-of-the-art regularization methods designed to alleviate robust overfitting.
The study reveals that while diversity consistently improves accuracy and robustness, hardness should be carefully balanced to avoid compromising accuracy. |
The study lacked sufficient computational resources to conduct a more extensive automatic augmentation search for optimal hyperparameters.
The proposed hardness measure has limitations, as demonstrated by the exceptional behavior of certain transformations. |
adversarial training, robust overfitting, data augmentation, cropshift, idbh |
2301.09637
Report |
InfiniCity: Infinite-Scale City Synthesis |
Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, Sergey Tulyakov |
Toward infinite-scale 3D city synthesis, we propose a novel framework,
InfiniCity, which constructs and renders an unconstrainedly large and
3D-grounded environment from random noises. InfiniCity decomposes the seemingly
impractical task into three feasible modules, taking advantage of both 2D and
3D data. First, an infinite-pixel image synthesis module generates
arbitrary-scale 2D maps from the bird's-eye view. Next, an octree-based voxel
completion module lifts the generated 2D map to 3D octrees. Finally, a
voxel-based neural rendering module texturizes the voxels and renders 2D
images. InfiniCity can thus synthesize arbitrary-scale and traversable 3D city
environments, and allow flexible and interactive editing from users. We
quantitatively and qualitatively demonstrate the efficacy of the proposed
framework. Project page: https://hubert0527.github.io/infinicity/ |
InfiniCity, a novel framework for synthesizing infinite-scale, realistic, and navigable 3D city environments from random noises. |
Existing methods are limited to bounded environments, constrained camera movements, or lack 3D consistency, highlighting the need for a scalable and realistic 3D environment generation approach. |
A three-stage pipeline: 1) Infinite-pixel satellite image synthesis generates large-scale 2D maps. 2) Octree-based voxel completion converts 2D maps to watertight 3D voxel environments. 3) Voxel-based neural rendering adds textures to the voxel world. |
Generates arbitrary-scale, coherent, and diverse city maps with plausible structures across multiple modalities.
Successfully lifts 2D maps to 3D voxel environments, ensuring watertight structures while retaining surface details.
Outperforms baseline methods in quantitative evaluations (FID, KID, P-FID) and exhibits better visual quality and 3D consistency. |
The quality of the final rendering is currently limited by the neural rendering stage.
Future work includes exploring advanced neural rendering techniques to improve visual fidelity and address convergence and efficiency challenges. |
3d city synthesis, infinite-scale generation, neural rendering, voxel completion, generative modeling |
2301.09595
Report |
Zorro: the masked multimodal transformer |
Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman |
Attention-based models are appealing for multimodal processing because inputs
from multiple modalities can be concatenated and fed to a single backbone
network - thus requiring very little fusion engineering. The resulting
representations are however fully entangled throughout the network, which may
not always be desirable: in learning, contrastive audio-visual self-supervised
learning requires independent audio and visual features to operate, otherwise
learning collapses; in inference, evaluation of audio-visual models should be
possible on benchmarks having just audio or just video. In this paper, we
introduce Zorro, a technique that uses masks to control how inputs from each
modality are routed inside Transformers, keeping some parts of the
representation modality-pure. We apply this technique to three popular
transformer-based architectures (ViT, Swin and HiP) and show that with
contrastive pre-training Zorro achieves state-of-the-art results on most
relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore,
the resulting models are able to perform unimodal inference on both video and
audio benchmarks such as Kinetics-400 or ESC-50. |
Introduces Zorro, a multimodal Transformer architecture using masks to control information flow between modalities, enabling unimodal and multimodal training and inference within a single model. |
Addresses limitations of existing multimodal models that entangle representations, hindering contrastive learning and unimodal inference. |
Preserves modality-specific representations within the Transformer by applying masks to self-attention and decoding cross-attention operations. Extends the approach to ViT, Swin, and HiP architectures. |
Achieves state-of-the-art results on AudioSet and VGGSound benchmarks with contrastive pre-training.
Enables unimodal inference on Kinetics-400 (video) and ESC-50 (audio) with a model trained on multimodal data.
Shows superior performance compared to alternative masking configurations and fusion positions. |
Unimodal performance on Kinetics-400, while strong, does not yet surpass specialized video Transformers.
Exploration of alternative self-supervised learning methods beyond contrastive learning. |
multimodal learning, transformers, contrastive learning, self-supervised learning, audio-visual recognition |
2301.09451
Report |
A Simple Recipe for Competitive Low-compute Self supervised Vision Models |
Quentin Duval, Ishan Misra, Nicolas Ballas |
Self-supervised methods in vision have been mostly focused on large
architectures as they seem to suffer from a significant performance drop for
smaller architectures. In this paper, we propose a simple self-supervised
distillation technique that can train high performance low-compute neural
networks. Our main insight is that existing joint-embedding based SSL methods
can be repurposed for knowledge distillation from a large self-supervised
teacher to a small student model. Thus, we call our method Replace one Branch
(RoB) as it simply replaces one branch of the joint-embedding training with a
large teacher model. RoB is widely applicable to a number of architectures such
as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV
or iBOT. When pretraining on the ImageNet dataset, RoB yields models that
compete with supervised knowledge distillation. When applied to MSN, RoB
produces students with strong semi-supervised capabilities. Finally, our best
ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by $2.3\%$
and are on par or better than a supervised distilled DeiT on five downstream
transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We
hope RoB enables practical self-supervision at smaller scale. |
This paper introduces "Replace one Branch" (RoB), a simple self-supervised distillation technique for training high-performance, low-compute neural networks by adapting existing joint-embedding based self-supervised learning methods. |
Existing self-supervised learning methods often underperform with smaller architectures, limiting their practical application in contexts where computational resources are constrained. RoB addresses this limitation, enabling the benefits of self-supervision in low-compute settings. |
RoB replaces one branch of a joint-embedding self-supervised method (e.g., DINO, SwAV, iBOT, MSN) with a pretrained teacher network. It removes the regularization terms used to prevent collapse in the original method and utilizes identical-view predictions instead of cross-view predictions. |
RoB produces state-of-the-art self-supervised models for ViT-Tiny, ResNet18, and ResNet34, demonstrating significant improvements over previous methods.
The distilled students outperform their supervised counterparts on transfer learning tasks and achieve competitive performance with models trained via supervised distillation.
RoB-trained models exhibit strong semi-supervised learning capabilities, particularly for low-shot image classification. |
The performance improvement with RoB is more significant for students that struggle with traditional SSL training.
Future work could explore alternative student head designs and distillation techniques to further improve RoB's performance. |
self-supervised learning, knowledge distillation, low-compute vision models, transfer learning, semi-supervised learning |
2301.09430
Report |
Rethinking Real-world Image Deraining via An Unpaired Degradation-Conditioned Diffusion Model |
Yiyang Shen, Mingqiang Wei, Yongzhen Wang, Xueyang Fu, Jing Qin |
Recent diffusion models have exhibited great potential in generative modeling
tasks. Part of their success can be attributed to the ability of training
stable on huge sets of paired synthetic data. However, adapting these models to
real-world image deraining remains difficult for two aspects. First, collecting
a large-scale paired real-world clean/rainy dataset is unavailable while
regular conditional diffusion models heavily rely on paired data for training.
Second, real-world rain usually reflects real-world scenarios with a variety of
unknown rain degradation types, which poses a significant challenge for the
generative modeling process. To meet these challenges, we propose RainDiff, the
first real-world image deraining paradigm based on diffusion models, serving as
a new standard bar for real-world image deraining. We address the first
challenge by introducing a stable and non-adversarial unpaired cycle-consistent
architecture that can be trained, end-to-end, with only unpaired data for
supervision; and the second challenge by proposing a degradation-conditioned
diffusion model that refines the desired output via a diffusive generative
process conditioned by learned priors of multiple rain degradations. Extensive
experiments confirm the superiority of our RainDiff over existing
unpaired/semi-supervised methods and show its competitive advantages over
several fully-supervised ones. |
This paper presents RainDiff, a novel unpaired learning paradigm for real-world image deraining leveraging a degradation-conditioned diffusion model. |
Real-world image deraining is challenging due to the lack of paired training data and the diverse degradation types in real rain. |
RainDiff utilizes a non-adversarial unpaired cycle-consistent architecture with a degradation-conditioned diffusion model. It learns degradation priors to guide the rain removal process. |
RainDiff outperforms state-of-the-art unpaired and semi-supervised methods on both synthetic and real-world datasets.
The degradation-conditioned diffusion model effectively handles multiple rain degradation types.
The proposed method achieves competitive performance against fully-supervised deraining approaches. |
Further research is needed to assess the performance of RainDiff on diverse weather conditions beyond rain.
Similar to other diffusion models, RainDiff has a longer runtime compared to single-pass image restoration models. |
image deraining, diffusion models, unpaired learning, degradation-conditioned, cycle-consistent |
2301.09376
Report |
Crowd3D: Towards Hundreds of People Reconstruction from a Single Image |
Hao Wen, Jing Huang, Huili Cui, Haozhe Lin, YuKun Lai, Lu Fang, Kun Li |
Image-based multi-person reconstruction in wide-field large scenes is
critical for crowd analysis and security alert. However, existing methods
cannot deal with large scenes containing hundreds of people, which encounter
the challenges of large number of people, large variations in human scale, and
complex spatial distribution. In this paper, we propose Crowd3D, the first
framework to reconstruct the 3D poses, shapes and locations of hundreds of
people with global consistency from a single large-scene image. The core of our
approach is to convert the problem of complex crowd localization into pixel
localization with the help of our newly defined concept, Human-scene Virtual
Interaction Point (HVIP). To reconstruct the crowd with global consistency, we
propose a progressive reconstruction network based on HVIP by pre-estimating a
scene-level camera and a ground plane. To deal with a large number of persons
and various human sizes, we also design an adaptive human-centric cropping
scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd
reconstruction in a large scene. Experimental results demonstrate the
effectiveness of the proposed method. The code and datasets will be made
public. |
This supplementary document provides additional details and analysis for the paper 'Crowd3D: Towards Hundreds of People Reconstruction from a Single Image', focusing on large-scale crowd reconstruction from images. |
The work addresses the limitations of existing methods that struggle with large-scene images containing many people, aiming to improve the accuracy and robustness of 3D human pose and shape estimation in crowded scenes. |
The document details the adaptive human-centric cropping scheme for dividing large images into manageable patches, the process of applying and adapting other methods for comparison, and additional ablation studies of key modules. |
The method achieves state-of-the-art results on the large-scale 'LargeCrowd' dataset, demonstrating its effectiveness.
Evaluation on small-scene datasets ('Panoptic', 'MuPoTS') shows that the method generalizes well and maintains high performance.
Ablation studies confirm the positive impact of adaptive cropping and accurate ground plane estimation on overall performance. |
Automatically obtaining cropping parameters can be sensitive to false pose detections and computationally expensive.
Future work could explore weakly-supervised or unsupervised methods to reduce reliance on ground-truth annotations for ground plane estimation. |
crowd reconstruction, 3d human pose estimation, large-scale scene understanding, adaptive cropping, depth estimation |
2301.09264
Report |
Efficient Training Under Limited Resources |
Mahdi Zolnouri, Dounia Lakhmiri, Christophe Tribes, Eyyüb Sari, Sébastien Le Digabel |
Training time budget and size of the dataset are among the factors affecting
the performance of a Deep Neural Network (DNN). This paper shows that Neural
Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data
Augmentation help DNNs perform much better while these two factors are limited.
However, searching for an optimal architecture and the best hyperparameter
values besides a good combination of data augmentation techniques under low
resources requires many experiments. We present our approach to achieving such
a goal in three steps: reducing training epoch time by compressing the model
while maintaining the performance compared to the original model, preventing
model overfitting when the dataset is small, and performing the hyperparameter
tuning. We used NOMAD, which is a blackbox optimization software based on a
derivative-free algorithm to do NAS and HPO. Our work achieved an accuracy of
86.0 % on a tiny subset of Mini-ImageNet at the ICLR 2021 Hardware Aware
Efficient Training (HAET) Challenge and won second place in the competition.
The competition results can be found at haet2021.github.io/challenge and our
source code can be found at github.com/DouniaLakhmiri/ICLR\_HAET2021. |
This paper presents a method for improving Deep Neural Network (DNN) performance under limited training time and data size constraints by combining Neural Architecture Search (NAS), Hyperparameter Optimization (HPO), and Data Augmentation (DA). |
Training time budget and dataset size significantly affect DNN performance. This paper addresses the challenge of finding optimal architectures, hyperparameters, and data augmentation techniques under limited resources. |
The methodology involves: 1) Reducing training time by compressing the model via NAS while maintaining performance. 2) Using DA to mitigate overfitting on small datasets. 3) Performing HPO using the blackbox optimization software NOMAD to fine-tune hyperparameters. |
SENet-18 and ResNet-18 were identified as strong baseline models on CIFAR-10 subsets.
NAS produced smaller variants of SENet-18 and ResNet-18 with comparable accuracy.
The final model, NOMAD-NAS-SENet-18, achieved 86.0% accuracy on a Mini-ImageNet subset, securing second place in the ICLR 2021 HAET Challenge. |
The study primarily used CIFAR-10 as a proxy dataset, potentially limiting generalizability to the Mini-ImageNet evaluation dataset.
An undocumented image resizing step in the evaluation process impacted the final performance. |
neural architecture search, hyperparameter optimization, data augmentation, model compression, efficient training |
2301.09121
Report |
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision |
Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie |
In this paper, we consider the problem of open-vocabulary semantic
segmentation (OVS), which aims to segment objects of arbitrary classes instead
of pre-defined, closed-set categories. The main contributions are as follows:
First, we propose a transformer-based model for OVS, termed as OVSegmentor,
which only exploits web-crawled image-text pairs for pre-training without using
any mask annotations. OVSegmentor assembles the image pixels into a set of
learnable group tokens via a slot-attention based binding module, and aligns
the group tokens to the corresponding caption embedding. Second, we propose two
proxy tasks for training, namely masked entity completion and cross-image mask
consistency. The former aims to infer all masked entities in the caption given
the group tokens, that enables the model to learn fine-grained alignment
between visual groups and text entities. The latter enforces consistent mask
predictions between images that contain shared entities, which encourages the
model to learn visual invariance. Third, we construct CC4M dataset for
pre-training by filtering CC12M with frequently appeared entities, which
significantly improves training efficiency. Fourth, we perform zero-shot
transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO
Object. Our model achieves superior segmentation results over the
state-of-the-art method by using only 3\% data (4M vs 134M) for pre-training.
Code and pre-trained models will be released for future research. |
This paper proposes OVSegmentor, a transformer-based model for open-vocabulary semantic segmentation that uses only image-caption pairs for pre-training and can segment arbitrary object classes. |
Existing semantic segmentation approaches suffer from costly annotations and limitations to pre-defined classes. This work explores open-vocabulary segmentation by leveraging freely available image-caption pairs and enables zero-shot transfer to unseen classes. |
OVSegmentor introduces learnable group tokens to cluster image patches and aligns them with caption embeddings. Two novel proxy tasks, masked entity completion and cross-image mask consistency, are introduced to learn entity-specific and visually invariant group semantics. |
OVSegmentor achieves superior results compared to methods using supervised finetuning on PASCAL VOC, showcasing the effectiveness of training with image-caption pairs.
The model outperforms state-of-the-art open-vocabulary segmentation approaches on PASCAL VOC while using only 3% of their pre-training data, demonstrating high training efficiency.
Ablation studies validate the contribution of each component, especially the proposed proxy tasks, in improving segmentation performance. |
The model shows limitations in recognizing stuff classes and struggles to separate co-occurring objects into distinct groups.
Future work can explore incorporating fine-grained descriptions and leveraging external knowledge bases to further enhance segmentation accuracy. |
open-vocabulary semantic segmentation, vision-language pre-training, zero-shot transfer learning, masked entity completion, cross-image mask consistency |
2301.08898
Report |
Recurrent Generic Contour-based Instance Segmentation with Progressive Learning |
Hao Feng, Keyi Zhou, Wengang Zhou, Yufei Yin, Jiajun Deng, Qi Sun, Houqiang Li |
Contour-based instance segmentation has been actively studied, thanks to its
flexibility and elegance in processing visual objects within complex
backgrounds. In this work, we propose a novel deep network architecture, i.e.,
PolySnake, for generic contour-based instance segmentation. Motivated by the
classic Snake algorithm, the proposed PolySnake achieves superior and robust
segmentation performance with an iterative and progressive contour refinement
strategy. Technically, PolySnake introduces a recurrent update operator to
estimate the object contour iteratively. It maintains a single estimate of the
contour that is progressively deformed toward the object boundary. At each
iteration, PolySnake builds a semantic-rich representation for the current
contour and feeds it to the recurrent operator for further contour adjustment.
Through the iterative refinements, the contour progressively converges to a
stable status that tightly encloses the object instance. Beyond the scope of
general instance segmentation, extensive experiments are conducted to validate
the effectiveness and generalizability of our PolySnake in two additional
specific task scenarios, including scene text detection and lane detection. The
results demonstrate that the proposed PolySnake outperforms the existing
advanced methods on several multiple prevalent benchmarks across the three
tasks. The codes and pre-trained models are available at
https://github.com/fh2019ustc/PolySnake |
Proposes PolySnake, a deep network architecture for generic contour-based instance segmentation, which uses an iterative and progressive contour refinement strategy inspired by the classic Snake algorithm. |
Addresses limitations of existing instance segmentation methods that rely on inaccurate bounding boxes or complex contour learning strategies, aiming for more accurate and efficient instance segmentation. |
PolySnake initializes a coarse contour and iteratively deforms it using a recurrent update operator that leverages multi-scale features. A shape loss is introduced to encourage accurate object boundary adherence. The model is trained in two stages, optimizing initial contour generation and iterative deformation, followed by multi-scale contour refinement. |
PolySnake outperforms state-of-the-art contour-based methods on SBD, Cityscapes, COCO, and KINS datasets for instance segmentation.
Achieves outstanding performance in scene text detection on CTW1500, demonstrating its generalizability.
Exhibits superior accuracy in lane detection on CULane, surpassing both segmentation-based and anchor-based methods, particularly in challenging categories like dazzle and night. |
Performance improvement on KINS dataset (amodal instance segmentation) with multi-scale refinement is marginal, possibly due to limitations in leveraging vertex features from occluded parts.
Future work includes integrating PolySnake with state-of-the-art instance segmentation methods and exploring its application to other polygon or curve estimation tasks. |
instance segmentation, contour-based segmentation, progressive learning, recurrent networks, scene text detection, lane detection |
2301.08455
Report |
Spatial Steerability of GANs via Self-Supervision from Discriminator |
Jianyuan Wang, Lalit Bhagat, Ceyuan Yang, Yinghao Xu, Yujun Shen, Hongdong Li, Bolei Zhou |
Generative models make huge progress to the photorealistic image synthesis in
recent years. To enable human to steer the image generation process and
customize the output, many works explore the interpretable dimensions of the
latent space in GANs. Existing methods edit the attributes of the output image
such as orientation or color scheme by varying the latent code along certain
directions. However, these methods usually require additional human annotations
for each pretrained model, and they mostly focus on editing global attributes.
In this work, we propose a self-supervised approach to improve the spatial
steerability of GANs without searching for steerable directions in the latent
space or requiring extra annotations. Specifically, we design randomly sampled
Gaussian heatmaps to be encoded into the intermediate layers of generative
models as spatial inductive bias. Along with training the GAN model from
scratch, these heatmaps are being aligned with the emerging attention of the
GAN's discriminator in a self-supervised learning manner. During inference,
users can interact with the spatial heatmaps in an intuitive manner, enabling
them to edit the output image by adjusting the scene layout, moving, or
removing objects. Moreover, we incorporate DragGAN into our framework, which
facilitates fine-grained manipulation within a reasonable time and supports a
coarse-to-fine editing process. Extensive experiments show that the proposed
method not only enables spatial editing over human faces, animal faces, outdoor
scenes, and complicated multi-object indoor scenes but also brings improvement
in synthesis quality. Code, models, and demo video are available at
https://genforce.github.io/SpatialGAN/. |
This paper introduces SpatialGAN, a self-supervised approach to enhance the spatial steerability of GANs, enabling users to manipulate scene layouts, move or remove objects, and change local appearances without the need for extra annotations or searching for steerable directions in the latent space. |
Existing methods for controlling GAN outputs often require expensive annotations or struggle to provide fine-grained spatial control. SpatialGAN addresses these limitations by introducing a novel self-supervised framework that leverages the inherent spatial attention of the discriminator to guide the generator. |
The method encodes randomly sampled Gaussian heatmaps into the intermediate layers of the generator as spatial inductive bias. These heatmaps are then aligned with the emerging attention maps of the discriminator during training through a self-supervised learning process. For complex indoor scenes, a multi-object heatmap sampling and encoding method is introduced. Moreover, the integration of DragGAN with SpatialGAN is proposed for more efficient and flexible manipulations. |
SpatialGAN enables various spatial manipulations like moving and removing objects by altering the encoded heatmaps.
The proposed method consistently improves synthesis quality across different datasets, including LSUN Cat, FFHQ, LSUN Church, and LSUN Bedroom, outperforming the baseline StyleGAN2 and its conference version.
The integration of DragGAN with SpatialGAN allows for precise object manipulation while significantly reducing computation time compared to using DragGAN alone. |
The spatial encoding operation may occasionally result in blurring at heatmap boundaries, impacting the visual quality of manipulations.
In some cases, altering one sub-heatmap may unintentionally affect the appearance of distant, unrelated areas, a phenomenon not intended by the design. |
generative adversarial networks, spatial editing, self-supervision, image synthesis, interpretability |
2301.07969
Report |
Fast Inference in Denoising Diffusion Models via MMD Finetuning |
Emanuele Aiello, Diego Valsesia, Enrico Magli |
Denoising Diffusion Models (DDMs) have become a popular tool for generating
high-quality samples from complex data distributions. These models are able to
capture sophisticated patterns and structures in the data, and can generate
samples that are highly diverse and representative of the underlying
distribution. However, one of the main limitations of diffusion models is the
complexity of sample generation, since a large number of inference timesteps is
required to faithfully capture the data distribution. In this paper, we present
MMD-DDM, a novel method for fast sampling of diffusion models. Our approach is
based on the idea of using the Maximum Mean Discrepancy (MMD) to finetune the
learned distribution with a given budget of timesteps. This allows the
finetuned model to significantly improve the speed-quality trade-off, by
substantially increasing fidelity in inference regimes with few steps or,
equivalently, by reducing the required number of steps to reach a target
fidelity, thus paving the way for a more practical adoption of diffusion models
in a wide range of applications. We evaluate our approach on unconditional
image generation with extensive experiments across the CIFAR-10, CelebA,
ImageNet and LSUN-Church datasets. Our findings show that the proposed method
is able to produce high-quality samples in a fraction of the time required by
widely-used diffusion models, and outperforms state-of-the-art techniques for
accelerated sampling. Code is available at:
https://github.com/diegovalsesia/MMD-DDM. |
Presents MMD-DDM, a technique for fast sampling of diffusion models by finetuning the learned distribution using Maximum Mean Discrepancy (MMD) with a limited number of timesteps. |
Addresses the limitation of slow sample generation in Denoising Diffusion Models (DDMs) by significantly improving the speed-quality trade-off. |
Finetunes a pretrained DDM by minimizing the MMD between real and generated samples in a perceptually-relevant feature space (Inception-V3 or CLIP), backpropagating through the sampling chain with reparametrization and gradient checkpointing. |
Significantly reduces the number of timesteps required to achieve a target fidelity, outperforming state-of-the-art methods.
Demonstrates effectiveness on CIFAR-10, CelebA, ImageNet, and LSUN-Church datasets using FID and other metrics.
Shows improved visual quality, including sharper details and clarity, especially when using CLIP feature space for MMD. |
Memory requirements can be high when finetuning over a large number of timesteps.
Future work could explore more advanced timestep selection and optimization techniques, as well as integration with conditional DDMs. |
denoising diffusion models, generative models, fast inference, maximum mean discrepancy, image generation |
2301.07870
Report |
Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception |
Bin Huang, Yangguang Li, Enze Xie, Feng Liang, Luya Wang, Mingzhu Shen, Fenggang Liu, Tianqi Wang, Ping Luo, Jing Shao |
Recently, the pure camera-based Bird's-Eye-View (BEV) perception removes
expensive Lidar sensors, making it a feasible solution for economical
autonomous driving. However, most existing BEV solutions either suffer from
modest performance or require considerable resources to execute on-vehicle
inference. This paper proposes a simple yet effective framework, termed
Fast-BEV, which is capable of performing real-time BEV perception on the
on-vehicle chips. Towards this goal, we first empirically find that the BEV
representation can be sufficiently powerful without expensive view
transformation or depth representation. Starting from M2BEV baseline, we
further introduce (1) a strong data augmentation strategy for both image and
BEV space to avoid over-fitting (2) a multi-frame feature fusion mechanism to
leverage the temporal information (3) an optimized deployment-friendly view
transformation to speed up the inference. Through experiments, we show Fast-BEV
model family achieves considerable accuracy and efficiency on edge. In
particular, our M1 model (R18@256x704) can run over 50FPS on the Tesla T4
platform, with 47.0% NDS on the nuScenes validation set. Our largest model
(R101@900x1600) establishes a new state-of-the-art 53.5% NDS on the nuScenes
validation set. The code is released at: https://github.com/Sense-GVT/Fast-BEV. |
This paper proposes Fast-BEV, a simple yet effective fully convolutional framework capable of real-time Bird’s-Eye-View (BEV) perception on resource-constrained on-vehicle chips. |
Existing BEV solutions either suffer from limited performance or are too resource-intensive for real-time on-vehicle inference, hindering economical autonomous driving. |
Building upon the M$^2$BEV baseline, Fast-BEV introduces strong image and BEV augmentation, a multi-frame feature fusion mechanism, and an optimized deployment-friendly view transformation for efficient inference. |
Fast-BEV achieves state-of-the-art 53.5% NDS on the nuScenes validation set with its largest model (R101@900x1600).
The efficient M1 model (R18@256×704) achieves a considerable 47.0% NDS while running over 50FPS on the Tesla T4 platform.
The proposed optimized view transformation achieves orders of magnitude speedup on CPU compared to the M$^2$BEV baseline. |
The study primarily focuses on the nuScenes dataset, potentially limiting generalizability to other datasets.
Further investigation into more advanced temporal fusion techniques beyond simple concatenation could potentially yield additional performance gains.
Future work will explore the generalization of Fast-BEV to other autonomous driving datasets and investigate more sophisticated temporal modeling techniques. |
autonomous driving, bev perception, 3d object detection, real-time inference, on-vehicle deployment |
2301.07584
Report |
Joint Representation Learning for Text and 3D Point Cloud |
Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang |
Recent advancements in vision-language pre-training (e.g. CLIP) have shown
that vision models can benefit from language supervision. While many models
using language modality have achieved great success on 2D vision tasks, the
joint representation learning of 3D point cloud with text remains
under-explored due to the difficulty of 3D-Text data pair acquisition and the
irregularity of 3D data structure. In this paper, we propose a novel Text4Point
framework to construct language-guided 3D point cloud models. The key idea is
utilizing 2D images as a bridge to connect the point cloud and the language
modalities. The proposed Text4Point follows the pre-training and fine-tuning
paradigm. During the pre-training stage, we establish the correspondence of
images and point clouds based on the readily available RGB-D data and use
contrastive learning to align the image and point cloud representations.
Together with the well-aligned image and text features achieved by CLIP, the
point cloud features are implicitly aligned with the text embeddings. Further,
we propose a Text Querying Module to integrate language information into 3D
representation learning by querying text embeddings with point cloud features.
For fine-tuning, the model learns task-specific 3D representations under
informative language guidance from the label set without 2D images. Extensive
experiments demonstrate that our model shows consistent improvement on various
downstream tasks, such as point cloud semantic segmentation, instance
segmentation, and object detection. The code will be available here:
https://github.com/LeapLabTHU/Text4Point |
The paper proposes Text4Point, a novel framework that bridges the gap between text and 3D point cloud representations by leveraging 2D images as an intermediate link, enabling language-guided 3D point cloud models. |
Joint representation learning of 3D point clouds and text remains under-explored due to challenges in data acquisition and the irregularity of 3D data structures. This work aims to address these challenges and utilize language information for improved 3D representation learning. |
The method uses contrastive learning to align image and point cloud representations during pre-training, leveraging readily available RGB-D data. A Text Querying Module integrates language information by querying text embeddings with point cloud features, guiding 3D representation learning. Fine-tuning adapts the model for specific tasks with language guidance from label sets. |
Text4Point achieves state-of-the-art performance on the S3DIS dataset for both semantic and instance segmentation tasks.
The method significantly outperforms previous pre-training methods for 3D object detection on SUN RGB-D and ScanNet datasets.
Ablation studies confirm the importance of each component, particularly language modality, in enhancing 3D representation learning. |
The reliance on 2D images as a bridge introduces potential limitations if the image-point cloud correspondence is inaccurate.
Further exploration of alternative methods for aligning text and point cloud representations could be beneficial. |
3d point cloud, joint representation learning, vision-language pre-training, contrastive learning, text querying |
2301.07581
Report |
Blur Invariants for Image Recognition |
Jan Flusser, Matej Lebl, Matteo Pedone, Filip Sroubek, Jitka Kostkova |
Blur is an image degradation that is difficult to remove. Invariants with
respect to blur offer an alternative way of a~description and recognition of
blurred images without any deblurring. In this paper, we present an original
unified theory of blur invariants. Unlike all previous attempts, the new theory
does not require any prior knowledge of the blur type. The invariants are
constructed in the Fourier domain by means of orthogonal projection operators
and moment expansion is used for efficient and stable computation. It is shown
that all blur invariants published earlier are just particular cases of this
approach. Experimental comparison to concurrent approaches shows the advantages
of the proposed theory. |
This paper presents a unified theory of blur invariants for image recognition, using orthogonal projection operators and moment expansion. |
Blur invariants offer a robust alternative to deblurring for image recognition, and the proposed theory provides a general framework for their construction regardless of the blur type. |
The method defines blur invariants in the Fourier domain as the ratio of the image's Fourier transform to the Fourier transform of its projection onto the blur subspace. It then utilizes moment expansion in the image domain to enable efficient and stable computation of these invariants. |
A general theorem (GTBI) for constructing blur invariants for arbitrary blur subspaces closed under convolution and correlation is presented.
The completeness theorem demonstrates that the proposed invariants can distinguish any two images belonging to different blur-equivalence classes.
A method for calculating blur invariants directly in the image domain using moments is derived, eliminating the need for explicit deconvolution. |
The theory primarily focuses on linear orthogonal projection operators, while exploring non-linear and non-orthogonal projectors is left for future work.
Extending the framework to 3D images and investigating the fusion of blur invariants with deep learning are identified as promising research directions. |
blur invariants, image recognition, projection operators, moment expansion, image blur |
2301.07464
Report |
CLIPTER: Looking at the Bigger Picture in Scene Text Recognition |
Aviad Aberdam, David Bensaïd, Alona Golts, Roy Ganz, Oren Nuriel, Royee Tichauer, Shai Mazor, Ron Litman |
Reading text in real-world scenarios often requires understanding the context
surrounding it, especially when dealing with poor-quality text. However,
current scene text recognizers are unaware of the bigger picture as they
operate on cropped text images. In this study, we harness the representative
capabilities of modern vision-language models, such as CLIP, to provide
scene-level information to the crop-based recognizer. We achieve this by fusing
a rich representation of the entire image, obtained from the vision-language
model, with the recognizer word-level features via a gated cross-attention
mechanism. This component gradually shifts to the context-enhanced
representation, allowing for stable fine-tuning of a pretrained recognizer. We
demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP
TExt Recognition), on leading text recognition architectures and achieve
state-of-the-art results across multiple benchmarks. Furthermore, our analysis
highlights improved robustness to out-of-vocabulary words and enhanced
generalization in low-data regimes. |
CLIPTER, a novel framework that integrates image-level context into crop-based scene text recognizers using vision-language models like CLIP. |
Current scene text recognizers lack scene context as they operate on cropped text images, hindering their performance on poor-quality text where context is crucial. |
The framework extracts rich image representations using a frozen vision-language model and fuses them with word-level features of the recognizer through a gated cross-attention mechanism. |
CLIPTER consistently improves the performance of leading text recognizers, including TRBA, ViTSTR, ABINet, and PARSeq, achieving state-of-the-art results on multiple benchmarks.
It enhances robustness to out-of-vocabulary words and improves generalization in low-data regimes.
End-to-end evaluation demonstrates a marginal latency increase while surpassing the performance of both two-stage and existing end-to-end text spotting methods. |
The optimal integration point for fusing image and word-level features is architecture-dependent, requiring empirical search.
The computational cost of late fusion increases significantly for autoregressive decoders. |
scene text recognition, vision-language models, clip, contextual information, out-of-vocabulary words |
2301.07389
Report |
Towards Models that Can See and Read |
Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman |
Visual Question Answering (VQA) and Image Captioning (CAP), which are among
the most popular vision-language tasks, have analogous scene-text versions that
require reasoning from the text in the image. Despite their obvious
resemblance, the two are treated independently and, as we show, yield
task-specific methods that can either see or read, but not both. In this work,
we conduct an in-depth analysis of this phenomenon and propose UniTNT, a
Unified Text-Non-Text approach, which grants existing multimodal architectures
scene-text understanding capabilities. Specifically, we treat scene-text
information as an additional modality, fusing it with any pretrained
encoder-decoder-based architecture via designated modules. Thorough experiments
reveal that UniTNT leads to the first single model that successfully handles
both task types. Moreover, we show that scene-text understanding capabilities
can boost vision-language models' performance on general VQA and CAP by up to
2.69% and 0.6 CIDEr, respectively. |
This paper presents UniTNT, a unified text-non-text model that grants existing multimodal architectures scene-text understanding capabilities, enabling them to excel in both general and scene-text-based VQA and CAP tasks. |
Existing vision-language models often struggle to reason jointly from visual and scene-text information, limiting their performance on tasks requiring understanding of both. |
UniTNT introduces scene-text as an additional modality, fusing it with pretrained encoder-decoder architectures via a dedicated encoder and a gated cross-attention-based mechanism. It also employs scene-text-related intermediate supervision to encourage leveraging the added information. |
UniTNT enables a single model to successfully handle both general and scene-text VQA and CAP tasks, outperforming methods trained separately for each.
Integrating scene-text understanding boosts performance on general VQA (e.g., improving BLIP by 2.69% on VQAv2) and CAP (e.g., enhancing BLIP by 0.6 CIDEr on COCO Captions).
Analysis reveals the importance of combined training on datasets containing both visual and scene-text elements and highlights the need for benchmarks focusing on questions requiring reasoning over both modalities. |
The intrinsic tradeoff between TextCaps and COCO Captions, due to their different nature of ground truth captions, requires further investigation.
Further research is needed to improve models' performance on challenging questions requiring reasoning over both scene-text and visual information simultaneously. |
vision-language, scene-text understanding, visual question answering, image captioning, multimodal learning |
2301.07301
Report |
PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection |
Rui Wan, Tianyun Zhao, Wei Zhao |
In autonomous driving, 3D object detection based on multi-modal data has
become an indispensable approach when facing complex environments around the
vehicle. During multi-modal detection, LiDAR and camera are simultaneously
applied for capturing and modeling. However, due to the intrinsic discrepancies
between the LiDAR point and camera image, the fusion of the data for object
detection encounters a series of problems. Most multi-modal detection methods
perform even worse than LiDAR-only methods. In this investigation, we propose a
method named PTA-Det to improve the performance of multi-modal detection.
Accompanied by PTA-Det, a Pseudo Point Cloud Generation Network is proposed,
which can convert image information including texture and semantic features by
pseudo points. Thereafter, through a transformer-based Point Fusion Transition
(PFT) module, the features of LiDAR points and pseudo points from image can be
deeply fused under a unified point-based representation. The combination of
these modules can conquer the major obstacle in feature fusion across
modalities and realizes a complementary and discriminative representation for
proposal generation. Extensive experiments on the KITTI dataset show the
PTA-Det achieves a competitive result and support its effectiveness. |
This paper proposes PTA-Det, a multi-modal 3D object detection method that leverages pseudo points as an intermediate modality between images and point clouds. |
Multi-modal 3D object detection is crucial for autonomous driving, but existing methods struggle to effectively fuse image and point cloud data. |
PTA-Det employs a Pseudo Point Cloud Generation Network to transform images into pseudo points representing image features. A two-stream transformer-based network learns intra- and inter-modal features. A Point Fusion Transition (PFT) module fuses features across modalities. |
PTA-Det achieves competitive results on the KITTI dataset, particularly for car detection (77.88% mAP).
The method outperforms LiDAR-only methods with the same number of input points.
Ablation studies demonstrate the effectiveness of each module, especially the PFT module and the keypoint-based pseudo point sampling strategy. |
PTA-Det's performance on smaller objects like cyclists and pedestrians is hindered by the reduced number of input points due to transformer memory constraints.
Future work includes optimizing the model for efficiency and exploring data augmentation techniques for multi-modal scenarios. |
3d object detection, multi-modal fusion, point cloud, pseudo point cloud, transformer |
2301.07093
Report |
GLIGEN: Open-Set Grounded Text-to-Image Generation |
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee |
Large-scale text-to-image diffusion models have made amazing advances.
However, the status quo is to use text input alone, which can impede
controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image
Generation, a novel approach that builds upon and extends the functionality of
existing pre-trained text-to-image diffusion models by enabling them to also be
conditioned on grounding inputs. To preserve the vast concept knowledge of the
pre-trained model, we freeze all of its weights and inject the grounding
information into new trainable layers via a gated mechanism. Our model achieves
open-world grounded text2img generation with caption and bounding box condition
inputs, and the grounding ability generalizes well to novel spatial
configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS
outperforms that of existing supervised layout-to-image baselines by a large
margin. |
This paper introduces \shortname{}, a novel method that incorporates grounding inputs into pre-trained text-to-image diffusion models, enabling versatile controllability over object placement, style, and composition. |
Existing text-to-image generation models often lack precise controllability beyond textual descriptions. \shortname{} addresses this by enabling grounding with bounding boxes, keypoints, reference images, and spatially-aligned maps. |
\shortname{} freezes pre-trained model weights and introduces trainable gated Transformer layers to inject grounding information. This approach preserves existing knowledge while enabling the integration of new conditions like bounding boxes, image prompts, keypoints, and spatially-aligned maps. |
Achieves open-world grounded text-to-image generation, enabling the synthesis of novel localized concepts.
Outperforms existing supervised layout-to-image generation baselines on COCO and LVIS benchmarks, highlighting the effectiveness of building upon large pre-trained generative models.
Demonstrates generalization to unseen object categories and diverse grounding conditions. |
The generalization ability for keypoint grounding across different object categories is limited.
Further research can explore incorporating more complex grounding conditions and refining the interplay between text and grounding inputs. |
text-to-image generation, diffusion models, grounding, open-world learning, controllable image synthesis |
2301.06958
Report |
RILS: Masked Visual Reconstruction in Language Semantic Space |
Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, Xiaohu Qie, Xinggang Wang |
Both masked image modeling (MIM) and natural language supervision have
facilitated the progress of transferable visual pre-training. In this work, we
seek the synergy between two paradigms and study the emerging properties when
MIM meets natural language supervision. To this end, we present a novel masked
visual Reconstruction In Language semantic Space (RILS) pre-training framework,
in which sentence representations, encoded by the text encoder, serve as
prototypes to transform the vision-only signals into patch-sentence
probabilities as semantically meaningful MIM reconstruction targets. The vision
models can therefore capture useful components with structured information by
predicting proper semantic of masked tokens. Better visual representations
could, in turn, improve the text encoder via the image-text alignment
objective, which is essential for the effective MIM target transformation.
Extensive experimental results demonstrate that our method not only enjoys the
best of previous MIM and CLIP but also achieves further improvements on various
tasks due to their mutual benefits. RILS exhibits advanced transferability on
downstream classification, detection, and segmentation, especially for low-shot
regimes. Code will be made available at https://github.com/hustvl/RILS. |
This paper introduces RILS, a novel masked visual Reconstruction In Language semantic Space pre-training framework, that combines the strengths of masked image modeling (MIM) and natural language supervision for improved visual representation learning. |
This work addresses the limitations of independently using MIM or natural language supervision, aiming to leverage their synergistic potential for more transferable and scalable visual pre-training. |
RILS utilizes a dual-encoder architecture (vision and language) with an asymmetric encoder-decoder design for the vision model. It leverages text representations as prototypes to map masked visual tokens into probability distributions within the language semantic space. RILS employs both image-text contrastive loss and a novel masked visual reconstruction loss in the language space. |
RILS achieves state-of-the-art performance on various downstream tasks, including image classification, object detection, and semantic segmentation.
RILS exhibits strong transferability, particularly in low-shot learning scenarios, demonstrating its ability to learn from limited labeled data.
RILS shows superior zero-shot image classification and image-text retrieval performance, highlighting its capacity to capture rich semantic information. |
The paper acknowledges the potential for further scaling up RILS in terms of both model and data size.
Future work could explore incorporating more sophisticated techniques like multi-crop augmentation and exponential moving average (EMA) into RILS for potential performance gains. |
masked image modeling, natural language supervision, vision-language pre-training, transfer learning, zero-shot learning |
2301.06871
Report |
Denoising Diffusion Probabilistic Models as a Defense against Adversarial Attacks |
Lars Lien Ankile, Anna Midgley, Sebastian Weisshaar |
Neural Networks are infamously sensitive to small perturbations in their
inputs, making them vulnerable to adversarial attacks. This project evaluates
the performance of Denoising Diffusion Probabilistic Models (DDPM) as a
purification technique to defend against adversarial attacks. This works by
adding noise to an adversarial example before removing it through the reverse
process of the diffusion model. We evaluate the approach on the PatchCamelyon
data set for histopathologic scans of lymph node sections and find an
improvement of the robust accuracy by up to 88\% of the original model's
accuracy, constituting a considerable improvement over the vanilla model and
our baselines. The project code is located at
https://github.com/ankile/Adversarial-Diffusion. |
This paper evaluates the effectiveness of Denoising Diffusion Probabilistic Models (DDPMs) as a purification technique to defend against adversarial attacks on image classification models, particularly in histopathology. |
Robustness against adversarial attacks is crucial for reliable deployment of deep learning models, especially in sensitive domains like medical image analysis, where small perturbations can lead to misdiagnosis. |
The study uses a DDPM to add noise to adversarial examples generated from the PatchCamelyon histopathology dataset and then removes the noise through the reverse diffusion process, aiming to purify the image and allow for correct classification by a ResNet model. The approach is compared to baseline methods including Gaussian noise addition and adversarial training. |
DDPM purification significantly improves robust accuracy against adversarial attacks, recovering up to 88% of the original model's accuracy on adversarial examples.
The method outperforms baseline defenses, including simple noise addition and adversarial training.
The study identifies a trade-off between standard accuracy and robust accuracy for adversarial training, highlighting the challenge of balancing performance on clean and adversarial data. |
The chosen noise level, while enabling faster inference, could potentially be further optimized to potentially achieve even better robust accuracy.
The study focuses on a specific dataset (PatchCamelyon) and a single attack type, and future work could explore generalizability to other datasets, attack strategies, and medical imaging modalities. |
adversarial attacks, diffusion models, image classification, histopathology, robustness |
2301.06782
Report |
A Large-Scale Outdoor Multi-modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction |
Chongshan Lu, Fukun Yin, Xin Chen, Tao Chen, Gang YU, Jiayuan Fan |
Neural Radiance Fields (NeRF) has achieved impressive results in single
object scene reconstruction and novel view synthesis, which have been
demonstrated on many single modality and single object focused indoor scene
datasets like DTU, BMVS, and NeRF Synthetic.However, the study of NeRF on
large-scale outdoor scene reconstruction is still limited, as there is no
unified outdoor scene dataset for large-scale NeRF evaluation due to expensive
data acquisition and calibration costs. In this paper, we propose a large-scale
outdoor multi-modal dataset, OMMO dataset, containing complex land objects and
scenes with calibrated images, point clouds and prompt annotations. Meanwhile,
a new benchmark for several outdoor NeRF-based tasks is established, such as
novel view synthesis, surface reconstruction, and multi-modal NeRF. To create
the dataset, we capture and collect a large number of real fly-view videos and
select high-quality and high-resolution clips from them. Then we design a
quality review module to refine images, remove low-quality frames and
fail-to-calibrate scenes through a learning-based automatic evaluation plus
manual review. Finally, a number of volunteers are employed to add the text
descriptions for each scene and key-frame to meet the potential multi-modal
requirements in the future. Compared with existing NeRF datasets, our dataset
contains abundant real-world urban and natural scenes with various scales,
camera trajectories, and lighting conditions. Experiments show that our dataset
can benchmark most state-of-the-art NeRF methods on different tasks. We will
release the dataset and model weights very soon. |
The paper introduces OMMO, a large-scale outdoor multi-modal dataset for benchmarking Neural Radiance Fields (NeRF) in complex outdoor scenes. |
Existing NeRF datasets are limited to single objects, indoor scenes, or small-scale outdoor environments, hindering the development and evaluation of NeRF methods for large-scale outdoor scene reconstruction. |
The OMMO dataset is created by collecting and curating real fly-view videos from various sources, including YouTube and drone captures. It includes quality review, manual annotation, and scene representation generation using methods like Mega-NeRF and Colmap. |
OMMO dataset contains 33 diverse outdoor scenes with over 14K calibrated images, surpassing existing datasets in quantity and diversity.
Benchmarks for novel view synthesis demonstrate the dataset's ability to support various NeRF methods, with Mip-NeRF 360 showing superior performance.
Analysis of scene representation benchmarks highlights the challenges in reconstructing large-scale outdoor scenes with high fidelity, indicating an area for future research. |
The dataset currently has a limited number of scenes with low-light, rain, and fog conditions.
Future work involves expanding the dataset with more diverse scenes and exploring advanced reconstruction methods. |
neural radiance fields, nerf, outdoor dataset, novel view synthesis, scene representation |
2301.06281
Report |
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing |
Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xiaodong Cun, Ying Shan, Dong-ming Yan |
One-shot video-driven talking face generation aims at producing a synthetic
talking video by transferring the facial motion from a video to an arbitrary
portrait image. Head pose and facial expression are always entangled in facial
motion and transferred simultaneously. However, the entanglement sets up a
barrier for these methods to be used in video portrait editing directly, where
it may require to modify the expression only while maintaining the pose
unchanged. One challenge of decoupling pose and expression is the lack of
paired data, such as the same pose but different expressions. Only a few
methods attempt to tackle this challenge with the feat of 3D Morphable Models
(3DMMs) for explicit disentanglement. But 3DMMs are not accurate enough to
capture facial details due to the limited number of Blenshapes, which has side
effects on motion transfer. In this paper, we introduce a novel self-supervised
disentanglement framework to decouple pose and expression without 3DMMs and
paired data, which consists of a motion editing module, a pose generator, and
an expression generator. The editing module projects faces into a latent space
where pose motion and expression motion can be disentangled, and the pose or
expression transfer can be performed in the latent space conveniently via
addition. The two generators render the modified latent codes to images,
respectively. Moreover, to guarantee the disentanglement, we propose a
bidirectional cyclic training strategy with well-designed constraints.
Evaluations demonstrate our method can control pose or expression independently
and be used for general video editing. |
This paper presents a novel self-supervised disentanglement framework for decoupling pose and expression in facial motion for general video portrait editing. |
Existing one-shot talking face generation methods struggle to transfer pose and expression independently due to their entanglement, limiting their application in video editing tasks where modifying expression while maintaining pose is crucial. |
The framework consists of a motion editing module, a pose generator, and an expression generator. It employs a bidirectional cyclic training strategy with self-reconstruction constraints to disentangle pose and expression motion in a latent space, enabling independent editing via addition. |
The method achieves independent control over pose and expression, enabling seamless video portrait editing by pasting edited faces back into original videos.
It outperforms 3DMM-based methods in preserving facial expression details and identity during editing.
The method shows comparable performance to state-of-the-art one-shot talking face generation approaches. |
Pose preservation performance is slightly worse than the 3DMM-based method, PIRender.
Expression motion, being more local and subtle, is harder to learn than pose motion, demanding further refinement. |
talking face generation, video portrait editing, disentanglement, self-supervised learning, motion transfer |
2301.06267
Report |
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models |
Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan |
The ability to quickly learn a new task with minimal instruction - known as
few-shot learning - is a central aspect of intelligent agents. Classical
few-shot benchmarks make use of few-shot samples from a single modality, but
such samples may not be sufficient to characterize an entire concept class. In
contrast, humans use cross-modal information to learn new concepts efficiently.
In this work, we demonstrate that one can indeed build a better ${\bf visual}$
dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them
bark. To do so, we exploit the fact that recent multimodal foundation models
such as CLIP are inherently cross-modal, mapping different modalities to the
same representation space. Specifically, we propose a simple cross-modal
adaptation approach that learns from few-shot examples spanning different
modalities. By repurposing class names as additional one-shot training samples,
we achieve SOTA results with an embarrassingly simple linear classifier for
vision-language adaptation. Furthermore, we show that our approach can benefit
existing methods such as prefix tuning, adapters, and classifier ensembling.
Finally, to explore other modalities beyond vision and language, we construct
the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal
training to improve the performance of both image and audio classification. |
This paper proposes a cross-modal adaptation approach for few-shot learning, leveraging multimodal models like CLIP. It utilizes examples from different modalities (e.g., text labels, audio) as additional training samples, effectively converting an "n-shot" problem into an "(n+1)-shot" problem. |
The method addresses the ambiguity inherent in traditional few-shot learning by incorporating cross-modal information, mimicking how humans utilize multiple senses for concept learning. |
The approach leverages pre-trained multimodal models (CLIP and AudioCLIP) that map different modalities to the same representation space. It treats examples from additional modalities as supplementary training data, jointly optimizing a linear classifier with data from all modalities. |
Cross-modal adaptation achieves state-of-the-art results on 11 image classification benchmarks using a simple linear classifier.
The method improves the performance of existing few-shot adaptation methods like prompting, adapters, and robust finetuning.
Audiovisual adaptation is explored, showing improvement in both image and audio classification tasks on a newly created ImageNet-ESC benchmark. |
Cross-modal adaptation may be less effective when model representations are not well-aligned or sufficiently trained (e.g. limited audio data).
The paper primarily focuses on uni-modal inference tasks (e.g. image classification), leaving exploration of multimodal test sets (e.g. video) for future work. |
cross-modal learning, few-shot learning, multimodal models, clip, audioclip |
2301.06052
Report |
T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations |
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen |
In this work, we investigate a simple and must-known conditional generative
framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and
Generative Pre-trained Transformer (GPT) for human motion generation from
textural descriptions. We show that a simple CNN-based VQ-VAE with commonly
used training recipes (EMA and Code Reset) allows us to obtain high-quality
discrete representations. For GPT, we incorporate a simple corruption strategy
during the training to alleviate training-testing discrepancy. Despite its
simplicity, our T2M-GPT shows better performance than competitive approaches,
including recent diffusion-based approaches. For example, on HumanML3D, which
is currently the largest dataset, we achieve comparable performance on the
consistency between text and generated motion (R-Precision), but with FID 0.116
largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses
on HumanML3D and observe that the dataset size is a limitation of our approach.
Our work suggests that VQ-VAE still remains a competitive approach for human
motion generation. |
This paper proposes T2M-GPT, a simple yet effective two-stage text-to-motion generation framework based on VQ-VAE and GPT that utilizes discrete representations. |
Generating human motion from text is crucial for various applications like gaming and animation but remains challenging due to the modality gap between language and motion. |
The framework first learns a mapping between motion data and discrete code sequences using a CNN-based VQ-VAE with EMA and Code Reset techniques. Subsequently, a GPT-like model generates code indices from text embeddings, which are then decoded into motions. |
T2M-GPT achieves state-of-the-art performance on HumanML3D and KIT-ML datasets, showing competitive or superior results compared to diffusion-based methods.
The study demonstrates that VQ-VAE with proper training strategies remains a strong approach for motion generation.
Analysis suggests that model performance can be further improved with larger datasets. |
The model may miss details in excessively long text descriptions.
Generated motions can exhibit slight jittering, requiring further refinement in VQ-VAE architecture. |
text-to-motion generation, vq-vae, gpt, human motion synthesis, discrete representation learning |
2301.06018
Report |
CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition |
Cheng-Ze Lu, Xiaojie Jin, Zhicheng Huang, Qibin Hou, Ming-Ming Cheng, Jiashi Feng |
Contrastive Masked Autoencoder (CMAE), as a new self-supervised framework,
has shown its potential of learning expressive feature representations in
visual image recognition. This work shows that CMAE also trivially generalizes
well on video action recognition without modifying the architecture and the
loss criterion. By directly replacing the original pixel shift with the
temporal shift, our CMAE for visual action recognition, CMAE-V for short, can
generate stronger feature representations than its counterpart based on pure
masked autoencoders. Notably, CMAE-V, with a hybrid architecture, can achieve
82.2% and 71.6% top-1 accuracy on the Kinetics-400 and Something-something V2
datasets, respectively. We hope this report could provide some informative
inspiration for future works. |
This paper proposes CMAE-V, a novel approach for video action recognition leveraging the Contrastive Masked Autoencoder (CMAE) framework. |
The authors demonstrate that CMAE can be effectively adapted for video action recognition without architectural modifications or loss function adjustments, achieving strong performance in self-supervised representation learning. |
The key innovation lies in replacing the spatial pixel shift in the original CMAE with a temporal shift for generating correlated augmented views. This adaptation effectively captures temporal correlations in videos, enabling the model to learn temporally invariant and semantically meaningful representations. |
CMAE-V achieves state-of-the-art performance on Kinetics-400 and Something-Something V2 datasets, outperforming previous self-supervised methods.
Replacing the vanilla ViT encoder with a hybrid convolutional ViT further boosts CMAE-V's performance, setting new benchmarks for both datasets.
The results highlight the effectiveness of incorporating contrastive learning within a masked autoencoder framework for video action recognition. |
The paper acknowledges the potential limitation of using a relatively simple temporal shift augmentation strategy and suggests exploring more sophisticated augmentation techniques.
Future work could focus on extending CMAE-V to other video understanding tasks beyond action recognition. |
video action recognition, contrastive learning, masked autoencoders, self-supervised learning, computer vision |
2301.06015
Report |
Diffusion-based Generation, Optimization, and Planning in 3D Scenes |
Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu |
We introduce SceneDiffuser, a conditional generative model for 3D scene
understanding. SceneDiffuser provides a unified model for solving
scene-conditioned generation, optimization, and planning. In contrast to prior
works, SceneDiffuser is intrinsically scene-aware, physics-based, and
goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly
formulates the scene-aware generation, physics-based optimization, and
goal-oriented planning via a diffusion-based denoising process in a fully
differentiable fashion. Such a design alleviates the discrepancies among
different modules and the posterior collapse of previous scene-conditioned
generative models. We evaluate SceneDiffuser with various 3D scene
understanding tasks, including human pose and motion generation, dexterous
grasp generation, path planning for 3D navigation, and motion planning for
robot arms. The results show significant improvements compared with previous
models, demonstrating the tremendous potential of SceneDiffuser for the broad
community of 3D scene understanding. |
This paper proposes SceneDiffuser, a novel conditional generative model for 3D scene understanding that unifies scene-conditioned generation, physics-based optimization, and goal-oriented planning. |
Existing methods suffer from posterior collapse in generation and lack a unified framework to address discrepancies among generation, optimization, and planning. |
SceneDiffuser leverages a diffusion model for scene-conditioned generation and integrates physics-based and goal-oriented objectives into an iterative guided-sampling framework. |
SceneDiffuser generates significantly more plausible human poses and motions in 3D scenes compared to CVAE-based methods.
It generates diverse and successful dexterous grasps for unseen objects, outperforming baselines in success rate and collision avoidance.
It demonstrates superior performance in path planning for 3D navigation and motion planning for robot arms, exhibiting better generalization and efficiency in long-horizon tasks. |
SceneDiffuser suffers from slow training and test speed compared to non-diffusion-based generative models.
The optimization and planning modules heavily rely on objective design, demanding significant effort in hyperparameter tuning. |
3d scene understanding, conditional generation, motion planning, diffusion models, physics-based optimization |
2301.05957
Report |
Towards Spatial Equilibrium Object Detection |
Zhaohui Zheng, Yuming Chen, Qibin Hou, Xiang Li, Ming-Ming Cheng |
Semantic objects are unevenly distributed over images. In this paper, we
study the spatial disequilibrium problem of modern object detectors and propose
to quantify this ``spatial bias'' by measuring the detection performance over
zones. Our analysis surprisingly shows that the spatial imbalance of objects
has a great impact on the detection performance, limiting the robustness of
detection applications. This motivates us to design a more generalized
measurement, termed Spatial equilibrium Precision (SP), to better characterize
the detection performance of object detectors. Furthermore, we also present a
spatial equilibrium label assignment (SELA) to alleviate the spatial
disequilibrium problem by injecting the prior spatial weight into the
optimization process of detectors. Extensive experiments on PASCAL VOC, MS
COCO, and 3 application datasets on face mask/fruit/helmet images demonstrate
the advantages of our method. Our findings challenge the conventional sense of
object detectors and show the indispensability of spatial equilibrium. We hope
these discoveries would stimulate the community to rethink how an excellent
object detector should be. All the source code, evaluation protocols, and the
tutorials are publicly available at https://github.com/Zzh-tju/ZoneEval |
This paper reveals the spatial disequilibrium problem in object detection, where detectors perform inconsistently across different image zones due to photographer's bias in datasets (objects concentrated in the center). |
Traditional metrics like Average Precision (AP) are inflated by central object performance, masking poor detection in outer regions. This spatial bias limits robustness in real-world applications. |
The paper introduces zone evaluation, dividing images into zones and calculating metrics (ZP) for each. It proposes Spatial equilibrium Precision (SP), weighted by zone area, for a more comprehensive assessment. |
Significant performance gaps exist between central and outer zones across various detectors and datasets.
Traditional AP is inflated, failing to reflect poor outer zone detection. Even excluding a small central area drastically lowers performance.
Proposed Spatial Equilibrium Label Assignment (SELA) improves SP and reduces performance variance across zones by re-balancing sampling. |
Current zone division is preliminary, exploring other designs is needed.
SELA is a first step, other solutions like data augmentation or skew spatial weights warrant investigation. |
object detection, spatial bias, zone evaluation, spatial equilibrium precision (sp), spatial equilibrium label assignment (sela) |
2301.05586
Report |
YOLOv6 v3.0: A Full-Scale Reloading |
Chuyi Li, Lulu Li, Yifei Geng, Hongliang Jiang, Meng Cheng, Bo Zhang, Zaidan Ke, Xiaoming Xu, Xiangxiang Chu |
The YOLO community has been in high spirits since our first two releases! By
the advent of Chinese New Year 2023, which sees the Year of the Rabbit, we
refurnish YOLOv6 with numerous novel enhancements on the network architecture
and the training scheme. This release is identified as YOLOv6 v3.0. For a
glimpse of performance, our YOLOv6-N hits 37.5% AP on the COCO dataset at a
throughput of 1187 FPS tested with an NVIDIA Tesla T4 GPU. YOLOv6-S strikes
45.0% AP at 484 FPS, outperforming other mainstream detectors at the same scale
(YOLOv5-S, YOLOv8-S, YOLOX-S and PPYOLOE-S). Whereas, YOLOv6-M/L also achieve
better accuracy performance (50.0%/52.8% respectively) than other detectors at
a similar inference speed. Additionally, with an extended backbone and neck
design, our YOLOv6-L6 achieves the state-of-the-art accuracy in real-time.
Extensive experiments are carefully conducted to validate the effectiveness of
each improving component. Our code is made available at
https://github.com/meituan/YOLOv6. |
This paper introduces YOLOv6 v3.0, an enhanced object detection framework with improvements to network architecture and training schemes. |
The YOLO series is important for real-time object detection in industrial applications due to its balance between speed and accuracy. This work aims to push the boundaries of real-time object detection performance. |
The authors propose: 1) Bi-directional Concatenation (BiC) module and SimCSPSPPF block for improved neck design. 2) Anchor-aided training (AAT) strategy to leverage benefits of both anchor-based and anchor-free paradigms. 3) Extended backbone and neck with an extra stage for high-resolution inputs. 4) New self-distillation strategies for small and large models. |
YOLOv6-N achieves 37.5% AP at 1187 FPS on COCO, outperforming peers at the same scale.
YOLOv6-L6 achieves state-of-the-art accuracy in real-time object detection.
Extensive experiments validate the effectiveness of individual components like BiC, AAT, and self-distillation. |
The paper primarily focuses on speed and accuracy, with limited discussion on other aspects like model robustness.
Future work could explore adapting YOLOv6 for specific tasks and datasets beyond COCO. |
object detection, yolo, real-time, deep learning, computer vision |
2301.05499
Report |
CLIP the Gap: A Single Domain Generalization Approach for Object Detection |
Vidit Vidit, Martin Engilberge, Mathieu Salzmann |
Single Domain Generalization (SDG) tackles the problem of training a model on
a single source domain so that it generalizes to any unseen target domain.
While this has been well studied for image classification, the literature on
SDG object detection remains almost non-existent. To address the challenges of
simultaneously learning robust object localization and representation, we
propose to leverage a pre-trained vision-language model to introduce semantic
domain concepts via textual prompts. We achieve this via a semantic
augmentation strategy acting on the features extracted by the detector
backbone, as well as a text-based classification loss. Our experiments evidence
the benefits of our approach, outperforming by 10% the only existing SDG object
detection method, Single-DGOD [49], on their own diverse weather-driving
benchmark. |
This paper proposes a novel single domain generalization approach for object detection that leverages a pre-trained vision-language model (CLIP) to introduce semantic domain concepts via textual prompts. |
SDG for object detection is a nascent topic and poses additional challenges compared to image classification due to the need for robust object localization. |
The method uses textual prompts related to potential target domain concepts to perform semantic augmentations on image features extracted by the detector backbone. It also incorporates a text-based classification loss during training to further leverage the vision-language model. |
Outperforms the only existing SDG object detection method, Single-DGOD, by 10% on a diverse weather-driving benchmark.
Demonstrates consistent improvements across various target domains including day-foggy, night-clear, dusk-rainy, and night-rainy.
Shows the effectiveness of semantic augmentation with relevant prompts compared to random or no augmentation. |
The method assumes some prior knowledge about the potential domain gap to generate relevant textual prompts.
Future work includes exploring techniques to learn the prompts automatically, further enhancing generalization capabilities. |
single domain generalization, object detection, vision-language models, clip, semantic augmentation |
2301.05496
Report |
Learning Transformations To Reduce the Geometric Shift in Object Detection |
Vidit Vidit, Martin Engilberge, Mathieu Salzmann |
The performance of modern object detectors drops when the test distribution
differs from the training one. Most of the methods that address this focus on
object appearance changes caused by, e.g., different illumination conditions,
or gaps between synthetic and real images. Here, by contrast, we tackle
geometric shifts emerging from variations in the image capture process, or due
to the constraints of the environment causing differences in the apparent
geometry of the content itself. We introduce a self-training approach that
learns a set of geometric transformations to minimize these shifts without
leveraging any labeled data in the new domain, nor any information about the
cameras. We evaluate our method on two different shifts, i.e., a camera's field
of view (FoV) change and a viewpoint change. Our results evidence that learning
geometric transformations helps detectors to perform better in the target
domains. |
This paper proposes a self-training approach to handle geometric shifts in object detection for unsupervised domain adaptation. The method learns a set of geometric transformations, modeled as multiple homographies, to reduce the domain gap without requiring labeled target data or camera information. |
Object detectors often suffer performance drops when tested on data with geometric differences from the training set. Existing domain adaptation methods mainly focus on appearance changes but neglect geometric shifts caused by variations in camera viewpoint, field-of-view, or object scale. |
The proposed method uses an aggregator block within a FasterRCNN architecture to combine features from multiple homography-transformed images. The model is trained in three steps: source-only detector training, aggregator training on source data with random transformations, and joint optimization of transformations using a Mean Teacher strategy on source and unlabeled target data. |
The approach achieves state-of-the-art results on car detection for field-of-view adaptation between Cityscapes and KITTI datasets, outperforming methods relying on camera information.
It generalizes well to different degrees of field-of-view changes, demonstrating consistent improvement over baselines.
The method effectively handles viewpoint adaptation for pedestrian detection, showing its applicability to diverse geometric shifts. |
The computational cost increases with the number of homographies used.
Current implementation optimizes transformations at the dataset level; image-specific adaptation could be beneficial. |
unsupervised domain adaptation, object detection, geometric shift, homography, mean teacher |
2301.05225
Report |
Domain Expansion of Image Generators |
Yotam Nitzan, Michaël Gharbi, Richard Zhang, Taesung Park, Jun-Yan Zhu, Daniel Cohen-Or, Eli Shechtman |
Can one inject new concepts into an already trained generative model, while
respecting its existing structure and knowledge? We propose a new task - domain
expansion - to address this. Given a pretrained generator and novel (but
related) domains, we expand the generator to jointly model all domains, old and
new, harmoniously. First, we note the generator contains a meaningful,
pretrained latent space. Is it possible to minimally perturb this hard-earned
representation, while maximally representing the new domains? Interestingly, we
find that the latent space offers unused, "dormant" directions, which do not
affect the output. This provides an opportunity: By "repurposing" these
directions, we can represent new domains without perturbing the original
representation. In fact, we find that pretrained generators have the capacity
to add several - even hundreds - of new domains! Using our expansion method,
one "expanded" model can supersede numerous domain-specific models, without
expanding the model size. Additionally, a single expanded generator natively
supports smooth transitions between domains, as well as composition of domains.
Code and project page available at
https://yotamnitzan.github.io/domain-expansion/. |
Introduces domain expansion, a novel task aiming to augment the image generation space of a pre-trained model with new, related domains, without overriding its original generation capabilities. |
Addresses the limitations of domain adaptation, which typically erases the original domain after adapting to a new one. Domain expansion allows a single generator to model multiple domains harmoniously, enabling applications like domain composition and fine-grained control over image generation. |
Structures the latent space by identifying dormant directions that don't affect image generation. These directions are repurposed to represent new domains by applying domain adaptation methods only to specific subspaces associated with those domains. Regularization techniques ensure the preservation of the original domain. |
Repurposed latent directions successfully encode new domains, enabling smooth transitions and extrapolations beyond trained concepts.
Expanded generator maintains high-quality generation for both original and new domains, comparable to domain-specific generators.
Disentangled representation allows for meaningful composition of multiple domains, even those learned from different adaptation tasks. |
The number of domains that can be expanded might be ultimately limited by model capacity.
Current method requires roughly linear increase in training time with the number of domains. |
generative models, domain adaptation, latent space, disentanglement, compositionality |
2301.05187
Report |
WIRE: Wavelet Implicit Neural Representations |
Vishwanath Saragadam, Daniel LeJeune, Jasper Tan, Guha Balakrishnan, Ashok Veeraraghavan, Richard G. Baraniuk |
Implicit neural representations (INRs) have recently advanced numerous
vision-related areas. INR performance depends strongly on the choice of the
nonlinear activation function employed in its multilayer perceptron (MLP)
network. A wide range of nonlinearities have been explored, but, unfortunately,
current INRs designed to have high accuracy also suffer from poor robustness
(to signal noise, parameter variation, etc.). Inspired by harmonic analysis, we
develop a new, highly accurate and robust INR that does not exhibit this
tradeoff. Wavelet Implicit neural REpresentation (WIRE) uses a continuous
complex Gabor wavelet activation function that is well-known to be optimally
concentrated in space-frequency and to have excellent biases for representing
images. A wide range of experiments (image denoising, image inpainting,
super-resolution, computed tomography reconstruction, image overfitting, and
novel view synthesis with neural radiance fields) demonstrate that WIRE defines
the new state of the art in INR accuracy, training time, and robustness. |
Introduces WIRE (Wavelet Implicit neural REpresentation), a novel INR employing a continuous complex Gabor wavelet activation function for superior signal representation. |
Addresses limitations of existing INRs, such as lack of robustness to noise, slow training times, and limitations in accuracy for fine details, especially in high-dimensional data. |
Leverages the optimal space-frequency concentration of Gabor wavelets, offering advantages over purely sinusoidal (SIREN) or Gaussian activations, with a focus on visual signal representation. |
WIRE excels in diverse applications including image denoising/inpainting, super-resolution, CT reconstruction, and NeRF.
Demonstrates faster training, higher accuracy (e.g., PSNR, SSIM), and improved robustness to noise compared to existing INR techniques.
Proposes a multi-dimensional extension of WIRE further enhancing performance in tasks like denoising and super-resolution. |
Limited exploration of WIRE for non-visual signals, focusing primarily on image-based tasks.
Computational cost of complex-valued operations, although mitigated by reducing hidden features, might still pose challenges for real-time applications.
Future work will explore applications in areas like audio processing and time series analysis. |
implicit neural representations, gabor wavelets, inverse problems, image processing, neural radiance fields |
2301.05065
Report |
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks |
Xinsong Zhang, Yan Zeng, Jipeng Zhang, Hang Li |
Foundation models or pre-trained models have substantially improved the
performance of various language, vision, and vision-language understanding
tasks. However, existing foundation models can only perform the best in one
type of tasks, namely language, vision, or vision-language. It is still an open
question whether it is possible to construct a foundation model performing the
best for all the understanding tasks, which we call a general foundation model.
In this paper, we propose a new general foundation model, X-FM (the
X-Foundation Model). X-FM has one language encoder, one vision encoder, and one
fusion encoder, as well as a new training method. The training method includes
two new techniques for learning X-FM from text, image, and image-text pair
data. One is to stop gradients from the vision-language training when learning
the language encoder. The other is to leverage the vision-language training to
guide the learning of the vision encoder. Extensive experiments on benchmark
datasets show that X-FM can significantly outperform existing general
foundation models and perform better than or comparable to existing foundation
models specifically for language, vision, or vision-language understanding.
Code and pre-trained models are released at
https://github.com/zhangxinsong-nlp/XFM. |
This paper introduces XFM (X-Foundation Model), a novel general foundation model designed to excel in language, vision, and vision-language understanding tasks. |
Existing foundation models typically specialize in a single modality, making it challenging to achieve state-of-the-art performance across all understanding tasks with a single model. |
XFM employs three encoders (language, vision, and fusion) and leverages two innovative training techniques: (1) stopping gradients from vision-language training to the language encoder and (2) using vision-language training to guide masked image modeling for the vision encoder. |
XFM significantly outperforms previous general foundation models on 23 language, vision, and vision-language understanding tasks.
XFM achieves comparable or even superior performance to state-of-the-art models specifically designed for language, vision, or vision-language tasks.
Ablation studies demonstrate the effectiveness of the proposed training techniques in enhancing both uni-modal and multi-modal understanding capabilities. |
The training process is computationally expensive, requiring optimization for efficiency.
Future work will investigate scalability by exploring larger model sizes and datasets. |
foundation models, multimodality, vision-language understanding, masked image modeling, transfer learning |
2301.04650
Report |
Geometry-biased Transformers for Novel View Synthesis |
Naveen Venkat, Mayank Agarwal, Maneesh Singh, Shubham Tulsiani |
We tackle the task of synthesizing novel views of an object given a few input
images and associated camera viewpoints. Our work is inspired by recent
'geometry-free' approaches where multi-view images are encoded as a (global)
set-latent representation, which is then used to predict the color for
arbitrary query rays. While this representation yields (coarsely) accurate
images corresponding to novel viewpoints, the lack of geometric reasoning
limits the quality of these outputs. To overcome this limitation, we propose
'Geometry-biased Transformers' (GBTs) that incorporate geometric inductive
biases in the set-latent representation-based inference to encourage multi-view
geometric consistency. We induce the geometric bias by augmenting the
dot-product attention mechanism to also incorporate 3D distances between rays
associated with tokens as a learnable bias. We find that this, along with
camera-aware embeddings as input, allows our models to generate significantly
more accurate outputs. We validate our approach on the real-world CO3D dataset,
where we train our system over 10 categories and evaluate its view-synthesis
ability for novel objects as well as unseen categories. We empirically validate
the benefits of the proposed geometric biases and show that our approach
significantly improves over prior works. |
This paper proposes Geometry-biased Transformers (GBTs) for novel view synthesis, which incorporate geometric inductive biases into set-latent representation-based inference to improve multi-view consistency. |
Existing geometry-free methods, while able to capture global context, lack precision and struggle to render high-quality details. This work aims to address this limitation by introducing geometric reasoning into the process. |
GBTs leverage a ray-distance-based bias within the attention mechanism of Transformer layers. This bias guides both scene encoding and ray decoding stages to prioritize geometrically relevant context. The model uses a CNN for patch-level feature extraction, a GBT encoder for global scene representation, and a GBT decoder for pixel color prediction. |
GBTs outperform previous state-of-the-art methods in novel view synthesis on the CO3D dataset, demonstrating superior quality in rendering details and consistency.
The method exhibits strong generalization capabilities, achieving good performance on unseen object categories.
Analysis reveals that the geometric bias in the attention mechanism leads to more concentrated focus on relevant regions, resulting in finer details and improved rendering quality. |
Set-latent representation methods, including GBTs, still lag behind projection-based methods in predicting precise details, presenting an area for future improvement.
The reliance on camera viewpoints for inference might restrict the applicability of GBTs in real-world scenarios with unknown camera poses. |
novel view synthesis, transformers, geometric reasoning, set-latent representation, attention mechanism |
2301.04647
Report |
EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata |
Chenhao Zheng, Ayush Shrivastava, Andrew Owens |
We learn a visual representation that captures information about the camera
that recorded a given photo. To do this, we train a multimodal embedding
between image patches and the EXIF metadata that cameras automatically insert
into image files. Our model represents this metadata by simply converting it to
text and then processing it with a transformer. The features that we learn
significantly outperform other self-supervised and supervised features on
downstream image forensics and calibration tasks. In particular, we
successfully localize spliced image regions "zero shot" by clustering the
visual embeddings for all of the patches within an image. |
The paper introduces a method for learning a visual representation that captures camera properties from image patches by associating them with EXIF metadata using a contrastive learning framework. The metadata is treated as a language-like modality and processed using a transformer. |
Understanding the imaging properties of an image is crucial for various tasks, including image forensics, 3D reconstruction, and image generation, complementing the understanding of semantic content. |
The method involves training a joint embedding between image patches and EXIF metadata, which is converted to text and processed using a transformer. The model learns to associate visual features with camera properties described in the metadata. |
Camera metadata serves as effective supervision for self-supervised representation learning.
Image-metadata embeddings prove valuable for image forensics and camera understanding tasks, outperforming alternative features.
Image manipulations, such as splicing, can be detected 'zero shot' by identifying inconsistencies in patch embeddings derived from the learned representation. |
The model's performance might be limited to cameras and metadata present in the training dataset (YFCC100M).
The method's reliance on large patches might limit its effectiveness in detecting small splices. |
self-supervised learning, multimodal learning, image forensics, camera metadata, exif |
2301.04634
Report |
Street-View Image Generation from a Bird's-Eye View Layout |
Alexander Swerdlow, Runsheng Xu, Bolei Zhou |
Bird's-Eye View (BEV) Perception has received increasing attention in recent
years as it provides a concise and unified spatial representation across views
and benefits a diverse set of downstream driving applications. At the same
time, data-driven simulation for autonomous driving has been a focal point of
recent research but with few approaches that are both fully data-driven and
controllable. Instead of using perception data from real-life scenarios, an
ideal model for simulation would generate realistic street-view images that
align with a given HD map and traffic layout, a task that is critical for
visualizing complex traffic scenarios and developing robust perception models
for autonomous driving. In this paper, we propose BEVGen, a conditional
generative model that synthesizes a set of realistic and spatially consistent
surrounding images that match the BEV layout of a traffic scenario. BEVGen
incorporates a novel cross-view transformation with spatial attention design
which learns the relationship between cameras and map views to ensure their
consistency. We evaluate the proposed model on the challenging NuScenes and
Argoverse 2 datasets. After training, BEVGen can accurately render road and
lane lines, as well as generate traffic scenes with diverse different weather
conditions and times of day. |
This paper introduces BEVGen, a novel generative model that synthesizes realistic and spatially consistent street-view images from a given Bird's-Eye View (BEV) layout. |
This work addresses the unexplored area of generative BEV perception, with applications in synthetic data generation for perception models, visualization of safety-critical situations, and editing of traffic scenes for autonomous driving development. |
BEVGen uses an autoregressive transformer with VQ-VAE encoders for images and BEV layouts. Spatial embeddings align image and BEV tokens, while a pairwise camera bias ensures image consistency and correspondence. |
BEVGen generates high-quality, diverse scenes including intersections, parking lots, and boulevards, with consistent weather and time of day across views.
Quantitative evaluation on NuScenes and Argoverse 2 shows superior performance over baselines in terms of FID score, road/vehicle mIoU, and View Consistency Score (VSC).
The model demonstrates promising applications in data augmentation for BEV segmentation and 3D detection, and in generating images from simulated BEV layouts for safety-critical scenario testing and sim2real applications. |
The model exhibits limitations in synthesizing small objects like vehicles, impacting downstream perception tasks.
Future work can explore improvements in small object synthesis, alternative architectures like diffusion models, and integration with scenario generation methods. |
generative model, bev perception, autonomous driving, data augmentation, scene synthesis |
2301.04628
Report |
Face Attribute Editing with Disentangled Latent Vectors |
Yusuf Dalva, Hamza Pehlivan, Cansu Moran, Öykü Irmak Hatipoğlu, Ayşegül Dündar |
We propose an image-to-image translation framework for facial attribute
editing with disentangled interpretable latent directions. Facial attribute
editing task faces the challenges of targeted attribute editing with
controllable strength and disentanglement in the representations of attributes
to preserve the other attributes during edits. For this goal, inspired by the
latent space factorization works of fixed pretrained GANs, we design the
attribute editing by latent space factorization, and for each attribute, we
learn a linear direction that is orthogonal to the others. We train these
directions with orthogonality constraints and disentanglement losses. To
project images to semantically organized latent spaces, we set an
encoder-decoder architecture with attention-based skip connections. We
extensively compare with previous image translation algorithms and editing with
pretrained GAN works. Our extensive experiments show that our method
significantly improves over the state-of-the-arts. Project page:
https://yusufdalva.github.io/vecgan |
This paper presents VecGAN++, an image-to-image translation framework for facial attribute editing that uses disentangled and interpretable latent directions. |
Facial attribute editing is challenging because it requires targeted modifications without affecting other attributes, and existing methods often struggle with disentanglement, controllability, or reconstruction quality. |
VecGAN++ uses an encoder-decoder architecture with latent space manipulation. It learns a linear direction for each attribute and performs translations via vector arithmetic in this space. The framework incorporates orthogonality constraints, disentanglement losses, and attention-based skip connections to improve attribute separation and image reconstruction. |
VecGAN++ achieves state-of-the-art results on facial attribute editing benchmarks, outperforming both end-to-end image translation methods and StyleGAN inversion-based editing techniques.
The learned latent directions provide controllable attribute editing, allowing for adjustments to the intensity of the modifications.
Analysis of the latent space reveals that the learned directions successfully capture semantic information, with projected style codes effectively classifying attributes like smile presence and hair color. |
While VecGAN++ demonstrates strong performance on pre-defined attributes, it requires training for each specific attribute, limiting its flexibility compared to StyleGAN inversion methods that can leverage pre-trained models for diverse edits.
The separation of hair color attributes, while improved, still presents challenges due to its continuous nature, suggesting an area for future exploration. |
image translation, generative adversarial networks, facial attribute editing, latent space manipulation, disentanglement |
2301.04604
Report |
LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis |
Jiapeng Zhu, Ceyuan Yang, Yujun Shen, Zifan Shi, Bo Dai, Deli Zhao, Qifeng Chen |
This work presents an easy-to-use regularizer for GAN training, which helps
explicitly link some axes of the latent space to a set of pixels in the
synthesized image. Establishing such a connection facilitates a more convenient
local control of GAN generation, where users can alter the image content only
within a spatial area simply by partially resampling the latent code.
Experimental results confirm four appealing properties of our regularizer,
which we call LinkGAN. (1) The latent-pixel linkage is applicable to either a
fixed region (\textit{i.e.}, same for all instances) or a particular semantic
category (i.e., varying across instances), like the sky. (2) Two or multiple
regions can be independently linked to different latent axes, which further
supports joint control. (3) Our regularizer can improve the spatial
controllability of both 2D and 3D-aware GAN models, barely sacrificing the
synthesis performance. (4) The models trained with our regularizer are
compatible with GAN inversion techniques and maintain editability on real
images. |
Proposes LinkGAN, a regularizer for GAN training that explicitly links specific latent space axes to image pixels, enabling precise local control over generated images by resampling the linked latent codes. |
Addresses limitations of existing GAN manipulation methods that rely on posterior discovery of latent semantics, which are often unstable, inaccurate, and inflexible. Enables more convenient and reliable control for image editing applications. |
Introduces a regularizer that minimizes the impact of perturbing a subset of latent codes on designated out-of-region pixels, while encouraging changes within the linked region. This effectively disentangles the influence of specific latent axes on chosen image areas. |
Successfully links arbitrary image regions to latent axes, enabling independent control over multiple areas.
Demonstrates effectiveness for both fixed regions and semantically defined areas (e.g., sky, car).
Improves controllability of 2D and 3D-aware GAN models without significantly sacrificing synthesis quality. |
The built linkage is not perfect, sometimes resulting in slight changes outside the targeted region or image inconsistencies after resampling.
Future work includes exploring methods to further improve linkage accuracy and address the inconsistency issue, potentially through incorporating stronger priors or adversarial training strategies. |
generative adversarial networks, image synthesis, local image editing, disentanglement, latent space manipulation |
2301.04075
Report |
Benchmarking Robustness in Neural Radiance Fields |
Chen Wang, Angtian Wang, Junbo Li, Alan Yuille, Cihang Xie |
Neural Radiance Field (NeRF) has demonstrated excellent quality in novel view
synthesis, thanks to its ability to model 3D object geometries in a concise
formulation. However, current approaches to NeRF-based models rely on clean
images with accurate camera calibration, which can be difficult to obtain in
the real world, where data is often subject to corruption and distortion. In
this work, we provide the first comprehensive analysis of the robustness of
NeRF-based novel view synthesis algorithms in the presence of different types
of corruptions.
We find that NeRF-based models are significantly degraded in the presence of
corruption, and are more sensitive to a different set of corruptions than image
recognition models. Furthermore, we analyze the robustness of the feature
encoder in generalizable methods, which synthesize images using neural features
extracted via convolutional neural networks or transformers, and find that it
only contributes marginally to robustness. Finally, we reveal that standard
data augmentation techniques, which can significantly improve the robustness of
recognition models, do not help the robustness of NeRF-based models. We hope
that our findings will attract more researchers to study the robustness of
NeRF-based approaches and help to improve their performance in the real world. |
This paper presents the first comprehensive benchmark for evaluating the robustness of Neural Radiance Field (NeRF) based novel view synthesis models to visual corruptions. |
NeRF models are often applied to real-world scenarios where data corruption is common, yet their robustness to such corruption has not been previously studied. |
The authors construct two datasets, LLFF-C and Blender-C, by adding various types of corruptions to existing NeRF datasets. They benchmark seven representative NeRF models on these datasets, evaluating their performance using PSNR, SSIM, and LPIPS metrics. |
All benchmarked NeRF models exhibit significant performance degradation across all types of corruptions.
Scene-specific NeRF models are generally more robust than generalizable ones.
Standard image data augmentation techniques, while effective for image recognition, do not improve the robustness of NeRF models. |
The study primarily focuses on novel view synthesis and could be extended to other NeRF applications.
Future work could investigate the development of more robust NeRF training and pose estimation techniques. |
neural radiance fields, novel view synthesis, robustness, benchmarking, data corruption |
2301.03992
Report |
Vision Transformers Are Good Mask Auto-Labelers |
Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar |
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask
auto-labeling framework for instance segmentation using only box annotations.
MAL takes box-cropped images as inputs and conditionally generates their mask
pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our
method significantly reduces the gap between auto-labeling and human annotation
regarding mask quality. Instance segmentation models trained using the
MAL-generated masks can nearly match the performance of their fully-supervised
counterparts, retaining up to 97.4\% performance of fully supervised models.
The best model achieves 44.1\% mAP on COCO instance segmentation (test-dev
2017), outperforming state-of-the-art box-supervised methods by significant
margins. Qualitative results indicate that masks produced by MAL are, in some
cases, even better than human annotations. |
This paper proposes Mask Auto-Labeler (MAL), a Transformer-based mask auto-labeling framework for instance segmentation that only requires bounding box annotations. |
Creating large-scale instance segmentation datasets with mask annotations is expensive and time-consuming. MAL offers a solution to train these models without expensive mask annotations by generating high-quality masks from bounding boxes. |
MAL uses a two-phase framework: 1) Training a Vision Transformer to generate mask pseudo-labels from box-cropped images. 2) Training instance segmentation models using the generated masks. The framework utilizes techniques like box expansion, attention-based decoding, and a teacher network to achieve high-quality masks. |
MAL significantly reduces the gap between auto-labeling and human annotation quality.
Instance segmentation models trained on MAL-generated masks achieve up to 97.4% of their fully supervised performance on COCO and LVIS.
MAL demonstrates strong open-vocabulary generalization by labeling novel categories not seen during training. |
MAL faces challenges in occlusion situations where human annotations outperform it.
The authors observed saturation problems when scaling the model from ViT-Base to ViT-Large. |
instance segmentation, weakly supervised learning, mask auto-labeling, vision transformers, box-supervised learning |
2301.03786
Report |
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation |
Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, Jiwen Lu |
Talking head synthesis is a promising approach for the video production
industry. Recently, a lot of effort has been devoted in this research area to
improve the generation quality or enhance the model generalization. However,
there are few works able to address both issues simultaneously, which is
essential for practical applications. To this end, in this paper, we turn
attention to the emerging powerful Latent Diffusion Models, and model the
Talking head generation as an audio-driven temporally coherent denoising
process (DiffTalk). More specifically, instead of employing audio signals as
the single driving factor, we investigate the control mechanism of the talking
face, and incorporate reference face images and landmarks as conditions for
personality-aware generalized synthesis. In this way, the proposed DiffTalk is
capable of producing high-quality talking head videos in synchronization with
the source audio, and more importantly, it can be naturally generalized across
different identities without any further fine-tuning. Additionally, our
DiffTalk can be gracefully tailored for higher-resolution synthesis with
negligible extra computational cost. Extensive experiments show that the
proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking
head videos for generalized novel identities. For more video results, please
refer to \url{https://sstzal.github.io/DiffTalk/}. |
Presents DiffTalk, a novel conditional diffusion model for high-quality and generalized talking head synthesis that leverages audio and reference images to generate synchronized and personalized talking videos across multiple identities without fine-tuning. |
Addresses the limitations of existing talking head synthesis methods that struggle to achieve both high generation quality and strong model generalization, essential for practical applications like animation and virtual avatars. |
Models talking head generation as an audio-driven denoising process using Latent Diffusion Models (LDMs). It incorporates smooth audio features as conditions to guide lip movements and utilizes reference images, landmarks, and masked ground-truth images to ensure identity preservation and pose control. |
Achieves high-fidelity talking head video synthesis with accurate audio-lip synchronization across diverse identities without fine-tuning.
Significantly outperforms 2D-based methods in terms of generated image quality and surpasses 3D-based methods in model generalization ability.
Demonstrates the capacity for higher-resolution image generation by adjusting the downsampling factor of the image encoder and decoder. |
The iterative denoising process of DiffTalk demands more time for frame synthesis compared to GAN-based methods.
Cross-identity audio driving poses a challenge, leading to relatively less accurate audio-lip synchronization than in self-driven scenarios. |
talking head synthesis, diffusion models, generative models, audio-visual synchronization, identity preservation |
2301.03580
Report |
Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling |
Keyu Tian, Yi Jiang, Qishuai Diao, Chen Lin, Liwei Wang, Zehuan Yuan |
We identify and overcome two key obstacles in extending the success of
BERT-style pre-training, or the masked image modeling, to convolutional
networks (convnets): (i) convolution operation cannot handle irregular,
random-masked input images; (ii) the single-scale nature of BERT pre-training
is inconsistent with convnet's hierarchical structure. For (i), we treat
unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution
to encode. This is the first use of sparse convolution for 2D masked modeling.
For (ii), we develop a hierarchical decoder to reconstruct images from
multi-scale encoded features. Our method called Sparse masKed modeling (SparK)
is general: it can be used directly on any convolutional model without backbone
modifications. We validate it on both classical (ResNet) and modern (ConvNeXt)
models: on three downstream tasks, it surpasses both state-of-the-art
contrastive learning and transformer-based masked modeling by similarly large
margins (around +1.0%). Improvements on object detection and instance
segmentation are more substantial (up to +3.5%), verifying the strong
transferability of features learned. We also find its favorable scaling
behavior by observing more gains on larger models. All this evidence reveals a
promising future of generative pre-training on convnets. Codes and models are
released at https://github.com/keyu-tian/SparK. |
This paper introduces SparK, a novel BERT-style pre-training method specifically designed for convolutional networks (convnets), addressing the limitations of traditional masked image modeling approaches when applied to convnets. |
Extending the success of BERT-style pre-training, highly effective in NLP and vision transformers, to convnets remained a challenge. This paper tackles this gap by overcoming convnets' limitations in processing irregular masked images and leveraging their hierarchical structure. |
SparK leverages sparse convolution to encode unmasked image patches as a 3D point cloud, effectively handling irregularly masked inputs. It also employs a hierarchical decoder that utilizes multi-scale features to reconstruct the image, harnessing convnets' inherent strengths. |
SparK-pretrained convnets outperform state-of-the-art contrastive learning methods and transformer-based masked modeling on ImageNet classification by a significant margin.
The improvements are even more pronounced on COCO object detection and instance segmentation tasks, highlighting SparK's ability to learn highly transferable representations.
SparK exhibits favorable scaling behavior, with larger models showing more significant gains, suggesting its potential to boost a wide range of convnet architectures. |
The current implementation utilizes a simple, fixed decoder architecture for all encoder models and could benefit from exploring task-specific decoder designs.
Future work may involve investigating the integration of SparK with other pre-training techniques, like contrastive learning, to further enhance representation learning in convnets. |
self-supervised learning, masked image modeling, convolutional networks, sparse convolution, hierarchical pre-training |
2301.03396
Report |
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation |
Michał Stypułkowski, Konstantinos Vougioukas, Sen He, Maciej Zięba, Stavros Petridis, Maja Pantic |
Talking face generation has historically struggled to produce head movements
and natural facial expressions without guidance from additional reference
videos. Recent developments in diffusion-based generative models allow for more
realistic and stable data synthesis and their performance on image and video
generation has surpassed that of other generative models. In this work, we
present an autoregressive diffusion model that requires only one identity image
and audio sequence to generate a video of a realistic talking human head. Our
solution is capable of hallucinating head movements, facial expressions, such
as blinks, and preserving a given background. We evaluate our model on two
different datasets, achieving state-of-the-art results on both of them. |
Presents Diffused Heads, a novel autoregressive diffusion model for generating realistic talking head videos using only a single identity image and an audio sequence. |
Addresses limitations of GAN-based talking face generation methods, which often suffer from training instability, require additional guidance (limiting originality), and can produce distortions, especially with large head movements. |
Utilizes a diffusion model conditioned on an identity frame, motion frames (past frames for smoother motion), and audio embeddings (for lip sync and expressions). Introduces a lip sync loss to enhance mouth movement accuracy and uses grayscale motion frames during sampling to prioritize motion information. |
Achieves state-of-the-art results on LRW and CREMA datasets in terms of visual quality (FID, FVD), smoothness (OFM, F-MSE), and expressiveness (blinks).
Outperforms existing methods in a human perception Turing test, demonstrating realistic and believable talking head synthesis.
Demonstrates strong generalization ability, effectively generating videos with out-of-distribution identity images and audio recordings. |
Limited to generating shorter video sequences (8-9 seconds) due to the autoregressive frame generation process.
Suffers from long generation times compared to GAN-based approaches, hindering real-time applications. |
talking face generation, diffusion models, speech-driven animation, one-shot learning, video synthesis |
2301.03110
Report |
RobArch: Designing Robust Architectures against Adversarial Attacks |
ShengYun Peng, Weilin Xu, Cory Cornelius, Kevin Li, Rahul Duggal, Duen Horng Chau, Jason Martin |
Adversarial Training is the most effective approach for improving the
robustness of Deep Neural Networks (DNNs). However, compared to the large body
of research in optimizing the adversarial training process, there are few
investigations into how architecture components affect robustness, and they
rarely constrain model capacity. Thus, it is unclear where robustness precisely
comes from. In this work, we present the first large-scale systematic study on
the robustness of DNN architecture components under fixed parameter budgets.
Through our investigation, we distill 18 actionable robust network design
guidelines that empower model developers to gain deep insights. We demonstrate
these guidelines' effectiveness by introducing the novel Robust Architecture
(RobArch) model that instantiates the guidelines to build a family of
top-performing models across parameter capacities against strong adversarial
attacks. RobArch achieves the new state-of-the-art AutoAttack accuracy on the
RobustBench ImageNet leaderboard. The code is available at
$\href{https://github.com/ShengYun-Peng/RobArch}{\text{this url}}$. |
This paper conducts a large-scale systematic study of the robustness of DNN architecture components under fixed parameter budgets, resulting in 18 actionable guidelines for designing robust networks. |
Architecture design plays a crucial role in deep learning, but its impact on robustness against adversarial attacks is not well understood. This paper fills this gap by systematically examining how architecture components contribute to robustness. |
The authors conduct a comprehensive study on ImageNet, controlling for model capacity to isolate the effects of individual architecture components. They train over 150 models, exploring network depth, width, stage-level modifications, and block-level designs, under both Fast-AT and Standard-AT. |
Deepening a network is generally more effective than widening it for robustness, until catastrophic overfitting occurs.
Specific modifications, like adding Squeeze-and-Excitation (SE) blocks, removing the first normalization layer in a block, and reducing downsampling in the stem stage, boost robustness.
Architectural choices that harm robustness include inverted bottlenecks, large dilation factors, Instance Normalization (IN), parametric activation functions, and reducing activation layers. |
The study is limited to ResNet-style architectures and ImageNet.
Future work could explore interactions between architectural components and optimize training recipes for robust models. |
adversarial robustness, deep neural networks, architecture design, imagenet, adversarial training |
2301.02657
Report |
TarViS: A Unified Approach for Target-based Video Segmentation |
Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, Bastian Leibe |
The general domain of video segmentation is currently fragmented into
different tasks spanning multiple benchmarks. Despite rapid progress in the
state-of-the-art, current methods are overwhelmingly task-specific and cannot
conceptually generalize to other tasks. Inspired by recent approaches with
multi-task capability, we propose TarViS: a novel, unified network architecture
that can be applied to any task that requires segmenting a set of arbitrarily
defined 'targets' in video. Our approach is flexible with respect to how tasks
define these targets, since it models the latter as abstract 'queries' which
are then used to predict pixel-precise target masks. A single TarViS model can
be trained jointly on a collection of datasets spanning different tasks, and
can hot-swap between tasks during inference without any task-specific
retraining. To demonstrate its effectiveness, we apply TarViS to four different
tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation
(VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking
(PET). Our unified, jointly trained model achieves state-of-the-art performance
on 5/7 benchmarks spanning these four tasks, and competitive performance on the
remaining two. Code and model weights are available at:
https://github.com/Ali2500/TarViS |
TarViS, a novel unified architecture for target-based video segmentation, allowing a single model to be jointly trained for and perform multiple video segmentation tasks by specifying the target as queries. |
Existing video segmentation methods are task-specific and cannot generalize to other tasks. This work addresses the fragmentation by proposing a unified model for tasks requiring segmenting arbitrarily defined targets in videos. |
TarViS uses a temporal neck for spatiotemporal feature interaction and a transformer decoder to refine target queries. It encodes task-specific targets, such as object instances or semantic classes, as queries and jointly trains on datasets spanning different tasks like VIS, VPS, VOS, and PET. |
TarViS achieves state-of-the-art results for VIS on YouTube-VIS 2021 and OVIS, and for VPS on KITTI-STEP and VIPSeg.
The model performs competitively for VOS on DAVIS, outperforming several space-time correspondence-based methods.
For PET on BURST, TarViS significantly outperforms baselines by encoding objects as queries, effectively handling both point and mask object guidance. |
Training on multiple datasets might not always improve performance on all benchmarks due to potential class bias.
The approach of encoding objects as queries could lead to a loss of fine-grained object details. |
video segmentation, multi-task learning, transformer, query-based model, target segmentation |
2301.02280
Report |
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training |
Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, Dhruv Mahajan |
Vision-language models trained with contrastive learning on large-scale noisy
data are becoming increasingly popular for zero-shot recognition problems. In
this paper we improve the following three aspects of the contrastive
pre-training pipeline: dataset noise, model initialization and the training
objective. First, we propose a straightforward filtering strategy titled
Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset
size, while achieving improved performance across zero-shot vision-language
tasks. Next, we propose an approach titled Concept Distillation to leverage
strong unimodal representations for contrastive training that does not increase
training complexity while outperforming prior work. Finally, we modify the
traditional contrastive alignment objective, and propose an importance-sampling
approach to up-sample the importance of hard-negatives without adding
additional complexity. On an extensive zero-shot benchmark of 29 tasks, our
Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks
compared to the baseline. Furthermore, for few-shot linear probing, we propose
a novel approach that bridges the gap between zero-shot and few-shot
performance, substantially improving over prior work. Models are available at
https://github.com/facebookresearch/diht. |
The paper proposes Distilled and Hard-negative Training (DiHT), a new vision-language pre-training method combining dataset filtering, concept distillation, and a hard-negative contrastive objective. |
Existing contrastive pre-training methods suffer from noisy datasets, suboptimal model initialization, and inefficient use of negative samples. DiHT addresses these issues to improve zero-shot performance on various vision-language tasks. |
DiHT uses the Complexity, Action, and Text-spotting (CAT) filter to clean large-scale datasets, distills object and attribute concepts from a pre-trained teacher model, and employs a hard-negative contrastive loss to focus on more informative negative samples. |
DiHT improves zero-shot performance on 20 out of 29 vision-language tasks compared to the CLIP baseline trained on LAION-2B.
Training DiHT on the smaller PMD dataset yields better performance than CLIP on 28 out of 29 tasks.
DiHT bridges the gap between zero-shot and few-shot performance, showing significant improvements in few-shot linear probing. |
The effectiveness of the hard-negative loss in very noisy settings needs further exploration.
Extending DiHT to more computationally expensive but performant encoder/decoder architectures is a promising future direction. |
vision-language pre-training, contrastive learning, dataset filtering, concept distillation, hard negative mining |
2301.02240
Report |
Skip-Attention: Improving Vision Transformers by Paying Less Attention |
Shashanka Venkataramanan, Amir Ghodrati, Yuki M. Asano, Fatih Porikli, Amirhossein Habibian |
This work aims to improve the efficiency of vision transformers (ViT). While
ViTs use computationally expensive self-attention operations in every layer, we
identify that these operations are highly correlated across layers -- a key
redundancy that causes unnecessary computations. Based on this observation, we
propose SkipAt, a method to reuse self-attention computation from preceding
layers to approximate attention at one or more subsequent layers. To ensure
that reusing self-attention blocks across layers does not degrade the
performance, we introduce a simple parametric function, which outperforms the
baseline transformer's performance while running computationally faster. We
show the effectiveness of our method in image classification and
self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image
denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput
at the same-or-higher accuracy levels in all these tasks. |
This paper introduces Skip-Attention (SKAT), a method to improve the efficiency of Vision Transformers (ViT) by reusing self-attention computations from previous layers. |
Self-attention operations in ViTs are computationally expensive and exhibit high correlation across layers, leading to redundancy. |
SKAT introduces a parametric function inspired by ResNeXt that reuses and refines attention information from preceding layers to approximate attention in subsequent layers. |
SKAT achieves superior accuracy-efficiency trade-off compared to baseline ViT and other state-of-the-art efficiency-focused methods on ImageNet-1K image classification.
SKAT demonstrates strong generalization by improving performance on various tasks like semantic segmentation, image denoising, and video denoising.
The paper provides extensive ablations to analyze the impact of different components and configurations of SKAT. |
The paper primarily focuses on skipping attention in the encoder part of the models and suggests exploring its application in decoders as future work.
Investigating the effectiveness of applying the parametric function directly to the self-attention map instead of the entire MSA block is left for future exploration. |
vision transformer, self-attention, efficiency, image classification, semantic segmentation |
2301.02239
Report |
Robust Dynamic Radiance Fields |
Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, Jia-Bin Huang |
Dynamic radiance field reconstruction methods aim to model the time-varying
structure and appearance of a dynamic scene. Existing methods, however, assume
that accurate camera poses can be reliably estimated by Structure from Motion
(SfM) algorithms. These methods, thus, are unreliable as SfM algorithms often
fail or produce erroneous poses on challenging videos with highly dynamic
objects, poorly textured surfaces, and rotating camera motion. We address this
robustness issue by jointly estimating the static and dynamic radiance fields
along with the camera parameters (poses and focal length). We demonstrate the
robustness of our approach via extensive quantitative and qualitative
experiments. Our results show favorable performance over the state-of-the-art
dynamic view synthesis methods. |
This paper introduces RoDynRF, an algorithm for reconstructing dynamic radiance fields from casual monocular videos without requiring known camera poses and camera intrinsics as input. |
Existing dynamic radiance field reconstruction methods rely on accurate camera poses typically derived from SfM algorithms, which are prone to failure in challenging videos with dynamic objects, textureless surfaces, and complex camera motion. This work addresses this robustness issue. |
The method jointly estimates camera poses, focal length, and two separate radiance fields for static and dynamic elements. It employs a coarse-to-fine strategy for static scene reconstruction, late viewing direction conditioning for improved geometry estimation, and incorporates auxiliary losses (reprojection, disparity, monocular depth) for both static and dynamic components. The dynamic reconstruction utilizes a deformation MLP to handle temporal information and a scene flow MLP for motion modeling. |
RoDynRF achieves state-of-the-art results on dynamic view synthesis benchmarks, including the Dynamic Scene and iPhone datasets.
The method demonstrates superior camera pose estimation accuracy compared to existing NeRF-based methods and performs favorably against learning-based visual odometry techniques on the MPI Sintel dataset.
Qualitative results showcase RoDynRF's ability to reconstruct high-fidelity dynamic scenes and synthesize novel views from challenging videos where traditional SfM techniques fail. |
The method assumes a fixed focal length throughout the video, limiting its applicability in scenarios with zooming effects.
Fast camera motion or inaccurate flow estimation can still lead to failure in pose estimation.
Future work could explore handling dynamic camera intrinsics and improving robustness in extreme motion scenarios. |
dynamic radiance fields, view synthesis, camera pose estimation, neural rendering, computer vision |
2301.01802
Report |
MonoEdge: Monocular 3D Object Detection Using Local Perspectives |
Minghan Zhu, Lingting Ge, Panqu Wang, Huei Peng |
We propose a novel approach for monocular 3D object detection by leveraging
local perspective effects of each object. While the global perspective effect
shown as size and position variations has been exploited for monocular 3D
detection extensively, the local perspectives has long been overlooked. We
design a local perspective module to regress a newly defined variable named
keyedge-ratios as the parameterization of the local shape distortion to account
for the local perspective, and derive the object depth and yaw angle from it.
Theoretically, this module does not rely on the pixel-wise size or position in
the image of the objects, therefore independent of the camera intrinsic
parameters. By plugging this module in existing monocular 3D object detection
frameworks, we incorporate the local perspective distortion with global
perspective effect for monocular 3D reasoning, and we demonstrate the
effectiveness and superior performance over strong baseline methods in multiple
datasets. |
Presents MonoEdge, a novel monocular 3D object detection approach leveraging local perspective effects within objects to estimate depth and yaw angle. |
Exploits previously overlooked local perspective cues, offering camera-intrinsics-free depth and direct global yaw angle estimation, unlike common allocentric angle-based methods. |
Introduces 'keyedge-ratios' to quantify local shape distortions, deriving depth and yaw angle independent of camera intrinsics. Employs camera-centric keyedge indexing and grouped regression heads for effective learning, and integrates uncertainty-based depth fusion. |
Achieves consistent improvements across all evaluation metrics on KITTI and nuScenes datasets over baseline methods.
Demonstrates the value of incorporating local perspective distortion with existing approaches for enhanced 3D object detection.
Shows robustness in handling objects with varying viewpoints, including those with minimal apparent local distortion. |
Relies on combining with existing methods based on visual size and position for optimal performance.
Limited effectiveness for distant objects with diminished local perspective distortion. |
3d object detection, monocular vision, local perspective, keyedge-ratios, camera-intrinsics-free |
2301.01795
Report |
PACO: Parts and Attributes of Common Objects |
Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, Dhruv Mahajan |
Object models are gradually progressing from predicting just category labels
to providing detailed descriptions of object instances. This motivates the need
for large datasets which go beyond traditional object masks and provide richer
annotations such as part masks and attributes. Hence, we introduce PACO: Parts
and Attributes of Common Objects. It spans 75 object categories, 456
object-part categories and 55 attributes across image (LVIS) and video (Ego4D)
datasets. We provide 641K part masks annotated across 260K object boxes, with
roughly half of them exhaustively annotated with attributes as well. We design
evaluation metrics and provide benchmark results for three tasks on the
dataset: part mask segmentation, object and part attribute prediction and
zero-shot instance detection. Dataset, models, and code are open-sourced at
https://github.com/facebookresearch/paco. |
The paper introduces \dataname{}, a large-scale dataset for common objects with annotations for part masks, object attributes, and part attributes, aiming to enable research on fine-grained object understanding beyond category-level labels. |
Existing object datasets lack comprehensive annotations for parts and attributes, limiting research on tasks requiring detailed object understanding such as open-vocabulary detection, visual question answering, and referring expressions. |
The dataset is constructed from LVIS and Ego4D, with careful selection of 75 object categories, 456 object-part categories, and 55 attributes. The annotation pipeline involves object bounding box and mask annotation, part mask annotation, object and part attribute annotation, and instance ID annotation, ensuring high quality through user studies and manual curation. |
Part segmentation results show lower AP for object-parts compared to objects due to smaller size, with larger backbones improving performance.
Attribute prediction is more challenging than object detection, with larger models showing better performance, and a significant gap between lower and upper bounds highlighting room for improvement.
Zero-shot instance detection performance suggests a trade-off between query complexity and compounded errors from multiple attribute predictions, with object and part attributes both contributing significantly to performance. |
The study primarily focuses on visual attributes and does not explicitly consider shape attributes due to annotation challenges.
Future work includes exploring more sophisticated models for joint object, part, and attribute detection, as well as investigating the role of shape attributes. |
computer vision, object detection, part segmentation, attribute prediction, zero-shot learning |
2301.01413
Report |
Attribute-Centric Compositional Text-to-Image Generation |
Yuren Cong, Martin Renqiang Min, Li Erran Li, Bodo Rosenhahn, Michael Ying Yang |
Despite the recent impressive breakthroughs in text-to-image generation,
generative models have difficulty in capturing the data distribution of
underrepresented attribute compositions while over-memorizing overrepresented
attribute compositions, which raises public concerns about their robustness and
fairness. To tackle this challenge, we propose ACTIG, an attribute-centric
compositional text-to-image generation framework. We present an
attribute-centric feature augmentation and a novel image-free training scheme,
which greatly improves model's ability to generate images with underrepresented
attributes. We further propose an attribute-centric contrastive loss to avoid
overfitting to overrepresented attribute compositions. We validate our
framework on the CelebA-HQ and CUB datasets. Extensive experiments show that
the compositional generalization of ACTIG is outstanding, and our framework
outperforms previous works in terms of image quality and text-image
consistency. |
This paper proposes ACTIG, an attribute-centric compositional text-to-image generation framework that excels in generating high-fidelity images with accurate attribute compositions even for underrepresented combinations. |
Current text-to-image models struggle to capture the data distribution of underrepresented attribute combinations, leading to biased or inaccurate generations, which raises concerns about robustness and fairness. |
ACTIG introduces: (1) attribute-centric feature augmentation to generate training data with underrepresented attributes, (2) image-free training using augmented text features, and (3) attribute-centric contrastive loss to disentangle attribute distributions and prevent overfitting to popular combinations. |
ACTIG achieves state-of-the-art results on CelebA-HQ and CUB datasets in terms of image quality (FID) and text-image consistency (R-Precision).
ACTIG effectively generates images matching complex attribute compositions, including those not seen during training.
User studies confirm ACTIG's superiority in generating high-quality images with accurate attribute representations compared to other state-of-the-art models. |
The image features from the text-to-image mapping network used in image-free training might not perfectly represent visual appearance, potentially limiting image quality.
The attribute parser based on dependency matching may not accurately extract attributes from complex sentences, requiring further improvement. |
text-to-image generation, compositional generalization, attribute-centric, image-free training, contrastive learning |
2301.01296
Report |
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models |
Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu |
Masked image modeling (MIM) performs strongly in pre-training large vision
Transformers (ViTs). However, small models that are critical for real-world
applications cannot or only marginally benefit from this pre-training approach.
In this paper, we explore distillation techniques to transfer the success of
large MIM-based pre-trained models to smaller ones. We systematically study
different options in the distillation framework, including distilling targets,
losses, input, network regularization, sequential distillation, etc, revealing
that: 1) Distilling token relations is more effective than CLS token- and
feature-based distillation; 2) An intermediate layer of the teacher network as
target perform better than that using the last layer when the depth of the
student mismatches that of the teacher; 3) Weak regularization is preferred;
etc. With these findings, we achieve significant fine-tuning accuracy
improvements over the scratch MIM pre-training on ImageNet-1K classification,
using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4%
gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K
semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM
model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image
classification, which sets a new record for small vision models of the same
size and computation budget. This strong performance suggests an alternative
way for developing small vision Transformer models, that is, by exploring
better training methods rather than introducing inductive biases into
architectures as in most previous works. Code is available at
https://github.com/OliverRensu/TinyMIM. |
This paper presents TinyMIM, a method that leverages knowledge distillation to enable masked image modeling (MIM) pre-training for small vision transformers (ViTs), significantly improving their performance on downstream tasks. |
While MIM pre-training has proven effective for large ViTs, small ViTs struggle to benefit from this approach due to their limited capacity. This hinders their applicability in real-world scenarios where efficiency is crucial. |
TinyMIM distills knowledge from a larger, MIM pre-trained teacher ViT to a smaller student ViT. The paper systematically investigates various design choices for distillation, including targets (token relations, features, CLS token), input (raw or masked images), network regularization, and sequential distillation. |
Distilling token relations is more effective than using CLS token or feature-based distillation.
Intermediate teacher layers as targets often outperform the last layer, particularly when student and teacher depths mismatch.
TinyMIM significantly boosts the accuracy of small ViTs on ImageNet-1K classification and ADE20K segmentation, setting new records for their size and computational budget. |
The study primarily focuses on transferring knowledge from MIM pre-trained teachers, leaving other pre-training methods unexplored for distillation.
Future work could explore different teacher architectures or pre-training tasks to further enhance TinyMIM's performance. |
knowledge distillation, masked image modeling, vision transformers, self-supervised learning, model compression |
2301.01206
Report |
Speed up the inference of diffusion models via shortcut MCMC sampling |
Gang Chen |
Diffusion probabilistic models have generated high quality image synthesis
recently. However, one pain point is the notorious inference to gradually
obtain clear images with thousands of steps, which is time consuming compared
to other generative models. In this paper, we present a shortcut MCMC sampling
algorithm, which balances training and inference, while keeping the generated
data's quality. In particular, we add the global fidelity constraint with
shortcut MCMC sampling to combat the local fitting from diffusion models. We do
some initial experiments and show very promising results. Our implementation is
available at https://github.com//vividitytech/diffusion-mcmc.git. |
Presents a shortcut MCMC sampling algorithm for diffusion models that balances training and inference time while maintaining generated data quality by adding a global fidelity constraint to combat local fitting. |
Diffusion models, while effective for high-quality image synthesis, suffer from slow inference times compared to other generative models due to the need for thousands of steps to obtain clear images. |
Introduces shortcut MCMC sampling and incorporates a fidelity term in the loss function to match synthesized images with original data, acting as a global constraint for quality control during shortcut generation. |
The approach achieves fast convergence and better reconstruction compared to traditional diffusion models.
Generates high-quality samples with significantly fewer inference steps.
Demonstrates superior performance on synthetic datasets, showcasing its potential for fast and accurate image synthesis. |
Currently explored on synthetic datasets; further validation on more complex and diverse datasets is needed.
The impact of varying the number of shortcut steps (K) on the balance between inference speed and generated image quality requires further investigation. |
diffusion models, mcmc sampling, image synthesis, generative models, fast inference |
2301.01156
Report |
Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation |
Yue Han, Jiangning Zhang, Zhucun Xue, Chao Xu, Xintian Shen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li |
Few Shot Instance Segmentation (FSIS) requires models to detect and segment
novel classes with limited several support examples. In this work, we explore a
simple yet unified solution for FSIS as well as its incremental variants, and
introduce a new framework named Reference Twice (RefT) to fully explore the
relationship between support/query features based on a Transformer-like
framework. Our key insights are two folds: Firstly, with the aid of support
masks, we can generate dynamic class centers more appropriately to re-weight
query features. Secondly, we find that support object queries have already
encoded key factors after base training. In this way, the query features can be
enhanced twice from two aspects, i.e., feature-level and instance-level. In
particular, we firstly design a mask-based dynamic weighting module to enhance
support features and then propose to link object queries for better calibration
via cross-attention. After the above steps, the novel classes can be improved
significantly over our strong baseline. Additionally, our new framework can be
easily extended to incremental FSIS with minor modification. When benchmarking
results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method
achieves a competitive performance compared to existing approaches across
different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current
state-of-the-art FSIS method for 10/30-shot. We further demonstrate the
superiority of our approach on Few Shot Object Detection. Code and model will
be available. |
This paper proposes RefT, a simple and unified Transformer-based framework for Few-Shot Instance Segmentation (FSIS) and its variants, by leveraging support set information on both feature and instance levels. |
FSIS is challenging as it requires both detection and segmentation of novel instances with limited data. Existing methods often under-explore cues from support data. This paper aims to address this issue with a more powerful framework and unify FSIS, gFSIS, and iFSIS. |
RefT uses a two-stage Meta-Learning pipeline with two key novelties: 1) a mask-based dynamic prototype generation for feature-level enhancement, and 2) cross-attention linking of query and support object queries for instance-level guidance. |
RefT significantly outperforms previous FSIS methods, achieving state-of-the-art results on COCO benchmarks across different shots.
The method generalizes well to gFSIS and iFSIS, consistently outperforming recent state-of-the-art approaches.
Ablation studies demonstrate the effectiveness of each component, highlighting the importance of both feature-level and instance-level enhancements. |
The model doesn't perform well in a one-shot setting.
Future work will focus on improving performance in the one-shot setting. |
few-shot instance segmentation, few-shot learning, vision transformer, object detection, image segmentation |
2301.01146
Report |
Rethinking Mobile Block for Efficient Attention-based Models |
Jiangning Zhang, Xiangtai Li, Jian Li, Liang Liu, Zhucun Xue, Boshen Zhang, Zhengkai Jiang, Tianxin Huang, Yabiao Wang, Chengjie Wang |
This paper focuses on developing modern, efficient, lightweight models for
dense predictions while trading off parameters, FLOPs, and performance.
Inverted Residual Block (IRB) serves as the infrastructure for lightweight
CNNs, but no counterpart has been recognized by attention-based studies. This
work rethinks lightweight infrastructure from efficient IRB and effective
components of Transformer from a unified perspective, extending CNN-based IRB
to attention-based models and abstracting a one-residual Meta Mobile Block
(MMB) for lightweight model design. Following simple but effective design
criterion, we deduce a modern Inverted Residual Mobile Block (iRMB) and build a
ResNet-like Efficient MOdel (EMO) with only iRMB for down-stream tasks.
Extensive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks
demonstrate the superiority of our EMO over state-of-the-art methods, e.g.,
EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass equal-order
CNN-/Attention-based models, while trading-off the parameter, efficiency, and
accuracy well: running 2.8-4.0x faster than EdgeNeXt on iPhone14. |
This paper introduces the Meta Mobile Block (MMB), a novel one-residual block design for lightweight attention-based models, and presents the Efficient MOdel (EMO), built solely with MMBs. |
Existing efficient models either struggle to achieve high accuracy (CNN-based) or require complex structures and high computational costs (attention-based). This work aims to bridge this gap by creating a simple yet effective lightweight model design. |
The authors extend the concept of Inverted Residual Blocks (IRBs) used in lightweight CNNs to attention-based models. They abstract a unified MMB that can be instantiated into IRB, MHSA, and FFN by adjusting expansion ratio and efficient operator. By deducing MMB with specific components, they propose the Inverted Residual Mobile Block (iRMB) and build a ResNet-like EMO with only iRMBs. |
EMO outperforms state-of-the-art lightweight attention-based models on ImageNet-1K, COCO2017, and ADE20K benchmarks.
EMO achieves a better balance between accuracy, parameters, and FLOPs compared to counterparts, running 2.8-4.0x faster than EdgeNeXt on iPhone14.
Ablation studies demonstrate the effectiveness of the iRMB design and the importance of component choices and configurations. |
Exploration of more complex and potentially more effective operators within the iRMB structure is left for future work.
Further performance improvements could be achieved by utilizing higher resolution input, NAS, knowledge distillation, larger datasets, and stronger training strategies. |
lightweight model, efficient architecture, attention mechanism, meta mobile block, inverted residual block |
2301.00950
Report |
Class-Continuous Conditional Generative Neural Radiance Field |
Jiwook Kim, Minhyeok Lee |
The 3D-aware image synthesis focuses on conserving spatial consistency
besides generating high-resolution images with fine details. Recently, Neural
Radiance Field (NeRF) has been introduced for synthesizing novel views with low
computational cost and superior performance. While several works investigate a
generative NeRF and show remarkable achievement, they cannot handle conditional
and continuous feature manipulation in the generation procedure. In this work,
we introduce a novel model, called Class-Continuous Conditional Generative NeRF
($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated
photorealistic 3D-consistent images by projecting conditional features to the
generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated
with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows
strong 3D-consistency with fine details and smooth interpolation in conditional
feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet
Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a
$\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated
3D-aware images of each class of the datasets as it is possible to synthesize
class-conditional images with $\text{C}^{3}$G-NeRF. |
This paper introduces C³G-NeRF, a novel model for conditional and continuous feature manipulation in 3D-aware image generation using Neural Radiance Fields (NeRF). |
Existing generative NeRF methods lack the ability to handle conditional and continuous feature control during generation, limiting their application in fields like avatar customization in the metaverse. |
C³G-NeRF projects conditional features onto both the generator and discriminator, enabling fine-grained control over image synthesis. It utilizes a generative neural feature field for each object and the background, composed using a density-weighted mean-based composition operator. The model employs volume rendering to generate feature images and a neural rendering network to produce the final high-resolution image. Residual modules are incorporated to improve training efficiency and performance. |
C³G-NeRF achieves state-of-the-art results in conditional 3D-aware image generation, outperforming baseline models in terms of FID and KID scores across various datasets (AFHQ, CelebA, Cars).
The model exhibits robust 3D consistency, preserving spatial coherence under object rotations, translations, and additions.
C³G-NeRF enables smooth interpolation and extrapolation of conditional input values, allowing for continuous feature manipulation in generated images. |
The evaluation of FID scores for different view degrees across datasets might not be directly comparable.
Future work could explore incorporating higher-resolution feature images or alternative neural rendering techniques to further enhance generation quality. |
generative adversarial networks (gans), neural radiance fields (nerf), 3d-aware image synthesis, conditional image generation, continuous feature manipulation |
2301.00808
Report |
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders |
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie |
Driven by improved architectures and better representation learning
frameworks, the field of visual recognition has enjoyed rapid modernization and
performance boost in the early 2020s. For example, modern ConvNets, represented
by ConvNeXt, have demonstrated strong performance in various scenarios. While
these models were originally designed for supervised learning with ImageNet
labels, they can also potentially benefit from self-supervised learning
techniques such as masked autoencoders (MAE). However, we found that simply
combining these two approaches leads to subpar performance. In this paper, we
propose a fully convolutional masked autoencoder framework and a new Global
Response Normalization (GRN) layer that can be added to the ConvNeXt
architecture to enhance inter-channel feature competition. This co-design of
self-supervised learning techniques and architectural improvement results in a
new model family called ConvNeXt V2, which significantly improves the
performance of pure ConvNets on various recognition benchmarks, including
ImageNet classification, COCO detection, and ADE20K segmentation. We also
provide pre-trained ConvNeXt V2 models of various sizes, ranging from an
efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a
650M Huge model that achieves a state-of-the-art 88.9% accuracy using only
public training data. |
This paper introduces ConvNeXt V2, an improved ConvNeXt model designed for enhanced performance with masked autoencoders (MAE). It features a fully convolutional masked autoencoder framework and a novel Global Response Normalization (GRN) layer. |
This co-design of self-supervised learning techniques and architecture boosts the performance of pure ConvNets in image recognition tasks, exceeding previous ConvNeXt versions and rivaling transformer-based models. |
The authors develop a fully convolutional MAE framework with sparse convolutions for ConvNets. They analyze feature collapse in ConvNeXt with MAE pre-training and address it by incorporating the GRN layer to enhance feature diversity. |
ConvNeXt V2 significantly outperforms prior ConvNeXt models and achieves state-of-the-art accuracy on ImageNet classification with public data (88.9%).
The proposed method demonstrates consistent improvement across a wide range of model sizes, from efficient (3.7M parameters) to high-capacity (650M parameters) variants.
ConvNeXt V2 excels in transfer learning, surpassing Swin transformer-based models in object detection and semantic segmentation tasks on COCO and ADE20K. |
The largest model shows a slight performance gap compared to ViT in the huge model regime, potentially due to ViT benefiting more from self-supervised pre-training.
The efficiency of sparse convolution libraries can be further optimized for modern hardware. |
image recognition, convolutional neural networks, self-supervised learning, masked autoencoders, transfer learning |
2301.00805
Report |
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation |
Jianzong Wu, Xiangtai Li, Henghui Ding, Xia Li, Guangliang Cheng, Yunhai Tong, Chen Change Loy |
In this work, we focus on open vocabulary instance segmentation to expand a
segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex
pipelines to establish one-to-one mappings between image regions and words in
captions. However, such methods build noisy supervision by matching non-visible
words to image regions, such as adjectives and verbs. Meanwhile, context words
are also important for inferring the existence of novel objects as they show
high inter-correlations with novel categories. To overcome these limitations,
we devise a joint \textbf{Caption Grounding and Generation (CGG)} framework,
which incorporates a novel grounding loss that only focuses on matching object
nouns to improve learning efficiency. We also introduce a caption generation
head that enables additional supervision and contextual modeling as a
complementation to the grounding loss. Our analysis and results demonstrate
that grounding and generation components complement each other, significantly
enhancing the segmentation performance for novel classes. Experiments on the
COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS)
and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the
CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel
classes without extra data on the OVIS task and 15% PQ improvements for novel
classes on the OSPS benchmark. |
This paper presents a joint Caption Grounding and Generation (CGG) framework to address the challenge of open vocabulary instance segmentation, where the goal is to enable segmentation models to classify and segment novel object categories not seen during training. |
Existing methods often rely on noisy supervision from matching non-visual words in captions to image regions or struggle to effectively leverage contextual information for novel object inference. CGG addresses these limitations by employing a novel grounding loss focused on matching object nouns and introducing a caption generation head for contextual modeling and additional supervision. |
CGG uses a Mask2Former baseline and incorporates two main components: (1) a caption grounding module that extracts object nouns from captions and aligns them with object queries in the segmentation model using a dedicated loss function and (2) a caption generation module that leverages multi-modal embeddings from the segmentation model to generate image captions, providing additional supervision and contextual understanding. |
CGG outperforms previous state-of-the-art methods on the OVIS benchmark, achieving a significant improvement of 6.8% mAP for novel classes without using additional data.
On the OSPS benchmark, CGG demonstrates superior performance, achieving a 15% PQ improvement for novel classes compared to previous approaches.
Ablation studies confirm the effectiveness of both the caption grounding and generation components, highlighting their complementary roles in enhancing the model's ability to identify and segment novel object categories. |
The study is limited by computational resources, preventing pre-training on larger caption datasets or using VLMs like CLIP for distillation/supervision.
Future work will focus on exploring these avenues and evaluating CGG on more extensive datasets such as LVIS and Open Images. |
open vocabulary instance segmentation, caption grounding, caption generation, mask2former, open set panoptic segmentation |
2301.00592
Report |
Edge Enhanced Image Style Transfer via Transformers |
Chiyu Zhang, Jun Yang, Zaiyan Dai, Peng Cao |
In recent years, arbitrary image style transfer has attracted more and more
attention. Given a pair of content and style images, a stylized one is hoped
that retains the content from the former while catching style patterns from the
latter. However, it is difficult to simultaneously keep well the trade-off
between the content details and the style features. To stylize the image with
sufficient style patterns, the content details may be damaged and sometimes the
objects of images can not be distinguished clearly. For this reason, we present
a new transformer-based method named STT for image style transfer and an edge
loss which can enhance the content details apparently to avoid generating
blurred results for excessive rendering on style features. Qualitative and
quantitative experiments demonstrate that STT achieves comparable performance
to state-of-the-art image style transfer methods while alleviating the content
leak problem. |
This paper proposes STT, a novel Transformer-based image style transfer network that generates high-quality stylized images while preserving fine content details. |
Existing image style transfer methods struggle to balance content details and style features, often resulting in blurred outputs with indistinguishable objects. |
STT utilizes a Transformer-based encoder-transfer-decoder architecture with a content-aware positional encoding (Conv PE). It incorporates a novel edge loss to enhance content details and prevent blurred stylizations. |
STT demonstrates superior performance in preserving content details and transferring style features compared to state-of-the-art methods.
The proposed Conv PE effectively encodes positional information, outperforming traditional functional and parametric approaches.
The edge loss significantly improves the clarity of stylized images, particularly in cases where the original results are blurred. |
The edge loss in STT is only applied when the initial results are noticeably blurred.
Further research could explore the integration of the content-aware positional encoding (CAPE) within the STT framework. |
image style transfer, transformer, edge enhancement, content leak, deep learning |
2301.00527
Report |
Diffusion Probabilistic Models for Scene-Scale 3D Categorical Data |
Jumin Lee, Woobin Im, Sebin Lee, Sung-Eui Yoon |
In this paper, we learn a diffusion model to generate 3D data on a
scene-scale. Specifically, our model crafts a 3D scene consisting of multiple
objects, while recent diffusion research has focused on a single object. To
realize our goal, we represent a scene with discrete class labels, i.e.,
categorical distribution, to assign multiple objects into semantic categories.
Thus, we extend discrete diffusion models to learn scene-scale categorical
distributions. In addition, we validate that a latent diffusion model can
reduce computation costs for training and deploying. To the best of our
knowledge, our work is the first to apply discrete and latent diffusion for 3D
categorical data on a scene-scale. We further propose to perform semantic scene
completion (SSC) by learning a conditional distribution using our diffusion
model, where the condition is a partial observation in a sparse point cloud. In
experiments, we empirically show that our diffusion models not only generate
reasonable scenes, but also perform the scene completion task better than a
discriminative model. Our code and models are available at
https://github.com/zoomin-lee/scene-scale-diffusion |
This paper introduces the first application of discrete and latent diffusion models for generating scene-scale 3D semantic segmentation maps. |
Existing 3D diffusion models focus on single object generation, while this model aims to generate entire 3D scenes with multiple objects which has broader applications like semantic scene completion. |
The authors extend discrete diffusion models to handle 3D categorical voxel data, representing scenes with discrete class labels. They also validate the use of latent diffusion models to reduce computation costs during training and deployment. |
Both discrete and latent diffusion models successfully generate diverse and plausible 3D scenes.
Latent diffusion significantly reduces training and sampling time compared to discrete diffusion.
The proposed method outperforms a discriminative model in the semantic scene completion task, demonstrating its ability to complete scenes from partial observations. |
The performance of VQ-VAE in latent diffusion can be limited by codebook size and resolution.
Future work can explore more sophisticated network architectures and training strategies specifically designed for 3D scene generation. |
diffusion models, 3d scene generation, semantic scene completion, latent diffusion, categorical data |
2301.00411
Report |
Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields |
Boyu Zhang, Wenbo Xu, Zheng Zhu, Guan Huang |
Representing and synthesizing novel views in real-world dynamic scenes from
casual monocular videos is a long-standing problem. Existing solutions
typically approach dynamic scenes by applying geometry techniques or utilizing
temporal information between several adjacent frames without considering the
underlying background distribution in the entire scene or the transmittance
over the ray dimension, limiting their performance on static and occlusion
areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance
fields offers high-quality view synthesis and a 3D solution to
$\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene,
which is called $\text{D}^4$NeRF. Specifically, it employs a neural
representation to capture the scene distribution in the static background and a
6D-input NeRF to represent dynamic objects, respectively. Each ray sample is
given an additional occlusion weight to indicate the transmittance lying in the
static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic
scenes and our urban driving scenes acquired from an autonomous-driving
dataset. Extensive experiments demonstrate that our approach outperforms
previous methods in rendering texture details and motion areas while also
producing a clean static background. Our code will be released at
https://github.com/Luciferbobo/D4NeRF. |
This paper proposes D⁴NeRF, a novel method using Distribution-Driven Neural Radiance Fields for Detachable Novel Views Synthesis of Dynamic Scenes from casual monocular videos. |
Existing methods often neglect the underlying background distribution and ray transmittance in dynamic scenes, limiting their performance on static and occlusion areas. This work addresses these limitations. |
The method uses a parallel structure with a background pipeline capturing the scene distribution and a 6D-input NeRF representing dynamic objects. An occlusion weight module is introduced to learn the transmittance between static and dynamic components, and multiple regularization losses optimize the training. |
D⁴NeRF outperforms state-of-the-art methods in novel view synthesis quality, achieving higher PSNR and SSIM and lower LPIPS.
The method effectively decouples static backgrounds from dynamic scenes in a self-supervised manner.
Quantitative and qualitative evaluations on NVIDIA dynamic scenes and a new urban driving scenes dataset demonstrate the effectiveness of D⁴NeRF. |
The performance relies on accurate estimations of camera poses and optical flow.
Future work includes exploring high-quality decomposition and editor models for 3D dynamic scenes, and potentially integrating deformable latent code for finer dynamic object representation. |
novel view synthesis, neural radiance fields, dynamic scenes, monocular videos, occlusion handling |
2301.00184
Report |
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? |
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang |
Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at https://github.com/whwu95/Cap4Video . |
Proposes Cap4Video, a novel framework leveraging automatically generated captions to enhance text-video retrieval. |
Existing methods focus on visual-textual matching, neglecting the valuable textual information often associated with videos. |
1. Generates captions from videos using zero-shot video captioning (CLIP+GPT-2). 2. Leverages captions for: Data augmentation with video-caption pairs, Feature interaction between video and caption representations, Output score fusion of query-video and query-caption matching. |
Achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
Demonstrates significant improvements over baselines, particularly in retrieving precise ground-truth videos.
Shows effectiveness of caption-based data augmentation, feature interaction, and score fusion through ablation studies. |
Relies on the quality of generated captions, which can be improved with more advanced captioning methods.
Current implementation focuses on global caption embedding; exploring fine-grained caption information could be beneficial. |
text-video retrieval, video captioning, cross-modal learning, clip, gpt-2 |
2301.00182
Report |
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models |
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang |
Vision-language models (VLMs) pre-trained on large-scale image-text pairs
have demonstrated impressive transferability on various visual tasks.
Transferring knowledge from such powerful VLMs is a promising direction for
building effective video recognition models. However, current exploration in
this field is still limited. We believe that the greatest value of pre-trained
VLMs lies in building a bridge between visual and textual domains. In this
paper, we propose a novel framework called BIKE, which utilizes the cross-modal
bridge to explore bidirectional knowledge: i) We introduce the Video Attribute
Association mechanism, which leverages the Video-to-Text knowledge to generate
textual auxiliary attributes for complementing video recognition. ii) We also
present a Temporal Concept Spotting mechanism that uses the Text-to-Video
expertise to capture temporal saliency in a parameter-free manner, leading to
enhanced video representation. Extensive studies on six popular video datasets,
including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show
that our method achieves state-of-the-art performance in various recognition
scenarios, such as general, zero-shot, and few-shot video recognition. Our best
model achieves a state-of-the-art accuracy of 88.6% on the challenging
Kinetics-400 using the released CLIP model. The code is available at
https://github.com/whwu95/BIKE . |
This paper proposes BIKE, a novel framework that utilizes bidirectional cross-modal knowledge from pre-trained VLMs for enhanced video recognition. |
Existing methods utilizing VLMs for video recognition often only leverage unidirectional video-to-text matching, not fully exploiting VLMs' potential. |
BIKE comprises two branches: 1) Attributes branch: employs Video-Attributes Association to generate textual attributes from videos, complementing recognition. 2) Video branch: utilizes Video Concept Spotting to generate temporal saliency from category descriptions, enhancing video representation. |
BIKE achieves state-of-the-art accuracy on Kinetics-400 (88.6%) using CLIP, outperforming methods with larger pre-training datasets.
It also demonstrates superior performance on ActivityNet, UCF-101, and HMDB-51, showing strong generalization ability.
BIKE exhibits promising results in few-shot and zero-shot video recognition settings, showcasing its effectiveness in data-scarce scenarios. |
The complementary effect of the Attributes branch diminishes with larger backbones.
Future work can explore automatically generating better lexicon for attribute generation. |
video recognition, vision-language models, cross-modal learning, temporal saliency, attribute generation |
2301.00157
Report |
Ponder: Point Cloud Pre-training via Neural Rendering |
Di Huang, Sida Peng, Tong He, Honghui Yang, Xiaowei Zhou, Wanli Ouyang |
We propose a novel approach to self-supervised learning of point cloud
representations by differentiable neural rendering. Motivated by the fact that
informative point cloud features should be able to encode rich geometry and
appearance cues and render realistic images, we train a point-cloud encoder
within a devised point-based neural renderer by comparing the rendered images
with real images on massive RGB-D data. The learned point-cloud encoder can be
easily integrated into various downstream tasks, including not only high-level
tasks like 3D detection and segmentation, but low-level tasks like 3D
reconstruction and image synthesis. Extensive experiments on various tasks
demonstrate the superiority of our approach compared to existing pre-training
methods. |
This paper introduces Ponder, a novel self-supervised point cloud representation learning framework that leverages differentiable neural rendering. |
Learning effective 3D point cloud representations is crucial for various applications, but existing pre-training methods have limitations such as reliance on contrastive learning or difficulty in handling point cloud irregularity. Ponder addresses these limitations by connecting 2D and 3D data through rendering, enabling the learning of rich geometry and appearance cues. |
Ponder takes RGB-D images as input, constructs point clouds via back-projection, encodes point features, and organizes them into a 3D feature volume. It then reconstructs a neural scene representation using SDF and utilizes differentiable rendering to generate color and depth images. The network is trained by minimizing the difference between rendered and real images. |
Ponder significantly outperforms existing pre-training methods on 3D object detection and semantic segmentation tasks.
It demonstrates strong transfer learning ability to low-level 3D tasks, including scene reconstruction and image synthesis from point clouds, which is a first for pre-training methods.
The pre-trained Ponder model can be directly applied to 3D reconstruction and image synthesis from sparse point clouds, producing high-fidelity meshes and realistic images. |
The current Ponder model could be improved by integrating more recent advancements in neural representations for better rendering quality.
The flexible architecture design of Ponder presents potential for expansion to other self-supervised learning areas, like 2D image backbone pre-training, and different downstream tasks. |
self-supervised learning, point cloud representation, neural rendering, 3d object detection, 3d scene reconstruction |
2301.00135
Report |
TeViS:Translating Text Synopses to Video Storyboards |
Xu Gu, Yuchong Sun, Feiyue Ni, Shizhe Chen, Xihua Wang, Ruihua Song, Boyuan Li, Xiang Cao |
A video storyboard is a roadmap for video creation which consists of
shot-by-shot images to visualize key plots in a text synopsis. Creating video
storyboards, however, remains challenging which not only requires cross-modal
association between high-level texts and images but also demands long-term
reasoning to make transitions smooth across shots. In this paper, we propose a
new task called Text synopsis to Video Storyboard (TeViS) which aims to
retrieve an ordered sequence of images as the video storyboard to visualize the
text synopsis. We construct a MovieNet-TeViS dataset based on the public
MovieNet dataset. It contains 10K text synopses each paired with keyframes
manually selected from corresponding movies by considering both relevance and
cinematic coherence. To benchmark the task, we present strong CLIP-based
baselines and a novel VQ-Trans. VQ-Trans first encodes text synopsis and images
into a joint embedding space and uses vector quantization (VQ) to improve the
visual representation. Then, it auto-regressively generates a sequence of
visual features for retrieval and ordering. Experimental results demonstrate
that VQ-Trans significantly outperforms prior methods and the CLIP-based
baselines. Nevertheless, there is still a large gap compared to human
performance suggesting room for promising future work. The code and data are
available at: \url{https://ruc-aimind.github.io/projects/TeViS/} |
This paper introduces TeViS, a novel task focused on automatically generating video storyboards from text synopses by retrieving and ordering relevant images. |
The task addresses the challenge faced by amateur video creators in translating their creative ideas into professional-looking visual sequences. |
The authors build a dataset, MovieNet-TeViS, derived from the MovieNet dataset, containing 10k text synopses paired with manually selected keyframes. They propose TeViS, a decoder-only model that uses CLIP for text and image encoding and leverages vector quantization to improve visual representation for sequence generation. |
TeViS significantly outperforms several baselines, including those based on CLIP and existing story-to-image retrieval models.
Employing Vector Quantization for image discretization and a decoder-only architecture shows substantial improvement in ordering accuracy.
A considerable gap still exists between TeViS and human performance, indicating a large space for future development. |
The current dataset and model don't fully encapsulate intricate cinematic styles like camera angles and movements, which limits the professional quality of generated storyboards.
Future work could focus on incorporating these nuanced elements and explore more advanced generative models to further bridge the gap to human performance. |
video storyboarding, text-to-image retrieval, sequence generation, vector quantization, movienet |